Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, but pure Mamba doesn't perform as well as a transformer (and neither did LTSMs). This is why you see hybrid architectures like Jamba = Mamba + transformer. The ability to attend to specific tokens is really key, and what is lost in recurrent models where sequence history is munged into a single state.


That's my point. It doesn't perform in terms of loss, even though it performs well enough in terms of compute




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: