Yes, but pure Mamba doesn't perform as well as a transformer (and neither did LT... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		HarHarVeryFunny on April 15, 2024 \| parent \| context \| favorite \| on: Visualizing Attention, a Transformer's Heart [vide... Yes, but pure Mamba doesn't perform as well as a transformer (and neither did LTSMs). This is why you see hybrid architectures like Jamba = Mamba + transformer. The ability to attend to specific tokens is really key, and what is lost in recurrent models where sequence history is munged into a single state.

YetAnotherNick on April 15, 2024 [–]

That's my point. It doesn't perform in terms of loss, even though it performs well enough in terms of compute

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact