Yes, but pure Mamba doesn't perform as well as a transformer (and neither did LTSMs). This is why you see hybrid architectures like Jamba = Mamba + transformer. The ability to attend to specific tokens is really key, and what is lost in recurrent models where sequence history is munged into a single state.