More

mesuvash · 2026-03-26T01:17:11 1774487831

That's actually correct and intentional. TurboQuant applies the same rotation matrix to every vector. The key insight is that any unit vector, when multiplied by a random orthogonal matrix, produces coordinates with a known distribution (Beta/arcsine in 2D, near-Gaussian in high-d). The randomness is in the matrix itself (generated once from a seed), not per-vector. Since the distribution is the same regardless of the input vector, a single precomputed quantization grid works for everything. I've updated the description to make this clearer.

Geee · 2026-03-26T02:03:20 1774490600

Thanks. However, from this visualization it's not clear how the random rotation is beneficial. I guess it makes more sense on higher dimensional vectors.

mesuvash · 2026-03-26T04:42:16 1774500136

Yes, this is important in high dimension. But sadly, very hard to visualize. In 2d it looks like unnecessary.

mesuvash · 2026-03-26T01:16:29 1774487789

Yes. Great catch. I simplified the grid just for visualization purpose.

I've updated the visualization. The grid is actually not uniformly spaced. Each coordinate is quantized independently using optimal centroids for the known coordinate distribution. In 2D, unit-circle coordinates follow the arcsine distribution (concentrating near ±1), so the centroids cluster at the edges, not the center.

spencerflem · 2026-03-26T14:07:13 1774534033

Cool! Thank you

mesuvash · 2026-03-26T01:13:38 1774487618

Author here. Sorry still working on refining the post. Will share once the post is ready.

mesuvash · 2026-03-25T21:35:32 1774474532

TurboQuant explained with an easy to understand (no-math) animation https://mesuvash.github.io/blog/2026/turboquant-interactive/

fc417fc802 · 2026-03-26T02:12:24 1774491144

Someone else linked that elsewhere in the comments and while it's certainly a nice visual it seems like it's not accurately portraying the paper. Isn't the grid supposed to have a weird alignment that depends on the bit depth? And there's supposed to be a second quantization step involving the residual.

mesuvash · 2026-03-26T05:10:57 1774501857

Fair point. I've updated the animation to address this. The grid now uses the correct non-uniform centroids (optimal for the arcsine distribution in 2D), so you'll see grid lines cluster near the edges where unit-circle coordinates actually concentrate, rather than being evenly spaced. The spacing does change with bit depth.

On the second quantization step: the paper's inner-product variant uses (b-1) bits for the MSE quantizer shown here, then applies a 1-bit QJL (Quantized Johnson-Lindenstrauss) encoding of the residual to make dot-product estimates unbiased. I chose to omit QJL from the animation to keep it digestible as a visual, but I've added a note calling this out explicitly.

fc417fc802 · 2026-03-26T14:33:14 1774535594

It looks nice! Fair enough about QJL - it seems to be nothing more than an unbiasing measure anyway.

I'm not sure if it's my own misunderstanding or if the paper [0] has something of an error. Section 3.1 starts out to the effect "let x be on the unit hypersphere" (but I'm fairly certain it's actually not). Neither algorithm 1 nor algorithm 2 show a normalization step prior to rotating x. Algorithm 2 line 8 shows that the scalar returned is actually the magnitude of the residual without accounting for QJL.

Anyway I'm pretty sure the authors inadvertently omitted that detail which really had me confused for a while there.

[0] https://arxiv.org/abs/2504.19874

mesuvash · 2026-03-26T16:08:21 1774541301

IIUC, The paper's notation S^(d-1) means the unit sphere in R^d (e.g., the familiar unit circle is S^1 living in R^2). So, i think, x in the algorithm is already a unit vector.

Reference: Section 2:Preliminaries ... We use the notation S^d−1 to denote the hypersphere in R^d of radius 1.

Section 3.1 Let x ∈ S^d−1 be a (worst-case) vector on the unit sphere in dimension d.

fc417fc802 · 2026-03-26T22:19:13 1774563553

Right but in reality IIUC w ∈ R^d and it's x = w / ||w|| ∈ S^(d-1) and then given r = x - Qmse^-1( Qmse( x ) ) the scalar you use is derived as ||r|| (I'm missing a couple subscript twos there I think).

I was primarily aiming to confirm my understanding given the author's omission but also the scalar is subtly different than in your linked explanation (although conceptually equivalent).

mesuvash · 2026-03-05T17:24:04 1772731444

I am glad you liked it :) You might like this https://mesuvash.github.io/blog/2026/rl_for_llm/ as well :)

mesuvash · on Feb 17, 2019

Thanks for the pointers.

From my personal experience, Auto-encoders are amazing for dense input (images, audio etc), more specifically, when the input feature space is not large. However, in many real-world problems such as recommendation, ranking etc. the feature space is generally very sparse for eg clicks, purchase of items (say 100M items). In such cases, scaling can be challenging with neural models esp Autoencoder.

mesuvash · on Feb 17, 2019

>>I think the hashes could take some work. Any suggestions or thing that are not clear?

Thanks for your feedback. I shall update the post accordingly.

mesuvash · on March 11, 2014

Nothing but #respect. It's hard to see people who give up fortune for what they consider right thing(atleast from his perspective).

bglazer · on March 11, 2014

I think he's still making money from the in-game ads in already downloaded copies of Flappy Bird. Still cool. "Independent thinker". I like it.

asharpe · on March 12, 2014

yes, especially as he doesn't want to sell it and let someone else deal with the crap. Shame that people want to cut down others who are successful

mesuvash · on March 5, 2012

Btw, If someone wants to drop out from the course. How can he/she do so ?

_hzd1 · on March 5, 2012

I recall them saying you don't have to do anything. Just stop participating and they will know whether you are active or something like that

dhughes · on March 5, 2012

Good to know because I leaped before I looked, never had AP level physics and barely any calculus but 1/3 of the way through an electronics cert.

graeme · on March 5, 2012

I had the same question; looks like the parent post is correct. I found the FAQ here:

http://mitx.mit.edu/6002x-faq.html

"How do I drop the course?

You do not have to do anything. You can simply stop working on the course at any time you choose to do so.

What happens if I drop the course?

For the prototype course, learners achieving grades of "A," "B," or "C" will receive an electronic Certificate of completion with the learner's name and grade on it. If you receive a grade below a "C" or do not complete the course, you will not receive a Certificate and no grade record attaching your name to your participation in the class will be disclosed outside of MITx. You can also choose to opt for a no record at any time. However, the posts you make while enrolled in the class will remain visible."

mesuvash · on March 5, 2012

Thanks for the info :)

mesuvash · on March 5, 2012

Awesome. MITx platform is superior than any other online learning platform i have ever seen. Very well done. Congrats.