Ask HN: What can we learn about human cognition from the performance of LLMs

tlb · on May 27, 2023

Transformer-based LLMs define a theory of time: each token representation has added to it a vector full of sincos(wt) for a set of frequencies w, after which order is ignored. (Each sincos defines 2 elements of the vector: sin(wt) and cos(wt). Use e^iwt if you prefer to think in complex numbers.)

So in "Your heart is stronger than your head", heart and head are 5 words apart, or ~8 tokens. So one gets sincos(w(t+0)), the other gets sincos(w(t+8)). That's the only thing that distinguishes it from the converse sentence, "Your head is stronger than your heart."

Chomsky had a much more symbolic theory of grammar. The fact that Chat GPT can answer questions about the above sentences (try them!) with order only defined by relative timestamps is remarkable.

Interestingly, if you throw in some extra words like "Bob's head is stronger (and more cromulent) than his heart" it fails to answer questions about which is stronger. Possibly because the extra tokens bring the sincos terms it had learned to use for "A is Xer than B" statements wrapped all the way around the circle.

It'd be interesting to devise similar tests for people, to see what extraneous parentheticals can confuse them.

james-revisoai · on May 27, 2023

This is a great point to illustrate the importance of the token representation of position, though I didn't find ChatGPT had any problem with the second example you showed with extra words though - it was correct.

I'm not sure the cases you reference work in practise how you describe exactly really nowadays, even though they did in the past (say, 2019) - I agree the wraparound effect could occur like that, though the way the position is encoded in the vector is usually built to work so that it will wrap around at different levels as such (the frequencies w) which include padding tokens as non-zero values (implementation dependent), so in essence for every sentence ending at 8 tokens, the model has many other encodings that indicate the loop, not just the one wraparound encoding. This should avoid that problem and the reason for this relates to spans of tokens and is a problem considered when devising training for earlier transformer models like BART and the MASK/NSP tasks. But in practise, position is learned these days, it does not use sin/cosine anymore.

There is a technical difference in positional embedding vs positional encodings (and the trend since 2020 to learn position) which is quite interesting.

ted_bunny · on May 30, 2023

Surfing Uncertainty makes a case that the whole damn brain seems to work on a predictive model. I found it convincing, but I'm very much a layman so that's all I can say with half confidence.