Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: What can we learn about human cognition from the performance of LLMs
11 points by abrax3141 on May 27, 2023 | hide | past | favorite | 3 comments
What can we learn about human cognition from the performance of LLMs

Some hypotheses (adapted from other posts):

* We have learned that Spreading Activation, when applied through high-dimensional non-symbolic network (the network formed by embedding vectors) may be able to account for abstraction in fluent language.

* We have learned that "fluent reasoning" (sometimes called "inline" or "online" reasoning), that is, the shallow reasoning embedded in fluent language, may be more powerful than usually thought.

* We have learned that "talking to yourself" (externally, in the case of GPTs, and potentially also internally in the case of human's "hearing yourself think") is able to successfully maintain enough short-term context to track naturally long chains of argument (via contextually-guided fluent reasoning, as above).

* We have learned that to some extent powerful "mental models" that support (again, at least fluent) reasoning can be in effect (functionally) represented and used in a highly distributed system.

* We have learned that meta-reasoning (which the LLMs do not do) may be important in augmenting fluent reasoning, and in tracking extended "trains of thought" (and thus extended dialogues).

* We have a new model of confabulation that fits into the fluent language model as implemented by LLMs.

* We have learned that people's "knowledge space" is quite amazing, given that they have ~10x current LLM parameter size (~10T, where as an individual has potentially ~100T cortical parameters -- depending on what you count, of course) but a given individual only encodes a small number of languages and a small number of domains to any great depth (in addition to the standard operating procedures that almost all people encode). [That is, vs. the LLM encoding the whole damned internet in ~10 different languages.]

What else? (And, of course, it goes w/o saying that you'll argue about the above :-)



Transformer-based LLMs define a theory of time: each token representation has added to it a vector full of sincos(wt) for a set of frequencies w, after which order is ignored. (Each sincos defines 2 elements of the vector: sin(wt) and cos(wt). Use e^iwt if you prefer to think in complex numbers.)

So in "Your heart is stronger than your head", heart and head are 5 words apart, or ~8 tokens. So one gets sincos(w(t+0)), the other gets sincos(w(t+8)). That's the only thing that distinguishes it from the converse sentence, "Your head is stronger than your heart."

Chomsky had a much more symbolic theory of grammar. The fact that Chat GPT can answer questions about the above sentences (try them!) with order only defined by relative timestamps is remarkable.

Interestingly, if you throw in some extra words like "Bob's head is stronger (and more cromulent) than his heart" it fails to answer questions about which is stronger. Possibly because the extra tokens bring the sincos terms it had learned to use for "A is Xer than B" statements wrapped all the way around the circle.

It'd be interesting to devise similar tests for people, to see what extraneous parentheticals can confuse them.


This is a great point to illustrate the importance of the token representation of position, though I didn't find ChatGPT had any problem with the second example you showed with extra words though - it was correct.

I'm not sure the cases you reference work in practise how you describe exactly really nowadays, even though they did in the past (say, 2019) - I agree the wraparound effect could occur like that, though the way the position is encoded in the vector is usually built to work so that it will wrap around at different levels as such (the frequencies w) which include padding tokens as non-zero values (implementation dependent), so in essence for every sentence ending at 8 tokens, the model has many other encodings that indicate the loop, not just the one wraparound encoding. This should avoid that problem and the reason for this relates to spans of tokens and is a problem considered when devising training for earlier transformer models like BART and the MASK/NSP tasks. But in practise, position is learned these days, it does not use sin/cosine anymore.

There is a technical difference in positional embedding vs positional encodings (and the trend since 2020 to learn position) which is quite interesting.


Surfing Uncertainty makes a case that the whole damn brain seems to work on a predictive model. I found it convincing, but I'm very much a layman so that's all I can say with half confidence.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: