Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> TBF there is no good explanation why it works

My mental justification for attention has always been that the output of the transformer is a sequence of new token vectors such that each individual output token vector incorporates contextual information from the surrounding input token vectors. I know it's incomplete, but it's better than nothing at all.



> TBF there is no good explanation why it works

I thought the general consesus was: "transformers allow neural networks to have adaptive weights".

As opposed to the previous architectures, were every edge connecting two neurons always has the same weight.

EDIT: a good video, where it's actually explained better: https://youtu.be/OFS90-FX6pg?t=750&si=A_HrX1P3TEfFvLay


You're effectively steering the predictions based on adjacent vectors (and precursors from the prompt). That mental model works fine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: