That's an implementation detail. The behaviour of trained transformer models remains similar even if you quantise them to 4-bit floats, or make every floating point operation noisy. This model only works if you use double-precision floating point.
> pattern matches the "idea shape" of words in the "idea space
it does much more than this. first layer has an attention mechanism on all previous tokens and spits out an activation representing some sum of all relations between the tokens. then the next layer spits out an activation representing relations of relations, and the next layer and so forth. the llm is capable of deducing a hierarchy of structural information embedded in the text.
thats unlikely. but they are awfully lot like turing machines (k/v cache ~ turing tape) so their architecture is strongly predisposed to be able to find any algorithm, possibly including reasoning
reply