Right, so the 64 Probes are able to look at OthelloGTPs internals and are trained using the known board-state-to-OthelloGPT-internals data. The article says
It turns out that the error rates of these probes are reduced from 26.2% on a randomly-initialized Othello-GPT to only 1.7% on a trained Othello-GPT. This suggests that there exists a world model in the internal representation of a trained Othello-GPT.
I take that to mean that the 64 trained Probes are then shown other OthelloGTP internals and can tell us what what the state of their particular 'square' is 98.3% of the time. (we know what the board would look like, but the probes dont)
As you say "Again, not a practitioner but once you are indirecting internal state through a 2 layer MLP it gets less obvious to me that the world model is really there."
But then they go back and actually mess around with OthelloGTPs internal state (using the Probes to work out how), changing black counters to white and so on, and then this directly affects the next move OthelloGTP makes. They even do this for impossible board states (e.g. two unlinked sets of discs) and OthelloGTP still comes up with correct next moves.
So surely this proves that the Probes were actually pointing to an internal model? Because when you mess with the model in a way to affect the next move, it changes OthelloGTPs behaviour in the expected way?
It turns out that the error rates of these probes are reduced from 26.2% on a randomly-initialized Othello-GPT to only 1.7% on a trained Othello-GPT. This suggests that there exists a world model in the internal representation of a trained Othello-GPT.
I take that to mean that the 64 trained Probes are then shown other OthelloGTP internals and can tell us what what the state of their particular 'square' is 98.3% of the time. (we know what the board would look like, but the probes dont)
As you say "Again, not a practitioner but once you are indirecting internal state through a 2 layer MLP it gets less obvious to me that the world model is really there."
But then they go back and actually mess around with OthelloGTPs internal state (using the Probes to work out how), changing black counters to white and so on, and then this directly affects the next move OthelloGTP makes. They even do this for impossible board states (e.g. two unlinked sets of discs) and OthelloGTP still comes up with correct next moves.
So surely this proves that the Probes were actually pointing to an internal model? Because when you mess with the model in a way to affect the next move, it changes OthelloGTPs behaviour in the expected way?