Those Large Language Models are just stochastic parrots! They can only predict the next token, that's it!
Noooooo! They model the world!
Look at the game of Othello, for example!
Research has shown that LLMs trained to play Othello can learn to do it quite well!
And what's most interesting, we can decode the board state from their hidden representations!
Okay, I have a riddle for you. See these grey squares? Transformers represent them much worse than others! What do you think makes them different?
Here is why: these squares don't affect which moves are legal! Transformers only compute what they need to predict the next token, haha!
Don't be so joyful! They still represent them to some extent!
Transformers internally represent features that are not needed to predict the next token.
Why do they represent them?
Why worse than other features?
Let's look at how information flows in Transformers.
With respect to one position and layer in the residual stream (indicated by ) there are three possible paths for the features to affect the predictions: direct, pre-cached, and shared. Click on the tick boxes to learn about them.
Information flows straight up from the current token to predict the immediate next token.
Information from the current token flows diagonally through attention. At an intermediate layer, the model "pre-caches" features that might be useful at future positions.
So: Why do the "useless" features emerge?
The features that are useful for predicting the next token receive gradient signal along all three paths. But the "useless" features only benefit from pre-cached and shared paths — they miss the direct gradient!
This explains why "useless" features are represented better than chance: they still get learning signal through the pre-cached and shared paths!
And this explains why "useless" features are represented worse than "useful" ones: they receive less gradient signal overall, since they miss out on the direct path which is a strong learning signal!
What about real LLMs?
We can rank the features in Gemma 2 based on how much they benefitted from direct vs pre-cached paths. Features on the left are more "direct" (they help predict the immediate next token), while features on the right are more "pre-cached" (they help predict future tokens).
Turns out that highly pre-cached features are often related to coding!
Hover over the bars to explore what kinds of features live at each end of the spectrum!
Interested? Have a look at our paper!