just take me to the paper 1

Those Large Language Models are just stochastic parrots! They can only predict the next token, that's it!

Noooooo! They model the world!

AI Sceptic
AI Optimist
scroll down
2

How to play:

  • Black moves first
  • Place a piece to capture opponent pieces
  • Pieces are captured when trapped between your pieces
  • Legal moves are highlighted in green
Turn: Black

Look at the game of Othello, for example!

3

Research has shown that LLMs trained to play Othello can learn to do it quite well!

And what's most interesting, we can decode the board state from their hidden representations!

4

Okay, I have a riddle for you. See these grey squares? Transformers represent them much worse than others! What do you think makes them different?

Turn: Black
Well represented
Poorly represented
5

Here is why: these squares don't affect which moves are legal! Transformers only compute what they need to predict the next token, haha!

Don't be so joyful! They still represent them to some extent!

Turn: Black
Well represented
Poorly represented
6

Transformers internally represent features that are not needed to predict the next token.

Why do they represent them?

Why worse than other features?

scroll down to learn the answer
7

Let's look at how information flows in Transformers.

With respect to one position and layer in the residual stream (indicated by ) there are three possible paths for the features to affect the predictions: direct, pre-cached, and shared. Click on the tick boxes to learn about them.

Information Paths:

you need is love you need is all i = 1 i = 2 i = 3 i = 4 k = 2 k = 1 k = 0

Direct Path

Information flows straight up from the current token to predict the immediate next token.

Pre-cached Path

Information from the current token flows diagonally through attention. At an intermediate layer, the model "pre-caches" features that might be useful at future positions.

Shared Paths

These are all the paths that don't pass through the considered residual stream position.

8

So: Why do the "useless" features emerge?

The features that are useful for predicting the next token receive gradient signal along all three paths. But the "useless" features only benefit from pre-cached and shared paths — they miss the direct gradient!

Bar plot comparing integrated influence across Direct, Pre-cached, and Shared paths for NTP-Useful vs NTP-Useless features
Computed for Othello. See the paper for details!

This explains why "useless" features are represented better than chance: they still get learning signal through the pre-cached and shared paths!

And this explains why "useless" features are represented worse than "useful" ones: they receive less gradient signal overall, since they miss out on the direct path which is a strong learning signal!

9

What about real LLMs?

We can rank the features in Gemma 2 based on how much they benefitted from direct vs pre-cached paths. Features on the left are more "direct" (they help predict the immediate next token), while features on the right are more "pre-cached" (they help predict future tokens).

Turns out that highly pre-cached features are often related to coding!

Hover over the bars to explore what kinds of features live at each end of the spectrum!

10

Interested? Have a look at our paper!

QR code linking to the paper on OpenReview Paper QR code linking to the GitHub repository Code