On "useless features"

1

Those Large Language Models are just stochastic parrots! They can only predict the next token, that's it!

Noooooo! They model the world!

AI Sceptic

AI Optimist

scroll down

2

How to play:

Black moves first
Place a piece to capture opponent pieces
Pieces are captured when trapped between your pieces
Legal moves are highlighted in green

Turn: Black

Look at the game of Othello, for example!

3

Research has shown that LLMs trained to play Othello can learn to do it quite well!

And what's most interesting, we can decode the board state from their hidden representations!

4

Okay, I have a riddle for you. See these grey squares? Transformers represent them much worse than others! What do you think makes them different?

Turn: Black

Well represented

Poorly represented

5

Here is why: these squares don't affect which moves are legal! Transformers only compute what they need to predict the next token, haha!

Don't be so joyful! They still represent them to some extent!

Turn: Black

Well represented

Poorly represented

6

Transformers internally represent features that are not needed to predict the next token.

Why do they represent them?

Why worse than other features?

scroll down to learn the answer

7

Let's look at how information flows in Transformers.

With respect to one position and layer in the residual stream (indicated by ) there are three possible paths for the features to affect the predictions: direct, pre-cached, and shared. Click on the tick boxes to learn about them.

Information Paths:

Direct Pre-cached Shared

Direct Path

Information flows straight up from the current token to predict the immediate next token.

Pre-cached Path

Information from the current token flows diagonally through attention. At an intermediate layer, the model "pre-caches" features that might be useful at future positions.

Shared Paths

These are all the paths that don't pass through the considered residual stream position.

8

So: Why do the "useless" features emerge?

The features that are useful for predicting the next token receive gradient signal along all three paths. But the "useless" features only benefit from pre-cached and shared paths — they miss the direct gradient!

Bar plot comparing integrated influence across Direct, Pre-cached, and Shared paths for NTP-Useful vs NTP-Useless features

Computed for Othello. See the paper for details!

This explains why "useless" features are represented better than chance: they still get learning signal through the pre-cached and shared paths!

And this explains why "useless" features are represented worse than "useful" ones: they receive less gradient signal overall, since they miss out on the direct path which is a strong learning signal!

9

What about real LLMs?

We can rank the features in Gemma 2 based on how much they benefitted from direct vs pre-cached paths. Features on the left are more "direct" (they help predict the immediate next token), while features on the right are more "pre-cached" (they help predict future tokens).

Turns out that highly pre-cached features are often related to coding!

Hover over the bars to explore what kinds of features live at each end of the spectrum!

10

Interested? Have a look at our paper!

Paper

Code