Logan Meeting (10/20/2023)

Read Logan’s paper in full
Read Cunningham paper cited in Anthropic
Read “Finding Neurons in a Haystack” by Wes Gurnee

Think about differences between residual stream and MLP “mathematically speaking”

Think about what synthetic dataset would be good to construct (chess!)

Notes with Logan, Naomi, Laker (10/20/2023)

Directions Logan is interested in:

Interpreting reward models that are learned by RLHF
- How RLHF works: train a reward model by learning a [dim_model x 1] vector to classify text as “good” or “bad”. Then use reinforcement learning to fine tune model based on reward model. Look up original GPT3 paper for RLHF discussion.
- Ablate features that are found by the sparse autoencoder and see how much the reward model responds (e.g., ablate curse word makes reward go from -10 to 0 ... so reward model really hates curse words!)
- This is the one Logan is actively pursuing and mostly done with — don’t work on.
Could do RLHF training, but only trainable parameters are scalar multiples of the autoencoder features... that would be very interpretable, and then we could compare that on evaluation benchmarks.
- Confounding variable is that sparse autoencoders don’t perfectly reconstruct
- One way to get around this... instead of ablating a feature inside the autoencoder space, find the difference vector from ablating it in the autoencoder but then subtract that difference vector from the original activation vector to “start from where is correct”
- Steps: (1) train sparse autoencoder to learn feature directions, (2) run PPO with the special learnable parameters [running evals could be a lot of work]. Would be huge if this interpretable RLHF works just as well as regular RLHF!
- Do we recommend using a model like Pythia 70M that already has a reward model trained on it? Logan: we could train our own reward model, but these things take a lot of compute. Pythia only released a reward model for 6B. We could possibly train our own reward model for 70M. Logan thinks we can fit 6B parameter model on 20-30GB. We’d want 48 GB GPUs for example. We might want to quantize the 6B parameter reward model to mixed precision floats. Look for standard practices on HuggingFace for “loading models in lower precision.”
- Resource: the goose lover (see below)
Find a metric for monosemanticity
- How many features do you want per data point? 10? 100? A percentage of the residual stream? And how long to train on tokens? Seems like longer is better.
- One metric would be if a feature only activates on a given token (e.g., “the” — known as a token-level feature). Logan tried this... got within the bounds of reason for sparsity, but lots of contamination with polysemantic non-token-level features. We could try Wes Gurnee’s more complex features (regex, French, etc.). Multitoken features include “words after opening parenthesis” for example.
Circuits in features for chess models
- Pythia 70M has fine tuned to play chess... that could be a good toy model. We could train a sparse autoencoder on that, and we can see how features relate to each other across different layers. “Features are just features, but how features combine is circuits.”
- Example: train sparse autoencoder AE1 in residual stream pre-MLP, train another one AE2 right after the MLP, try to relate the features. Relate features at different points in the residual stream.
- Chess is good because it’ll go viral and get people interested in sparse autoencoders. Chess LLM. The start for the class is just finding features... that’s enough to get a good grade... the bonus mile is relating features to find chess circuits. That’s what Logan is working on, and we could collaborate.
Last crazy idea: do sparse autoencoders on the chess model, but turn it all into code! In a sparse autoencoder, a feature is a function f : text \to \R. The feature tells you how much to scale in the feature’s direction.

Advice/Resources:

Pythia 70M could train in 10 minutes on one beefy Colab GPU
Use existing model? Yes! It’s extra work for new models. If we want to make our own model we could do a smaller synthetic model.
Louis (“gooselover” on Eleuther Discord) is great at RLHF, everyone goes to him... if we need any help, talk to him!
- For our RLHF, we would probably end up using TRLX.

Next Steps:

Two most promising directions are PPO with specific trainable parameters and chess!
Practice loading Pythia 70M, Pythia 70M fine tuned on chess, running forward pass.
Practice loading Logan’s sparse autoencoder for Pythia 70M, run, look at features.