Wednesday, 10/25/2023

Log:

We met 4-8:30 pm and read through Neel Nanda’s replication.
Key insight: The encoder stores representations of features, while the decoder stores the feature directions themselves.
Key insight: The MLP hidden layer has 75% of it in the null space of the projection W_out. Maybe it’s better/more efficient to train on the post-W_out.
Key insight: High kurtosis = heavy tails, which indicates a privileged basis, because it means non-normal aka rotation dependent decoder weights.
Question: What is going on with Neel’s low-frequency features?? It’s the same direction across seeds!

Next todos:

Use Neel’s scrappy code to train a sparse autoencoder for a Pythia-70m model. This should be short! Let’s train it on W_out. Then, put the autoencoder on Hugging Face, and load it in Neel’s colab!! And interpret it, look for chess features.
We have Thursday, Friday, maybe Saturday, and maybe Monday for Naomi only
Train on a 6-billion parameter model next (on Friday?). This might require getting Pro+.

Naomi’s plans:

Explain why we’re working on sparse autoencoder + RLHF. We want to make an interpretable RLHF. Explain what our ideas are and why we’re working on them.
Explain the progress we’ve made so far, and why we’re working on it.
Have opinions about the paper.
Be prepared to talk about things I’ve done for HAIST. Tell particular details, but for 1.5 minutes. Then tell a different story for 1.5 minutes. High-density interview!