Log:
- We met 4-8:30 pm and read through Neel Nanda’s replication.
- Key insight: The encoder stores representations of features, while the decoder stores the feature directions themselves.
- Key insight: The MLP hidden layer has 75% of it in the null space of the projection W_out. Maybe it’s better/more efficient to train on the post-W_out.
- Key insight: High kurtosis = heavy tails, which indicates a privileged basis, because it means non-normal aka rotation dependent decoder weights.
- Question: What is going on with Neel’s low-frequency features?? It’s the same direction across seeds!
Next todos:
- Read various papers (on own time). Also read about RLHF.
https://github.com/neelnanda-io/1L-Sparse-Autoencoder
- Use Neel’s scrappy code to train a sparse autoencoder for a Pythia-70m model. This should be short! Let’s train it on W_out. Then, put the autoencoder on Hugging Face, and load it in Neel’s colab!! And interpret it, look for chess features.
- We have Thursday, Friday, maybe Saturday, and maybe Monday for Naomi only
- Train on a 6-billion parameter model next (on Friday?). This might require getting Pro+.
Naomi’s plans:
- Explain why we’re working on sparse autoencoder + RLHF. We want to make an interpretable RLHF. Explain what our ideas are and why we’re working on them.
- Explain the progress we’ve made so far, and why we’re working on it.
- Have opinions about the paper.
- Be prepared to talk about things I’ve done for HAIST. Tell particular details, but for 1.5 minutes. Then tell a different story for 1.5 minutes. High-density interview!