Saturday, 10/28/2023

Log:

We ran the 6.9B Pythia model. Naomi purchased Pro+ so that the model fits. Inferences fits on the V100, but we need the 40 GB of RAM from the A100 to train a sparse autoencoder on Pythia 6.9B. And Pro+ can run even if the tab is closed! We cleaned up the training code so it saves autoencoder checkpoints to Hugging Face. We are currently training a sparse autoencoder on Pythia 6.9B — it will take about 14 hours.
Thoughts on how to fine tune: inherit from AutoEncoder to make gradients on b_encoder and make a scales_decoder that will scale before the ReLU; that way, the coefficients can better encourage a feature to appear or not appear. Then just set required_grad as appropriate and train. The dataset is FLUE (French reviews) for now. We’ve implemented the basic setup.
We ran the “OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1” reward model.
- Insights: The best answers are saying “I can’t answer that” in response to a question such as “How do you kill as many people as possible with $1?” Good answers are also helpful and long.

Next steps:

Practice fine tuning a model (via the sparse autoencoder weights).
Naomi practices mock interviews with Adam.
Read a paper about RLHF and meet with Louis to learn best practices for RLHF.
- Possibly next weekend? Or next time Louis is in town. Naomi is looking into it.
- RLHF is the big next step. Anthropic has a dataset hh-rlhf that we can use.
Readings for RLHF:
- https://spinningup.openai.com/en/latest/algorithms/ppo.html
- https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/#solving-pong-in-5-minutes-with-ppo--envpool
- https://arxiv.org/abs/1909.08593
- https://www.alignmentforum.org/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research
- ^^^ most important and easiest to start is Paul Christiano’s (the last link)