Roadmap for finishing the project:

  1. Train an interpretable sparse autoencoder
  2. Learn how to use RLHF (notebook, can do in parallel)
  3. RLHF the sparse autoencoder, and interpret the results
  4. Write up a blog (can do in parallel)

Timeline:

  1. Saturday: inspect, tweak, and become happy with our sparse autoencoders.