Relevant papers | Notion

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
**Sparse Autoencoders Find Highly Interpretable Features in Language Models (Logan)**
Comparing Anthropic's Dictionary Learning to Ours
[Interim research report] Taking features out of superposition with sparse autoencoders
Sparse Autoencoders: Future Work
Finding Neurons in a Haystack by Wes Gurnee
Neel Nanda replication!!

PPO: https://huggingface.co/docs/trl/main/en/ppo_trainer

Upload Custom model to Hugging Face: https://huggingface.co/docs/transformers/custom_models#sharing-custom-models