Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
**Sparse Autoencoders Find Highly Interpretable Features in Language Models (Logan)**
[Interim research report] Taking features out of superposition with sparse autoencoders
Finding Neurons in a Haystack by Wes Gurnee
PPO: https://huggingface.co/docs/trl/main/en/ppo_trainer
Upload Custom model to Hugging Face: https://huggingface.co/docs/transformers/custom_models#sharing-custom-models