Skip to content

Screenshot 2024-03-21 at 3 08 28 pm

SAELens

PyPI License: MIT build Deploy Docs codecov

The SAELens training codebase exists to help researchers:

  • Train sparse autoencoders.
  • Analyse sparse autoencoders and neural network internals.
  • Generate insights which make it easier to create safe and aligned AI systems.

Please note these docs are in beta. We intend to make them cleaner and more comprehensive over time.

Quick Start

Installation

pip install sae-lens

Loading Sparse Autoencoders from Huggingface

To load a pretrained sparse autoencoder, you can use SAE.from_pretrained() as below. Note that we return the original cfg dict from the huggingface repo so that it's easy to debug older configs that are being handled when we import an SAE. We also return a sparsity tensor if it is present in the repo. For an example repo structure, see here.

from sae_lens import SAE

sae, cfg_dict, sparsity = SAE.from_pretrained(
    release = "gpt2-small-res-jb", # see other options in sae_lens/pretrained_saes.yaml
    sae_id = "blocks.8.hook_resid_pre", # won't always be a hook point
    device = "cuda"
)

You can see other importable SAEs on this page.

Any SAE on Huggingface that's trained using SAELens can also be loaded using SAE.from_pretrained(). In this case, release is the name of the Huggingface repo, and sae_id is the path to the SAE in the repo. You can see a list of SAEs listed on Huggingface with the saelens tag.

Loading Sparse Autoencoders from Disk

To load a pretrained sparse autoencoder from disk that you've trained yourself, you can use SAE.load_from_disk() as below.

from sae_lens import SAE

sae = SAE.load_from_disk("/path/to/your/sae", device="cuda")

Importing SAEs from other libraries

You can import an SAE created with another library by writing a custom PretrainedSaeHuggingfaceLoader or PretrainedSaeDiskLoader for use with SAE.from_pretrained() or SAE.load_from_disk(), respectively. See the pretrained_sae_loaders.py file for more details, or ask on the Open Source Mechanistic Interpretability Slack. If you write a good custom loader for another library, please consider contributing it back to SAELens!

Background and further Readings

We highly recommend this tutorial.

For recent progress in SAEs, we recommend the LessWrong forum's Sparse Autoencoder tag

Tutorials

I wrote a tutorial to show users how to do some basic exploration of their SAE:

  • Loading and Analysing Pre-Trained Sparse Autoencoders Open In Colab
  • Understanding SAE Features with the Logit Lens Open In Colab
  • Training a Sparse Autoencoder Open In Colab

Example WandB Dashboard

WandB Dashboards provide lots of useful insights while training SAEs. Here's a screenshot from one training run.

screenshot

Citation

@misc{bloom2024saetrainingcodebase,
   title = {SAELens},
   author = {Joseph Bloom, Curt Tigges, Anthony Duong and David Chanin},
   year = {2024},
   howpublished = {\url{https://github.com/jbloomAus/SAELens}},
}}