Skip to content

Screenshot 2024-03-21 at 3 08 28 pm

SAELens

PyPI License: MIT build Deploy Docs codecov

The SAELens training codebase exists to help researchers:

  • Train sparse autoencoders.
  • Analyse sparse autoencoders and neural network internals.
  • Generate insights which make it easier to create safe and aligned AI systems.

Please note these docs are in beta. We intend to make them cleaner and more comprehensive over time.

Quick Start

Installation

pip install sae-lens

Loading Sparse Autoencoders from Huggingface

To load a pretrained sparse autoencoder, you can use SAE.from_pretrained() as below. Note that we return the original cfg dict from the huggingface repo so that it's easy to debug older configs that are being handled when we import an SAe. We also return a sparsity tensor if it is present in the repo. For an example repo structure, see here.

from sae_lens import SAE

sae, cfg_dict, sparsity = SAE.from_pretrained(
    release = "gpt2-small-res-jb", # see other options in sae_lens/pretrained_saes.yaml
    sae_id = "blocks.8.hook_resid_pre", # won't always be a hook point
    device = device
)

You can see other importable SAEs on this page.

Background and further Readings

We highly recommend this tutorial.

For recent progress in SAEs, we recommend the LessWrong forum's Sparse Autoencoder tag

Tutorials

I wrote a tutorial to show users how to do some basic exploration of their SAE:

  • Loading and Analysing Pre-Trained Sparse Autoencoders Open In Colab
  • Understanding SAE Features with the Logit Lens Open In Colab
  • Training a Sparse Autoencoder Open In Colab

Example WandB Dashboard

WandB Dashboards provide lots of useful insights while training SAE's. Here's a screenshot from one training run.

screenshot

Citation

@misc{bloom2024saetrainingcodebase,
   title = {SAELens},
   author = {Joseph Bloom, Curt Tigges and David Chanin},
   year = {2024},
   howpublished = {\url{https://github.com/jbloomAus/SAELens}},
}}