Project Overview

AI Interpretability Research Reproduction

A Sparse Autoencoder reproduction to understand how hidden representations form, overlap, and become inspectable inside transformer language models.

August 2025 - October 2025

01Why It Matters

Problem

Modern language models can produce useful answers while still being difficult to inspect internally. That creates a real trust problem: when a model fails, hallucinates, refuses, or changes behavior under a small prompt shift, it is hard to know which internal representation caused the behavior.

The original Sparse Autoencoder line of work is motivated by polysemanticity, where one neuron can appear to represent multiple unrelated concepts. If neurons are not clean units of meaning, then debugging a model by looking at individual neurons is often misleading.

Transformer activations are high-dimensional, so the useful concepts may be directions in activation space rather than single neurons.
Anthropic reports decomposing a transformer layer with 512 neurons into more than 4,000 learned features, showing why neuron-level inspection can miss important structure.
A reproduction project was useful because interpretability papers are easy to summarize but much harder to understand without rebuilding the pipeline.

02What I Did

Decision

I reproduced Sparse Autoencoder workflows to study how latent features can be learned from transformer activations and surfaced for analysis.

Instead of treating the paper as a black-box result, I focused on the mechanics: which activations are collected, how the autoencoder reconstructs them, how sparsity pressure changes the learned representation, and how a learned feature becomes interpretable enough to inspect.

Studied feature superposition, polysemanticity, representation learning, and neuron-level interpretability limits.
Used Python, PyTorch, and TransformerLens to stay close to existing mechanistic interpretability tooling.
Compared the reproduction against the paper narrative so the result became a learning artifact, not just a runnable notebook.

03What I Learned

Learning & Impact

It strengthened my interest in mechanistic interpretability and trustworthy AI, especially for systems deployed beyond controlled labs.

The work also changed how I evaluate AI projects. I now pay more attention to whether a system can be inspected, debugged, and explained when its behavior becomes surprising, not only whether it performs well on a benchmark.

Built practical familiarity with Sparse Autoencoder workflows and the motivation behind monosemantic features.
Improved my ability to read interpretability papers through implementation.
Connected my cybersecurity background with AI transparency and failure analysis.

Two people reviewing a laptop illustration

Takeaway

The project taught me to value reproducibility. Understanding a paper deeply often starts by rebuilding its assumptions carefully: what data enters the system, what objective is optimized, what gets visualized, and where interpretation may become too confident.

Research References

Cunningham et al., 2023

This paper studies Sparse Autoencoders as a scalable way to find more interpretable features in language-model activations.

Anthropic, 2023

Anthropic describes decomposing 512 neurons into more than 4,000 features, including concepts not visible from individual neurons alone.

Stack & Links

PythonPyTorchTransformerLens

Back To Projects