How to interpret and manipulate intermediate representations (also called latent spaces/hidden state/intermediate states/activations) in a neural network
Potential applications in steerability and alignment
Side note: alignment research is capability research
What is a representation?
Informational content of some form
Takes the form of a vector for NNs, which is passed from layer to layer
This paper presents a rigorous definition of the Linear Representation Hypothesis (LRH) and provides empirical evidence in support
LRH: high-level concepts are represented linearly in the representation space of a model
If LRH holds, we can use linear algebraic operations on representations to interpret and manipulate them
3 interpretations of the LRH:
Subspace: Concepts are represented as a subspace
Measurement: The value/presence of a concept for a given input can be measured by using a linear probe on intermediate representations
Intervention: The value/presence of a concept for a given input can be adjusted with the appropriate linear algebraic operation
LRH underpins entire endeavor of RepE
Side note: I think vector search and retrieval too!
Some results:
I wanted to introduce RepE with this paper to show that this approach is relatively principled – something which other interpretability/explainability approaches can sometimes struggle with
Representation Engineering: A Top-Down Approach to AI Transparency