Representation Engineering 101

Intro:

How to interpret and manipulate intermediate representations (also called latent spaces/hidden state/intermediate states/activations) in a neural network

Potential applications in steerability and alignment

Side note: alignment research is capability research

What is a representation?

Informational content of some form

Takes the form of a vector for NNs, which is passed from layer to layer

Interesting demos:

Linus: https://www.youtube.com/watch?v=YvobVu1l7GI

Thebes: https://vgel.me/posts/representation-engineering/

RepE vs. MechInterp	Level of abstraction	Research difficulty	Application difficulty	Level of control
RepE	Representation	Medium	Easy	Medium
MechInterp	Neurons/circuits	Hard	Hard	High

What I’ll talk about

Does this approach even make sense?

Potential applications

The Linear Representation Hypothesis and the Geometry of Large Language Models

https://arxiv.org/abs/2311.03658

This paper presents a rigorous definition of the Linear Representation Hypothesis (LRH) and provides empirical evidence in support

LRH: high-level concepts are represented linearly in the representation space of a model

If LRH holds, we can use linear algebraic operations on representations to interpret and manipulate them

3 interpretations of the LRH:

Subspace: Concepts are represented as a subspace

Measurement: The value/presence of a concept for a given input can be measured by using a linear probe on intermediate representations

Intervention: The value/presence of a concept for a given input can be adjusted with the appropriate linear algebraic operation

LRH underpins entire endeavor of RepE

Side note: I think vector search and retrieval too!

Some results:

I wanted to introduce RepE with this paper to show that this approach is relatively principled – something which other interpretability/explainability approaches can sometimes struggle with