Representation Engineering 101

Intro:

How to interpret and manipulate intermediate representations (also called latent spaces/hidden state/intermediate states/activations) in a neural network

Potential applications in steerability and alignment

Side note: alignment research is capability research

What is a representation?

Informational content of some form
Takes the form of a vector for NNs, which is passed from layer to layer
RepE vs. MechInterp
Level of abstraction
Research difficulty
Application difficulty
Level of control
RepE
Representation
Medium
Easy
Medium
MechInterp
Neurons/circuits
Hard
Hard
High

What I’ll talk about

Does this approach even make sense?
Potential applications

The Linear Representation Hypothesis and the Geometry of Large Language Models

This paper presents a rigorous definition of the Linear Representation Hypothesis (LRH) and provides empirical evidence in support
LRH: high-level concepts are represented linearly in the representation space of a model

If LRH holds, we can use linear algebraic operations on representations to interpret and manipulate them

3 interpretations of the LRH:

Subspace: Concepts are represented as a subspace
Measurement: The value/presence of a concept for a given input can be measured by using a linear probe on intermediate representations
Intervention: The value/presence of a concept for a given input can be adjusted with the appropriate linear algebraic operation

LRH underpins entire endeavor of RepE

Side note: I think vector search and retrieval too!

Some results:

I wanted to introduce RepE with this paper to show that this approach is relatively principled – something which other interpretability/explainability approaches can sometimes struggle with

Representation Engineering: A Top-Down Approach to AI Transparency

Survey on past work in RepE + experiments in steerability

Note: Honesty example application is a bit shaky. See Mistral 7b on Acid example

Methods

Concept vector identification

Dataset creation/selection

Instruct/base
Concept type
Continuous
Discrete
Binary
Classes > 2
Existing/new handcrafted/new synthetic
Difficulty in specifying concepts?

Vector calculation

Binary: difference of hidden states between pairs

Single layer
Multi layer

Continuous/Discrete

Linear probe?
Time: https://arxiv.org/abs/2312.13401
Difference in weights between base and fine-tuned model

Concept vector application

Potential applications

Steerability
Alignment
Adversarial robustness
Tana logo