Suggest edit — Logit Probe

Title

Name

Note

---
title: "Logit Probe"
source: https://www.jemoka.com/posts/kbhlogitprobe/
---

Goals Motivation: it is very difficult to have an interpretable, causal trace of facts. Let&rsquo;s fix that.
Facts It is also further difficult to pull about what is a &ldquo;fact&rdquo; and what is a &ldquo;syntactical relation&rdquo;. For instance, the task of
The Apple iPhone is made by American company &lt;mask&gt;. is different and arguably more of a syntactical relationship rather than factually eliciting prompt than
The iPhone is made by American company &lt;mask&gt;. For our purposes, however, we obviate this problem by saying that both of these cases are a recall of the fact triplet &lt;iPhone, made_by, Apple&gt;. Even despite the syntactical relationship established by the first case, we define success as any intervention that edits this fact triplet without influencing other stuff of the form:
The [company] [product] is made by [country] company [company]. The Probe Definition Maps Hidden mappings \(H^{(1)}, &hellip;, H^{N}\) Output projections \(W = W^{O}W^{I}\) Spaces embedding space \(U \subset \mathbb{R}^{\text{hidden}}\) vocab space \(V \subset \mathbb{R}^{|V|}\), where \(|V|\) is vocab size LM: \(L = (W H^{(N)} \dots H^{(1)}): U \to V\), such that \(L u \in V\), for some word embedding \(u \in U\). LM&rsquo;s distribution: \(\sigma L\), such that \(\sigma u \in \triangle_{|V|}\). The Logit Lens The Logit Lens proposes that we can chop off some \(H\) and recover a distribution that&rsquo;s similar to the true output distribution. Empirically, given large enough \(N\), it is likely that:
\begin{equation} \arg\max_{j} \qty(W H^{(N)} \dots H^{(1)})_{j} = \arg\max_{j} \qty(W H^{(N-1)} \dots H^{(1)})_{j} = \arg\max_{j} \qty(W H^{(N-2)} \dots H^{(1)})_{j} \end{equation}
up to some finite depth before this effect breaks down.
A Sketch Evidence suggests that storage of &ldquo;factual&rdquo; information is not typically axis-aligned in \(U\). Meaning, it&rsquo;s difficult to learn some binary mask \(m\) such that \(m \cdot u \in U\) which would then disrupt downstream knowledge production of a fact without knocking out other stuff.
However, we know that due to the one-hot cross-entropy LM objective, &ldquo;facts&rdquo; (as defined above) is axis aligned to \(V\). After all, a word \(v_{j}\) is represented by the \(j\) th standard basis (i.e. one-hot vector) in \(v\).