Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

Source

arxiv.orgfull article ↗

Read on arxiv

Publisher summary· verbatim

arXiv:2606.13720v1 Announce Type: new Abstract: Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based intervent

Stay posted· Newsletter

A 5-min weekly brief — top movers, price watch, story of the week.

Discussion

No replies yet. Be first.

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

Related coverage

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

Related coverage