Jailbreak Attack Initializations as Extractors of Compliance Directions

Source

arxiv.orgfull article ↗

Read on arxiv

Publisher summary· verbatim

arXiv:2502.09755v4 Announce Type: replace-cross Abstract: Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantl

Stay posted· Newsletter

A 5-min weekly brief — top movers, price watch, story of the week.

Discussion

No replies yet. Be first.

Jailbreak Attack Initializations as Extractors of Compliance Directions

Related coverage

Jailbreak Attack Initializations as Extractors of Compliance Directions

Related coverage