Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

Source

arxiv.orgfull article ↗

Read on arxiv

Publisher summary· verbatim

arXiv:2606.04778v1 Announce Type: new Abstract: Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens.

Stay posted· Newsletter

A 5-min weekly brief — top movers, price watch, story of the week.

Discussion

No replies yet. Be first.

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

Related coverage

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

Related coverage