arxiv
PublishedJune 4, 2026 at 4:00 AM
—neutral
Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
Publisher summary· verbatim
arXiv:2606.04778v1 Announce Type: new Abstract: Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens.
Stay posted· Newsletter
A 5-min weekly brief — top movers, price watch, story of the week.
Discussion
No replies yet. Be first.
The Bubble Brief
WEEKLYRead safety insights every Tuesday — top movers, new releases, story of the week.
Originally published on arxiv ↗