arxivApril 15, 2026 at 4:00 AM1 min read
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
arXiv:2604.12384v1 Announce Type: new Abstract: Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or act
No replies yet. Be first.