arxivApril 15, 2026 at 4:00 AM1 min read

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

arXiv:2604.12384v1 Announce Type: new Abstract: Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or act

Read original article ↗

No replies yet. Be first.

arxiv6h ago

Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

arxiv6h ago

EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture

arxiv6h ago

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Related Articles

Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture

Red Teaming Large Reasoning Models