arxivJul 21bullish

TRACE: Trajectory-Based Safety Patch Learning for LLM Post-Training Realignment

arXiv:2607.16242v1 Announce Type: cross Abstract: Fine-Tuning-as-a-Service (FTaaS) platforms let users train large language models (LLMs) on customized tasks, but this pipeline could erode models' safety alignment. In practice, service providers need to recover models' safety without re-running full

#safety #fine-tuning #language-models Read on arxiv →

arxivApr 16

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

arXiv:2509.25843v2 Announce Type: replace Abstract: Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrase

#safety #alignment #jailbreaking Read on arxiv →

arxivApr 14

Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment

arXiv:2604.10673v1 Announce Type: new Abstract: AI alignment is often framed as the task of ensuring that an AI system follows a set of stated principles or human preferences, but general principles rarely determine their own application in concrete cases. When principles conflict, when they are too

#alignment #interpretability #evaluation Read on arxiv →