Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Source

arxiv.orgfull article ↗

Read on arxiv

Publisher summary· verbatim

arXiv:2604.20995v1 Announce Type: new Abstract: Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because current diagnostic tools remain limited. Prior di

Discussion

No replies yet. Be first.

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Related coverage