LLM Safety From Within: Detecting Harmful Content with Internal Representations

Source

arxiv.orgfull article ↗

Publisher summary· verbatim

arXiv:2604.18519v1 Announce Type: new Abstract: Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal la

Discussion

No replies yet. Be first.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Related coverage