arxiv
PublishedApril 21, 2026 at 4:00 AM
LLM Safety From Within: Detecting Harmful Content with Internal Representations
Publisher summary· verbatim
arXiv:2604.18519v1 Announce Type: new Abstract: Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal la
Discussion
No replies yet. Be first.
Related coverage
More from ARXIV
arxivFrom Local to Cluster: A Unified Framework for Causal Discovery with Latent Variables10harxivConsequentialist Objectives and Catastrophe10harxivEgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms10harxivA general optimization solver based on OP-to-MaxSAT reduction10hOriginally published on arxiv ↗