Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

Source

arxiv.orgfull article ↗

Read on arxiv

Publisher summary· verbatim

arXiv:2605.29629v1 Announce Type: new Abstract: Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks that produce equally harmful outputs may have followed completely differe

Stay posted· Newsletter

A 5-min weekly brief — top movers, price watch, story of the week.

Discussion

No replies yet. Be first.

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

Related coverage

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

Related coverage