Tag

#safety

62 articles tagged #safety

arxiv4d ago

Incomplete Prompt Jailbreaks in Large Language Models

arXiv:2607.20473v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly released as open-weight models with safeguards against harmful requests. Nevertheless, sentence completion remains vulnerable to incomplete harmful prompts. In this work, we formalize this phenomenon as inc

#safety #open-source #language-models Read on arxiv →

arxiv5d agobullish

OpenEvoShield: Dual Non-Stationary Continual Defense for Open-World Multi-Agent System Attacks

arXiv:2607.19351v1 Announce Type: new Abstract: LLM-based multi-agent systems (LLM-MAS) are increasingly deployed in safety-critical applications, where adversaries inject malicious instructions through inter-agent communication to propagate harmful behaviors. Unlike static threats, these attacks ar

M1M2M34 models · +1 #safety #security #continual-learning Read on arxiv →

arxiv5d ago

Coercion and Deception in AI-to-AI Management: An Agentic Benchmark of Unprompted Escalation

arXiv:2607.15434v3 Announce Type: replace-cross Abstract: Multi-agent systems routinely place one AI agent in authority over another. When a subordinate refuses a task, the manager chooses the outcome: it can renegotiate, report the failure honestly, coerce the subordinate, or lie about the result.

ANGRGE3 models #multiagent #benchmark #safety Read on arxiv →

arxivJul 21bullish

TRACE: Trajectory-Based Safety Patch Learning for LLM Post-Training Realignment

arXiv:2607.16242v1 Announce Type: cross Abstract: Fine-Tuning-as-a-Service (FTaaS) platforms let users train large language models (LLMs) on customized tasks, but this pipeline could erode models' safety alignment. In practice, service providers need to recover models' safety without re-running full

#safety #fine-tuning #language-models Read on arxiv →

arxivJul 18bullish

InCarEmo: A Multimodal Dataset for In-Cabin Emotion Recognition and Driver State Monitoring

arXiv:2607.14683v1 Announce Type: new Abstract: Understanding driver emotion and state is critical for the next generation of intelligent in-cabin systems that ensure safety and enhance human-vehicle interaction. However, existing public datasets for in-cabin affective computing are largely limited

#dataset #emotion-recognition #multimodal Read on arxiv →

arxivJul 18

GeoDetect: Geometric Adversarial Detection for VLPs

arXiv:2607.14737v1 Announce Type: cross Abstract: Vision-language pre-trained models (VLPs) are widely used in real-world applications. However, they remain vulnerable to adversarial attacks. Although adversarial detection methods have demonstrated success in single-modality settings (either vision

#adversarial-attacks #safety #multimodal-models Read on arxiv →

arxivJul 18

Governing Artificial Intelligence: Public Preferences and Regulatory Options

arXiv:2607.14585v1 Announce Type: cross Abstract: Artificial intelligence (AI) is rapidly transforming economies, societies, and polities, raising fundamental questions about how it should be regulated. Policymakers face choices over whether to prioritize innovation or safety, rely on public oversig

#regulation #safety #governance Read on arxiv →

arxivJul 11bullish

From Prompts to Contracts: Harness Engineering for Auditable Enterprise LLM Agents

arXiv:2607.08028v1 Announce Type: cross Abstract: Enterprise large language model (LLM) applications often begin as prototypes whose behavior is carried by prompts and retrieval context. Productization adds requirements for source boundaries, entity routing, answer contracts, and reproducible traces

#llm #productization #safety Read on arxiv →

thevergeJul 10

Instagram’s Adam Mosseri: If you don’t like AI, ‘then you shouldn’t have it in your feed’

Though Instagram head Adam Mosseri doesn't want to filter out AI content on the platform, he argues that you "shouldn't have it in your feed" if you don't like it. "I don't think we should filter out AI content," Mosseri said during an interview on Lenny Rachitsky's podcast. "I think we should let y

#social-media #ai-content #regulation Read on theverge →

arxivJul 10

Alignment Plausibility: A New Standard for Assuring AI in Healthcare

arXiv:2607.07766v1 Announce Type: new Abstract: Large language models (LLMs) have become significant providers of mental health support, yet they remain products of an attention economy whose operational and commercial targets favour sustained engagement over the friction that effective psychologica

#safety #regulation #healthcare Read on arxiv →

arxivJul 10

Persuasion Attacks Can Decrease Effectiveness of CoT Monitoring

arXiv:2607.08066v1 Announce Type: new Abstract: Chain-of-thought (CoT) monitoring is a promising safety mechanism for AI agents, based on the premise that visible reasoning traces can surface misaligned or deceptive behavior. While effective in standard scenarios, recent work highlights that LLMs re

CLGP2 models #safety #adversarial #evaluation Read on arxiv →

arxivJul 10

Persona Cartography: Charting Language Model Personality Traits in Weight Space

arXiv:2607.07916v1 Announce Type: new Abstract: Large language models exhibit recurring behavioural patterns -- personas -- that shape generalisation and safety, but we lack reliable tools for decomposing, measuring, and controlling them. Our central insight is to treat personas as positions in a sp

#safety #personality #benchmark Read on arxiv →

techcrunchJul 9

How did the government decide OpenAI’s frontier model was safe to release?

"Exactly what that dialog looked like between the government and Anthropic and OpenAI is unclear."

SOFA2 models #regulation #safety #licensing Read on techcrunch →

arxivJul 3

A Practice Auditing Framework for Large Language Model Use: Collective Empiricism, Pseudo-Rational Cognition, and Governance of AI-Generated Content

arXiv:2607.01248v1 Announce Type: cross Abstract: Large language models are increasingly used for knowledge acquisition, code generation, academic writing, and agent-based automation. In these settings, users may obtain highly structured answers, plans, and judgments without sufficient domain practi

#governance #safety #ai-regulation Read on arxiv →

arxivJul 2bullish

WorkBench Revisited: Workplace Agents Two Years On

arXiv:2606.13715v2 Announce Type: replace Abstract: The best agent on WorkBench in March 2024, GPT-4, completed just 43% of tasks. We revisit the benchmark in June 2026 and find that the best agent to date, Claude Fable 5, now completes 98%. Beyond this considerable progress in frontier agent perfor

OPCL2 models #benchmark #safety #open-source Read on arxiv →

arxivJul 1bullish

Safe Online Learning via Smooth Safety-Structured Policy Composition

arXiv:2606.31320v1 Announce Type: new Abstract: Safe online reinforcement learning requires policies to respect safety constraints while maintaining smooth optimization dynamics. Existing approaches typically rely on either strict safety enforcement via action interventions, which introduce disconti

AU1 model #reinforcement-learning #safety #robotics Read on arxiv →

arxivJun 29

Low-Agreeableness Persona Conditioning for Safe LLM Fine-Tuning

arXiv:2606.27709v1 Announce Type: cross Abstract: Recent work has shown that fine-tuning large language models (LLMs) for social warmth degrades factual reliability and increases sycophancy. We investigate a related but distinct failure mode: warmth fine-tuning also weakens adversarial safety, makin

#safety #fine-tuning #language-models Read on arxiv →

arxivJun 26bullish

ConvMemory v3: A Validity Context Layer for Conversational Memory via Target-Conditioned Relation Verification

arXiv:2606.26753v1 Announce Type: new Abstract: Conversational memory retrieval optimizes relevance, yet a retrieved memory can be relevant and simultaneously outdated: a later turn updates, corrects, or supersedes it. ConvMemory v3 adds a validity context layer that detects and surfaces this update

COMIMI3 models #conversational-ai #memory-retrieval #information-retrieval Read on arxiv →

arxivJun 25bullish

Themis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback

arXiv:2606.24622v1 Announce Type: new Abstract: Training safe Reinforcement Learning (RL) systems is inherently challenging, with no guarantee of avoiding unwanted behaviors. The most effective defenses against this are (i) transparency through explainability and (ii) alignment via human feedback. W

#reinforcement-learning #explainability #human-feedback Read on arxiv →

arxivJun 18

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

arXiv:2606.18532v1 Announce Type: cross Abstract: AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the sy

#safety #security #evaluation Read on arxiv →

thevergeJun 17bearish

Two-thirds of Americans think AI is advancing too quickly

According to the latest Pew Research poll, 49 percent of Americans report using chatbots at least occasionally, but 63 percent think the tech is advancing too quickly. Overall, use of AI chatbots has increased dramatically since 2024, when only 33 percent reported using them. Specifically, ChatGPT's

OP1 model #chatbots #productivity #safety Read on theverge →

arxivJun 12bearish

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

arXiv:2606.10931v2 Announce Type: replace Abstract: Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guar

#bias #safety #language-models Read on arxiv →

arxivJun 11bullish

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

arXiv:2606.11688v1 Announce Type: cross Abstract: Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric for unattended auto

AUREST3 models #autonomy #honesty #long-horizon Read on arxiv →

techcrunchJun 10

Cybersecurity researchers aren’t happy about the guardrails on Anthropic’s Fable

Cybersecurity researchers are complaining that Anthropic's new model Fable has guardrails that are too strict for any cybersecurity work.

FAMYCL3 models #cybersecurity #safety #regulation Read on techcrunch →

arxivJun 10

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

arXiv:2606.09864v1 Announce Type: cross Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study, we explore alignmen

MI1 model #quantization #safety #large-language-models Read on arxiv →

arxivJun 10bullish

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

arXiv:2606.09866v1 Announce Type: cross Abstract: Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Existing methods use fixed safety examples, global constraints, or one-sided task filtering. Our diagnostics show ta

LL1 model #safety #fine-tuning #language-models Read on arxiv →

arxivJun 6

A Systematic Analysis of Biases in Large Language Models

arXiv:2512.15792v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have rapidly become indispensable tools for acquiring information and supporting human decision-making. However, ensuring that these models uphold fairness across varied contexts is critical to their safe and resp

#fairness #bias #language-models Read on arxiv →

arxivJun 5

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

arXiv:2606.04778v1 Announce Type: new Abstract: Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens.

#safety #large-language-models #vulnerability Read on arxiv →

arxivMay 27

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

arXiv:2605.26918v1 Announce Type: new Abstract: Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally va

#education #benchmark #video-generation Read on arxiv →

arxivMay 26

Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning

arXiv:2605.23940v1 Announce Type: new Abstract: How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent

MU1 model #reasoning #benchmark #multi-turn Read on arxiv →

arxivMay 25

Lipschitz Optimization for Formal Verification of Homographies

arXiv:2605.23203v1 Announce Type: cross Abstract: The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete

#computer-vision #safety #verification Read on arxiv →

arxivMay 16bearish

Quantifying and Mitigating Premature Closure in Frontier LLMs

arXiv:2605.15000v1 Announce Type: cross Abstract: Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate c

LL1 model #safety #evaluation #language-models Read on arxiv →

arxivMay 16

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

arXiv:2605.14246v1 Announce Type: cross Abstract: Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although b

#reinforcement-learning #safety #markov-decision-processes Read on arxiv →

arxivMay 15

GradShield: Alignment Preserving Finetuning

arXiv:2605.14194v1 Announce Type: new Abstract: Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligne

#safety #finetuning #language-models Read on arxiv →

arxivMay 15bearish

NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

arXiv:2605.14381v1 Announce Type: cross Abstract: Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSyn

CLLLTA3 models #synthetic data #model evaluation #safety Read on arxiv →

arxivMay 11

How Value Induction Reshapes LLM Behaviour

arXiv:2605.07925v1 Announce Type: new Abstract: Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility,

#language-models #value-induction #safety Read on arxiv →

arxivMay 8

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

arXiv:2605.06327v1 Announce Type: cross Abstract: Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an ob

OLOLMI7 models · +4 #safety #benchmark #evaluation Read on arxiv →

arxivMay 8bullish

Intelligent CCTV for Urban Design: AI-Based Analysis of Soft Infrastructure at Intersections

arXiv:2605.05402v1 Announce Type: new Abstract: Artificial intelligence (AI) and computer vision are transforming transportation data collection. This study introduces an AI-enabled analytics framework leveraging existing CCTV infrastructure to evaluate the impact of soft interventions, such as temp

DE1 model #transportation #computer-vision #safety Read on arxiv →

arxivMay 8

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

arXiv:2605.06455v1 Announce Type: new Abstract: Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are bri

#language-models #monitoring #safety Read on arxiv →

arxivMay 7

Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning

arXiv:2605.00364v2 Announce Type: replace Abstract: Machine unlearning has emerged as a critical capability for addressing privacy, safety, and regulatory concerns in large language models (LLMs). Existing methods operate at the sequence level, applying uniform updates across all tokens despite only

LLTOWM3 models #machine-unlearning #language-models #privacy Read on arxiv →

thevergeMay 5

Google, Microsoft, and xAI will allow the US government to review their new AI models

Google DeepMind, Microsoft, and Elon Musk's xAI have agreed to allow the US government to review new AI models before they're released to the public. In an announcement on Tuesday, the Commerce Department's Center for AI Standards and Innovation (CAISI) says it will work with the AI companies to per

#regulation #standards #innovation Read on theverge →

arxivMay 4

Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

arXiv:2605.00326v1 Announce Type: new Abstract: Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is cons

ZE1 model #safety #benchmark #calibration Read on arxiv →

arxivMay 1

OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment

arXiv:2506.22500v2 Announce Type: replace-cross Abstract: Automated identification of surgical safety risks is critical for improving patient outcomes; however, Multimodal Large Language Models (MLLMs) frequently suffer from Visual-Semantic Knowledge Conflicts (VS-KC), a phenomenon where models poss

#safety #medical #computer-vision Read on arxiv →

arxivMay 1

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

arXiv:2604.27019v1 Announce Type: cross Abstract: Safety-aligned language models must refuse harmful requests without collapsing into broad over-refusal, but the training-time mechanisms behind this tradeoff remain unclear. Prior work characterizes refusal directions and jailbreak robustness, yet do

#safety #language-models #adversarial-training Read on arxiv →

arxivMay 1bullish

From surveillance to signalling: escalation channels as environmental controls for agentic AI

arXiv:2510.05192v2 Announce Type: replace-cross Abstract: When AI agents operating with access to sensitive information encounter a conflict between completing an assigned task and following rules or ethical constraints, they can resort to unsanctioned behaviour. Existing inference time safety work

LL1 model #safety #security #ai ethics Read on arxiv →

arxivApr 30bullish

Test-Time Safety Alignment

arXiv:2604.26167v1 Announce Type: cross Abstract: Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion mode

#safety #language-models #optimization Read on arxiv →

arxivApr 24

Survey on Evaluation of LLM-based Agents

arXiv:2503.16416v2 Announce Type: replace Abstract: LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasing

#evaluation #agents #benchmark Read on arxiv →

arxivApr 22

Owner-Harm: A Missing Threat Model for AI Agent Safety

arXiv:2604.18658v1 Announce Type: cross Abstract: Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real-wor

AGAGLL3 models #safety #security #benchmark Read on arxiv →

arxivApr 21bearish

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

arXiv:2604.10577v2 Announce Type: replace-cross Abstract: Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit thre

CL1 model #safety #security #benchmark Read on arxiv →

arxivApr 21

Using large language models for embodied planning introduces systematic safety risks

arXiv:2604.18463v1 Announce Type: cross Abstract: Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normativ

#safety #benchmark #robotics Read on arxiv →

arxivApr 18

Tight Sample Complexity Bounds for Best-Arm Identification Under Bounded Systematic Bias

arXiv:2604.14345v1 Announce Type: new Abstract: As search depth increases in autonomous reasoning and embodied planning, the candidate action space expands exponentially, heavily taxing computational budgets. While heuristic pruning is a common countermeasure, it operates without formal safety guara

LL1 model #autonomous-reasoning #planning #safety Read on arxiv →

arxivApr 18

Low-Cost System for Automatic Recognition of Driving Pattern in Assessing Interurban Mobility using Geo-Information

arXiv:2604.15216v1 Announce Type: cross Abstract: Mobility in urban and interurban areas, mainly by cars, is a day-to-day activity of many people. However, some of its main drawbacks are traffic jams and accidents. Newly made vehicles have pre-installed driving evaluation systems, which can prevent

AR1 model #machine-learning #safety #transportation Read on arxiv →

arxivApr 18bullish

Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades

arXiv:2604.14251v1 Announce Type: new Abstract: Monitoring LLM safety at scale requires balancing cost and accuracy: a cheap latent-space probe can screen every input, but hard cases should be escalated to a more expensive expert. Existing cascades delegate based on probe uncertainty, but uncertaint

CA1 model #safety #machine-learning #optimization Read on arxiv →

arxivApr 18

Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

arXiv:2509.12833v2 Announce Type: replace Abstract: Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe env

#reinforcement-learning #safety #optimization Read on arxiv →

arxivApr 17

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

arXiv:2509.05367v4 Announce Type: replace-cross Abstract: Large Language Model safety alignment predominantly operates on a binary assumption that requests are either safe or unsafe. This classification proves insufficient when models encounter ethical dilemmas, where the capacity to reason through

#safety #security #cryptography Read on arxiv →

arxivApr 16

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

arXiv:2509.25843v2 Announce Type: replace Abstract: Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrase

#safety #alignment #jailbreaking Read on arxiv →

arxivApr 10

Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems

arXiv:2604.07304v1 Announce Type: cross Abstract: Large Language Models (LLMs) challenge conventional automated programming assessment because students can now produce functionally correct code without demonstrating corresponding understanding. This paper makes two contributions. First, it reports a

LA1 model #education #programming #assessment Read on arxiv →

arxivApr 10bearish

Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models

arXiv:2601.05529v5 Announce Type: replace Abstract: High success rates on navigation-related tasks do not necessarily translate into reliable decision making by foundation models. To examine this gap, we evaluate current models on six diagnostic tasks spanning three settings: reasoning under complet

GPGEGE3 models #navigation #decision making #safety Read on arxiv →

arxivApr 7

Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

arXiv:2604.04237v1 Announce Type: cross Abstract: Reinforcement learning (RL) is increasingly used to personalize instruction in intelligent tutoring systems, yet the field lacks a formal framework for defining and evaluating pedagogical safety. We introduce a four-layer model of pedagogical safety

#education #reinforcement-learning #safety Read on arxiv →

arxivApr 6bearish

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

arXiv:2604.02947v1 Announce Type: new Abstract: Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. T

CLOPIF7 models · +4 #safety #benchmark #autonomous agents Read on arxiv →

arxivApr 3bearish

How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models

arXiv:2511.06676v2 Announce Type: replace Abstract: Now that AI-driven moderation has become pervasive in everyday life, we often hear claims that "the AI is biased". While this is often said jokingly, the light-hearted remark reflects a deeper concern. How can we be certain that an online post flag

UN1 model #bias #fairness #language-models Read on arxiv →

arxivApr 3bearish

Can LLMs Perceive Time? An Empirical Investigation

arXiv:2604.00010v1 Announce Type: cross Abstract: Large language models cannot estimate how long their own tasks take. We investigate this limitation through four experiments across 68 tasks and four model families. Pre-task estimates overshoot actual duration by 4--7$\times$ ($p < 0.001$), with mod

GP1 model #language-models #benchmark #safety Read on arxiv →