·
DataBubble
  • Home
  • Models
  • News
  • Compare
  • Boards
  • Pricing
  • About
  • Newsletter
  • Methodology
  • Contact
Latest
This is your laptop… on AI1h◆New York lawmakers pass one-year ban on new data centers2h◆The token bill comes due: Inside the industry scramble to manage AI’s runaway costs2h◆The latest AI news we announced in May 20263h◆This AI startup says it can tell if a script will make a hit film3h◆AirTrunk commits $30B to build 5GW of AI data centers in India4h◆The Meta hack shows there’s more to AI security than Mythos8h◆Mira Murati steps back into the spotlight, carefully12h◆SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning13h◆Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning13h◆Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models13h◆Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents13h◆Why Muon Outperforms Adam: A Curvature Perspective13h◆Vision Hopfield Memory Networks13h◆Provably Auditable and Safe LLM Agents from Human-Authored Ontologies13h◆FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment13h◆Stable Deep Reinforcement Learning via Isotropic Gaussian Representations13h◆HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data13h◆q0: Primitives for Hyper-Epoch Pretraining13h◆MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery13h◆This is your laptop… on AI1h◆New York lawmakers pass one-year ban on new data centers2h◆The token bill comes due: Inside the industry scramble to manage AI’s runaway costs2h◆The latest AI news we announced in May 20263h◆This AI startup says it can tell if a script will make a hit film3h◆AirTrunk commits $30B to build 5GW of AI data centers in India4h◆The Meta hack shows there’s more to AI security than Mythos8h◆Mira Murati steps back into the spotlight, carefully12h◆SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning13h◆Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning13h◆Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models13h◆Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents13h◆Why Muon Outperforms Adam: A Curvature Perspective13h◆Vision Hopfield Memory Networks13h◆Provably Auditable and Safe LLM Agents from Human-Authored Ontologies13h◆FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment13h◆Stable Deep Reinforcement Learning via Isotropic Gaussian Representations13h◆HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data13h◆q0: Primitives for Hyper-Epoch Pretraining13h◆MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery13h◆
News/Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
arxiv
PublishedMay 8, 2026 at 4:00 AM
—neutral

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

Source
arxiv.orgfull article ↗
Read on arxiv→
Publisher summary· verbatim

arXiv:2605.06327v1 Announce Type: cross Abstract: Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an ob

Models mentioned
02
  • 01meta-llama logo
    Llama-3.1-8B
    meta-llama/Llama-3.1-8B
    DL 1.3M0.0%IN $0.10/Mtok
  • 02meta-llama logo
    Llama-3.1-70B
    meta-llama/Llama-3.1-70B
Compare these 2 models→
Related
05
  • arxiv8d
    A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents
  • arxiv14d
    GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval
  • arxiv14d
    DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline
  • arxiv20d
    Krause Synchronization Transformers
  • arxiv21d
    A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes
Stay posted· Newsletter

A 5-min weekly brief — top movers, price watch, story of the week.

// no spam · unsubscribe one-click · free forever

Discussion
Mentioned models
07
  • 01
    OLMo-3
  • 02
    OLMo-3-Instruct
  • 03
    Mistral-Small-3.2
  • 04
    Phi-3.5-mini
  • 05
    Llama-3.1-8B
    meta-llama/Llama-3.1-8B
    1.3M dl
  • 06
    Llama-3.1-70B
    meta-llama/Llama-3.1-70B
  • 07
    Llama-Guard-3-8B
Source
↗
arxiv
Read original ↗All from arxiv →
Tags
04
#safety#benchmark#evaluation#language models

No replies yet. Be first.

Mentioned models
07
  • 01
    OLMo-3
  • 02
    OLMo-3-Instruct
  • 03
    Mistral-Small-3.2
  • 04
    Phi-3.5-mini
  • 05
    Llama-3.1-8B
    meta-llama/Llama-3.1-8B
    1.3M dl
  • 06
    Llama-3.1-70B
    meta-llama/Llama-3.1-70B
  • 07
    Llama-Guard-3-8B
Source
↗
arxiv
Read original ↗All from arxiv →
Tags
04
#safety#benchmark#evaluation#language models

Related coverage

More from ARXIV
arxivSFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning13harxivOptical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning13harxivDynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models13harxivTemporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents13h
The Bubble Brief
WEEKLY

Read safety insights every Tuesday — top movers, new releases, story of the week.

// no spam · unsubscribe one-click · free forever

Originally published on arxiv ↗
HomeModelsNews