·
DataBubble
  • Home
  • Models
  • News
  • Compare
  • Boards
  • Pricing
  • About
  • Newsletter
  • Methodology
  • Contact
Latest
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions38m◆MARIC: Multi-Agent Reasoning for Image Classification38m◆The Impossibility of Eliciting Latent Knowledge38m◆A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents38m◆PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents38m◆Nonslop: A Gamified Experiment in Human-AI Collaborative Writing38m◆Geometric Metrics and LLMs: What They Measure and When They Work38m◆From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data38m◆From Consumption to Reflection: Designing Human-AI Relations for Stable Reasoning38m◆PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference38m◆MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation38m◆The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content38m◆Noise-Guided Transport for Imitation Learning38m◆NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track38m◆To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending38m◆Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention38m◆BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts38m◆When Researchers Say Mental Model/Theory of Mind of AI, What Are They Really Talking About?38m◆ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward38m◆T2MM: An LLM Supported Architecture For Inquiry-Based Modeling38m◆Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions38m◆MARIC: Multi-Agent Reasoning for Image Classification38m◆The Impossibility of Eliciting Latent Knowledge38m◆A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents38m◆PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents38m◆Nonslop: A Gamified Experiment in Human-AI Collaborative Writing38m◆Geometric Metrics and LLMs: What They Measure and When They Work38m◆From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data38m◆From Consumption to Reflection: Designing Human-AI Relations for Stable Reasoning38m◆PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference38m◆MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation38m◆The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content38m◆Noise-Guided Transport for Imitation Learning38m◆NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track38m◆To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending38m◆Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention38m◆BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts38m◆When Researchers Say Mental Model/Theory of Mind of AI, What Are They Really Talking About?38m◆ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward38m◆T2MM: An LLM Supported Architecture For Inquiry-Based Modeling38m◆
News/Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality
arxiv
PublishedJune 11, 2026 at 4:00 AM

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

Source
arxiv.orgfull article ↗
Read on arxiv→
Publisher summary· verbatim

arXiv:2606.11499v1 Announce Type: cross Abstract: The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence

Stay posted· Newsletter

A 5-min weekly brief — top movers, price watch, story of the week.

// no spam · unsubscribe one-click · free forever

Discussion
Source
↗
arxiv
Read original ↗All from arxiv →

No replies yet. Be first.

Source
↗
arxiv
Read original ↗All from arxiv →

Related coverage

More from ARXIV
arxivGeneralizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions38marxivMARIC: Multi-Agent Reasoning for Image Classification38marxivThe Impossibility of Eliciting Latent Knowledge38marxivA Five-Plane Reference Architecture for Runtime Governance of Production AI Agents38m
The Bubble Brief
WEEKLY

Read AI insights every Tuesday — top movers, new releases, story of the week.

// no spam · unsubscribe one-click · free forever

Originally published on arxiv ↗
HomeModelsNews