arxiv
PublishedJune 11, 2026 at 4:00 AM
Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality
Publisher summary· verbatim
arXiv:2606.11499v1 Announce Type: cross Abstract: The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence
Stay posted· Newsletter
A 5-min weekly brief — top movers, price watch, story of the week.
Discussion
No replies yet. Be first.
Related coverage
More from ARXIV
arxivGeneralizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions38marxivMARIC: Multi-Agent Reasoning for Image Classification38marxivThe Impossibility of Eliciting Latent Knowledge38marxivA Five-Plane Reference Architecture for Runtime Governance of Production AI Agents38mThe Bubble Brief
WEEKLYRead AI insights every Tuesday — top movers, new releases, story of the week.
Originally published on arxiv ↗