arxiv
PublishedMay 27, 2026 at 4:00 AM
—neutral
Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
Publisher summary· verbatim
arXiv:2605.26133v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training
Stay posted· Newsletter
A 5-min weekly brief — top movers, price watch, story of the week.
Discussion
No replies yet. Be first.
Related coverage
More from ARXIV
arxivSFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning10harxivOptical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning10harxivDynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models10harxivTemporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents10hThe Bubble Brief
WEEKLYRead AI insights every Tuesday — top movers, new releases, story of the week.
Originally published on arxiv ↗