arxiv4d ago

Efficient and Privacy Aware Edge Cloud Collaborative Inference for Large Language Models

arXiv:2607.13093v4 Announce Type: replace-cross Abstract: On-device LLM inference faces a trilemma of response latency, limited hardware resources and user privacy. Full cloud inference delivers strong computing power but exposes user prompts and dialogue data, while standalone on-device inference i

#privacy #edge-computing #cryptography Read on arxiv →

arxivJul 1bullish

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

arXiv:2603.19453v3 Announce Type: replace Abstract: We propose an LLM harness that generates code-based policy functions for multi-agent environments, evaluates them with self-play, and refines them using feedback from previous iterations. Following the recent line of work in feedback engineering (t

CLGE2 models #multi-agent #feedback #game-theory Read on arxiv →

arxivJul 1bullish

The Past Is Prologue: A Plug-in Controller for Selective Updates in Sequentially Evolving LLM Memory

arXiv:2606.31121v1 Announce Type: new Abstract: Sequentially evolving LLM memory enables agents to reuse past experience, but existing systems usually deploy each locally generated memory update without checking whether it improves future behavior. As a result, updates that help the current task may

JA1 model #llm #memory-management #sequential-learning Read on arxiv →

arxivJun 30bullish

Cognitive World Models for Process-Level Social Influence Evaluation

arXiv:2606.29495v1 Announce Type: new Abstract: Social influence dialogue changes user behavior by altering internal cognitive states. The central evaluation question is whether the user's beliefs, desires, intentions, and emotions measurably change over the course of conversation, a process-oriente

COGPLL3 models #social influence #dialogue evaluation #cognitive modeling Read on arxiv →

techcrunchJun 30

Vibe-coding platform Base44 launches own model as AI startups seek defensibility

Wix-owned vibe-coding platform Base44 has started rolling out its own AI model — with hopes that it will eventually outperform frontier models.

BAME2 models #acquisition #ai #llm Read on techcrunch →

arxivJun 19

Characterizing Narrative Content in Web-scale LLM Pretraining Data

arXiv:2606.19468v1 Announce Type: new Abstract: The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token ope

NAFADO3 models #narrative-analysis #pretraining #llm Read on arxiv →

arxivJun 18bullish

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

arXiv:2508.04086v3 Announce Type: replace Abstract: Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like depth-first search (DFS). This leads to inevitable annotation failures and low efficiency in data generation. We introduce

TO1 model #llm #dataset #open-source Read on arxiv →

arxivJun 12

Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement

arXiv:2606.12834v1 Announce Type: new Abstract: As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating agent construction

#scientific-workflows #llm #agent-construction Read on arxiv →

arxivJun 12bullish

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

arXiv:2606.12916v1 Announce Type: new Abstract: Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even o

MD1 model #molecular-dynamics #llm #pipeline-automation Read on arxiv →

arxivJun 11bullish

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

arXiv:2606.11688v1 Announce Type: cross Abstract: Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric for unattended auto

AUREST3 models #autonomy #honesty #long-horizon Read on arxiv →

arxivJun 6bullish

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

arXiv:2601.02880v2 Announce Type: replace Abstract: Every existing inference-time reasoning framework discards all failure context at problem boundaries, leaving a model solving problem 500 no wiser than it was on problem 1. We present ReTreVal (Reasoning Tree with Validation), a training-free frame

REZESE3 models #inference-time #reasoning #llm Read on arxiv →

arxivJun 4bullish

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

arXiv:2606.04272v1 Announce Type: new Abstract: The standard LLM training pipeline applies reinforcement learning (RL) only after pre-training and supervised fine-tuning (SFT). We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL directly to interme

#reinforcement-learning #pre-training #fine-tuning Read on arxiv →

arxivMay 29bullish

From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

arXiv:2601.21909v2 Announce Type: replace Abstract: Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach do

#llm #reinforcement-learning #cognitive-architecture Read on arxiv →

arxivMay 29

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

arXiv:2605.29018v1 Announce Type: new Abstract: Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known about how individual users change their behavior over time. To address this gap, we analyze the conversational tr

MIWI2 models #user-behavior #llm #conversational-ai Read on arxiv →

arxivMay 28bullish

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

arXiv:2605.27570v1 Announce Type: new Abstract: Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations. However, each se

LA1 model #research #llm #parallel-processing Read on arxiv →

arxivMay 8

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

arXiv:2605.05726v1 Announce Type: new Abstract: As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption b

#benchmark #llm #retrieval Read on arxiv →

arxivMay 5

Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

arXiv:2605.00314v1 Announce Type: cross Abstract: An agent skill is a configuration package that equips an LLM-driven agent with a concrete capability, such as reading email, executing shell commands, or signing blockchain transactions. Each skill is a hybrid artifact-a structured half declares exec

LL1 model #security #audit #llm Read on arxiv →

arxivApr 21bullish

ThreadSumm: Summarization of Nested Discourse Threads Using Tree of Thoughts

arXiv:2604.17648v1 Announce Type: new Abstract: Summarizing deeply nested discussion threads requires handling interleaved replies, quotes, and overlapping topics, which standard LLM summarizers struggle to capture reliably. We introduce ThreadSumm, a multi-stage LLM framework that treats thread sum

THLL2 models #summarization #llm #discussion-threads Read on arxiv →

arxivApr 16

PrivacyReasoner: Can LLM Emulate a Human-like Privacy Mind?

arXiv:2601.09152v2 Announce Type: replace Abstract: Prior work on LLM-based privacy focuses on norm judgment over synthetic vignettes, rather than how people think about a specific data practice and formulate their opinions. We address this gap by designing PrivacyReasoner, an agent architecture gro

LLPR2 models #privacy #llm #artificial-intelligence Read on arxiv →

arxivApr 7

TimeSeek: Temporal Reliability of Agentic Forecasters

arXiv:2604.04220v1 Announce Type: new Abstract: We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market's lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with a

TI1 model #benchmark #forecasting #evaluation Read on arxiv →