arxiv21h ago
arXiv:2606.11284v1 Announce Type: cross Abstract: Real-world multi-agent systems, from traffic coordination to resource allocation, are often modeled as general-sum games where individual incentives conflict with collective welfare. In these settings, the central challenge is not merely finding an e
arxiv21h agobullish
arXiv:2505.13196v3 Announce Type: replace-cross Abstract: We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algori
arxiv1d agobullish
arXiv:2606.11119v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, aris
arxiv1d agobullish
arXiv:2606.09896v1 Announce Type: cross Abstract: In modern online advertising platforms, Guaranteed Delivery (GD) contracts coexist and bid with Real-Time Bidding (RTB) auctions. Recent approaches either decouple GD and RTB optimization or rely on heuristic priority rules, and thus fail to effectiv
arxiv1d agobullish
arXiv:2606.10286v1 Announce Type: new Abstract: Open-pit mine scheduling is a critical process for maximizing economic return under complex geotechnical and operational constraints. While Mixed-Integer Linear Programming (MILP) provides mathematically optimal baselines, its exponential computational
arxiv1d agobullish
arXiv:2606.09879v1 Announce Type: new Abstract: This study addresses on-device inference bottlenecks of Transformer models on Tenstorrent's Tensix architecture and proposes an operator fusion strategy that enhances data locality. RMSNorm is fused with matrix multiplication in self-attention and in t
arxiv5d agobullish
arXiv:2606.05688v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, h
arxivJun 3
arXiv:2509.08726v3 Announce Type: replace-cross Abstract: This paper focuses on the decentralized stochastic optimization problem $f(\mathbf{x})=\frac{1}{m}\sum_{i=1}^m f_i(\mathbf{x})$ over a connected network of $n$ agents, where each local function has the form of $f_i(\mathbf{x}) = {\mathbb E}\l
arxivJun 3bullish
arXiv:2606.02684v1 Announce Type: cross Abstract: On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most info
arxivJun 3bullish
arXiv:2606.03113v1 Announce Type: new Abstract: Large Language Models suffer from slow autoregressive inference. While self-speculative decoding accelerates this process, its efficiency is hampered by static configurations like fixed exit layers and speculation lengths. We reframe this optimization
arxivJun 3bullish
arXiv:2606.02679v1 Announce Type: new Abstract: Multimodal systems often benefit from combining information across language, sound, and visual streams, but this benefit is not guaranteed. A modality that is useful for one input may become distracting for another, and local feature responses within t
arxivJun 2
arXiv:2606.00371v1 Announce Type: new Abstract: Muon optimizers improve neural-network training by replacing ill-conditioned momentum updates with approximately semi-orthogonal updates. This motivates a practical question: how much orthogonalization does Muon actually require? We study this question
arxivJun 2bullish
arXiv:2605.12969v3 Announce Type: replace-cross Abstract: Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy
arxivJun 2bullish
arXiv:2510.05342v2 Announce Type: replace-cross Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing over
arxivJun 2bullish
arXiv:2606.00618v1 Announce Type: new Abstract: Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by the training data distribution. One approach is to improve generated solutions during inference by scaling test-time compute. A more eff
arxivMay 29
arXiv:2605.29913v1 Announce Type: cross Abstract: This paper investigates a multi-user indoor integrated sensing and communication (ISAC) system operating in the terahertz (THz) band, designed for adaptive communication based on gesture recognition. Leveraging gesture tracking through an extended Ka
arxivMay 29
arXiv:2510.10020v4 Announce Type: replace-cross Abstract: Generative models frequently suffer miscalibration, wherein statistics of the sampling distribution, such as the fraction of generations in a given class, deviate from desired values. We frame calibration as a constrained optimization problem
arxivMay 29bullish
arXiv:2605.29843v1 Announce Type: cross Abstract: Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and anisotropic weight curvature. Existing incoherence-base
arxivMay 28
arXiv:2605.27739v1 Announce Type: cross Abstract: Deep neural network training often exhibits highly anisotropic loss geometry, where a few sharp dominant Hessian directions coexist with a large flatter bulk. Gradients tend to align disproportionately with these dominant directions, although stable
arxivMay 26bullish
arXiv:2510.07343v3 Announce Type: replace-cross Abstract: Diffusion Posterior Sampling (DPS) provides a principled Bayesian approach to inverse problems by sampling from $p(x_0 \mid y)$. While posterior sampling is valuable for capturing uncertainty and multi-modality, many classical and practical i
arxivMay 26
arXiv:2605.26072v1 Announce Type: new Abstract: Efficient learning of user preferences is crucial for many modern decision making systems but typically requires costly labeled data. Active learning reduces this cost, yet standard methods are computationally expensive due to pool-based evaluation. Fu
arxivMay 25bullish
arXiv:2601.17261v4 Announce Type: replace Abstract: Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically emp
arxivMay 25bullish
arXiv:2603.01655v2 Announce Type: replace Abstract: Ray tracing has become a standard for accurate radio propagation modeling, but suffers from exponential computational complexity, as the number of candidate paths scales with the number of objects raised to the interaction order. This bottleneck li
arxivMay 22bullish
arXiv:2605.21883v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existi
arxivMay 22bullish
arXiv:2508.09001v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory foot
arxivMay 21bullish
arXiv:2605.19781v1 Announce Type: new Abstract: Modern optimizers, like Muon, impose matrix-wise geometry constraints on their updates. These matrix-wise constraints can be unified under Linear Minimization Oracle (LMO) theory. However, all current methods impose fixed LMO geometries for the update
arxivMay 21bullish
arXiv:2605.19916v1 Announce Type: cross Abstract: We introduce Contrastive FUSE, a fast and unified framework for scalable node representation learning in graphs with partially available pairwise node labels and no available node features. Unlike existing methods, we directly optimize a spectral con
arxivMay 19bullish
arXiv:2605.17137v1 Announce Type: new Abstract: The integration of Large Language Models (LLMs) into evolutionary frameworks has established a new paradigm for automated heuristic discovery. Despite their promise, these methods typically search in the discrete space of program syntax, relying on sto
arxivMay 19
arXiv:2605.13900v2 Announce Type: replace-cross Abstract: In large-scale multi-agent systems with shared resource constraints, an upstream planner must iteratively evaluate candidate resource plans -- assessing feasibility, aggregate response, and marginal cost -- before committing to one. Lagrangia
arxivMay 18
arXiv:2509.01685v2 Announce Type: replace-cross Abstract: We consider sampling from a Gibbs distribution by evolving finitely many particles. We propose a preconditioned version of a recently proposed noise-free sampling method, governed by approximating the score function with the numerically tract
arxivMay 18
arXiv:2605.15530v1 Announce Type: new Abstract: Neural networks are typically trained with a single learning rate across all layers. While recent empirical evidence suggests that assigning layer-specific learning rates can accelerate training, a principled understanding of the conditions and mechani
arxivMay 16bullish
arXiv:2605.14773v1 Announce Type: cross Abstract: Data selection accelerates training by identifying representative training data while preserving model performance. However, existing methods mainly focus on designing sample-importance criteria, i.e., deciding what to select, while typically fixing
arxivMay 16
arXiv:2602.14881v2 Announce Type: replace-cross Abstract: We introduce a novel numerical framework for the exploration of Blaschke--Santal\'o diagrams, which are efficient tools characterizing the possible inequalities relating some given shape functionals. We introduce a parametrization of convex b
arxivMay 16
arXiv:2603.11729v3 Announce Type: replace-cross Abstract: In recent years, RAPTOR based algorithms have been considered the state-of-the-art for path-finding with unlimited transfers without preprocessing. However, this status largely stems from the evolution of routing research, where Dijkstra-base
arxivMay 15
arXiv:2605.14142v1 Announce Type: cross Abstract: Integration against a probability distribution given its unnormalized density is a central task in Bayesian inference and other fields. We introduce new methods for approximating such expectations with a small set of weighted samples -- i.e., a quadr
arxivMay 15
arXiv:2510.25240v3 Announce Type: replace-cross Abstract: We present a general strategy for turning generative models into candidate solution samplers for batch Bayesian optimization (BO). The use of generative models for BO enables large batch scaling as generative sampling, optimization of non-con
arxivMay 14bullish
arXiv:2605.12879v1 Announce Type: new Abstract: Doubly-stochastic attention has emerged as a transport-based alternative to row-softmax attention, with recent Transformer variants using it to reduce attention sinks and rank collapse while improving performance. In this family, the standard approach
arxivMay 14bullish
arXiv:2605.13784v1 Announce Type: new Abstract: Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-
arxivMay 13
arXiv:2602.08606v3 Announce Type: replace-cross Abstract: Motivated by applications in conditional sampling, given a probability measure $\mu$ and a diffeomorphism $\phi$, we consider the problem of simultaneously approximating $\phi$ and the pushforward $\phi_{\#}\mu$ by means of the flow of a cont
arxivMay 11
arXiv:2605.06821v1 Announce Type: cross Abstract: Cohen et al. (arXiv:2207.14484) observed that adaptive gradient methods such as Adam operate at the edge of stability. While there has been significant work on continuous-time modeling of gradient descent at the edge of stability, extending these mod
arxivMay 11bullish
arXiv:2604.24013v2 Announce Type: cross Abstract: The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhe
arxivMay 8
arXiv:2605.06469v1 Announce Type: cross Abstract: Self-optimizing control is a strategy for selecting controlled variables, where the economic objective guides the selection and design of controlled variables, with the expectation that maintaining the controlled variables at constant values can achi
arxivMay 7bullish
arXiv:2507.23501v2 Announce Type: replace Abstract: Ensembles are ubiquitous in off-policy actor-critic learning, yet their efficacy depends critically on how they are aggregated. Current methods typically rely on static rules or task-specific hyperparameters to balance overestimation bias and varia
arxivMay 5bullish
arXiv:2604.19021v2 Announce Type: replace Abstract: Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta
arxivMay 4
arXiv:2604.17423v2 Announce Type: replace Abstract: A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as adpative
arxivMay 4bullish
arXiv:2605.00641v1 Announce Type: new Abstract: Both Dimensionality Reduction (DR) and Graph Drawing (GD) aim to visualize abstract, non-linear structures, yet rely on different optimization paradigms. This contrast is evident in Multidimensional Scaling (MDS), which typically depends on the SMACOF
arxivMay 1bullish
arXiv:2602.05371v3 Announce Type: replace Abstract: Oblique decision trees combine the transparency of trees with the power of multivariate decision boundaries, but learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics. We present
arxivApr 30bullish
arXiv:2508.19900v2 Announce Type: replace Abstract: Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL t
arxivApr 30bullish
arXiv:2508.06550v3 Announce Type: replace-cross Abstract: Bid shading plays a crucial role in Real-Time Bidding (RTB) by adaptively adjusting the bid to avoid advertisers overspending. Existing mainstream two-stage methods, which first model bid landscapes and then optimize surplus using operations
arxivApr 30bullish
arXiv:2604.26167v1 Announce Type: cross Abstract: Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion mode
arxivApr 29bullish
arXiv:2604.22881v1 Announce Type: cross Abstract: Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization
arxivApr 29bullish
arXiv:2510.23215v2 Announce Type: replace-cross Abstract: Eigenvalue problems are among the most important topics in many scientific disciplines. With the recent surge and development of machine learning, neural eigenvalue methods have attracted significant attention as a forward pass of inference r
arxivApr 29bullish
arXiv:2604.14989v2 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have sparked growing interest in automatic RTL optimization for better performance, power, and area (PPA). However, existing methods are still far from realistic RTL optimization. Their evaluation set
arxivApr 27bullish
arXiv:2604.22405v1 Announce Type: new Abstract: K-plane clustering (KPC), hyperplane clustering, and mixture regression all essentially fall within the same class of problems. This problem can be conceptualized as clustering in relatively high-dimensional K subspaces or K linear manifolds. Tradition
arxivApr 27bullish
arXiv:2604.21961v1 Announce Type: cross Abstract: Optimization problems are fundamental in diverse fields, such as engineering, economics, and scientific computing. However, current algorithms are mostly designed for specific problem types and exhibit limited generality in solving multiple types of
arxivApr 23bullish
arXiv:2308.03303v2 Announce Type: replace Abstract: Fine-tuning large language models (LLMs) is crucial for improving their performance on downstream tasks, but full-parameter fine-tuning (Full-FT) is computationally expensive and memory-intensive. Parameter-efficient fine-tuning (PEFT) methods, suc
arxivApr 23
arXiv:2604.20556v1 Announce Type: cross Abstract: Currently, Large Language Models (LLMs) feature a diversified architectural landscape, including traditional Transformer, GateDeltaNet, and Mamba. However, the evolutionary laws of hierarchical representations, task knowledge formation positions, and
arxivApr 18bullish
arXiv:2604.14251v1 Announce Type: new Abstract: Monitoring LLM safety at scale requires balancing cost and accuracy: a cheap latent-space probe can screen every input, but hard cases should be escalated to a more expensive expert. Existing cascades delegate based on probe uncertainty, but uncertaint
arxivApr 18
arXiv:2509.12833v2 Announce Type: replace Abstract: Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe env
arxivApr 17bullish
arXiv:2603.13683v2 Announce Type: replace Abstract: Although debiased large language models (LLMs) excel at handling known or low-bias prompts, they often fail on unfamiliar and high-bias prompts. We demonstrate via out-of-distribution (OOD) detection that these high-bias prompts cause a distributio
arxivApr 16bullish
arXiv:2507.09503v4 Announce Type: replace-cross Abstract: This paper proposes a neural stochastic optimization method for efficiently solving the two-stage stochastic unit commitment (2S-SUC) problem under high-dimensional uncertainty scenarios. The proposed method approximates the second-stage reco
arxivApr 15
arXiv:2505.23737v2 Announce Type: replace-cross Abstract: The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structu
arxivApr 14bullish
arXiv:2604.11109v1 Announce Type: cross Abstract: As high-performance computing and AI workloads become increasingly dependent on GPUs, maintaining high performance across rapidly evolving hardware generations has become a major challenge. Developers often spend months tuning scientific applications
arxivApr 13bullish
arXiv:2604.08584v1 Announce Type: cross Abstract: Long-context LLMs increasingly rely on extended, reusable prefill prompts for agents and domain Q&A, pushing attention and KV-cache to become the dominant decode-time bottlenecks. While sparse attention reduces computation and transfer costs, it ofte
arxivApr 10bullish
arXiv:2510.24561v2 Announce Type: replace-cross Abstract: LoRA has become a widely adopted method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based
arxivApr 10bullish
arXiv:2512.00170v2 Announce Type: replace Abstract: Existing high-dimensional Bayesian optimization (BO) methods aim to overcome the curse of dimensionality by carefully encoding structural assumptions, from locality to sparsity to smoothness, into the optimization procedure. Surprisingly, we demons
arxivApr 10bullish
arXiv:2604.06779v1 Announce Type: new Abstract: We introduce Fleming-Viot Diffusion (FVD), an inference-time alignment method that resolves the diversity collapse commonly observed in Sequential Monte Carlo (SMC) based diffusion samplers. Existing SMC-based diffusion samplers often rely on multinomi
arxivApr 10bullish
arXiv:2604.07663v1 Announce Type: new Abstract: The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Although light-state optimizers like SinkGD attempt to address this issue, we identify the embedd
arxivApr 9bullish
arXiv:2604.06836v1 Announce Type: new Abstract: Quantization is an effective way to reduce the memory cost of large-scale model training. However, most existing methods adopt fixed-precision policies, which ignore the fact that optimizer-state distributions vary significantly across layers and train
arxivApr 8bullish
arXiv:2509.22630v2 Announce Type: replace-cross Abstract: Recurrent neural networks (RNNs), such as linear attention and state-space models, have gained popularity due to their constant per-token complexity when processing long contexts. However, these recurrent models struggle with tasks that requi
arxivApr 8bullish
arXiv:2604.05887v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to
arxivApr 7bullish
arXiv:2603.13842v2 Announce Type: replace-cross Abstract: End-to-end autonomous driving is typically built upon imitation learning (IL), yet its performance is constrained by the quality of human demonstrations. To overcome this limitation, recent methods incorporate reinforcement learning (RL) thro
arxivApr 7bullish
arXiv:2601.22776v2 Announce Type: replace Abstract: Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on s
arxivApr 6bullish
arXiv:2604.02869v1 Announce Type: new Abstract: Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Rela
arxivApr 6
arXiv:2602.16746v3 Announce Type: replace Abstract: Grokking -- the delayed transition from memorization to generalization in small algorithmic tasks -- remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention
arxivApr 6
arXiv:2604.02558v1 Announce Type: new Abstract: We address nonconvex learning problems over undirected networks. In particular, we focus on the challenge of designing an algorithm that is both communication-efficient and that guarantees the privacy of the agents' data. The first goal is achieved thr
arxivApr 4
arXiv:2604.01938v1 Announce Type: new Abstract: The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permuta
arxivApr 2bullish
arXiv:2603.22455v4 Announce Type: replace Abstract: Reusable skills let LLM agents package task-specific procedures, tool affordances, and execution guidance into modular building blocks. As skill ecosystems grow to tens of thousands of entries, exposing every skill at inference time becomes infeasi
arxivApr 2bullish
arXiv:2604.00260v1 Announce Type: new Abstract: Shuffling strategies for stochastic gradient descent (SGD), including incremental gradient, shuffle-once, and random reshuffling, are supported by rigorous convergence analyses for arbitrary within-epoch permutations. In particular, random reshuffling
arxivApr 2bullish
arXiv:2601.12604v2 Announce Type: replace Abstract: Policy gradient methods are known to be highly sensitive to the choice of policy parameterization. In particular, the widely used softmax parameterization can induce ill-conditioned optimization landscapes and lead to exponentially slow convergence
arxivApr 2
arXiv:2604.01117v1 Announce Type: new Abstract: Dependency networks (Heckerman et al., 2000) provide a flexible framework for modeling complex systems with many variables by combining independently learned local conditional distributions through pseudo-Gibbs sampling. Despite their computational adv