clip-vit-large-patch14 news

46 articles mentioning clip-vit-large-patch14

arxiv12h ago

AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping

arXiv:2502.11034v4 Announce Type: replace Abstract: Loss spikes remain a persistent obstacle in large-scale language model pretraining. While previous research has attempted to identify the root cause of loss spikes by investigating individual factors, we observe that, in practice, such spikes are t

arxiv1d ago

Brain-CLIPLM: Semantic Compression for EEG-to-Text Decoding

arXiv:2604.16370v3 Announce Type: replace-cross Abstract: Decoding natural language from non-invasive electroencephalography (EEG) remains constrained by low signal-to-noise ratio and limited information bandwidth. This raises a central question: can sentence-level language be reliably recovered fro

arxiv1d ago

Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

arXiv:2502.06818v4 Announce Type: replace Abstract: Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image representations mainly encode homogeneous image-level properties, which hinders the application of CLIP

arxiv5d ago

The Hyperspherical Geometry of CLIP Latent Space: A Semantic Mixture Model

arXiv:2607.13660v1 Announce Type: new Abstract: Contrastive Language-Image Pretraining (CLIP) representations form a semantic embedding space governed by cosine similarity, reflecting an intrinsic hyperspherical geometry. However, existing probabilistic interpretations typically rely on Gaussian ass

arxivJul 14

SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception

arXiv:2607.11008v1 Announce Type: cross Abstract: Open-vocabulary dense perception (OVDP) aims to localize objects unseen during training by leveraging textual knowledge. Despite the remarkable progress of recent CLIP-based approaches, we identify a critical limitation: synonym-induced grounding inc

arxivJul 14

Beyond Euclidean Clipping: Overcoming Exploration Collapse in LLM RL via Riemannian Isometric Policy Optimization

arXiv:2607.10169v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a dominant paradigm for enhancing LLMs' reasoning capabilities. However, RL algorithms with PPO-Clip are inherently limited by exploration collapse. Subsequent works remain primarily heuristic and fail to identi

arxivJul 10

Vanilla SGD with Momentum Survives Heavy-Tailed Noise: Convergence Analysis without Gradient Clipping or Normalization

arXiv:2607.08104v1 Announce Type: new Abstract: Stochastic gradient descent (SGD) is a cornerstone of modern optimization. While its performance under heavy-tailed noise is often addressed through specialized modifications such as gradient clipping or normalization, we investigate a more fundamental

arxivJul 2

Selective Test-Time Debiasing for CLIP via Reward Gating

arXiv:2607.00423v1 Announce Type: new Abstract: Vision language models (VLMs) demonstrate strong zero-shot performance, but often perpetuate social stereotypes in person-centric queries, yielding skewed demographic distributions. Current debiasing methods apply uniform bias corrections across all in

thevergeJun 30

Google’s NotebookLM can sum up your research in a TikTok-style clip

Google's NotebookLM is adding a new way to catch up on your notes: TikTok-style AI videos. The new feature is rolling out to Google AI Ultra and Pro subscribers, allowing NotebookLM to generate 60-second vertical AI clips based on the sources you upload to the app. The example shared by Google detai

arxivJun 30

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

arXiv:2606.29586v1 Announce Type: cross Abstract: Vision-language foundation models have shown strong potential in medical image analysis. Although foundation models for ultrasound imaging have recently emerged, the domain remains particularly challenging due to severe speckle noise, acquisition var

arxivJun 30

CytoCLIP: Learning Cytoarchitectural Characteristics in Developing Human Brain Using Contrastive Language Image Pre-Training

arXiv:2601.12282v2 Announce Type: replace-cross Abstract: The functions of different regions of the human brain are closely linked to their distinct cytoarchitecture, which is defined by the spatial arrangement and morphology of the cells. Identifying brain regions by their cytoarchitecture enables

arxivJun 27

ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP

arXiv:2606.26794v1 Announce Type: cross Abstract: CLIP and its variants are widely adopted visual backbones in multimodal systems, but their pretraining remains dominated by descriptive image-text alignment. As downstream applications increasingly demand visually grounded commonsense inference and c

arxivJun 25

BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning

arXiv:2511.11421v2 Announce Type: replace-cross Abstract: Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them

arxivJun 24

KLip-PPO: A per-sample KL perspective on PPO-Clip

arXiv:2606.23932v1 Announce Type: new Abstract: Proximal Policy Optimization (PPO) is the standard policy-gradient algorithm for on-policy reinforcement learning. The literature presents it in two forms, a clipped surrogate that bounds the importance ratio between successive policies and a Kullback-

arxivJun 24

Robust and Fast Training via Per-Sample Clipping

arXiv:2605.02701v2 Announce Type: replace-cross Abstract: We propose a robust gradient estimator based on per-sample gradient clipping and analyze its properties both theoretically and empirically. We show that the resulting method, per-sample clipped SGD (PS-Clip-SGD), achieves optimal in-expectati

arxivJun 17

Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP Backdoors

arXiv:2606.17815v1 Announce Type: cross Abstract: Contrastive Language-Image Pre-training models are widely reused across downstream interfaces, including feature extraction, retrieval, reranking, and selection. Existing CLIP backdoor, however, usually validate attacks on a small attack-native task,

arxivJun 15

What Drives Test-Time Adaptation for CLIP? A Controlled Empirical Study from an Update Perspective

arXiv:2606.14299v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) such as CLIP have become a standard backbone for open-vocabulary recognition, yet their zero-shot predictions remain vulnerable to distribution shifts encountered at deployment. Test-Time Adaptation (TTA) has recently be

arxivJun 12

Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping

arXiv:2506.01396v2 Announce Type: replace Abstract: Differential privacy (DP) has become an essential framework for privacy-preserving machine learning. Existing DP learning methods, however, often have disparate impacts on model predictions, e.g., for minority groups. Gradient clipping, which is of

arxivJun 5

DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum

arXiv:2606.05435v1 Announce Type: new Abstract: Differentially private stochastic gradient descent (DP-SGD) has become the standard framework for privacy-preserving machine learning, yet its reliance on a fixed gradient clipping threshold to limit sensitivity remains a significant practical limitati

arxivJun 4

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

arXiv:2602.05657v2 Announce Type: replace Abstract: The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the err

arxivJun 2

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

arXiv:2605.13178v2 Announce Type: replace-cross Abstract: In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens

arxivJun 2

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

arXiv:2606.02111v1 Announce Type: cross Abstract: As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inpu

arxivJun 2

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

arXiv:2512.12997v2 Announce Type: replace-cross Abstract: CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work primarily matches predicted logits between clean and adversarial examples, which overlooks uncertainty cali

arxivJun 2

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

arXiv:2606.00172v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improve reasoning in large language models. However, outcome-level rewards provide only sparse supervision, and group-r

arxivJun 1

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

arXiv:2603.19862v2 Announce Type: replace-cross Abstract: Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieva

arxivMay 29

Enhancing LLM Training via Spectral Clipping

arXiv:2603.14315v2 Announce Type: replace Abstract: While spectral-based optimizers like Muon operate directly on the spectrum of updates, standard adaptive methods such as AdamW do not account for the spectral structure of weights and gradients, leaving them vulnerable to two empirical issues in la

arxivMay 28

Adaptive Sampling and Clipping for Private Worst-Case Group Optimization

arXiv:2602.10820v2 Announce Type: replace Abstract: A central requirement for the acceptance of machine learning methods for human-centric tasks is that they should be fair, in the sense that they should work comparably well for individuals from different societal groups. A second, equally important

arxivMay 28

Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?

arXiv:2605.27733v1 Announce Type: new Abstract: Training instabilities such as loss spikes are frequently the result of stochastic gradient noise. Because of rare expressions in language training data, and multiple layer composition, the noise impact is heavy-tailed and survives mini-batch averaging

arxivMay 28

AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning

arXiv:2605.28809v1 Announce Type: cross Abstract: Class-Incremental Learning (CIL) is important in building real-world learning systems. In CLIP-based CIL, the model performs classification by comparing similarity between visual and textual embeddings obtained from template prompts, e.g., ``a photo

arxivMay 27

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

arXiv:2601.12809v2 Announce Type: replace-cross Abstract: Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how le

arxivMay 27

MuCon: Clipped Muon Updates for LLM Training

arXiv:2605.26459v1 Announce Type: new Abstract: Muon-style optimizers take a matrix-valued momentum or preconditioned update $B = U \operatorname{diag}(\sigma_1,\ldots,\sigma_r) V^\top$ and replace it with its canonical partial polar factor $\operatorname{Pol}(B) = U V^\top$. This maps every nonzero

arxivMay 27

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

arXiv:2605.26415v1 Announce Type: cross Abstract: Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumul

arxivMay 26

Posture Clip: Sit properly or I wont let you work

arXiv:2605.25664v1 Announce Type: cross Abstract: Poor posture is a significant concern due to its detrimental effects on health and productivity. This paper presents a collar-clipped device called PostureClip, designed to restrict users from sitting and working at a bent angle, by blacking out the

arxivMay 26

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

arXiv:2510.10921v3 Announce Type: replace-cross Abstract: Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perfor

arxivMay 26

Efficient DP-SGD for LLMs with Randomized Clipping

arXiv:2605.24879v1 Announce Type: new Abstract: Large language models (LLMs) are trained on vast datasets that may contain sensitive information. Differential privacy (DP), the de facto standard for formal privacy guarantees, provides a principled framework for training LLMs with provable privacy pr

arxivMay 25

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

arXiv:2510.26411v2 Announce Type: replace Abstract: Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language

arxivMay 22

CacheClip: Accelerating RAG with Effective KV Cache Reuse

arXiv:2510.10129v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely

arxivMay 22

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

arXiv:2605.22703v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of clipping-base

thevergeMay 21

AI video is moving beyond clip slop

This is Lowpass by Janko Roettgers, a newsletter on the ever-evolving intersection of tech and entertainment, syndicated just for The Verge subscribers once a week. Hollywood is cooked - or so a growing number of people on social media would like you to believe. Their purported proof: AI-generated c

arxivMay 20

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

arXiv:2605.19359v1 Announce Type: cross Abstract: Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of m

arxivMay 19

t-gems: text-guided exit modules for decreasing clip image encoder

arXiv:2605.17499v1 Announce Type: new Abstract: Multimodal deep neural networks enhance deep comprehension by integrating diverse data modalities. Data from different modalities are typically projected into a shared latent space for similarity computation, but this process is resource intensive due

arxivMay 19

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

arXiv:2512.23178v3 Announce Type: replace-cross Abstract: Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded ${\f

arxivMay 16

Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

arXiv:2605.14893v1 Announce Type: cross Abstract: Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, w

arxivMay 13

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

arXiv:2605.11838v1 Announce Type: new Abstract: Gradient clipping is a standard safeguard for training neural networks under noisy, heavy-tailed stochastic gradients; yet, most clipping rules treat all parameters as vectors and ignore the matrix structure of modern architectures. We show empirically

arxivMay 13

DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models

arXiv:2605.10272v1 Announce Type: cross Abstract: Federated learning (FL) enables the collaborative training of large-scale language models (LLMs) across edge devices while keeping user data on-device. However, FL still exposes sensitive information through client-provided gradients. Differentially

arxivMay 12

Phases of Muon: When Muon Eclipses SignSGD

arXiv:2605.09552v1 Announce Type: cross Abstract: Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperforming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, includin

clip-vit-large-patch14 news

46 articles mentioning clip-vit-large-patch14

arxiv12h ago

CytoCLIP: Learning Cytoarchitectural Characteristics in Developing Human Brain Using Contrastive Language Image Pre-Training

arxivJun 27

Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

arxivMay 13