Model Detail
clip-vit-large-patch14
—clip-vit-large-patch14 is an AI model with 214M parameters released by OpenAI. The model is registered under the zero-shot-image-classification pipeline tag on Hugging Face.
clip-vit-large-patch14 ships with 214M parameters.
Downloads of clip-vit-large-patch14 have moved +35.1% over the trailing thirty days. That is a slight downtrend, consistent with normal cooling as newer models compete for the same workloads. These numbers are signal, not guarantee — week-over-week download counts on Hugging Face also reflect mirror traffic, CI scrapes, and one-off benchmarking runs.
clip-vit-large-patch14 is best fit for workloads that match the zero-shot-image-classification pipeline tag. Treat this as a starting matrix rather than a benchmark verdict — the right deployment usually depends on the specific evaluation suite that mirrors your workload.
Brain-CLIPLM: Semantic Compression for EEG-to-Text Decoding
arXiv:2604.16370v2 Announce Type: replace Abstract: Decoding natural language from non-invasive electroencephalography (EEG) remains constrained by low signal-to-noise ratio and limited information bandwidth. This raises a central question: can sentence-level language be reliably recovered from such
DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum
arXiv:2606.05435v1 Announce Type: new Abstract: Differentially private stochastic gradient descent (DP-SGD) has become the standard framework for privacy-preserving machine learning, yet its reliance on a fixed gradient clipping threshold to limit sensitivity remains a significant practical limitati
Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization
arXiv:2602.05657v2 Announce Type: replace Abstract: The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the err
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
arXiv:2605.13178v2 Announce Type: replace-cross Abstract: In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens
Jailbreaking Multimodal Large Language Models using Multi-Clip Video
arXiv:2606.02111v1 Announce Type: cross Abstract: As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inpu
Calibrating Uncertainty for Zero-Shot Adversarial CLIP
arXiv:2512.12997v2 Announce Type: replace-cross Abstract: CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work primarily matches predicted logits between clean and adversarial examples, which overlooks uncertainty cali