arxiv4h ago

Evaluation of Blood Vessel Segmentation Methods on Hard-to-Detect Vascular Structures

arXiv:2406.13128v2 Announce Type: replace-cross Abstract: Due to the intricate structure of vascular trees, minor segmentation errors can significantly alter connectivity patterns and increase variability in extracted morphological properties. Global metrics such as the Dice coefficient, precision,

#segmentation #computer-vision #machine-learning Read on arxiv →

arxiv6d agobullish

Vidu S1: A Real-Time Interactive Video Generation Model

arXiv:2607.03118v2 Announce Type: replace-cross Abstract: We introduce Vidu S1, a real-time interactive video generation model supporting voice control of digital characters. Users can control video generation content at any moment through voice instructions. Vidu S1 supports infinite-length real-ti

VITUTU3 models #video-generation #real-time #computer-vision Read on arxiv →

arxivJul 21bullish

Distributed solar generation forecasting using attention-based deep neural networks for cloud movement prediction

arXiv:2411.10921v2 Announce Type: replace Abstract: Accurate forecasts of distributed solar generation are necessary to maintain grid stability amid the increased uptake of distributed solar photovoltaic (PV) systems. However, the high variability of solar generation over short time intervals (secon

COSE2 models #machine-learning #computer-vision #renewable-energy Read on arxiv →

arxivJul 21bullish

Certified Training for Convolutional Perturbations

arXiv:2607.18195v1 Announce Type: cross Abstract: Vision models have been found to be susceptible to perturbations such as motion blur induced at runtime by a shaking camera. This impedes their deployment in critical applications since phenomena such as slightly blurred vision might lead to failures

#computer-vision #robustness #adversarial-training Read on arxiv →

arxivJul 18

Depth-Dependent Hidden-State Collapse in Dynamical System Autoencoders for LiDAR Point-Cloud Classification

arXiv:2607.14463v1 Announce Type: new Abstract: We study Dynamical System Autoencoders (DSAE) for LiDAR point-cloud classification using spatial coordinates and Product Coefficient feature augmentations. The experiments compare separately trained DSAE architectures at encoder depths $K=1,\ldots,5$ a

DY1 model #machine-learning #computer-vision #classification Read on arxiv →

arxivJul 18bullish

A vision foundation model for single-cell biology via spatial gene cartography

arXiv:2607.14163v1 Announce Type: cross Abstract: Most single-cell foundation models are adapted from language models, representing each cell as a sequence of gene tokens. This discards the relationships among genes and often the magnitude of their expression. We present scVision, a vision foundatio

SC1 model #single-cell #computer-vision #machine-learning Read on arxiv →

arxivJul 16bullish

BenthiCat: An opti-acoustic dataset for advancing benthic classification and habitat mapping

arXiv:2510.04876v3 Announce Type: replace-cross Abstract: Benthic habitat mapping is fundamental for understanding marine ecosystems, guiding conservation efforts, and supporting sustainable resource management. Yet, the scarcity of large, annotated datasets limits the development and benchmarking o

#machine-learning #dataset #computer-vision Read on arxiv →

arxivJul 2bullish

Identifying Latent Concepts and Structures for Generalized Category Discovery

arXiv:2607.00620v1 Announce Type: cross Abstract: Generalized Category Discovery (GCD) aims to recognize known classes while autonomously discovering novel ones in open-world settings. However, current approaches primarily focus on designing clustering objectives, often overlooking a critical bottle

CO1 model #computer-vision #representation-learning #open-world-recognition Read on arxiv →

arxivJul 1bullish

Rethinking Garment Conditioning in Diffusion-based Virtual Try-On: Decouple, Don't Denoise

arXiv:2511.18775v2 Announce Type: replace-cross Abstract: Virtual Try-On (VTON) synthesizes realistic images of a person wearing a target garment, with broad applications in e-commerce and fashion. Diffusion-based dual-UNet methods achieve strong results but double the parameters by dedicating a sep

DEDUUN4 models · +1 #computer-vision #research #state-of-the-art Read on arxiv →

arxivJun 30bullish

InsertAnywhere: Geometrically Grounded and Optics-Aware Video Object Insertion

arXiv:2512.17504v2 Announce Type: replace-cross Abstract: Recent advances in diffusion models have enabled impressive video editing capabilities, yet production-grade Video Object Insertion (VOI) remains challenging due to inadequate 4D scene understanding and a lack of proper optical interactions,

#video-editing #computer-vision #generative-models Read on arxiv →

arxivJun 30

Pose-Based Fall Detection System: Efficient Monitoring on Standard CPUs

arXiv:2503.19501v2 Announce Type: replace-cross Abstract: Falls among elderly residents in assisted living homes pose significant health risks, often leading to injuries and a decreased quality of life. Current fall detection solutions typically rely on sensor-based systems that require dedicated ha

ME1 model #computer-vision #pose-estimation #fall-detection Read on arxiv →

arxivJun 30bullish

LLM-based Multimodal Personality Recognition via Facial Action Unit-Text Semantic Fusion

arXiv:2606.29900v1 Announce Type: cross Abstract: Personality recognition in asynchronous video interviews (AVIs) has become increasingly important due to their widespread adoption in modern recruitment. Existing approaches often rely on large language models (LLMs) to analyze textual responses of i

LA1 model #computer-vision #personality-recognition #multimodal Read on arxiv →

arxivJun 29bullish

Not All Relations Rotate Alike: Transformation-Aware Decoupling for Viewpoint-Robust 3D Scene Graph Generation

arXiv:2606.27412v1 Announce Type: cross Abstract: 3D Scene Graph Generation (3DSGG) represents 3D scenes as structured object-relation-object graphs, providing a compact relational abstraction for spatial understanding. In embodied intelligence settings, the same 3D scene may be observed by agents f

TR1 model #computer-vision #3d-scene-graph #robustness Read on arxiv →

arxivJun 27bullish

Semantic Generative Tuning for Unified Multimodal Models

arXiv:2605.18714v2 Announce Type: replace-cross Abstract: Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation

#multimodal #computer-vision #generative-models Read on arxiv →

arxivJun 27bullish

Event-Aware Instructed Assistant for Referring Video Segmentation

arXiv:2606.26994v1 Announce Type: cross Abstract: Existing referring video segmentation methods often treat a video as a single event consisting of multiple images, overlooking the fact that a video typically contains multiple distinct events. Under such a mechanism, the model needs to directly unde

EV1 model #video-segmentation #computer-vision #artificial-intelligence Read on arxiv →

arxivJun 27

VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image

arXiv:2602.04349v3 Announce Type: replace-cross Abstract: 3D editing has emerged as a critical research area to provide users with flexible control over 3D assets. While current editing approaches predominantly focus on 3D Gaussian Splatting or multi-view images, the direct editing of 3D meshes rema

VEVO2 models #computer-vision #3d-editing #mesh-editing Read on arxiv →

arxivJun 26

Hallucination in World Models is Predictable and Preventable

arXiv:2606.27326v1 Announce Type: new Abstract: Modern generative world models render increasingly realistic action-controllable futures, yet they frequently hallucinate: rollouts remain visually fluent while drifting from the ground-truth dynamics. We hypothesize that hallucination concentrates in

#world-models #machine-learning #computer-vision Read on arxiv →

arxivJun 25bullish

MedPCFM: Improving Medical Point Cloud Completion by Integrating Point Transformers and Flow Matching

arXiv:2606.24433v1 Announce Type: cross Abstract: Medical point cloud completion is important for anatomical reconstruction and downstream clinical workflows, yet generative modeling in this setting remains insufficiently studied. We investigate completion through continuous-time generative modeling

PCPTPV4 models · +1 #medical-imaging #point-cloud #generative-models Read on arxiv →

arxivJun 20bullish

Human Universal Grasping

arXiv:2606.17054v1 Announce Type: cross Abstract: Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flo

HU1 model #robotics #grasping #computer-vision Read on arxiv →

arxivJun 16

Multi-Sensor Fusion for UAV Classification Based on Feature Maps of Image and Radar Data

arXiv:2410.16089v2 Announce Type: replace Abstract: The unique cost, flexibility, speed, and efficiency of modern UAVs make them an attractive choice in many applications in contemporary society. This, however, causes an ever-increasing number of reported malicious or accidental incidents, rendering

DECO2 models #uav-detection #deep-learning #signal-processing Read on arxiv →

arxivJun 16bullish

Uncertainty Quality of VGGT: An Analysis on the DTU Benchmark Dataset

arXiv:2606.16479v1 Announce Type: cross Abstract: Visual Geometry Grounded Transformer (VGGT) has already attracted a great deal of attention in a short period of time, not least due to the Best Paper Award at CVPR-2025. Similar to DUSt3R and MASt3R, VGGT aims to bring about a paradigm shift by repl

VIDUMA3 models #computer-vision #3d-reconstruction #uncertainty-estimation Read on arxiv →

arxivJun 12bullish

Learning Instance-Adaptive Low-Rank Orthogonal Subspaces for Clothes-Changing Person Re-Identification

arXiv:2606.11661v1 Announce Type: cross Abstract: Clothes-changing person re-identification (CC-ReID) aims to recognize individuals despite drastic appearance changes caused by clothing variation. While existing methods rely on adversarial learning to disentangle clothing features, we propose Ortho-

ORBA2 models #computer-vision #machine-learning #reidentification Read on arxiv →

arxivJun 11bullish

MARIC: Multi-Agent Reasoning for Image Classification

arXiv:2509.14860v2 Announce Type: replace-cross Abstract: Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate s

MAVL2 models #computer-vision #multiagent-systems #image-classification Read on arxiv →

arxivJun 6bullish

Next-Generation Parallel Decoder for LPDR: Architectural Optimization and Class-Balanced GAN-Augmentation

arXiv:2606.05785v1 Announce Type: cross Abstract: Real-Time License Plate Detection and Recognition (LPDR) forms the backbone of modern smart cities. Although the YOLOV5-PDLPR model substantially improved system efficiency through a parallel decoder approach, its performance is still affected by spa

YO1 model #computer-vision #license-plate-recognition #real-time-processing Read on arxiv →

arxivJun 2bullish

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

arXiv:2606.02552v1 Announce Type: cross Abstract: Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points in the empty space between foreground and background surfaces. We trace this artifact to a

MD1 model #depth-estimation #computer-vision #image-processing Read on arxiv →

arxivJun 1bullish

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

arXiv:2605.31535v1 Announce Type: cross Abstract: Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We

RA1 model #computer-vision #self-supervised #transformer Read on arxiv →

arxivMay 29bullish

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

arXiv:2605.29539v1 Announce Type: cross Abstract: Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). However, they face two critical challenges in fine-tuning: insufficient support set utilization due to sparse single

GI1 model #computer-vision #object-detection #few-shot-learning Read on arxiv →

arxivMay 28

Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

arXiv:2605.27464v1 Announce Type: cross Abstract: AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond

HI1 model #computer-vision #action-recognition #dataset Read on arxiv →

arxivMay 25

Lipschitz Optimization for Formal Verification of Homographies

arXiv:2605.23203v1 Announce Type: cross Abstract: The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete

#computer-vision #safety #verification Read on arxiv →

arxivMay 22bullish

Letting Trajectories Spread: Quality-Preserving Control for Diverse Flow Matching

arXiv:2510.09060v2 Announce Type: replace Abstract: Flow-based text-to-image models follow deterministic trajectories, making it costly to explore diverse modes under limited sampling budgets. Existing approaches to improving diversity often rely on retraining or degrade image fidelity. To address t

#text-to-image #diversity #computer-vision Read on arxiv →

arxivMay 19bullish

Trajectory-Aware Adaptive Inference in Object Detection Models

arXiv:2605.16397v1 Announce Type: cross Abstract: The increasing integration of sensors in autonomous maritime navigation has led to large-scale multimodal datasets, raising challenges in achieving efficient real-time perception. In such systems, object detection and trajectory perception of nearby

YO1 model #computer-vision #real-time #efficiency Read on arxiv →

arxivMay 15bullish

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

arXiv:2605.13838v2 Announce Type: replace-cross Abstract: Video-guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma

REVATR4 models · +1 #animation #computer-vision #machine-learning Read on arxiv →

arxivMay 11bullish

NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

arXiv:2605.06317v2 Announce Type: replace-cross Abstract: Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, the

NA1 model #navigation #computer-vision #path-planning Read on arxiv →

arxivMay 8bullish

EGA: Adapting Frozen Encoders for Vector Search with Bounded Out-of-Distribution Degradation

arXiv:2605.05674v1 Announce Type: cross Abstract: Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class sampl

EUOP2 models #computer-vision #out-of-distribution #adapter-training Read on arxiv →

arxivMay 8bullish

Intelligent CCTV for Urban Design: AI-Based Analysis of Soft Infrastructure at Intersections

arXiv:2605.05402v1 Announce Type: new Abstract: Artificial intelligence (AI) and computer vision are transforming transportation data collection. This study introduces an AI-enabled analytics framework leveraging existing CCTV infrastructure to evaluate the impact of soft interventions, such as temp

DE1 model #transportation #computer-vision #safety Read on arxiv →

arxivMay 8bullish

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

arXiv:2602.13310v2 Announce Type: replace-cross Abstract: Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into

VI1 model #computer-vision #parallel-processing #multimodal-learning Read on arxiv →

arxivMay 8bullish

Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions

arXiv:2605.06058v1 Announce Type: new Abstract: Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-r

CO1 model #explainability #document-visual-question-answering #machine-learning Read on arxiv →

arxivMay 5bullish

Anomaly-Preference Image Generation

arXiv:2605.02439v1 Announce Type: cross Abstract: Synthesizing realistic and diverse anomalous samples from limited data is vital for robust model generalization. However, existing methods struggle to reconcile fidelity and diversity, often hampered by distribution misalignment and overfitting, resp

#anomaly-detection #machine-learning #computer-vision Read on arxiv →

arxivMay 4bullish

Being-H0.7: A Latent World-Action Model from Egocentric Videos

arXiv:2605.00078v1 Announce Type: cross Abstract: Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations

BE1 model #robotics #computer-vision #machine-learning Read on arxiv →

arxivMay 1

Culture-inspired Multi-modal Color Palette Generation and Colorization: A Chinese Youth Subculture Case

arXiv:2102.05231v1 Announce Type: cross Abstract: Color is an essential component of graphic design, acting not only as a visual factor but also carrying cultural implications. However, existing research on algorithmic color palette generation and colorization largely ignores the cultural aspect. In

#colorization #computer-vision #generative-models Read on arxiv →

arxivMay 1

OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment

arXiv:2506.22500v2 Announce Type: replace-cross Abstract: Automated identification of surgical safety risks is critical for improving patient outcomes; however, Multimodal Large Language Models (MLLMs) frequently suffer from Visual-Semantic Knowledge Conflicts (VS-KC), a phenomenon where models poss

#safety #medical #computer-vision Read on arxiv →

arxivApr 30bullish

Delineating Knowledge Boundaries for Honest Large Vision-Language Models

arXiv:2604.26419v1 Announce Type: cross Abstract: Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to refuse queries that

#computer-vision #artificial-intelligence #trustworthiness Read on arxiv →

arxivApr 29

OAMVOS:2nd Report for 5th PVUW MOSE Track

arXiv:2604.22837v1 Announce Type: cross Abstract: SAM-based dense trackers provide strong short-term mask propagation but remain fragile under long occlusion, fast motion, viewpoint change, and distractors. The problem is especially severe for small objects, where a few incorrect memory updates can

DASA2 models #computer-vision #object-tracking #occlusion Read on arxiv →

arxivApr 27bullish

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

arXiv:2604.22045v1 Announce Type: cross Abstract: Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups o

VGREDE5 models · +2 #computer-vision #interpretability #image-classification Read on arxiv →

arxivApr 27bullish

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

arXiv:2604.22260v1 Announce Type: cross Abstract: Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception an

UN1 model #open-source #dataset #computer-vision Read on arxiv →

arxivApr 24bullish

FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels

arXiv:2604.20825v1 Announce Type: new Abstract: Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi-stage frame

FE1 model #federated-learning #noisy-labels #robust-training Read on arxiv →

arxivApr 24bullish

LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

arXiv:2604.20800v1 Announce Type: cross Abstract: Reconstructing 3D Human-Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body and objects. While current methods rely on sparse,

LELEVQ3 models #computer-vision #3d-reconstruction #human-object-interaction Read on arxiv →

arxivApr 24

Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

arXiv:2604.20822v1 Announce Type: cross Abstract: The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapp

#earth-observation #offshore-wind #machine-learning Read on arxiv →

arxivApr 21bullish

GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

arXiv:2604.15495v1 Announce Type: new Abstract: Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static

GI1 model #navigation #computer-vision #human-ai-interaction Read on arxiv →

arxivApr 21

SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning

arXiv:2604.14373v2 Announce Type: replace-cross Abstract: Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a sa

SABLOP4 models · +1 #computer-vision #remote-sensing #vulnerability-index Read on arxiv →

arxivApr 20bullish

Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models

arXiv:2604.15609v1 Announce Type: new Abstract: Test-Time Adaptation (TTA) for black-box models accessible only via APIs remains a largely unexplored challenge. Existing approaches such as post-hoc output refinement offer limited adaptive capacity, while Zeroth-Order Optimization (ZOO) enables input

BEGOOP6 models · +3 #machine-learning #computer-vision #test-time-adaptation Read on arxiv →

arxivApr 18bullish

Edge-preserving noise for diffusion models

arXiv:2410.01540v4 Announce Type: replace-cross Abstract: Classical diffusion models typically rely on isotropic Gaussian noise, treating all regions uniformly and overlooking structural information important for high-quality generation. We introduce an edge-preserving diffusion process that general

#diffusion #computer-vision #machine-learning Read on arxiv →

arxivApr 18

The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform

arXiv:2604.13315v2 Announce Type: replace-cross Abstract: High-resolution data in spatial and temporal contexts is imperative for developing climate resilient cities. Current datasets for monitoring urban parameters are developed primarily using manual inspections, embedded-sensing, remote sensing,

#dataset #computer-vision #urban-planning Read on arxiv →

arxivApr 18bullish

Improving Prostate Gland Segmentation Using Transformer based Architectures

arXiv:2506.14844v2 Announce Type: replace-cross Abstract: Inter reader variability and cross site domain shift challenge the automatic segmentation of prostate anatomy using T2 weighted MRI images. This study investigates whether transformer models can retain precision amid such heterogeneity. We co

UNSW3D3 models #medical-imaging #segmentation #transformer-models Read on arxiv →

arxivApr 17bullish

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

arXiv:2604.14113v1 Announce Type: cross Abstract: GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher re

UI1 model #computer-vision #localization #uncertainty-quantification Read on arxiv →

arxivApr 16bullish

RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows

arXiv:2509.20490v4 Announce Type: replace-cross Abstract: Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X-ray (CXR) interpretation, prevailing methods

RA1 model #multiagent #medical-imaging #explainability Read on arxiv →

arxivApr 14bullish

CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

arXiv:2604.11539v1 Announce Type: cross Abstract: Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorp

VICL2 models #computer-vision #image-retrieval #adaptive-learning Read on arxiv →

arxivApr 13bullish

MixFlow: Mixed Source Distributions Improve Rectified Flows

arXiv:2604.09181v1 Announce Type: cross Abstract: Diffusion models and their variations, such as rectified flows, generate diverse and high-quality images, but they are still hindered by slow iterative sampling caused by the highly curved generative paths they learn. An important cause of high curva

DIREMI3 models #computer-vision #machine-learning #generative-models Read on arxiv →

arxivApr 11bullish

TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning

arXiv:2604.07960v1 Announce Type: cross Abstract: Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably,

LA1 model #cad #language-models #autonomous-systems Read on arxiv →

arxivApr 11

Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

arXiv:2604.07831v1 Announce Type: cross Abstract: Existing red-teaming studies on GUI agents have important limitations. Adversarial perturbations typically require white-box access, which is unavailable for commercial systems, while prompt injection is increasingly mitigated by stronger safety alig

#security #adversarial #computer-vision Read on arxiv →

arxivApr 10bullish

RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection

arXiv:2505.17732v2 Announce Type: replace-cross Abstract: Accurate, fast, and reliable 3D perception is essential for autonomous driving. Recently, bird's-eye view (BEV)-based perception approaches have emerged as superior alternatives to perspective-based solutions, offering enhanced spatial unders

RQ1 model #autonomous-driving #object-detection #computer-vision Read on arxiv →

arxivApr 10bullish

Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding

arXiv:2508.20765v2 Announce Type: replace-cross Abstract: The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, a

#video-understanding #abstract-concepts #foundation-models Read on arxiv →

arxivApr 10bearish

CAAP: Capture-Aware Adversarial Patch Attacks on Palmprint Recognition Models

arXiv:2604.06987v1 Announce Type: cross Abstract: Palmprint recognition is deployed in security-critical applications, including access control and palm-based payment, due to its contactless acquisition and highly discriminative ridge-and-crease textures. However, the robustness of deep palmprint re

#security #adversarial-attacks #computer-vision Read on arxiv →

arxivApr 9bullish

Visual prompting reimagined: The power of the Activation Prompts

arXiv:2604.06440v1 Announce Type: cross Abstract: Visual prompting (VP) has emerged as a popular method to repurpose pretrained vision models for adaptation to downstream tasks. Unlike conventional model fine-tuning techniques, VP introduces a universal perturbation directly into the input data to f

#computer-vision #fine-tuning #machine-learning Read on arxiv →

arxivApr 9bearish

Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

arXiv:2604.07254v1 Announce Type: cross Abstract: Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory va

VGEFBA3 models #computer-vision #machine-learning #explanability Read on arxiv →

arxivApr 8bullish

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

arXiv:2604.06156v1 Announce Type: cross Abstract: MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. Fir

MM1 model #multimodal-embedding #reasoning #computer-vision Read on arxiv →

arxivApr 7

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

arXiv:2512.03666v2 Announce Type: replace-cross Abstract: A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain large

#computer-vision #benchmark #embodied-intelligence Read on arxiv →

arxivApr 6bearish

Multimodal Language Models Cannot Spot Spatial Inconsistencies

arXiv:2604.00799v2 Announce Type: replace-cross Abstract: Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D ge

#computer-vision #machine-learning #evaluation Read on arxiv →

arxivApr 6bullish

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

arXiv:2604.02546v1 Announce Type: cross Abstract: Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based e

OPUN2 models #computer-vision #3d-scene-understanding #transformer Read on arxiv →