arxivApril 11, 2026 at 4:00 AM2 min readneutral

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

View PDF HTML (experimental) Abstract:Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.08294 [cs.CV] (or arXiv:2604.08294v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.08294 arXiv-issued DOI via DataCite (pending registration) Submission history From: Miguel Monte E Freitas [view email] [v1] Thu, 9 Apr 2026 14:29:19 UTC (1,342 KB)

Read original article ↗

No replies yet. Be first.

arxiv1h ago

FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

arxiv1h ago

AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs

arxiv1h ago

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Related Articles

FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs

Mapping generative AI use in the human brain: divergent neural, academic, and mental health profiles of functional versus socio emotional AI use