arxivApril 9, 2026 at 4:00 AM1 min readneutral

Multilingual Embedding Probes Fail to Generalize Across Learner Corpora

View PDF HTML (experimental) Abstract:Do multilingual embedding models encode a language-general representation of proficiency? We investigate this by training linear and non-linear probes on hidden-state activations from Qwen3-Embedding (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. We compare five probing architectures against a baseline trained on surface-level text features. Under in-distribution evaluation, probes achieve strong performance ($QWK\approx0.7$), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions. However, in cross-corpus evaluation performance collapses across all probe types and model sizes. Residual analysis reveals that out-of-distribution probes converge towards predicting uniformly distributed labels, indicating that the learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension. These results suggest that current multilingual embeddings do not straightforwardly encode language-general proficiency, with implications for representation-based approaches to proficiency-adaptive language technology. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.07095 [cs.CL] (or arXiv:2604.07095v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.07095 arXiv-issued DOI via DataCite (pending registration) Submission history From: Laurits Lyngbaek [view email] [v1] Wed, 8 Apr 2026 13:47:54 UTC (4,127 KB)

Read original article ↗

No replies yet. Be first.

arxiv6h ago

Multilingual Embedding Probes Fail to Generalize Across Learner Corpora

Related Articles

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?