NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data
Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment.
In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.
Comments on the work include details on the length and content of the paper, which spans 12 pages, includes 3 figures, and 8 tables, and has been accepted at the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026). The paper is classified under Computation and Language (cs.CL) with ACM classes I.2.7 and H.3.3, and can be cited as arXiv:2604.10401 [cs.CL] or arXiv:2604.10401v2 [cs.CL] for this version, with a DOI of https://doi.org/10.48550/arXiv.2604.10401, and a journal reference of Proceedings of Machine Learning Research 318 (2026). The submission history of the paper is also available, with the first version submitted on Sun, 12 Apr 2026 01:19:55 UTC, and the second version submitted on Mon, 20 Apr 2026 21:15:23 UTC.
No replies yet. Be first.