arxivApril 13, 2026 at 4:00 AM2 min readneutral

Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation

View PDF HTML (experimental) Abstract:Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. Thus, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Using a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization. We show that frontier models such as GPT-4o and Gemini generate story morals that are semantically similar to human responses and preferred by human evaluators. However, their outputs exhibit markedly less cross-linguistic variation and concentrate on a narrower set of widely shared values. These findings suggest that while contemporary models can approximate central tendencies of human moral interpretation, they struggle to reproduce the diversity that characterizes human narrative understanding. By framing narrative interpretation as an evaluative task, this work introduces a new approach to studying cultural alignment in language models beyond static benchmarks or knowledge-based tests. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.08797 [cs.CL] (or arXiv:2604.08797v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.08797 arXiv-issued DOI via DataCite (pending registration) Submission history From: Sophie Wu [view email] [v1] Thu, 9 Apr 2026 22:13:24 UTC (4,131 KB)

Read original article ↗

No replies yet. Be first.

arxiv4h ago

Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation

Related Articles

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?