arxivJul 18bullish

InCarEmo: A Multimodal Dataset for In-Cabin Emotion Recognition and Driver State Monitoring

arXiv:2607.14683v1 Announce Type: new Abstract: Understanding driver emotion and state is critical for the next generation of intelligent in-cabin systems that ensure safety and enhance human-vehicle interaction. However, existing public datasets for in-cabin affective computing are largely limited

#dataset #emotion-recognition #multimodal Read on arxiv →

arxivJul 11bullish

PLURAL: A Global Dataset for Value Alignment

arXiv:2607.08034v1 Announce Type: new Abstract: Large language models (LLMs) are used worldwide, yet disproportionately reflect Western values, limiting their ability to represent diverse value systems. We introduce PLURAL, a large-scale, value-focused preference dataset grounded in the Integrated V

#dataset #diversity #values Read on arxiv →

arxivJul 10

Nigeria Machinery: A Low-Resource Industrial Dataset with a Domain-Grounded Reasoning Layer

arXiv:2607.07883v1 Announce Type: new Abstract: There is relatively little, public, and model-ready data on industrial machinery for African economies. This makes it hard to do quantitative analysis or to train language models on numeric tasks grounded in that setting. We release two things to help

#dataset #industrial #african economies Read on arxiv →

arxivJun 19

Characterizing Narrative Content in Web-scale LLM Pretraining Data

arXiv:2606.19468v1 Announce Type: new Abstract: The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token ope

NAFADO3 models #narrative-analysis #pretraining #llm Read on arxiv →

arxivJun 18bullish

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

arXiv:2508.04086v3 Announce Type: replace Abstract: Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like depth-first search (DFS). This leads to inevitable annotation failures and low efficiency in data generation. We introduce

TO1 model #llm #dataset #open-source Read on arxiv →

arxivMay 28

Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

arXiv:2605.27464v1 Announce Type: cross Abstract: AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond

HI1 model #computer-vision #action-recognition #dataset Read on arxiv →

arxivMay 5

ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts

arXiv:2605.00116v1 Announce Type: cross Abstract: In this article, we introduce ViLegalNLI, the first large-scale Vietnamese Natural Language Inference (NLI) dataset specifically constructed for the legal domain. The dataset consists of 42,012 premise-hypothesis pairs derived from official statutory

#nlp #dataset #legal Read on arxiv →

arxivMay 1

Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

arXiv:2602.16516v2 Announce Type: replace Abstract: This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CA

PALA2 models #dataset #nlp #policy Read on arxiv →

arxivMay 1

K2MUSE: A human lower-limb multimodal walking dataset spanning task and acquisition variability for rehabilitation robotics

arXiv:2504.14602v2 Announce Type: replace-cross Abstract: The natural interaction and control performance of lower limb rehabilitation robots are closely linked to biomechanical information from various human locomotion activities. Multidimensional human motion data significantly deepen the understa

#rehabilitation #robotics #biomechanics Read on arxiv →

arxivApr 27bullish

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

arXiv:2604.22260v1 Announce Type: cross Abstract: Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception an

UN1 model #open-source #dataset #computer-vision Read on arxiv →

arxivApr 21bullish

Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions

arXiv:2604.17358v1 Announce Type: new Abstract: While recent Spoken Language Models (SLMs) have been actively deployed in real-world scenarios, they lack the capability to discern Third-Party Interruptions (TPI) from the primary user's ongoing flow, leaving them vulnerable to contextual failures. To

#spoken-language #dataset #evaluation Read on arxiv →

arxivApr 18

The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform

arXiv:2604.13315v2 Announce Type: replace-cross Abstract: High-resolution data in spatial and temporal contexts is imperative for developing climate resilient cities. Current datasets for monitoring urban parameters are developed primarily using manual inspections, embedded-sensing, remote sensing,

#dataset #computer-vision #urban-planning Read on arxiv →

arxivApr 17

Curation of a Palaeohispanic Dataset for Machine Learning

arXiv:2604.13070v1 Announce Type: cross Abstract: Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after G\'omez Moreno deciphered the Iberian Levantine script, one of the several semi-sil

#language #dataset #machine-learning Read on arxiv →

arxivApr 4bullish

J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

arXiv:2407.15828v2 Announce Type: replace Abstract: Spoken dialogue is essential for human-AI interactions, providing expressive capabilities beyond text. Developing effective spoken dialogue systems (SDSs) requires large-scale, high-quality, and diverse spoken dialogue corpora. However, existing da

#open-source #dataset #speech Read on arxiv →