Model Detail
Turkish-Gemma-9b-T1
—Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues
arXiv:2604.13620v1 Announce Type: cross Abstract: Managing natural dialogue timing is a significant challenge for voice-based chatbots. Most current systems usually rely on simple silence detection, which often fails because human speech patterns involve irregular pauses. This causes bots to interru
HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval
arXiv:2604.10665v1 Announce Type: new Abstract: HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary of approximately 8,000 unique syllable types. A BE
TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization
arXiv:2604.07553v1 Announce Type: new Abstract: This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, e
HUKUKBERT: Domain-Specific Language Model for Turkish Law
arXiv:2604.04790v1 Announce Type: cross Abstract: Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive mod
Tokens with Meaning: A Hybrid Tokenization Approach for Turkish
arXiv:2508.14292v3 Announce Type: replace Abstract: Tokenization shapes how language models perceive morphology and meaning in NLP, yet widely used frequency-driven subword tokenizers (e.g., Byte Pair Encoding and WordPiece) can fragment morphologically rich and agglutinative languages in ways that
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models
arXiv:2501.04828v2 Announce Type: replace Abstract: This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset,