report
Gherbal v4: Comprehensive Evaluation Report — Language Identification Across 214 Languages
Abstract
Overview
This technical report presents the most comprehensive language identification evaluation we are aware of: 10 models tested across 8 benchmarks under 5 scoping regimes, covering 214 languages.
Key Findings
- Gherbal v4 achieves 0.836 average accuracy at just 200 MB — the best accuracy-to-size ratio of any model tested
- Outperforms OpenLID v2 (0.824, 1,230 MB), GlotLID (0.803, 1,690 MB), and NLLB-LID (0.711, 1,180 MB)
- Only model to identify all 16 Arabic dialect variants tested
- Only model to detect Arabizi (Latin-script Darija) — all competitors score 0%
- Best-in-class on West/Central African languages (Kituba, Dyula, Kamba, Twi)
Data Quality Pipeline
Gherbal v4 is trained on <3 GB of cleaned data versus 21–45 GB for competitors. The 4-pass cleaning pipeline includes:
- Script validation
- Cross-language deduplication
- Self-prediction disambiguation
- Temperature resampling (p^0.3)
Resources
- Full Report (PDF)
- One-Pager (PDF)
- Language list
- Sawalni API — access Gherbal v4 via API
Citation
Related Research
Sawtone: A Universal Framework for Phonetic Similarity and Alignment Across Languages and Scripts
Introduces a cross-script phonetic alignment framework with modular language-specific adapters. Demonstrates 88% BLEU transliteration and 87–95% phonetic alignment accuracy across language/script pairs. Includes a case study on preprocessing Moroccan Arabic (Darija) for LLM training.
GenAI for Moroccan Darija: Challenges and Early Results
Conference presentation at the 7th International Congress for Moroccan Arabic, discussing challenges and early results in applying generative AI to Moroccan Darija.
Gherbal: A Multilingual Classifier for Low-Resource Languages
Conference presentation at TIM'24, introducing Gherbal — a multilingual classifier designed for low-resource languages.