Gherbal v4: Comprehensive Evaluation Report — Language Identification Across 214 Languages

Abstract

Comprehensive evaluation of Gherbal v4 across 10 models, 8 benchmarks, and 5 scoping regimes. Gherbal v4 achieves 0.836 average accuracy — outperforming OpenLID v2, GlotLID, and NLLB-LID. The report covers Arabic dialect identification, Arabizi detection, African language coverage, and model efficiency analysis.

Overview

This technical report presents the most comprehensive language identification evaluation we are aware of: 10 models tested across 8 benchmarks under 5 scoping regimes, covering 214 languages.

Key Findings

Gherbal v4 achieves 0.836 average accuracy — the best accuracy-to-compute ratio of any model tested
Outperforms OpenLID v2 (0.824), GlotLID (0.803), and NLLB-LID (0.711)
Only model to identify all 16 Arabic dialect variants tested
Only model to detect Arabizi (Latin-script Darija) — all competitors score 0%
Best-in-class on West/Central African languages (Kituba, Dyula, Kamba, Twi)

Data Quality Pipeline

Gherbal v4 is trained on high-quality curated data. The 4-pass cleaning pipeline includes:

Script validation
Cross-language deduplication
Self-prediction disambiguation
Content-aware resampling

Resources

Citation

Kamali, O. (2026). Gherbal v4: Comprehensive Evaluation Report — Language Identification Across 214 Languages. Omneity Labs Technical Report.

NLPLow-Resource LanguagesLanguage IdentificationFastTextBenchmarksMoroccan Darija

View Full Paper

Related Research

Sawtone: A Universal Framework for Phonetic Similarity and Alignment Across Languages and Scripts

Introduces a cross-script phonetic alignment framework with modular language-specific adapters. Demonstrates 88% BLEU transliteration and 87–95% phonetic alignment accuracy across language/script pairs. Includes a case study on preprocessing Moroccan Arabic (Darija) for LLM training.

GenAI for Moroccan Darija: Challenges and Early Results

Conference presentation at the 7th International Congress for Moroccan Arabic, discussing challenges and early results in applying generative AI to Moroccan Darija.

Gherbal: A Multilingual Classifier for Low-Resource Languages

Conference presentation at TIM'24, introducing Gherbal — a multilingual classifier designed for low-resource languages.