Abstract

Comprehensive evaluation of Gherbal v4 across 10 models, 8 benchmarks, and 5 scoping regimes. At 200 MB, Gherbal v4 achieves 0.836 average accuracy — outperforming OpenLID v2 (1,230 MB), GlotLID (1,690 MB), and NLLB-LID (1,180 MB). The report covers Arabic dialect identification, Arabizi detection, African language coverage, and model efficiency analysis.

Overview

This technical report presents the most comprehensive language identification evaluation we are aware of: 10 models tested across 8 benchmarks under 5 scoping regimes, covering 214 languages.

Key Findings

  • Gherbal v4 achieves 0.836 average accuracy at just 200 MB — the best accuracy-to-size ratio of any model tested
  • Outperforms OpenLID v2 (0.824, 1,230 MB), GlotLID (0.803, 1,690 MB), and NLLB-LID (0.711, 1,180 MB)
  • Only model to identify all 16 Arabic dialect variants tested
  • Only model to detect Arabizi (Latin-script Darija) — all competitors score 0%
  • Best-in-class on West/Central African languages (Kituba, Dyula, Kamba, Twi)

Data Quality Pipeline

Gherbal v4 is trained on <3 GB of cleaned data versus 21–45 GB for competitors. The 4-pass cleaning pipeline includes:

  1. Script validation
  2. Cross-language deduplication
  3. Self-prediction disambiguation
  4. Temperature resampling (p^0.3)

Resources

Citation

Kamali, O. (2026). Gherbal v4: Comprehensive Evaluation Report — Language Identification Across 214 Languages. Omneity Labs Technical Report.
NLPLow-Resource LanguagesLanguage IdentificationFastTextBenchmarksMoroccan Darija

Related Research