01
paper

Sawtone: A Universal Framework for Phonetic Similarity and Alignment Across Languages and Scripts

Lingua Posnaniensis, Vol. 67(1)

Introduces a cross-script phonetic alignment framework with modular language-specific adapters. Demonstrates 88% BLEU transliteration and 87–95% phonetic alignment accuracy across language/script pairs. Includes a case study on preprocessing Moroccan Arabic (Darija) for LLM training.

PhoneticsTransliterationCross-ScriptNLP
02
report

GenAI for Moroccan Darija: Challenges and Early Results

University of Navarra, Spain

Conference presentation at the 7th International Congress for Moroccan Arabic, discussing challenges and early results in applying generative AI to Moroccan Darija.

LLMMoroccan DarijaLow-Resource LanguagesNLP
03
report

Gherbal: A Multilingual Classifier for Low-Resource Languages

University Hassan II, Casablanca, Morocco

Conference presentation at TIM'24, introducing Gherbal — a multilingual classifier designed for low-resource languages.

NLPLow-Resource LanguagesCultural AILanguage Identification
04
report

Gherbal v4: Comprehensive Evaluation Report — Language Identification Across 214 Languages

Omneity Labs

Comprehensive evaluation of Gherbal v4 across 10 models, 8 benchmarks, and 5 scoping regimes. Gherbal v4 achieves 0.836 average accuracy — outperforming OpenLID v2, GlotLID, and NLLB-LID. The report covers Arabic dialect identification, Arabizi detection, African language coverage, and model efficiency analysis.

NLPLow-Resource LanguagesLanguage IdentificationFastText
05
report

Madmon: Specialized Multilingual Embeddings for Underrepresented Languages

Omneity Labs

Madmon is a multilingual text embeddings model specialized for Arabic dialects, Amazigh languages, and other underrepresented language families. Designed for retrieval, classification, clustering, and semantic search in multilingual and low-resource settings.

NLPLow-Resource LanguagesEmbeddingsLLM