report
Madmon: Specialized Multilingual Embeddings for Underrepresented Languages
Abstract
Overview
Madmon is a high-fidelity embedding model that produces 768-dimensional vectors with an 8,192-token context window — trained on curated specialized data for underrepresented languages.
Capabilities
- Embedding dimension: 768
- Max context length: 8,192 tokens
- Specialization: Arabic dialects (Darija, Egyptian, Gulf, Tunisian, etc.), Amazigh languages (Tamazight, Tachelhit, Tarifit, Kabyle ..), and other low-resource families
Use Cases
- Semantic search across multilingual corpora
- Document classification for Arabic dialect content
- Cross-lingual retrieval (e.g., Darija query → MSA documents)
- Clustering social media content by dialect
Availability
Madmon is available through the Sawalni API for production use.
Citation
Related Research
Sawtone: A Universal Framework for Phonetic Similarity and Alignment Across Languages and Scripts
Introduces a cross-script phonetic alignment framework with modular language-specific adapters. Demonstrates 88% BLEU transliteration and 87–95% phonetic alignment accuracy across language/script pairs. Includes a case study on preprocessing Moroccan Arabic (Darija) for LLM training.
GenAI for Moroccan Darija: Challenges and Early Results
Conference presentation at the 7th International Congress for Moroccan Arabic, discussing challenges and early results in applying generative AI to Moroccan Darija.
Gherbal: A Multilingual Classifier for Low-Resource Languages
Conference presentation at TIM'24, introducing Gherbal — a multilingual classifier designed for low-resource languages.