report
Madmon: Specialized Multilingual Embeddings for Underrepresented Languages
Abstract
Overview
Madmon extends Google's EmbeddingGemma — a 2-billion-parameter embedding model that produces 768-dimensional vectors with an 8,192-token context window — by further training on 5 billion tokens of specialized data curated for underrepresented languages.
Architecture
- Base model: Google EmbeddingGemma (2B parameters)
- Embedding dimension: 768
- Max context length: 8,192 tokens
- Training data: 5B tokens of curated multilingual data
- Focus languages: Arabic dialects (Darija, Egyptian, Gulf, Tunisian, etc.), Berber languages (Tamazight, Kabyle, Tachelhit), and other low-resource families
Use Cases
- Semantic search across multilingual corpora
- Document classification for Arabic dialect content
- Cross-lingual retrieval (e.g., Darija query → MSA documents)
- Clustering social media content by dialect
Availability
Madmon is available through the Sawalni API for production use.
Citation
Related Research
Sawtone: A Universal Framework for Phonetic Similarity and Alignment Across Languages and Scripts
Introduces a cross-script phonetic alignment framework with modular language-specific adapters. Demonstrates 88% BLEU transliteration and 87–95% phonetic alignment accuracy across language/script pairs. Includes a case study on preprocessing Moroccan Arabic (Darija) for LLM training.
GenAI for Moroccan Darija: Challenges and Early Results
Conference presentation at the 7th International Congress for Moroccan Arabic, discussing challenges and early results in applying generative AI to Moroccan Darija.
Gherbal: A Multilingual Classifier for Low-Resource Languages
Conference presentation at TIM'24, introducing Gherbal — a multilingual classifier designed for low-resource languages.