Madmon
Multilingual text embeddings specialized for Arabic dialects, Berber languages, and other underrepresented language families. Designed with deep focus on the languages frontier models ignore.
Architecture
Optimized for Underrepresented Languages
Madmon is a high-performance multilingual model designed specifically for generating text embeddings in complex linguistic environments. The architecture produces 768-dimensional vectors with an 8,192-token context window, optimized for high retrieval precision.
We trained this foundation on curated multilingual data, with a focus on:
- Arabic dialects — Darija, Egyptian, Gulf, Tunisian, Levantine, and more
- Berber languages — Tamazight, Kabyle, Tachelhit in Latin, Arabic, and Tifinagh scripts
- Other low-resource families — African, Southeast Asian, and indigenous languages
Semantic Search
Search across multilingual corpora. A Darija query retrieves relevant MSA, French, or English documents through shared semantic space.
Content Classification
Classify Arabic dialect content, social media posts, and user-generated text with high-quality representations.
Cross-Lingual Retrieval
Match content across scripts and languages. Arabizi, Arabic-script, and Latin-script content mapped to the same embedding space.
Clustering & Analytics
Group social media content by dialect, topic, or sentiment. Discover patterns in multilingual datasets.
Resources
Try Madmon now
Generate embeddings for any text. 768-dimensional vectors optimized for multilingual content.
Open Playground