Abstract

Madmon is a multilingual text embeddings model specialized for Arabic dialects, Amazigh languages, and other underrepresented language families. Designed for retrieval, classification, clustering, and semantic search in multilingual and low-resource settings.

Overview

Madmon is a high-fidelity embedding model that produces 768-dimensional vectors with an 8,192-token context window — trained on curated specialized data for underrepresented languages.

Capabilities

  • Embedding dimension: 768
  • Max context length: 8,192 tokens
  • Specialization: Arabic dialects (Darija, Egyptian, Gulf, Tunisian, etc.), Amazigh languages (Tamazight, Tachelhit, Tarifit, Kabyle ..), and other low-resource families

Use Cases

  • Semantic search across multilingual corpora
  • Document classification for Arabic dialect content
  • Cross-lingual retrieval (e.g., Darija query → MSA documents)
  • Clustering social media content by dialect

Availability

Madmon is available through the Sawalni API for production use.

Citation

Kamali, O. (2026). Madmon: Specialized Multilingual Embeddings for Underrepresented Languages. Omneity Labs Technical Report.
NLPLow-Resource LanguagesEmbeddingsLLMMoroccan Darija

Related Research