Abstract

Madmon is a multilingual text embeddings model built on Google's EmbeddingGemma architecture (2B parameters, 768-dimensional output, 8192-token context window). Further trained on 5 billion tokens of specialized data covering Arabic dialects, Berber languages, and other underrepresented language families. Designed for retrieval, classification, clustering, and semantic search in multilingual and low-resource settings.

Overview

Madmon extends Google's EmbeddingGemma — a 2-billion-parameter embedding model that produces 768-dimensional vectors with an 8,192-token context window — by further training on 5 billion tokens of specialized data curated for underrepresented languages.

Architecture

  • Base model: Google EmbeddingGemma (2B parameters)
  • Embedding dimension: 768
  • Max context length: 8,192 tokens
  • Training data: 5B tokens of curated multilingual data
  • Focus languages: Arabic dialects (Darija, Egyptian, Gulf, Tunisian, etc.), Berber languages (Tamazight, Kabyle, Tachelhit), and other low-resource families

Use Cases

  • Semantic search across multilingual corpora
  • Document classification for Arabic dialect content
  • Cross-lingual retrieval (e.g., Darija query → MSA documents)
  • Clustering social media content by dialect

Availability

Madmon is available through the Sawalni API for production use.

Citation

Kamali, O. (2026). Madmon: Specialized Multilingual Embeddings for Underrepresented Languages. Omneity Labs Technical Report.
NLPLow-Resource LanguagesEmbeddingsLLMMoroccan Darija

Related Research