Models / Madmon
Embeddings

Madmon

Multilingual text embeddings specialized for Arabic dialects, Berber languages, and other underrepresented language families. Built on Google's EmbeddingGemma architecture with deep focus on the languages frontier models ignore.

Base Model
EmbeddingGemma
Parameters
2B
Embedding Dim
768
Context Length
8,192 tokens
Training Data
5B tokens
Focus
Arabic dialects, Berber
Output
Dense vectors
License
API access

Architecture

Built on EmbeddingGemma

Madmon extends Google's EmbeddingGemma — a 2-billion-parameter model designed specifically for generating text embeddings. The base architecture produces 768-dimensional vectors with an 8,192-token context window.

We further trained this foundation on curated multilingual data, with a focus on:

  • Arabic dialects — Darija, Egyptian, Gulf, Tunisian, Levantine, and more
  • Berber languages — Tamazight, Kabyle, Tachelhit in Latin, Arabic, and Tifinagh scripts
  • Other low-resource families — African, Southeast Asian, and indigenous languages

Semantic Search

Search across multilingual corpora. A Darija query retrieves relevant MSA, French, or English documents through shared semantic space.

Content Classification

Classify Arabic dialect content, social media posts, and user-generated text with high-quality representations.

Cross-Lingual Retrieval

Match content across scripts and languages. Arabizi, Arabic-script, and Latin-script content mapped to the same embedding space.

Clustering & Analytics

Group social media content by dialect, topic, or sentiment. Discover patterns in multilingual datasets.

Try Madmon now

Generate embeddings for any text. 768-dimensional vectors optimized for multilingual content.

Open Playground