Sawalni

Jan 2023

The first LLM and AI assistant for Moroccan languages. Supports Moroccan Arabic (Darija) in Arabic and Latin scripts (Arabizi), Hassaniya, Tachelhit, Tarifit, and standardized Moroccan Tamazight in Tifinagh. Multilingual foundation with deep Moroccan specialization. Served thousands of users in controlled and public deployments.

LLMMoroccan DarijaLow-Resource LanguagesNLP +1

wikilangs.org

Jan 2026

NLP models for 340+ Wikipedia languages — no GPU required. An open playground enabling researchers, educators, and language communities to explore pre-trained NLP tools for languages with little to no commercial support.

NLPLow-Resource LanguagesWikipediaDatasets +1

WikiLLM

Feb 2026

A family of compact, locally-runnable language models trained on curated Wikipedia data with custom tokenizers per language family. Designed to set a reproducible open baseline for low-resource LLM training with publishable evaluation benchmarks.

LLMLow-Resource LanguagesWikipediaTokenizer +1

Sawtone

Jan 2025

An open framework for cross-script phonetic alignment and text normalization. Built to solve the pre-processing problem for alloglottographic and non-standardized languages (like Darija). Published in Lingua Posnaniensis.

PhoneticsTransliterationCross-ScriptNLP +1

Herd

Jan 2025

Browser superpowers and MCP server for AI agents — enables LLM agents to operate any website via function-calling and Model Context Protocol integrations.

Agentic AIMCPBrowser AutomationOpen Source

hypersets

Jan 2025

Query terabytes of data with simple SQL; work with massive HuggingFace datasets without fully downloading them.

DatasetsOpen Source

borgllm

Jan 2025

LLM infrastructure tooling for distributed and multi-provider setups.

LLMOpen Source

prepress

Jan 2026

Release management for Python, Rust, JavaScript, and Go.

Open Source

Semango

Jan 2025

Semantic analysis and annotation platform for multilingual corpora. Provides tools for cross-lingual semantic tagging, sense disambiguation, and corpus exploration.

NLPLow-Resource LanguagesDatasetsOpen Source

wikipedia-monthly

Jun 2024

Monthly-updated, ready-to-use Wikipedia dumps for all 340+ language editions on HuggingFace. Used by leading AI labs including Nous Research as part of their training pipelines. The go-to dataset for researchers needing fresh, clean Wikipedia data at scale.

DatasetsWikipediaOpen SourceLow-Resource Languages +1

Gherbal

Jan 2024

Industry-leading language identification model. State-of-the-art accuracy across standard benchmarks — the only LID model capable of identifying Arabic dialects, Arabizi, and dozens of African languages that competitors miss entirely. Available via the Sawalni API.

NLPLow-Resource LanguagesLanguage IdentificationBenchmarks

Madmon

Jan 2025

A multilingual text embeddings model specialized for Arabic dialects, Amazigh languages, and other underrepresented language families. Produces high-quality semantic representations for retrieval, classification, and clustering. Available via the Sawalni API.

NLPLow-Resource LanguagesEmbeddingsLLM +1