Open Source
Projects
Tools, models, and platforms we're building in the open.
Sawalni
Jan 2023
The first LLM and AI assistant for Moroccan languages. Supports Moroccan Arabic (Darija) in Arabic and Latin scripts (Arabizi), Hassaniya, Tachelhit, Tarifit, and standardized Moroccan Tamazight in Tifinagh. Multilingual foundation with deep Moroccan specialization. Served thousands of users in controlled and public deployments.
wikilangs.org
Jan 2026
NLP models for 340+ Wikipedia languages — no GPU required. An open playground enabling researchers, educators, and language communities to explore pre-trained NLP tools for languages with little to no commercial support.
WikiLLM
Feb 2026
A family of compact, locally-runnable language models trained on curated Wikipedia data with custom tokenizers per language family. Designed to set a reproducible open baseline for low-resource LLM training with publishable evaluation benchmarks.
Sawtone
Jan 2025
An open framework for cross-script phonetic alignment and text normalization. Built to solve the pre-processing problem for alloglottographic and non-standardized languages (like Darija). Published in Lingua Posnaniensis.
Herd
Jan 2025
Browser superpowers and MCP server for AI agents — enables LLM agents to operate any website via function-calling and Model Context Protocol integrations.
hypersets
Jan 2025
Query terabytes of data with simple SQL; work with massive HuggingFace datasets without fully downloading them.
borgllm
Jan 2025
LLM infrastructure tooling for distributed and multi-provider setups.
prepress
Jan 2026
Release management for Python, Rust, JavaScript, and Go.
Semango
Jan 2025
Semantic analysis and annotation platform for multilingual corpora. Provides tools for cross-lingual semantic tagging, sense disambiguation, and corpus exploration.
wikipedia-monthly
Jun 2024
Monthly-updated, ready-to-use Wikipedia dumps for all 340+ language editions on HuggingFace. Used by leading AI labs including Nous Research as part of their training pipelines. The go-to dataset for researchers needing fresh, clean Wikipedia data at scale.
Gherbal
Jan 2024
Industry-leading language identification model. State-of-the-art accuracy across standard benchmarks — the only LID model capable of identifying Arabic dialects, Arabizi, and dozens of African languages that competitors miss entirely. Available via the Sawalni API.
Madmon
Jan 2025
A multilingual text embeddings model specialized for Arabic dialects, Amazigh languages, and other underrepresented language families. Produces high-quality semantic representations for retrieval, classification, and clustering. Available via the Sawalni API.