Open Source
Projects
Tools, models, and platforms we're building in the open.
Sawalni
Jan 2023
The first LLM and AI assistant built specifically for Moroccan Darija, supporting both Arabic and Latin scripts. Trained from scratch with a custom corpus capturing Darija's linguistic and cultural subtleties. Served thousands of users in controlled and public deployments. Built by Omneity Labs founder as a solo project in 2023.
wikilangs.org
Jan 2026
NLP models for 340+ Wikipedia languages — no GPU required. An open playground enabling researchers, educators, and language communities to explore pre-trained NLP tools for languages with little to no commercial support.
WikiLLM
Feb 2026
A family of compact, locally-runnable language models trained on curated Wikipedia data with custom tokenizers per language family. Designed to set a reproducible open baseline for low-resource LLM training with publishable evaluation benchmarks.
Sawtone
Jan 2025
An open framework for cross-script phonetic alignment and text normalization. Built to solve the pre-processing problem for alloglottographic and non-standardized languages (like Darija). Published in Lingua Posnaniensis.
Herd
Jan 2025
Browser superpowers and MCP server for AI agents — enables LLM agents to operate any website via function-calling and Model Context Protocol integrations.
hypersets
Jan 2025
Query terabytes of data with simple SQL; work with massive HuggingFace datasets without fully downloading them.
borgllm
Jan 2025
LLM infrastructure tooling for distributed and multi-provider setups.
prepress
Jan 2026
Release management for Python, Rust, JavaScript, and Go.
Semango
Jan 2025
Semantic analysis and annotation platform for multilingual corpora. Provides tools for cross-lingual semantic tagging, sense disambiguation, and corpus exploration.
wikipedia-monthly
Jun 2024
Monthly-updated, ready-to-use Wikipedia dumps for all 340+ language editions on HuggingFace. Used by leading AI labs including Nous Research as part of their training pipelines. The go-to dataset for researchers needing fresh, clean Wikipedia data at scale.