Sawalni

Jan 2023

The first LLM and AI assistant built specifically for Moroccan Darija, supporting both Arabic and Latin scripts. Trained from scratch with a custom corpus capturing Darija's linguistic and cultural subtleties. Served thousands of users in controlled and public deployments. Built by Omneity Labs founder as a solo project in 2023.

LLMMoroccan DarijaLow-Resource LanguagesNLP +1

wikilangs.org

Jan 2026

NLP models for 340+ Wikipedia languages — no GPU required. An open playground enabling researchers, educators, and language communities to explore pre-trained NLP tools for languages with little to no commercial support.

NLPLow-Resource LanguagesWikipediaDatasets +1

WikiLLM

Feb 2026

A family of compact, locally-runnable language models trained on curated Wikipedia data with custom tokenizers per language family. Designed to set a reproducible open baseline for low-resource LLM training with publishable evaluation benchmarks.

LLMLow-Resource LanguagesWikipediaTokenizer +1

Sawtone

Jan 2025

An open framework for cross-script phonetic alignment and text normalization. Built to solve the pre-processing problem for alloglottographic and non-standardized languages (like Darija). Published in Lingua Posnaniensis.

PhoneticsTransliterationCross-ScriptNLP +1

Herd

Jan 2025

Browser superpowers and MCP server for AI agents — enables LLM agents to operate any website via function-calling and Model Context Protocol integrations.

Agentic AIMCPBrowser AutomationOpen Source

hypersets

Jan 2025

Query terabytes of data with simple SQL; work with massive HuggingFace datasets without fully downloading them.

DatasetsOpen Source

borgllm

Jan 2025

LLM infrastructure tooling for distributed and multi-provider setups.

LLMOpen Source

prepress

Jan 2026

Release management for Python, Rust, JavaScript, and Go.

Open Source

Semango

Jan 2025

Semantic analysis and annotation platform for multilingual corpora. Provides tools for cross-lingual semantic tagging, sense disambiguation, and corpus exploration.

NLPLow-Resource LanguagesDatasetsOpen Source

wikipedia-monthly

Jun 2024

Monthly-updated, ready-to-use Wikipedia dumps for all 340+ language editions on HuggingFace. Used by leading AI labs including Nous Research as part of their training pipelines. The go-to dataset for researchers needing fresh, clean Wikipedia data at scale.

DatasetsWikipediaOpen SourceLow-Resource Languages +1