Hebrew ML Datasets Navigator
Trusted89/100Navigate the fragmented landscape of Hebrew and Yiddish ML datasets and models. Covers ivrit.ai (22K+ hours of Hebrew audio, whisper-large-v3 ASR variants, Yiddish models), Dicta (DictaLM 3.0 LLM family, DictaBERT variants, HeQ reading comprehension), the Israeli National NLP Program / NNLP-IL (HebrewSentiment, HebNLI), AlephBERT, and Knesset Plenums. Helps researchers and ML engineers pick the right dataset for a task by use case, license (commercial vs research), Hebrew register coverage, and model-dataset pairing. Use when choosing training data for a Hebrew NLP or ASR project, verifying license compatibility for a commercial product, finding a baseline model for a Hebrew downstream task, or exploring Yiddish ML resources. Do NOT use for Arabic NLP, general HuggingFace dataset discovery, or Hebrew OCR dataset selection (use hebrew-ocr-forms).
Trust score 89/100 (Trusted) · 15+ installs · 3 GitHub contributors · MIT license
The Israeli ML community punches above its weight, but the datasets and models are scattered. ivrit.ai publishes world-class Hebrew speech corpora on one HuggingFace org, Dicta publishes Hebrew LLMs and BERT variants on another, the Israeli National NLP Program maintains benchmarks under HebArabNlpProject. Licenses vary from fully commercial-friendly to research-only. A researcher trying to pick the right combination for fine-tuning a Hebrew sentiment classifier on customer support chat for a commercial product has to hunt across five orgs and read every dataset card.
npx skills-il add skills-il/developer-tools@v1.0.4-hebrew-ml-datasets-navigator --skill hebrew-ml-datasets-navigator -a claude-codeInstall on Claude.ai, Claude Desktop, ChatGPT, Manus, or other platforms
- 1. Click "Download ZIP" to download the skill files.
- 2. Open Claude Desktop and go to Customize > Skills.
- 3. Click "+" and select "Upload a skill", then upload the ZIP file.
- 4. Start a new conversation. The skill will activate automatically when relevant.
When to Apply
- When choosing training data for a Hebrew NLP or ASR project
- When verifying license compatibility for commercial use of a dataset
- When looking for a baseline model for a specific Hebrew task
- When building a Hebrew transcription stack and need to know what ivrit.ai offers
- When researching or building something in Yiddish and need to find resources
Try These Prompts
I want to train a sentiment classifier on Hebrew customer support chat for a commercial SaaS product. Which dataset should I use, which starting model, and what does the license say about attribution?
I am building a Hebrew podcast transcription product. What does ivrit.ai offer, which ASR model should I use in production with low latency, and how do I handle multiple speakers?
I need a Hebrew LLM that runs on consumer hardware (16GB VRAM max) for a Hebrew product. What does Dicta offer, what are the size differences, and what are the upstream licenses?
I am researching Yiddish and looking for datasets and models for speech recognition and text processing. What is available in 2026 and what are the licenses?
Frequently Asked Questions
Changelog
Added Jamba 1.6 and Jamba-Reasoning-3B to the model catalog, acknowledged the DictaLM 3.0 benchmark suite (Translation, Summarization, Winograd, Israeli Trivia, Diacritization), and redirected HebrewSentiment license claims to the live dataset card.
May 19, 2026
Added HEBREW-MMLU, CulturaX, FineWeb-2, ParaShoot, HeSum, academic resources. Stripped 27 em dashes.
Apr 25, 2026
Content fix: find_dataset.py catalog now matches the markdown reference. Added translation datasets (NeuLabs-TedTalks, OPUS-100, MADLAD-400) and additional Dicta text corpora.
Apr 13, 2026
Related Skills
Best practices for programmatic video creation using HyperFrames, plain HTML compositions with GSAP animations rendered to MP4, with full Hebrew and RTL support. Covers composition authoring, data-* timing attributes, GSAP timeline contract, layout-before-animation methodology, visual identity gate, Hebrew fonts via Google Fonts auto-fetch (Heebo, Rubik, Assistant), RTL text with dir="rtl", Hebrew captions via Whisper, Hebrew voiceover via external TTS (Kokoro doesn't support Hebrew), audio-reactive visuals, scene transitions, and bidirectional text with <bdi>. Use when building HTML-based video content or Hebrew social/marketing videos without React. Do NOT use for Remotion or general React video work.
Build and configure Make.com scenarios for Israeli business processes, including Morning (formerly Green Invoice) sync, iCount accounting, Monday.com board automation, Priority ERP data exports, WhatsApp Business messaging, and payment gateways (Cardcom, Tranzila, Grow, Bit). Covers Make.com AI Agents, the Make.com MCP server for exposing scenarios as agent tools, Israel 2026 Invoice Reform, community modules for Israeli apps, Hebrew data transformations, Data Store for VAT period tracking, and Shabbat-aware scheduling. Do NOT use for n8n workflows (use n8n-hebrew-workflows) or Zapier Zaps (use zapier-israeli-integrations).
Guide Israeli startup operations including company formation, Innovation Authority grants, investment agreements, R&D tax benefits, and employee stock options (Option 102). Use when user asks about starting a company in Israel, IIA grants, "Innovation Authority", SAFE agreements (Israeli), convertible notes, Option 102, employee stock options in Israel, R&D tax benefits, preferred enterprise, Yozma 2.0, Delaware flip, or Israeli startup legal/financial setup. Do NOT use for non-Israeli company formation or international tax advice. Always recommend consulting with Israeli lawyer and accountant for binding decisions.
Use at your own risk. Terms of Use · Security
Want to build your own skill? Try the Skill Creator · Submit a Skill