Skip to content

Hebrew LLM Eval Suite

Trusted86/100
Before deciding whether to install, talk to the skill

Benchmark and compare LLMs on Hebrew reasoning, comprehension, sentiment, translation, and Israeli cultural knowledge. Wraps the HuggingFace Open Hebrew LLM Leaderboard tasks (HeQ, HebrewSentiment, Hebrew Winograd, translation) plus DictaLM 3.0 benchmark tasks (Summarization, Nikud, Israeli Trivia) into a reproducible evaluation harness. Runs evals against Claude, GPT, Gemini, AI21 Jamba, DictaLM, Llama, and local HuggingFace models. Produces comparison scorecards in JSON and markdown. Use when choosing an LLM for a Hebrew product, answering procurement questions about Hebrew performance, validating a fine-tuned Hebrew model, or tracking Hebrew regressions after a model upgrade. Do NOT use for Arabic NLP, ASR benchmarking, or general English benchmarks.

Trust score 86/100 (Trusted) · 25+ installs · 3 GitHub contributors · MIT license

The Problem

Israeli product teams pick LLMs blind. There is no standardized Hebrew benchmark that a PM can run in an afternoon to compare Claude against GPT against DictaLM against AI21 Jamba on their actual use case. The HuggingFace Open Hebrew LLM Leaderboard is built for base models and few-shot prompts, not for API-hosted chat models. DictaLM publishes benchmark results but only for its own suite. Teams end up guessing, testing informally, or trusting marketing claims.

skills-ilskills-ilDeveloper Tools
1.2.0MITGitHub
25installs1,146views
0Write a Review
npx skills-il add skills-il/developer-tools@v1.2.0-hebrew-llm-eval-suite --skill hebrew-llm-eval-suite -a claude-code
Install on Claude.ai, Claude Desktop, ChatGPT, Manus, or other platforms
  1. 1. Click "Download ZIP" to download the skill files.
  2. 2. Open Claude Desktop and go to Customize > Skills.
  3. 3. Click "+" and select "Upload a skill", then upload the ZIP file.
  4. 4. Start a new conversation. The skill will activate automatically when relevant.
A new version released? How to update your installed skill
Not sure how? Read the guide

When to Apply

  • When choosing an LLM for a new Hebrew product and needing to justify the choice to leadership
  • When answering enterprise procurement questions about Hebrew performance
  • When validating whether a provider upgrade improved or regressed Hebrew quality
  • When validating a fine-tuned Hebrew model against a baseline
  • When comparing providers on a specific task: comprehension, translation, summarization, or diacritization

Try These Prompts

Summarization model pick

We are building a Hebrew news summarization feature and need to pick between Claude Sonnet, GPT-5, and DictaLM-3.0-24B. Run the relevant benchmarks (HeQ, DictaLM Summarization, Winograd) with 1000 samples and 3 runs, and recommend a model with reasoning.

Post-upgrade regression

Anthropic released a new version of claude-sonnet. Run the hebrew-core suite on the new and previous versions and tell me if there was any regression over 2 points on any benchmark.

Claude vs Jamba

I am building a Hebrew chatbot and deciding between Claude Haiku and AI21 Jamba 1.5 Mini. Compare them on HeQ, HebrewSentiment, and HebNLI with 500 samples and 3 runs, and provide a scorecard with a recommendation.

Local vs cloud

We have a data residency constraint requiring a local model. Run Hebrew benchmarks on DictaLM-3.0-Nemotron-12B-Instruct and compare to Claude Sonnet quality. How much quality am I giving up?

Frequently Asked Questions

Changelog

v1.2.0

Added Gemini 3, Jamba 1.6, and Jamba-Reasoning-3B to the model roster; reconciled SKILL.md and run_eval.py model lists; relabeled scorecard table as illustrative placeholders, not measured results; added evidence.json.

May 20, 2026

v1.1.0

HEBREW-MMLU, lm-evaluation-harness + inspect_ai cross-refs, verified DictaLM 2.0/3.0, Aya/Hebrew-Mistral/Hebrew-Gemma comparators, claude-opus-4-7, fixed HE table row, tokenizer fairness section.

Apr 25, 2026

Related Skills

skills-ilAuthor: skills-il
v1.1.0PopularTrending

Best practices for programmatic video creation using HyperFrames, plain HTML compositions with GSAP animations rendered to MP4, with full Hebrew and RTL support. Covers composition authoring, data-* timing attributes, GSAP timeline contract, layout-before-animation methodology, visual identity gate, Hebrew fonts via Google Fonts auto-fetch (Heebo, Rubik, Assistant), RTL text with dir="rtl", Hebrew captions via Whisper, Hebrew voiceover via external TTS (Kokoro doesn't support Hebrew), audio-reactive visuals, scene transitions, and bidirectional text with <bdi>. Use when building HTML-based video content or Hebrew social/marketing videos without React. Do NOT use for Remotion or general React video work.

0.0281,127
Claude CodeCursorWindsurf+7
skills-ilAuthor: skills-il
v2.2.0PopularTrending

Build and configure Make.com scenarios for Israeli business processes, including Morning (formerly Green Invoice) sync, iCount accounting, Monday.com board automation, Priority ERP data exports, WhatsApp Business messaging, and payment gateways (Cardcom, Tranzila, Grow, Bit). Covers Make.com AI Agents, the Make.com MCP server for exposing scenarios as agent tools, Israel 2026 Invoice Reform, community modules for Israeli apps, Hebrew data transformations, Data Store for VAT period tracking, and Shabbat-aware scheduling. Do NOT use for n8n workflows (use n8n-hebrew-workflows) or Zapier Zaps (use zapier-israeli-integrations).

0.0541,328
Claude CodeCursorGitHub Copilot+4
skills-ilAuthor: skills-il
v1.2.0Popular

Guide Israeli startup operations including company formation, Innovation Authority grants, investment agreements, R&D tax benefits, and employee stock options (Option 102). Use when user asks about starting a company in Israel, IIA grants, "Innovation Authority", SAFE agreements (Israeli), convertible notes, Option 102, employee stock options in Israel, R&D tax benefits, preferred enterprise, Yozma 2.0, Delaware flip, or Israeli startup legal/financial setup. Do NOT use for non-Israeli company formation or international tax advice. Always recommend consulting with Israeli lawyer and accountant for binding decisions.

0.0481,807
Claude CodeCursorGitHub Copilot+5
Found an issue with this skill?

Use at your own risk. Terms of Use · Security

Want to build your own skill? Try the Skill Creator · Submit a Skill

Reviews (0)

No reviews yet. Be the first to write one!