Why not use Exact Match as HeQ primary metric?

Because EM on Hebrew is brittle due to sofit forms, nikud, and whitespace variations. Correct answers get EM 0 without normalization. F1 with Hebrew-aware normalization is the reliable metric.

How many samples per benchmark?

Minimum 500, ideally 1000+. Small benchmarks like Hebrew Winograd (<300 items) require multiple runs (at least 3) with standard deviation reported to get a reliable estimate.

Is BLEU reliable for Hebrew translation?

Not on its own. Hebrew morphology causes BLEU to underestimate quality. Always report chrF alongside BLEU and manually check a sample of low-scoring outputs.

How do I fairly compare a base model (DictaLM-Base) to a chat model (Claude)?

Either use few-shot prompting on both, or use the instruction-tuned DictaLM variants (e.g. DictaLM-3.0-Nemotron-12B-Instruct). A zero-shot chat-prompt comparison against a base model unfairly disadvantages the base model.

How do I track regressions when providers silently upgrade models?

Log the exact version string returned by the API (claude-opus-4-6-20251001, not just claude-opus-4-6). Save a scorecard per version. Re-run on every major upgrade. The diff between versions is your regression signal.

Hebrew LLM Eval Suite

Trusted76/100

Before deciding whether to install, talk to the skill

Benchmark and compare LLMs on Hebrew reasoning, comprehension, sentiment, translation, and Israeli cultural knowledge. Wraps the HuggingFace Open Hebrew LLM Leaderboard tasks (HeQ, HebrewSentiment, Hebrew Winograd, translation) plus DictaLM 3.0 benchmark tasks (Summarization, Nikud, Israeli Trivia) into a reproducible evaluation harness. Runs evals against Claude, GPT, Gemini, AI21 Jamba, DictaLM, Llama, and local HuggingFace models. Produces comparison scorecards in JSON and markdown. Use when choosing an LLM for a Hebrew product, answering procurement questions about Hebrew performance, validating a fine-tuned Hebrew model, or tracking Hebrew regressions after a model upgrade. Do NOT use for Arabic NLP, ASR benchmarking, or general English benchmarks.

The Problem

Israeli product teams pick LLMs blind. There is no standardized Hebrew benchmark that a PM can run in an afternoon to compare Claude against GPT against DictaLM against AI21 Jamba on their actual use case. The HuggingFace Open Hebrew LLM Leaderboard is built for base models and few-shot prompts, not for API-hosted chat models. DictaLM publishes benchmark results but only for its own suite. Teams end up guessing, testing informally, or trusting marketing claims.

skills-il Developer Tools|53installs2,577views

0Write a Review

1.2.0MITGitHub

53installs2,577views

0Write a Review

Updated: July 12, 2026|Tags:llm-eval benchmark hebrew HeQ DictaLM AI21-Jamba Claude GPT ml israel

How to use this skill

Not sure how? Read the guide

1. Click "Download ZIP" to download the skill files.
2. Open Claude Desktop and go to Customize > Skills.
3. Click "+" and select "Upload a skill", then upload the ZIP file.
4. Start a new conversation. The skill will activate automatically when relevant.

A new version released? How to update your installed skill

Developers? Install via command line (CLI)

npx skills-il add skills-il/developer-tools@v1.2.0-hebrew-llm-eval-suite --skill hebrew-llm-eval-suite -a claude-code

When to Apply

When choosing an LLM for a new Hebrew product and needing to justify the choice to leadership
When answering enterprise procurement questions about Hebrew performance
When validating whether a provider upgrade improved or regressed Hebrew quality
When validating a fine-tuned Hebrew model against a baseline
When comparing providers on a specific task: comprehension, translation, summarization, or diacritization

Try These Prompts

Summarization model pick

We are building a Hebrew news summarization feature and need to pick between Claude Sonnet, GPT-5, and DictaLM-3.0-24B. Run the relevant benchmarks (HeQ, DictaLM Summarization, Winograd) with 1000 samples and 3 runs, and recommend a model with reasoning.

Post-upgrade regression

Anthropic released a new version of claude-sonnet. Run the hebrew-core suite on the new and previous versions and tell me if there was any regression over 2 points on any benchmark.

Claude vs Jamba

I am building a Hebrew chatbot and deciding between Claude Haiku and AI21 Jamba 1.5 Mini. Compare them on HeQ, HebrewSentiment, and HebNLI with 500 samples and 3 runs, and provide a scorecard with a recommendation.

Local vs cloud

We have a data residency constraint requiring a local model. Run Hebrew benchmarks on DictaLM-3.0-Nemotron-12B-Instruct and compare to Claude Sonnet quality. How much quality am I giving up?

Frequently Asked Questions

Changelog

v1.2.0

Added Gemini 3, Jamba 1.6, and Jamba-Reasoning-3B to the model roster; reconciled SKILL.md and run_eval.py model lists; relabeled scorecard table as illustrative placeholders, not measured results; added evidence.json.

May 20, 2026

v1.1.0

HEBREW-MMLU, lm-evaluation-harness + inspect_ai cross-refs, verified DictaLM 2.0/3.0, Aya/Hebrew-Mistral/Hebrew-Gemma comparators, claude-opus-4-7, fixed HE table row, tokenizer fairness section.

Apr 25, 2026

Related Skills

Israeli Cloud Cost Comparator

Verified·92

Author: skills-il

v1.3.0Popular

Compare cloud hosting costs for Israeli startups and developers across AWS (il-central-1 Tel Aviv), Azure (Israel Central), GCP (me-west1 Tel Aviv), Oracle Cloud (il-jerusalem-1 Jerusalem), and Israeli providers like Kamatera. Use when the user needs to evaluate cloud pricing with Israel-specific considerations including data residency under Privacy Protection Law Amendment 13, latency from Tel Aviv, NIS billing options, startup credit programs (AWS Activate, Google for Startups, Microsoft Founders Hub, Israel Innovation Authority Telem program with subsidized Nvidia B200 GPUs), and FinOps cost optimization strategies. Do NOT use for comparing on-premise hosting, colocation services, or non-cloud SaaS pricing.

Ask the Skill

4.0371,623

Claude CodeCursorGitHub Copilot+4

Israeli Agritech Advisor

Trusted·79

Author: skills-il

v1.2.0Popular

Guide developers in integrating Israeli agritech tools and precision agriculture platforms including CropX (soil monitoring), Netafim GrowSphere (IoT irrigation), Taranis (crop intelligence), and the broader Israeli agritech ecosystem (approximately 600-750 companies per Start-Up Nation Central agrifoodtech). Use when user asks about agritech APIs, precision agriculture, smart irrigation, "hashkaya cham", crop monitoring, pest detection, Israeli agriculture tech, or needs to build farm management software. Covers irrigation optimization, pest detection, climate data integration, and Israeli agricultural context. Do NOT use for general gardening advice or non-agricultural IoT projects.

Ask the Skill

0.0131,525

Claude CodeCursorGitHub Copilot+5

IDF Date Converter

Verified·94

Author: skills-il

v2.0.0Popular

Convert between Hebrew (Jewish) calendar and Gregorian dates, look up Israeli holidays, format dual dates for Israeli documents, and calculate Israeli business days. Use when user asks about Hebrew dates, "luach ivri", Jewish calendar, Israeli holidays, "chagim", Shabbat times, or needs dual-date formatting for Israeli forms. Do NOT use for Islamic Hijri calendar or non-Israeli holiday calendars.

Ask the Skill

0.0891,871

Claude CodeCursorGitHub Copilot+6

Found an issue with this skill?

Use at your own risk. Terms of Use · Security

Want to build your own skill? Try the Skill Creator · Submit a Skill

Reviews (0)

No reviews yet. Be the first to write one!

Hebrew LLM Eval Suite

How to use this skill

When to Apply

Try These Prompts

Developer & AI Agent Instructions

Security Analysis

Quality Score

Performance Data

Frequently Asked Questions

Why not use Exact Match as HeQ primary metric?

Why not use Exact Match as HeQ primary metric?

How many samples per benchmark?

How many samples per benchmark?

Is BLEU reliable for Hebrew translation?

Is BLEU reliable for Hebrew translation?

How do I fairly compare a base model (DictaLM-Base) to a chat model (Claude)?

How do I fairly compare a base model (DictaLM-Base) to a chat model (Claude)?

How do I track regressions when providers silently upgrade models?

How do I track regressions when providers silently upgrade models?

Changelog

Related Skills

Israeli Cloud Cost Comparator

Israeli Agritech Advisor

IDF Date Converter

Reviews (0)