Benchmark and compare LLMs on Hebrew reasoning, comprehension, sentiment, translation, and Israeli cultural knowledge. Wraps the HuggingFace Open Hebrew LLM Leaderboard tasks (HeQ, HebrewSentiment, Hebrew Winograd, translation) plus DictaLM 3.0 benchmark tasks (Summarization, Nikud, Israeli Trivia) into a reproducible evaluation harness. Runs evals against Claude, GPT, Gemini, AI21 Jamba, DictaLM, Llama, and local HuggingFace models. Produces comparison scorecards in JSON and markdown. Use when choosing an LLM for a Hebrew product, answering procurement questions about Hebrew performance, validating a fine-tuned Hebrew model, or tracking Hebrew regressions after a model upgrade. Do NOT use for Arabic NLP, ASR benchmarking, or general English benchmarks.
Trust score 78/100 (Trusted) · 2 GitHub contributors · MIT license
Israeli product teams pick LLMs blind. There is no standardized Hebrew benchmark that a PM can run in an afternoon to compare Claude against GPT against DictaLM against AI21 Jamba on their actual use case. The HuggingFace Open Hebrew LLM Leaderboard is built for base models and few-shot prompts, not for API-hosted chat models. DictaLM publishes benchmark results but only for its own suite. Teams end up guessing, testing informally, or trusting marketing claims.
npx skills-il add skills-il/developer-tools --skill hebrew-llm-eval-suite -a claude-codeWe are building a Hebrew news summarization feature and need to pick between Claude Sonnet, GPT-5, and DictaLM-3.0-24B. Run the relevant benchmarks (HeQ, DictaLM Summarization, Winograd) with 1000 samples and 3 runs, and recommend a model with reasoning.
Anthropic released a new version of claude-sonnet. Run the hebrew-core suite on the new and previous versions and tell me if there was any regression over 2 points on any benchmark.
I am building a Hebrew chatbot and deciding between Claude Haiku and AI21 Jamba 1.5 Mini. Compare them on HeQ, HebrewSentiment, and HebNLI with 500 samples and 3 runs, and provide a scorecard with a recommendation.
We have a data residency constraint requiring a local model. Run Hebrew benchmarks on DictaLM-3.0-Nemotron-12B-Instruct and compare to Claude Sonnet quality. How much quality am I giving up?
Build and configure Make.com scenarios for Israeli business processes, including Morning (formerly Green Invoice) sync, iCount accounting, Monday.com board automation, Priority ERP data exports, WhatsApp Business messaging, and payment gateways (Cardcom, Tranzila, Grow, Bit). Covers Make.com AI Agents, Israel 2026 Invoice Reform, community modules for Israeli apps, Hebrew data transformations, Data Store for VAT period tracking, and Shabbat-aware scheduling. Do NOT use for n8n workflows (use n8n-hebrew-workflows) or Zapier Zaps (use zapier-israeli-integrations).
Build and manage shipping integrations with Israeli carriers, including Israel Post, Cheetah, HFD, and Mahir Li, plus locker pickup services (BOX2GO, Shlager, Done). Use when user asks about "shipping Israel", "Cheetah delivery", "meshloach", "shipping label", "HFD", "locker pickup Israel", or setting up carrier integrations for an e-commerce store. Covers carrier selection, Israeli address formatting, label generation, cross-carrier tracking system setup, and customer delivery notifications. Do NOT use for looking up a specific package tracking status (direct users to mypost.israelpost.co.il or hfd.co.il). Do NOT use for international shipping outside Israel or customs/import.
Build Hebrew voice bots and IVR systems with speech-to-text, text-to-speech, and telephony integration for Israeli businesses. Covers OpenAI Whisper Hebrew, Google Cloud STT/TTS, Azure Speech, Amazon Polly, IVR menu design for Sunday-Thursday hours, voicemail transcription, accent handling, and +972 phone integration. Do NOT use for text-based chatbots (use hebrew-chatbot-builder) or Hebrew NLP without voice (use hebrew-nlp-toolkit).
Want to build your own skill? Try the Skill Creator · Submit a Skill