How We Validate Skill Content: Fact-Checks, Meta-Skills, and Human Review
A detailed look at the layers of content validation we run when we touch a skill: per-skill fact-checks at update time, an independent LLM judge that verifies against live sources, domain-expert simulation, human review, and our roadmap for expert verification
The Real Problem With Skills
When people talk about skill quality, most think about evals: does the skill "work"? Does the agent complete the task it was given? That matters, but it's only half the picture.
The bigger challenge is the content itself. A skill that teaches an agent how to file a VAT report relies on tax rates, thresholds, dates, links to official forms, and Israeli Tax Authority procedures. All of those change. VAT rates increase, national insurance ceilings get updated, forms get replaced, procedures change. A skill that was perfect a year ago can today return a confidently wrong answer, which is a much bigger problem than a skill that simply doesn't work.
That's exactly why we built a multi-layered content validation system for every skill in the catalog. This guide explains how it works.
Layer 1: Per-Skill Fact Checks at Update Time
Whenever we touch a skill — creating a new one or updating an existing one — the content goes through three deterministic checks before any change ships. These checks rely on an LLM with live web access at runtime, so the model can verify time-sensitive facts (tax rates, ceilings, dates) against the live source instead of relying only on its training data. Every other layer in this guide builds on top of this foundation.
Link Validation
Every URL mentioned in the skill is hit with an HTTP request. If a link returns 404, 500, or redirects elsewhere, the update is blocked until it's fixed. This catches government pages that moved to a new structure, replaced forms, removed articles, and deleted repos.
Library Version Validation
Skills that depend on a specific library or tool (e.g., an SDK for a government service) are checked against npm and PyPI. If the version is outdated or the API was deprecated, that bubbles up for handling.
LLM Content Analysis
The model reads the entire skill and looks for factually wrong or outdated information. The prompt is tuned to four specific things: regulatory or legal information that's changed, APIs or libraries that have been replaced or removed, version-specific claims that are no longer true, and references to discontinued services. It does not flag style issues, gaps, or wording preferences — only factual errors.
If a VAT rate changed, a government form was replaced, or a library went through a major version bump that breaks the API, the model catches it, proposes a fix, and we approve the diff before it ships. The three layers below build on top of this foundation.
Layer 2: LLM as a Judge (Independent Judge Review)
Layer 1 uses an LLM to read the skill and look for mistakes, a pattern commonly called "LLM as a judge." It works well for periodic review of existing content, but it has a known weakness: models tend to agree with themselves. If the same model that wrote "VAT is 18%" is asked to verify the number, it will usually approve the claim even when it's wrong. This is classic confirmation bias, and published evals show the accuracy gap between self-check and fresh-context check can reach tens of percentage points.
When we create a new skill or push a significant update, we run a stronger variant of the same pattern: an independent subagent in a fresh context that never saw the process that produced the content. It receives only the final files plus an explicit list of sources we collected along the way, and its only job is to verify every claim in the skill against the cited source.
How It Works in Practice
- Every numeric or regulatory claim in the skill is logged in an evidence file with an official URL: VAT rates, form names, ceilings, statute quotations, regulation numbers. If something doesn't make it into the evidence file, it doesn't pass the judge.
- The judge receives the content + the evidence file, and for every item it pulls the live source in real time and checks whether the claim actually appears there as written in the skill.
- The output is a structured verdict: PASS or FAIL, a list of claims that failed verification, and a list of claims that have no evidence entry at all.
- Only a PASS verdict with both lists empty advances to publishing. Anything else loops back for fixing.
Why Not Self-Check
The agent that wrote client.get('/v1/accounts') will happily confirm "GET /v1/accounts looks right" when asked to verify itself. A fresh subagent that only sees the final file, with no knowledge of who wrote what, treats every claim as foreign input and actually does the work: pulls the official docs and compares. That's why we dispatch a separate subagent with a clean message thread rather than another pass of the same model.
Expert Review Simulation
A second, complementary judge attacks from a totally different angle. Instead of asking "is everything that's written correct," it asks "is anything essential missing." We load a domain-expert persona (a CPA for tax skills, a lawyer for legal skills, a licensed nurse for health skills) and prompt the LLM: "as the expert, what would you expect to see here that you don't?"
The output is a list of gaps classified by severity (critical / major / minor). A critical finding blocks publishing until it's either fixed or explicitly justified as out of scope.
Eval Loop on Skill Descriptions
Beyond the content itself, we run an LLM judge on every skill's description field too. It receives 8-12 queries (positive, negative, and edge cases) and returns, for each, whether the skill should have triggered or not. If the match rate drops below 90%, the description gets refined until it passes. To prevent overfitting we split the queries 70/30 between training and a holdout set, and the 90% threshold is measured on the holdout. This minimizes the two failure modes that hurt users most: a relevant question where the skill silently fails to load, or an unrelated question where it fires for no reason.
This is all the same broader family of patterns under the "LLM as a judge" umbrella, with two mechanisms that make them more reliable: clean context that prevents confirmation bias, and structured output criteria (JSON with explicit fields) that force the judge to return a verdict you can verify, not a vague opinion. It's not a substitute for a human, but it catches enough that the human in Layer 4 spends their time on real judgment calls rather than mechanical fact-checking.
Layer 3: Meta-Skills (Skills That Test Skills)
This is the part that best embodies dogfooding: we use skills themselves to test other skills. The same technology we ship to customers is what we use for quality control on ourselves.
Automated content audit
An internal skill that not only writes new skills but also analyzes an existing skill and returns a report on content issues: gaps, unclear phrasing, weak triggers, missing examples. We run it on every new skill before publishing, and on existing skills when something in the monitoring system raises a flag.
Periodic gap analysis on skills and MCPs
Other internal skills perform systematic gap analysis on an existing catalog item: what's missing compared to the standard of top-quality skills, what's changed since the last version, which sections need refreshing, and which sources need re-verifying. These are our main tools for refreshing older skills.
Layer 4: Human Review
Automation is great, but when it comes to Israeli regulation or legal content, a human has to be in the loop. The human layer includes:
Diff Review on Manual Updates
Every new skill and every significant manual update goes through a diff review before publishing. That's how we catch things that don't quite make sense, awkward phrasing, or editing mistakes. There's no human in the loop inside the layer-1 / layer-2 / layer-3 checks themselves — they run as autonomous steps — but a human approves the final diff before any change ships.
Conversation Flow Testing via Try Skill
Before publishing a new skill, we typically run an actual conversation against it through the "Try Skill" component. This isn't an automated test, it's a person talking to the skill, trying unexpected scenarios, and attempting to break it. Mistakes that automation misses often surface here.
Community Reports
Every skill in the catalog has a "report an issue" button that creates a record in the system. The system classifies reports into five categories: incorrect content, broken link, broken install instructions, security concern, and other. It's a 24/7 validation layer that doesn't require our team to initiate anything.
Layer 5: Security Scanner (Tank)
Tank is a dedicated security scanner that runs on every MCP and skill in the catalog. It checks six aspects:
- Package integrity and sandboxed runtime
- Static code analysis (Bandit + Semgrep)
- Prompt injection detection
- Exposed secrets scanning
- Dependency audit against the OSV database
- Structure and metadata validation
This relates less to content accuracy and more to safety, but it matters for overall trust. A skill can be factually correct and still dangerous to use if, for example, it runs external scripts without verification.
Layer 6: GitHub Verification
Alongside Tank, every skill is checked against the open agentskills.io spec and GitHub's own security signals. This layer is not about content accuracy, it's about the supply chain behind the code: that the repo is locked down, that the release is signed, and that what the CLI installs is provably what was pushed to GitHub. The output is a Security Scorecard with 15 signals across three tiers:
- Critical (5): spec compliance, secret scanning, code scanning, Sigstore-signed release, declared SPDX license
- Recommended (8): tag protection, branch protection, signed commits, SECURITY.md, MFA, CODEOWNERS, Dependabot, semver match
- Bonus (2): fresh release, tree SHA matches HEAD
When all five Critical signals pass, the skill earns the green "Verified ✓" badge and a bonus of up to 10 points on its trust score. Release-side signals (release tag, signed commit, Sigstore attestation) refresh on every push; repo-settings signals (code scanning, secret scanning, MFA, branch protection) refresh weekly via a dedicated workflow. The full breakdown lives in the GitHub Verification checklist.
Layer 7: Trust Score
Every skill gets a calculated trust score, a number between 0 and 100 based on 6 metrics: code quality, permissions, data handling, publisher reputation, maintenance, and documentation, plus up to 10 bonus points from GitHub Verification. The score is displayed to users before installation so they know what they're working with.
We have a separate detailed guide on how the trust score is calculated and how to improve it. See the Trust Score Guide.
What's Next: Expert Verification
This is the direction we're most excited about. Automation catches a lot, the community catches some, but the highest standard will only come from humans who are real experts in the domain.
Expert Eye on Sensitive Domains
The first step is to add an expert reviewer for skills in legal and tax domains. The idea: before such a skill ships, someone with a real background in the domain (or an external advisor) goes over the content and approves it. It won't be a full review, but it will filter out a serious layer of mistakes in domains where the cost of a mistake is especially high.
Skill Certification
Beyond that, we're planning a deep professional review process. A certified accountant will review a tax skill. A lawyer will review a legal skill. Each expert will sign the content they approved, and the trust tier will be updated to "expert verified".
Users will see a clear difference between a skill that passed automated fact-check and a skill that passed professional certification. Both are correct, but the level of confidence is different.
Expanded Community Review
We're planning to let professional users report mistakes directly from the skill page, with a fix workflow that includes credit to the reporter. This will attract a community of professionals who care about the quality of content in their domain.
Live Source Integration
Instead of a skill "knowing" the VAT rate, it reads it in real time from the Israeli Tax Authority API. This eliminates the need for manual updates entirely, the skill is always up to date in real time because it doesn't store the fact at all, it fetches it on every request.
The challenge here is API availability from regulatory bodies. Some ministries offer open APIs, some require registration, some don't provide any. We're pushing for more official data to be opened.
Frequently Asked Questions
How long does it take from a regulatory change to a skill update? It depends on how the signal reaches us. If a user reports a change via the "Report an issue" button on the skill page, or if we spot a change in our own monitoring of Israeli legislation, we open the skill for an update that day or the next. Community reports are the main detection channel, and we strongly encourage practitioners who use a skill in their daily work to use the report button without hesitation.
What if the judge "fixes" something that wasn't broken? The person running the update sees the proposed diff before anything ships — if something looks off, they just don't approve the fix. That's effectively human-in-the-loop at the diff level, not at the judge's autonomous-decision level. The judge itself only flags factual errors, not style or subjective calls, which keeps false positives low.
Has every skill in the catalog gone through all the layers? Not necessarily. The Independent Judge layer (based on an inspectable evidence file) was added to the system recently, so skills we created or updated after that point went through all the layers; older skills go through them for the first time on their next update (the process creates the evidence file and runs the judges if they're missing). We don't do a retroactive backfill on untouched skills, so the rollout is gradual.
What about external skills (skills from repos outside the skills-il GitHub org)?
External skills can't run the evidence-file layer because we don't have push access to the source repo — the file would have nowhere to live. Externals still go through Tank security scanning, GitHub Verification, LLM-based prompt-injection review, and our trust-score calculation, but the evidence file and the Independent Judge against it apply only to skills we host. If an external skill is later imported into our org, the full pipeline kicks in.
How do you know your fact-check model itself isn't wrong? Good question, there's no perfect answer. To minimize the risk, the system prompt is deliberately conservative: only flag what it can demonstrate is wrong, and never flag style or subjective calls. At low severity there are sometimes false positives. We periodically review the pipeline's output and tune the prompt based on error patterns.
What about skills that don't depend on regulation, do they go through all these layers too? Yes, though some checks are less relevant. A code-editing skill won't see many regulatory findings, but it still goes through link validation, library version checks, security scanning (Tank), and human review at publishing time.
Further Reading
- Trust Score Guide - how the trust score is calculated and how to improve it
- Chatbot Security - on encryption, privacy, and real-time fact-checking during conversations
- Security Page - an overview of the Skills IL security stack