How long from a regulatory change to a skill update?

Best case, up to a week (the weekly pipeline cycle). For urgent cases our team can run the script manually with --skill or --category filters (this requires push access to the category repos and API keys, so it is not something end users can trigger). Users who spot a regulatory change can report it from the skill page via the "Report an issue" button.

What if the pipeline fixes something that was not broken?

The weekly automated pipeline does not pause for manual approval, it pushes fixes directly to the category repo. Human oversight happens post-hoc by reviewing the diffs in the summary email and commits. That is why the content model is deliberately conservative, only flagging factual errors rather than stylistic choices.

Do you really check every skill every week?

Yes, across all 130+ skills per cycle. URL validation uses fast HTTP HEAD requests, version checks hit npm and PyPI, and content analysis goes through Claude and costs more, but one run per week is affordable.

How We Validate Skill Content: Fact-Checks, Meta-Skills, and Human Review

Q: How do you know your fact-check model itself is not wrong?

There is no perfect answer. The system prompt is deliberately conservative: only flag what can be demonstrated as wrong, and never flag style or subjective calls. At low severity there are sometimes false positives. We periodically review the pipeline output and tune the prompt based on error patterns.

Q: What about skills that do not depend on regulation?

They go through the same layers, though some checks are less relevant. A code-editing skill will not see many regulatory findings, but it still goes through link validation, library version checks, Tank security scanning, and human review at publishing time.

The Real Problem With Skills

When people talk about skill quality, most think about evals: does the skill "work"? Does the agent complete the task it was given? That matters, but it's only half the picture.

The bigger challenge is the content itself. A skill that teaches an agent how to file a VAT report relies on tax rates, thresholds, dates, links to official forms, and Israeli Tax Authority procedures. All of those change. VAT rates increase, national insurance ceilings get updated, forms get replaced, procedures change. A skill that was perfect a year ago can today return a confidently wrong answer, which is a much bigger problem than a skill that simply doesn't work.

That's exactly why we built a multi-layered content validation system for every skill in the catalog. This guide explains how it works.

Layer 1: Per-Skill Fact Checks at Update Time

Whenever we touch a skill — creating a new one or updating an existing one — the content goes through three deterministic checks before any change ships. These checks rely on an LLM with live web access at runtime, so the model can verify time-sensitive facts (tax rates, ceilings, dates) against the live source instead of relying only on its training data. Every other layer in this guide builds on top of this foundation.

Link Validation

Every URL mentioned in the skill is hit with an HTTP request. If a link returns 404, 500, or redirects elsewhere, the update is blocked until it's fixed. This catches government pages that moved to a new structure, replaced forms, removed articles, and deleted repos.

Library Version Validation

Skills that depend on a specific library or tool (e.g., an SDK for a government service) are checked against npm and PyPI. If the version is outdated or the API was deprecated, that bubbles up for handling.

LLM Content Analysis

The model reads the entire skill and looks for factually wrong or outdated information. The prompt is tuned to four specific things: regulatory or legal information that's changed, APIs or libraries that have been replaced or removed, version-specific claims that are no longer true, and references to discontinued services. It does not flag style issues, gaps, or wording preferences — only factual errors.

If a VAT rate changed, a government form was replaced, or a library went through a major version bump that breaks the API, the model catches it, proposes a fix, and we approve the diff before it ships. The three layers below build on top of this foundation.

Layer 2: LLM as a Judge (Independent Judge Review)

Layer 1 uses an LLM to read the skill and look for mistakes, a pattern commonly called "LLM as a judge." It works well for periodic review of existing content, but it has a known weakness: models tend to agree with themselves. If the same model that wrote "VAT is 18%" is asked to verify the number, it will usually approve the claim even when it's wrong. This is classic confirmation bias, and published evals show the accuracy gap between self-check and fresh-context check can reach tens of percentage points.

When we create a new skill or push a significant update, we run a stronger variant of the same pattern: an independent subagent in a fresh context that never saw the process that produced the content. It receives only the final files plus an explicit list of sources we collected along the way, and its only job is to verify every claim in the skill against the cited source.

How It Works in Practice

Every numeric or regulatory claim in the skill is logged in an evidence file with an official URL: VAT rates, form names, ceilings, statute quotations, regulation numbers. If something doesn't make it into the evidence file, it doesn't pass the judge.
The judge receives the content + the evidence file, and for every item it pulls the live source in real time and checks whether the claim actually appears there as written in the skill.
The output is a structured verdict: PASS or FAIL, a list of claims that failed verification, and a list of claims that have no evidence entry at all.
Only a PASS verdict with both lists empty advances to publishing. Anything else loops back for fixing.

Why Not Self-Check

The agent that wrote client.get('/v1/accounts') will happily confirm "GET /v1/accounts looks right" when asked to verify itself. A fresh subagent that only sees the final file, with no knowledge of who wrote what, treats every claim as foreign input and actually does the work: pulls the official docs and compares. That's why we dispatch a separate subagent with a clean message thread rather than another pass of the same model.

Expert Review Simulation

A second, complementary judge attacks from a totally different angle. Instead of asking "is everything that's written correct," it asks "is anything essential missing." We load a domain-expert persona (a CPA for tax skills, a lawyer for legal skills, a licensed nurse for health skills) and prompt the LLM: "as the expert, what would you expect to see here that you don't?"

The output is a list of gaps classified by severity (critical / major / minor). A critical finding blocks publishing until it's either fixed or explicitly justified as out of scope.

Eval Loop on Skill Descriptions

Beyond the content itself, we run an LLM judge on every skill's description field too. It receives 8-12 queries (positive, negative, and edge cases) and returns, for each, whether the skill should have triggered or not. If the match rate drops below 90%, the description gets refined until it passes. To prevent overfitting we split the queries 70/30 between training and a holdout set, and the 90% threshold is measured on the holdout. This minimizes the two failure modes that hurt users most: a relevant question where the skill silently fails to load, or an unrelated question where it fires for no reason.

This is all the same broader family of patterns under the "LLM as a judge" umbrella, with two mechanisms that make them more reliable: clean context that prevents confirmation bias, and structured output criteria (JSON with explicit fields) that force the judge to return a verdict you can verify, not a vague opinion. It's not a substitute for a human, but it catches enough that the human in Layer 4 spends their time on real judgment calls rather than mechanical fact-checking.

Layer 3: Meta-Skills (Skills That Test Skills)

This is the part that best embodies dogfooding: we use skills themselves to test other skills. The same technology we ship to customers is what we use for quality control on ourselves.

Automated content audit

An internal skill that not only writes new skills but also analyzes an existing skill and returns a report on content issues: gaps, unclear phrasing, weak triggers, missing examples. We run it on every new skill before publishing, and on existing skills when something in the monitoring system raises a flag.

Periodic gap analysis on skills and MCPs

Other internal skills perform systematic gap analysis on an existing catalog item: what's missing compared to the standard of top-quality skills, what's changed since the last version, which sections need refreshing, and which sources need re-verifying. These are our main tools for refreshing older skills.

Layer 4: Human Review

Automation is great, but when it comes to Israeli regulation or legal content, a human has to be in the loop. The human layer includes:

Diff Review on Manual Updates

Every new skill and every significant manual update goes through a diff review before publishing. That's how we catch things that don't quite make sense, awkward phrasing, or editing mistakes. There's no human in the loop inside the layer-1 / layer-2 / layer-3 checks themselves — they run as autonomous steps — but a human approves the final diff before any change ships.

Conversation Flow Testing via Try Skill

Before publishing a new skill, we typically run an actual conversation against it through the "Try Skill" component. This isn't an automated test, it's a person talking to the skill, trying unexpected scenarios, and attempting to break it. Mistakes that automation misses often surface here.

Community Reports

Every skill in the catalog has a "report an issue" button that creates a record in the system. The system classifies reports into five categories: incorrect content, broken link, broken install instructions, security concern, and other. It's a 24/7 validation layer that doesn't require our team to initiate anything.

Layer 5: Security Scanner (Tank)

Tank is a dedicated security scanner that runs on every MCP and skill in the catalog. It checks six aspects:

Package integrity and sandboxed runtime
Static code analysis (Bandit + Semgrep)
Prompt injection detection
Exposed secrets scanning
Dependency audit against the OSV database
Structure and metadata validation

This relates less to content accuracy and more to safety, but it matters for overall trust. A skill can be factually correct and still dangerous to use if, for example, it runs external scripts without verification.

Tank runs alongside NVIDIA SkillSpector, a scanner purpose-built for agent skills. It covers 16 vulnerability categories (prompt injection, data exfiltration, supply chain with live CVE lookups, and YARA malware signatures) and returns a 0-100 risk score. Tank was originally designed for MCP servers, so it can over-flag plain declarative skills; SkillSpector scans the actual skill files and gives a skill-tuned second opinion. Its result is shown under "Security Analysis" on every skill page and feeds the trust score (Layer 7).

Layer 6: GitHub Verification

Alongside Tank, every skill is checked against the open agentskills.io spec and GitHub's own security signals. This layer is not about content accuracy, it's about the supply chain behind the code: that the repo is locked down, that the release is signed, and that what the CLI installs is provably what was pushed to GitHub. The output is a Security Scorecard with 15 signals across three tiers:

Critical (5): spec compliance, secret scanning, code scanning, Sigstore-signed release, declared SPDX license
Recommended (8): tag protection, branch protection, signed commits, SECURITY.md, MFA, CODEOWNERS, Dependabot, semver match
Bonus (2): fresh release, tree SHA matches HEAD

When all five Critical signals pass, the skill earns the green "Verified ✓" badge and a bonus of up to 10 points on its trust score. Release-side signals (release tag, signed commit, Sigstore attestation) refresh on every push; repo-settings signals (code scanning, secret scanning, MFA, branch protection) refresh weekly via a dedicated workflow. The full breakdown lives in the GitHub Verification checklist.

Layer 7: Trust Score

Every skill gets a calculated trust score, a number between 0 and 100 based on 6 metrics: code quality, permissions, data handling, publisher reputation, maintenance, and documentation, plus up to 10 bonus points from GitHub Verification. The score is displayed to users before installation so they know what they're working with.

We have a separate detailed guide on how the trust score is calculated and how to improve it. See the Trust Score Guide.

What's Next: Expert Verification

This is the direction we're most excited about. Automation catches a lot, the community catches some, but the highest standard will only come from humans who are real experts in the domain.

Expert Eye on Sensitive Domains

The first step is to add an expert reviewer for skills in legal and tax domains. The idea: before such a skill ships, someone with a real background in the domain (or an external advisor) goes over the content and approves it. It won't be a full review, but it will filter out a serious layer of mistakes in domains where the cost of a mistake is especially high.

Skill Certification

Beyond that, we're planning a deep professional review process. A certified accountant will review a tax skill. A lawyer will review a legal skill. Each expert will sign the content they approved, and the trust tier will be updated to "expert verified".

Users will see a clear difference between a skill that passed automated fact-check and a skill that passed professional certification. Both are correct, but the level of confidence is different.

Expanded Community Review

We're planning to let professional users report mistakes directly from the skill page, with a fix workflow that includes credit to the reporter. This will attract a community of professionals who care about the quality of content in their domain.

Live Source Integration

Instead of a skill "knowing" the VAT rate, it reads it in real time from the Israeli Tax Authority API. This eliminates the need for manual updates entirely, the skill is always up to date in real time because it doesn't store the fact at all, it fetches it on every request.

The challenge here is API availability from regulatory bodies. Some ministries offer open APIs, some require registration, some don't provide any. We're pushing for more official data to be opened.

Frequently Asked Questions

How long does it take from a regulatory change to a skill update? It depends on how the signal reaches us. If a user reports a change via the "Report an issue" button on the skill page, or if we spot a change in our own monitoring of Israeli legislation, we open the skill for an update that day or the next. Community reports are the main detection channel, and we strongly encourage practitioners who use a skill in their daily work to use the report button without hesitation.

What if the judge "fixes" something that wasn't broken? The person running the update sees the proposed diff before anything ships — if something looks off, they just don't approve the fix. That's effectively human-in-the-loop at the diff level, not at the judge's autonomous-decision level. The judge itself only flags factual errors, not style or subjective calls, which keeps false positives low.

Has every skill in the catalog gone through all the layers? Not necessarily. The Independent Judge layer (based on an inspectable evidence file) was added to the system recently, so skills we created or updated after that point went through all the layers; older skills go through them for the first time on their next update (the process creates the evidence file and runs the judges if they're missing). We don't do a retroactive backfill on untouched skills, so the rollout is gradual.

What about external skills (skills from repos outside the skills-il GitHub org)? External skills can't run the evidence-file layer because we don't have push access to the source repo — the file would have nowhere to live. Externals still go through Tank security scanning, GitHub Verification, LLM-based prompt-injection review, and our trust-score calculation, but the evidence file and the Independent Judge against it apply only to skills we host. If an external skill is later imported into our org, the full pipeline kicks in.

How do you know your fact-check model itself isn't wrong? Good question, there's no perfect answer. To minimize the risk, the system prompt is deliberately conservative: only flag what it can demonstrate is wrong, and never flag style or subjective calls. At low severity there are sometimes false positives. We periodically review the pipeline's output and tune the prompt based on error patterns.

What about skills that don't depend on regulation, do they go through all these layers too? Yes, though some checks are less relevant. A code-editing skill won't see many regulatory findings, but it still goes through link validation, library version checks, security scanning (Tank), and human review at publishing time.