How We Validate Skill Content: Fact-Checks, Meta-Skills, and Human Review
A detailed look at the layers of content validation we run on every skill in the catalog: weekly automated fact-checks, multi-source cross-checking, meta-skills testing skills, human review, and our roadmap for expert verification
The Real Problem With Skills
When people talk about skill quality, most think about evals: does the skill "work"? Does the agent complete the task it was given? That matters, but it's only half the picture.
The bigger challenge is the content itself. A skill that teaches an agent how to file a VAT report relies on tax rates, thresholds, dates, links to official forms, and Israeli Tax Authority procedures. All of those change. VAT rates increase, national insurance ceilings get updated, forms get replaced, procedures change. A skill that was perfect a year ago can today return a confidently wrong answer, which is a much bigger problem than a skill that simply doesn't work.
That's exactly why we built a multi-layered content validation system for every skill in the catalog. This guide explains how it works.
Layer 1: Automated Fact-Check Pipeline
The core of the system is a weekly pipeline that runs every Sunday on GitHub Actions and processes every skill in the catalog. It performs three main checks:
Link Validation
Every URL mentioned inside a skill is hit with an HTTP request. If a link returns 404, 500, or redirects elsewhere, it gets flagged. This catches common cases: government pages that moved to a new structure, replaced forms, removed articles, deleted repos.
Version Validation
Skills that depend on a specific library or tool (e.g., an SDK for a government service) are checked against npm and PyPI. If the version is outdated or the API was deprecated, that bubbles up for handling.
Content Analysis
After the deterministic checks, we run a Claude (Sonnet) model that reads the skill and looks for factually wrong or outdated information. The prompt is tuned to four specific things: regulatory or legal information that's changed, APIs or libraries that have been replaced or removed, version-specific claims that are no longer true, and references to discontinued services. It does not flag style issues, gaps, or wording preferences, only factual errors.
If a VAT rate changed, a government form was replaced, or a library went through a major version bump that breaks the API, the model catches it. It leans on its training knowledge to do this, not an explicit cross-reference against a live source database at runtime.
What Happens When the Pipeline Finds a Gap
This is where it gets interesting: the pipeline doesn't just detect problems, it fixes them. The flow is:
- The fact-check model returns a list of findings at varying severity levels (low / medium / high)
- A rewrite model (Claude) generates an updated version of the skill, keeping SKILL_HE.md in sync
- The system bumps the version automatically via semver (major/minor/patch is derived from the finding types)
- Commit and tag are pushed to the category's GitHub repo
- The sync pipeline picks up the change and updates the database
- We receive an email summary of everything that changed in that cycle
Layer 2: Meta-Skills (Skills That Test Skills)
This is the part that best embodies dogfooding: we use skills themselves to test other skills. The same technology we ship to customers is what we use for quality control on ourselves.
skill-creator
An internal skill that can not only write new skills but also analyze an existing skill and return a report on content issues: gaps, unclear phrasing, weak triggers, missing examples. We run it on every new skill before publishing, and on existing skills when something in the monitoring system raises a flag.
update-skill and update-mcp
Two internal skills that perform systematic gap analysis on an existing catalog item: what's missing compared to the standard of top-quality skills, what's changed since the last version, which sections need refreshing, and which sources need re-verifying. These are our main tools for refreshing older skills.
Layer 3: Human Review
Automation is great, but when it comes to Israeli regulation or legal content, a human has to be in the loop. The human layer includes:
Diff Review on Manual Updates
Every new skill and every significant manual update goes through an admin workflow that includes a diff review before publishing. That's how we catch things that don't quite make sense, awkward phrasing, or editing mistakes. To be clear: the weekly automated fact-check pipeline runs end-to-end without a human in the loop. The human review of its diffs happens post-hoc, once the commit lands in the category repo.
Conversation Flow Testing via Try Skill
Before publishing a new skill, we typically run an actual conversation against it through the "Try Skill" component. This isn't an automated test, it's a person talking to the skill, trying unexpected scenarios, and attempting to break it. Mistakes that automation misses often surface here.
Community Reports
Every skill in the catalog has a "report an issue" button that creates a record in the system. The system classifies reports into five categories: incorrect content, broken link, broken install instructions, security concern, and other. It's a 24/7 validation layer that doesn't require our team to initiate anything.
Layer 4: Security Scanner (Tank)
Tank is a dedicated security scanner that runs on every MCP and skill in the catalog. It checks six aspects:
- Package integrity and sandboxed runtime
- Static code analysis (Bandit + Semgrep)
- Prompt injection detection
- Exposed secrets scanning
- Dependency audit against the OSV database
- Structure and metadata validation
This relates less to content accuracy and more to safety, but it matters for overall trust. A skill can be factually correct and still dangerous to use if, for example, it runs external scripts without verification.
Layer 5: Trust Score
Every skill gets a calculated trust score, a number between 0 and 100 based on 6 metrics: code quality, permissions, data handling, publisher reputation, maintenance, and documentation. The score is displayed to users before installation so they know what they're working with.
We have a separate detailed guide on how the trust score is calculated and how to improve it. See the Trust Score Guide.
What's Next: Expert Verification
This is the direction we're most excited about. Automation catches a lot, the community catches some, but the highest standard will only come from humans who are real experts in the domain.
Expert Eye on Sensitive Domains
The first step is to add an expert reviewer for skills in legal and tax domains. The idea: before such a skill ships, someone with a real background in the domain (or an external advisor) goes over the content and approves it. It won't be a full review, but it will filter out a serious layer of mistakes in domains where the cost of a mistake is especially high.
Skill Certification
Beyond that, we're planning a deep professional review process. A certified accountant will review a tax skill. A lawyer will review a legal skill. Each expert will sign the content they approved, and the trust tier will be updated to "expert verified".
Users will see a clear difference between a skill that passed automated fact-check and a skill that passed professional certification. Both are correct, but the level of confidence is different.
Expanded Community Review
We're planning to let professional users report mistakes directly from the skill page, with a fix workflow that includes credit to the reporter. This will attract a community of professionals who care about the quality of content in their domain.
Live Source Integration
Instead of a skill "knowing" the VAT rate, it reads it in real time from the Israeli Tax Authority API. This eliminates the need for manual updates entirely, the skill is always up to date in real time because it doesn't store the fact at all, it fetches it on every request.
The challenge here is API availability from regulatory bodies. Some ministries offer open APIs, some require registration, some don't provide any. We're pushing for more official data to be opened.
Frequently Asked Questions
How long does it take from a regulatory change to a skill update?
In the best case, up to a week (the weekly pipeline cycle). The pipeline skips skills that were checked less than seven days ago, so there are no redundant runs. For urgent cases (e.g., a VAT rate change that takes effect immediately), our team can run the script manually with --skill or --category filters (this requires push access to the category repos and API keys, so it's not something end users can trigger). If you spot a regulatory change, report it from the skill page via the "Report an issue" button and we'll push a manual sync if needed.
What if the pipeline "fixes" something that wasn't broken? The weekly automated pipeline does not pause for manual approval, it pushes fixes directly to the category repo. Human oversight happens post-hoc: we see the diffs in the summary email and in the commits, and if something looks off we can roll it back. That's why the content model is deliberately conservative, flagging only factual errors rather than stylistic choices.
Do you really check every skill every week? Yes, across all ~130+ skills per cycle. URL validation is fast HTTP HEAD requests, version checks are npm/PyPI calls, and the content analysis goes through Claude and costs more, but one run per week is affordable.
How do you know your fact-check model itself isn't wrong? Good question, there's no perfect answer. To minimize the risk, the system prompt is deliberately conservative: only flag what it can demonstrate is wrong, and never flag style or subjective calls. At low severity there are sometimes false positives. We periodically review the pipeline's output and tune the prompt based on error patterns.
What about skills that don't depend on regulation, do they go through all these layers too? Yes, though some checks are less relevant. A code-editing skill won't see many regulatory findings, but it still goes through link validation, library version checks, security scanning (Tank), and human review at publishing time.
Further Reading
- Trust Score Guide - how the trust score is calculated and how to improve it
- Chatbot Security - on encryption, privacy, and real-time fact-checking during conversations
- Security Page - an overview of the Skills IL security stack