skill-eval

Evaluate skills: trigger testing, A/B benchmarks, structure validation, head-to-head bake-offs.

Best for Skill developersWorks with GitHub

#skill evaluation #testing #benchmarking #bake-off #self-improvement

⌘source

author: @notque
repo: notque/vexjoy-agent
language: Python

✦overview.md

Key Features

·Trigger skill testing and benchmarks
·Head-to-head bake-off comparisons
·Structure validation and quality grading
·Self-improvement loop integration
·Reference loading for guidance files

Use Cases

→Run A/B tests on two skill implementations
→Validate skill structure and behavior
→Compare own skill vs. peer version for quality
→Trigger automated skill improvement workflows

Best for

✓Skill developers
✓Quality assurance
✓Benchmarking teams

Not ideal for

!One-off script evaluation
!Non-code artifacts

FAQs

skills/skill-eval/SKILL.md

name

skill-eval

description

Evaluate skills: trigger testing, A/B benchmarks, structure validation, head-to-head bake-offs.

user-invocable:false

argument-hint:<skill-name>

allowed-tools:["Read","Write","Bash","Grep","Glob","Agent"]

routing:{"triggers":["improve skill","test skill","eval skill","benchmark skill","skill triggers","skill quality","self-improve skill","skill self-improvement","improve skill with variants","bake-off","bake off","head-to-head","head to head","compare implementations","grade two versions","which skill is better"],"pairs_with":["agent-evaluation","verification-before-completion"],"complexity":"Medium-Complex","category":"meta"}

Skill Evaluation & Improvement

Measure and improve skill quality through empirical testing — because structure doesn't guarantee behavior, and measurement beats assumption. Also covers head-to-head bake-offs of two peer implementations of the same artifact (Mode F).

Reference Loading Table

Signal	Load These Files	Why
tasks related to this reference	`schemas.md`	Loads detailed guidance from `schemas.md`.
tasks related to this reference	`self-improve-loop.md`	Loads detailed guidance from `self-improve-loop.md`.
"bake-off", "head-to-head", "compare implementations", "grade two versions", "which Feynman skill is better"	`bake-off-methodology.md`	Loads the bake-off rubric, anti-rationalization gate, fold-filter, and worked Feynman example.

Instructions

Phase 1: ASSESS — Determine what to evaluate

Step 1: Identify the skill

# Validate skill structure first
python3 -m scripts.skill_eval.quick_validate <path/to/skill>

This checks: SKILL.md exists, valid frontmatter, required fields (name, description), kebab-case naming, description under 1024 chars, no angle brackets.

...

$install

1-click copy

npx skills add notque/vexjoy-agent --skill skill-eval

Safety assessment

★

Clarity score

How clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.

3/ 5

good

Mostly clear, but there are still a few confusing or poorly structured parts.

◎

Actionability score

How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.

4/ 5

high

Mostly actionable with clear steps; only a few small gaps remain.

~community cookbook

May 7, 2026

◧ Compare

Skill Evaluation & Improvement

Reference Loading Table

Signal

Load These Files

Why

tasks related to this reference

schemas.md

Loads detailed guidance from schemas.md.

tasks related to this reference

self-improve-loop.md

Loads detailed guidance from self-improve-loop.md.

"bake-off", "head-to-head", "compare implementations", "grade two versions", "which Feynman skill is better"

bake-off-methodology.md

Loads the bake-off rubric, anti-rationalization gate, fold-filter, and worked Feynman example.

Instructions

Phase 1: ASSESS — Determine what to evaluate

Step 1: Identify the skill

# Validate skill structure first python3 -m scripts.skill_eval.quick_validate <path/to/skill>

This checks: SKILL.md exists, valid frontmatter, required fields (name, description), kebab-case naming, description under 1024 chars, no angle brackets.

skill-eval

Key Features

Use Cases

Best for

Not ideal for

FAQs

What triggers this skill?

Does this skill require specific files?

What is a bake-off?

Skill Evaluation & Improvement

Reference Loading Table

Instructions

Phase 1: ASSESS — Determine what to evaluate

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

codebrewrouter-logging-contract

design-debt-audit

estimate-actual

game-sprite-pipeline

routing-table-updater

n8n-architect

AI Skill Finder

skill-eval

Key Features

Use Cases

Best for

Not ideal for

FAQs

What triggers this skill?

Does this skill require specific files?

What is a bake-off?

Skill Evaluation & Improvement

Reference Loading Table

Instructions

Phase 1: ASSESS — Determine what to evaluate

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

codebrewrouter-logging-contract

design-debt-audit

estimate-actual

game-sprite-pipeline

routing-table-updater

n8n-architect