eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

Best for AI development teamsWorks with GitHubLow risk

#evaluation #testing #claude #edd #metrics

⌘source

author: @affaan-m
repo: affaan-m/everything-claude-code
language: JavaScript

✦overview.md

Key Features

·Formal evaluation framework for Claude Code sessions
·Implements eval-driven development (EDD) principles
·Defines pass/fail criteria for task completion
·Measures agent reliability with pass@k metrics
·Creates regression test suites for prompt or agent changes
·Benchmarks agent performance across model versions

Use Cases

→Setting up eval-driven development for AI-assisted workflows
→Creating regression test suites after prompt or agent modifications
→Benchmarking Claude's performance across different model versions

Best for

✓AI development teams
✓Prompt engineering workflows

skills/eval-harness/SKILL.md

name

eval-harness

description

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

origin:ECC

tools:Read, Write, Edit, Bash, Grep, Glob

Eval Harness Skill

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.

When to Activate

Setting up eval-driven development (EDD) for AI-assisted workflows
Defining pass/fail criteria for Claude Code task completion
Measuring agent reliability with pass@k metrics
Creating regression test suites for prompt or agent changes
Benchmarking agent performance across model versions

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

Define expected behavior BEFORE implementation
Run evals continuously during development
Track regressions with each change
Use pass@k metrics for reliability measurement

Eval Types

Capability Evals

Test if Claude can do something it couldn't before:

[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
  - [ ] Criterion 3
Expected Output: Description of expected result

Regression Evals

Ensure changes don't break existing functionality:

[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:

...

$install

1-click copy

npx skills add affaan-m/everything-claude-code --skill eval-harness

Safety assessment

★

Clarity score

How clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.

4/ 5

very good

Clear and well structured, with only minor parts that might need a second read.

◎

Actionability score

How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.

3/ 5

medium

Partially actionable with several concrete steps, but still missing important details.

~community cookbook

~you might also like

view all →

xxe

★2.1k

security#xml

[✓]from @PurpleAILAB

[✓]

Hunt XML External Entity flaws in parsers and validate file read / SSRF impact with strict negative controls.

⚠ high risk

April 18, 2026

◧ Compare

ssrf

★2.1k

security#security

[✓]from @PurpleAILAB

[✓]

Hunt Server-Side Request Forgery (CWE-918) through taint analysis from user-controlled URLs to HTTP client sinks.

April 18, 2026

◧ Compare

test-driven-development

★34k

testing#testing

[✓]from @sickn33

[✓]

Use when implementing any feature or bugfix, before writing implementation code

April 18, 2026

◧ Compare

ultraqa

★30k

testing#qa

[✓]from @Yeachan-Heo

[✓]

QA cycling workflow - test, verify, fix, repeat until goal met

April 18, 2026

◧ Compare

tdd-workflows-tdd-refactor

★34k

refactoring#tdd

[✓]from @sickn33

[✓]

Use when working with tdd workflows tdd refactor

April 18, 2026

◧ Compare

verify

★30k

testing#verification

[✓]from @Yeachan-Heo

[✓]

Verify that a change really works before you claim completion

April 18, 2026

◧ Compare

Eval Harness Skill

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.

When to Activate

Setting up eval-driven development (EDD) for AI-assisted workflows

Defining pass/fail criteria for Claude Code task completion

Measuring agent reliability with pass@k metrics

Creating regression test suites for prompt or agent changes

Benchmarking agent performance across model versions

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

Define expected behavior BEFORE implementation

Run evals continuously during development

Track regressions with each change

Use pass@k metrics for reliability measurement

Eval Types

Capability Evals

Test if Claude can do something it couldn't before:

[CAPABILITY EVAL: feature-name] Task: Description of what Claude should accomplish Success Criteria: - [ ] Criterion 1 - [ ] Criterion 2 - [ ] Criterion 3 Expected Output: Description of expected result

Regression Evals

Ensure changes don't break existing functionality:

[REGRESSION EVAL: feature-name] Baseline: SHA or checkpoint name Tests:

eval-harness

Key Features

Use Cases

Best for

Eval Harness Skill

When to Activate

Philosophy

Eval Types

Capability Evals

Regression Evals

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

xxe

ssrf

test-driven-development

ultraqa

tdd-workflows-tdd-refactor

verify

AI Skill Finder

eval-harness

Key Features

Use Cases

Best for

Eval Harness Skill

When to Activate

Philosophy

Eval Types

Capability Evals

Regression Evals

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

xxe

ssrf

test-driven-development

ultraqa

tdd-workflows-tdd-refactor

verify