advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

Best for Building automated evaluati…Works with GitHubLow risk

#llm #evaluation #judge #bias #rubric #scoring #comparison

⌘source

author: @sickn33
repo: sickn33/antigravity-awesome-skills
language: Python

✦overview.md

Use Cases

→Building automated evaluation pipelines for LLM outputs
→Comparing multiple model responses to select the best one
→Establishing consistent quality standards across evaluation teams
→Debugging evaluation systems that show inconsistent results
→Designing A/B tests for prompt or model changes
→Creating rubrics for human or automated evaluation

Not ideal for

!Use this skill only when the task clearly matches the scope described above.
!Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
!Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.

skills/advanced-evaluation/SKILL.md

name

advanced-evaluation

description

risk:safe

source:community

date_added:"2026-03-18T00:00:00.000Z"

Advanced Evaluation

This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.

Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.

When to Use

Activate this skill when:

Building automated evaluation pipelines for LLM outputs
Comparing multiple model responses to select the best one
Establishing consistent quality standards across evaluation teams
Debugging evaluation systems that show inconsistent results
Designing A/B tests for prompt or model changes
Creating rubrics for human or automated evaluation
Analyzing correlation between automated and human judgments

Core Concepts

The Evaluation Taxonomy

Evaluation approaches fall into two primary categories with distinct reliability profiles:

Direct Scoring: A single LLM rates one response on a defined scale.

...

$install

1-click copy

npx skills add sickn33/antigravity-awesome-skills --skill advanced-evaluation

Safety assessment

★

Clarity score

How clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.

1/ 5

poor

The SKILL.md content is hard to understand and quite ambiguous.

◎

Actionability score

How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.

1/ 5

not actionable

The SKILL.md is hard to act on; an agent would not know what to do.

~community cookbook

April 18, 2026

◧ Compare

advanced-evaluation

Best for Building automated evaluati…Works with GitHubLow risk

Advanced Evaluation

When to Use

Activate this skill when:

Building automated evaluation pipelines for LLM outputs

Comparing multiple model responses to select the best one

Establishing consistent quality standards across evaluation teams

Debugging evaluation systems that show inconsistent results

Designing A/B tests for prompt or model changes

Creating rubrics for human or automated evaluation

Analyzing correlation between automated and human judgments

Core Concepts

The Evaluation Taxonomy

Evaluation approaches fall into two primary categories with distinct reliability profiles:

Direct Scoring: A single LLM rates one response on a defined scale.

advanced-evaluation

Use Cases

Not ideal for

Advanced Evaluation

When to Use

Core Concepts

The Evaluation Taxonomy

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

tdd-workflows-tdd-cycle

tdd-workflow

tdd-orchestrator

tdd-workflows-tdd-green

tdd-workflows-tdd-red

testing-patterns

AI Skill Finder

advanced-evaluation

Use Cases

Not ideal for

Advanced Evaluation

When to Use

Core Concepts

The Evaluation Taxonomy

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

tdd-workflows-tdd-cycle

tdd-workflow

tdd-orchestrator

tdd-workflows-tdd-green

tdd-workflows-tdd-red

testing-patterns