llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

Best for AI application development…Works with GitHubLow risk

#llm #evaluation #metrics #testing #benchmarking

⌘source

author: @wshobson
repo: wshobson/agents
language: Python

✦overview.md

Key Features

·Implements automated metrics for text generation, classification, and retrieval
·Supports human evaluation and A/B testing
·Detects performance regressions before deployment
·Compares different models or prompts
·Establishes baselines and tracks progress over time

Use Cases

→Measuring LLM application performance systematically
→Validating improvements from prompt changes
→Building confidence in production systems
→Debugging unexpected model behavior

Best for

✓AI application development teams
✓MLOps and production monitoring

plugins/llm-application-dev/skills/llm-evaluation/SKILL.md

name

llm-evaluation

description

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

When to Use This Skill

Measuring LLM application performance systematically
Comparing different models or prompts
Detecting performance regressions before deployment
Validating improvements from prompt changes
Building confidence in production systems
Establishing baselines and tracking progress over time
Debugging unexpected model behavior

Core Evaluation Types

1. Automated Metrics

Fast, repeatable, scalable evaluation using computed scores.

Text Generation:

BLEU: N-gram overlap (translation)
ROUGE: Recall-oriented (summarization)
METEOR: Semantic similarity
BERTScore: Embedding-based similarity
Perplexity: Language model confidence

Classification:

Accuracy: Percentage correct
Precision/Recall/F1: Class-specific performance
Confusion Matrix: Error patterns
AUC-ROC: Ranking quality

Retrieval (RAG):

MRR: Mean Reciprocal Rank
NDCG: Normalized Discounted Cumulative Gain
Precision@K: Relevant in top K

...

$install

1-click copy

npx skills add wshobson/agents --skill llm-evaluation

Safety assessment

★

Clarity score

How clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.

3/ 5

good

Mostly clear, but there are still a few confusing or poorly structured parts.

◎

Actionability score

How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.

3/ 5

medium

Partially actionable with several concrete steps, but still missing important details.

~community cookbook

April 18, 2026

◧ Compare

llm-evaluation

Best for AI application development…Works with GitHubLow risk

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

When to Use This Skill

Measuring LLM application performance systematically

Comparing different models or prompts

Detecting performance regressions before deployment

Validating improvements from prompt changes

Building confidence in production systems

Establishing baselines and tracking progress over time

Debugging unexpected model behavior

Core Evaluation Types

1. Automated Metrics

Fast, repeatable, scalable evaluation using computed scores.

Text Generation:

BLEU: N-gram overlap (translation)

ROUGE: Recall-oriented (summarization)

METEOR: Semantic similarity

BERTScore: Embedding-based similarity

Perplexity: Language model confidence

Classification:

Accuracy: Percentage correct

Precision/Recall/F1: Class-specific performance

Confusion Matrix: Error patterns

AUC-ROC: Ranking quality

Retrieval (RAG):

MRR: Mean Reciprocal Rank

NDCG: Normalized Discounted Cumulative Gain

Precision@K: Relevant in top K

llm-evaluation

Key Features

Use Cases

Best for

LLM Evaluation

When to Use This Skill

Core Evaluation Types

1. Automated Metrics

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

tdd-workflows-tdd-cycle

tdd-workflow

tdd-orchestrator

tdd-workflows-tdd-green

tdd-workflows-tdd-red

testing-patterns

AI Skill Finder

llm-evaluation

Key Features

Use Cases

Best for

LLM Evaluation

When to Use This Skill

Core Evaluation Types

1. Automated Metrics

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

tdd-workflows-tdd-cycle

tdd-workflow

tdd-orchestrator

tdd-workflows-tdd-green

tdd-workflows-tdd-red

testing-patterns