langchain-eval-harness

Best for Teams using LangChain 1.0 o…Works with GitHub

#langchain #langgraph #python #evaluation #langsmith #ragas #deepeval #research #ci #testing

⌘source

author: @jeremylongshore
repo: jeremylongshore/claude-code-plugins-plus-skills
language: Python

✦overview.md

Key Features

·Build reproducible evaluation pipelines for LangChain 1.0 chains and LangGraph 1.0 agents
·Integrates golden datasets, LangSmith evaluate(), ragas RAG metrics, deepeval LLM-as-judge, and agent trajectory analys…
·Enables CI gating on quality regressions
·Trigger with commands like 'langchain eval', 'langsmith evaluate', 'ragas', 'llm-as-judge'
·Uses allowed tools: Read, Write, Edit, Bash(python:), Bash(pip:), Bash(pytest:)

Use Cases

→Setting up quality measurement for a new chain
→Diagnosing regression after a model switch
→Building an evaluation gate for a pull request
→Creating a versioned golden dataset and regression gate after a model swap causes quality drop

Best for

✓Teams using LangChain 1.0 or LangGraph 1.0
✓CI/CD pipelines requiring quality gates
✓Research and development of AI chains

Not ideal for

!One-off manual testing
!Non-Python environments
!Projects not using LangChain or LangGraph

plugins/saas-packs/langchain-py-pack/skills/langchain-eval-harness/SKILL.md

name

langchain-eval-harness

description

Build reproducible evaluation pipelines for LangChain 1.0 chains and LangGraph 1.0 agents — golden datasets, LangSmith evaluate(), ragas RAG metrics, deepeval LLM-as-judge, agent trajectory analysis, and CI gating on quality regressions. Use when setting up quality measurement for a new chain, diagnosing regression after a model switch, or building an evaluation gate for a pull request. Trigger with "langchain eval", "langsmith evaluate", "ragas", "llm-as-judge", "agent trajectory eval", "eval regression gate".

license

MIT

allowed-tools:Read, Write, Edit, Bash(python:*), Bash(pip:*), Bash(pytest:*)

version:2.0.0

author:Jeremy Longshore <jeremy@intentsolutions.io>

tags:["saas","langchain","langgraph","python","langchain-1.0","evaluation","langsmith","ragas","deepeval","research"]

compatible-with:claude-code, codex

LangChain Eval Harness (Python)

Overview

A team swapped gpt-4o for claude-sonnet-4-6 to save money and a week later CS noticed answer quality dropped on 15% of refund tickets — the regression was invisible in code review and invisible in CI because no golden set existed.

Fix: a versioned golden set, a stacked eval pipeline (LangSmith + ragas + deepeval + custom trajectory), and a PR-blocking regression gate with paired Wilcoxon significance. The tooling exists; the patterns for wiring it into a statistically honest loop are scattered across five doc sites.

Build a 100-example JSONL golden set, wire LangSmith evaluate() with a custom correctness evaluator, add a ragas quartet (faithfulness, answer relevance, context precision/recall) for RAG, add deepeval LLM-as-judge with N=3 judge quorum, score LangGraph trajectories on coverage/precision/ order, and gate PRs on a 2% aggregate drop or 5% per-example drop. Pin: langchain-core 1.0.x, langgraph 1.0.x, langsmith>=0.2, ragas>=0.2, deepeval>=2.0. Pain-catalog anchors: P01, P11, P12, P22, P33.

Prerequisites

Python 3.10+
langchain-core >= 1.0, < 2.0, langgraph >= 1.0, < 2.0 for the system under eval

...

$install

1-click copy

npx skills add jeremylongshore/claude-code-plugins-plus-skills --skill langchain-eval-harness

Safety assessment

★

Clarity score

How clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.

3/ 5

good

Mostly clear, but there are still a few confusing or poorly structured parts.

◎

Actionability score

How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.

3/ 5

medium

Partially actionable with several concrete steps, but still missing important details.

~community cookbook

April 22, 2026

◧ Compare

LangChain Eval Harness (Python)

Overview

Prerequisites

Python 3.10+

langchain-core >= 1.0, < 2.0, langgraph >= 1.0, < 2.0 for the system under eval

langchain-eval-harness

Key Features

Use Cases

Best for

Not ideal for

LangChain Eval Harness (Python)

Overview

Prerequisites

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

bi-validation

table-validation

playwright-app-testing

agent-device

audit

seedance-video

AI Skill Finder

langchain-eval-harness

Key Features

Use Cases

Best for

Not ideal for

LangChain Eval Harness (Python)

Overview

Prerequisites

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

bi-validation

table-validation

playwright-app-testing

agent-device

audit

seedance-video