Ask me what skills you need
What are you building?
Tell me what you're working on and I'll find the best agent skills for you.
A team swapped gpt-4o for claude-sonnet-4-6 to save money and a week later CS
noticed answer quality dropped on 15% of refund tickets — the regression was
invisible in code review and invisible in CI because no golden set existed.
Fix: a versioned golden set, a stacked eval pipeline (LangSmith + ragas + deepeval + custom trajectory), and a PR-blocking regression gate with paired Wilcoxon significance. The tooling exists; the patterns for wiring it into a statistically honest loop are scattered across five doc sites.
Build a 100-example JSONL golden set, wire LangSmith evaluate() with a
custom correctness evaluator, add a ragas quartet (faithfulness, answer
relevance, context precision/recall) for RAG, add deepeval LLM-as-judge
with N=3 judge quorum, score LangGraph trajectories on coverage/precision/
order, and gate PRs on a 2% aggregate drop or 5% per-example drop. Pin:
langchain-core 1.0.x, langgraph 1.0.x, langsmith>=0.2, ragas>=0.2,
deepeval>=2.0. Pain-catalog anchors: P01, P11, P12, P22, P33.
langchain-core >= 1.0, < 2.0, langgraph >= 1.0, < 2.0 for the system under evalnpx skills add jeremylongshore/claude-code-plugins-plus-skills --skill langchain-eval-harnessHow clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.
Mostly clear, but there are still a few confusing or poorly structured parts.
How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.
Partially actionable with several concrete steps, but still missing important details.