skill-eval-harness
NewSummary
The skill-eval-harness skill provides a framework for evaluating language model outputs against custom criteria, enabling developers to automate quality assessment of generated text.
- It helps ensure consistency, accuracy, and adherence to guidelines by running structured evaluations on model responses.
Install & Usage
mkdir -p .claude/skillsmkdir -p .claude/skills && curl -o .claude/skills/skill-eval-harness.md https://raw.githubusercontent.com/adewale/skill-eval-harness/main/SKILL.md/skill-eval-harnessUse Cases
Usage Examples
/skill-eval-harness evaluate 'What is the capital of France?' against fact-check dataset
Run an evaluation on the last 10 responses for tone consistency using the guidelines in tone_rules.json
/skill-eval-harness compare 'Explain quantum computing' with two different prompts and score clarity
Security Audits
Frequently Asked Questions
What is skill-eval-harness?
The skill-eval-harness skill provides a framework for evaluating language model outputs against custom criteria, enabling developers to automate quality assessment of generated text. It helps ensure consistency, accuracy, and adherence to guidelines by running structured evaluations on model responses.
How to install skill-eval-harness?
To install skill-eval-harness: create the skills directory (mkdir -p .claude/skills), then run: mkdir -p .claude/skills && curl -o .claude/skills/skill-eval-harness.md https://raw.githubusercontent.com/adewale/skill-eval-harness/main/SKILL.md. Finally, /skill-eval-harness in Claude Code.
What is skill-eval-harness best for?
skill-eval-harness is a skill categorized under General. Created by adewale.
What can I use skill-eval-harness for?
skill-eval-harness is useful for: Assess the factual accuracy of Claude's answers against a provided knowledge base.; Evaluate the tone and style of generated content to match brand guidelines.; Run automated regression tests on model outputs after prompt changes.; Compare multiple model responses to select the best one based on predefined metrics.; Validate that code snippets generated by Claude compile and pass unit tests.; Measure the conciseness and relevance of summaries produced by the model..