Skip to content
🆕 New skill: Security Specialist v2.0! 6-phase pipeline, 9 attack classes and adversarial validation.View skill →

Skill Evaluation

Evaluate any agent skill against 12 best-practice criteria from Anthropic and agentskills.io. Produces a structured markdown scorecard with per-criterion scores (0–100), category classification, bonus patterns, and prioritized improvement actions.

  • Evaluate skill quality before publishing
  • Audit existing skills for improvement opportunities
  • Compare two skills side-by-side
  • Check compliance with industry best practices
Terminal window
npx skills add https://github.com/fabricioctelles/skills -s skill-evaluation
# Criterion Weight
1 Don’t state the obvious 2x
2 Gotchas section 2x
3 Progressive disclosure 2x
4 Avoids railroading 1x
5 Setup flow 1x
6 Description for trigger 2x
7 Memory mechanism 1x
8 Scripts & libraries 1x
9 On-demand hooks 1x
10 Conciseness 2x
11 Coherent scope 1x
12 Grounded in expertise 2x
  • Validation loops
  • Output templates
  • Procedures over declarations
  • Defaults over menus
Skill Evaluation agentskills.io evals
Evaluates Skill structure quality Skill output quality in use
Method Static inspection Test cases + benchmark
When Is the skill well-built? Does it work in practice?
Output Scorecard 0-100 + grade pass_rate + tokens + time

Use this skill first for solid structure, then run evals to validate real-world performance.

# Skill Evaluation — skill-evaluation
> Evaluated: 2026-06-27
> Evaluator: skill-evaluation v1.0.0
## Summary
| Metric | Value |
|--------|-------|
| Overall Score | 62/100 |
| Grade | B |
| Category | code-quality-and-review |
| Files | 2 |
| Has references/ | yes |
| Has scripts/ | no |
## Scorecard
| # | Criterion | Score | Notes |
|---|-----------|-------|-------|
| 1 | Don't state the obvious | 85 | Framework is specific, not generic |
| 2 | Gotchas section | 0 | Absent — no pitfall warnings |
| 3 | Progressive disclosure | 55 | 1 reference file, template inline |
| 6 | Description for trigger | 90 | Multiple concrete triggers |
| 10 | Conciseness | 70 | 223 lines, output template could be ref |
| 11 | Coherent scope | 95 | Does ONE thing well |
| 12 | Grounded in expertise | 80 | 3 authoritative sources |
## Top 3 Improvements
1. Gotchas (0) — Add multi-client evaluation pitfalls
2. Scripts (0) — Create quick-check.sh for measurable criteria
3. Memory (0) — Keep evaluations.log for progress tracking

📄 Full documentation on GitHub