skill-forge
NewCreate, evaluate, and improve agent skills (SKILL.md format) using an evidence-based methodology distilled from 244 top skills across the Anthropic, OpenAI, vendor, and community ecosystems. Use when the user wants to turn a repeated task, workflow, or expertise into a skill, asks to create or scaffold a SKILL.md, wants an existing skill judged, scored, or audited, or asks to improve a skill's structure, description, or triggering (中文触发:做成技能、创建 skill、沉淀为 skill、写一个技能、评估/评判这个 skill、优化 skill 描述、skill 体检). For routing between already-installed skills use agent-skill-hub instead; for Lark API skills, lark-skill-maker supplies the domain template while this skill governs quality and testing. NOT for writing ordinary prompts, CLAUDE.md/AGENTS.md rule files, or subagent definitions.
Overview
Skill Forge
Create, judge, and improve skills. Every rule in this skill traces to evidence from a corpus study of 1400+ public skills, deep-analyzed in waves (anthropics/skills, obra/superpowers, openai/skills, Vercel/Cloudflare/Supabase/Expo/HF/Microsoft/Trail of Bits, Matt Pocock, Addy Osmani, baoyu, marketing/PM verticals, K-Dense). Rule IDs like U7/S8/R15 and the spike exemplars in references/exemplars.md trace to that study (methodology summarized in the project README). The skill works standalone; the evidence is provenance, not a runtime dependency.
A skill is not a document. It is three layers of machinery: a corrective patch for the model's bad defaults (not a restatement of domain knowledge), an optimal allocation of knowledge across context budget (progressive disclosure), and a falsifiable quality loop (tests that can fail). Drop any layer and you produce one of the twenty documented anti-patterns.
Mode routing
| User intent sounds like | Mode | Read next |
|---|---|---|
| "turn this into a skill", "create a skill for X", 做成技能 | A: Create | below + references/type-playbooks.md |
| "is this skill good?", "audit/score this skill", 评估这个skill | B: Judge | references/rubric.md |
| "improve/fix/refresh this skill", 优化这个skill | C: Improve | references/rubric.md then below |
| "the skill doesn't trigger / triggers too much" | D: Tune description | references/description-engineering.md |
Output language for user-facing reports: Chinese. Skill bodies you produce: English by default (instructions read by models), bilingual trigger terms in the description when the user works in Chinese; follow the user's explicit preference if stated.
Mode A: Create
Eight steps. Steps 1-2 are gates — they can kill the project, which is a success, not a failure. Do not skip gates to be agreeable; a skill that shouldn't exist costs context on every future session.
Step 1 — Supply gate (R1)
A skill must be built from real, observed expertise. Acceptable sources:
- •(a) A task actually completed in this or a recorded conversation — extract the steps that worked, the corrections the user made, the input/output formats observed.
- •(b) A named existing asset: runbook, incident report, review comments, a personal knowledge base you can search (Obsidian vault, notes directory, wiki), or an upstream skill to adapt.
- •(c) Documented external truth you can fetch and cite (official docs, changelogs, spec).
If none exists, say so and offer two honest paths: do the task once for real and harvest it, or produce a thin draft with explicit OPEN QUESTIONS the user must answer. Never fabricate procedures from general knowledge — the corpus shows this produces "vague, generic procedures" (the #1 supply-side failure; U9).
Step 2 — Deletion test (R2)
Ask: if this skill did not exist, would the model acting on its priors get it wrong?
- •Wrong → full skill justified.
- •Mostly right → build the thin version only: version pitfalls + safety lines + ecosystem routing, a few dozen lines. (Evidence: scikit-learn-style API restatements fail this test; pydeseq2-style trap collections pass it.)
- •Right → recommend no skill; a one-line note in an existing skill or rules file may suffice.
Also apply per-section while drafting: "delete this paragraph — does model behavior change?" If not, cut it.
Step 3 — Type and neighbors (R3, R4)
Classify into one of six types; the type selects the structure template, freedom defaults, and test mechanism. Read references/type-playbooks.md for the chosen type before drafting.
| Type | One-line test |
|---|---|
| technique | Changes how the model works on any task (discipline, style) |
| workflow | Multi-step process with stages, gates, handoffs |
| reference | Knowledge the model must consult (rules, APIs, conventions) |
| tool-wrapper | Teaches a CLI/API/script the model wouldn't know |
| domain | Domain judgment, taste, conventions, compliance |
| meta | Operates on skills/agents/rules themselves |
Then scan for neighbors: list your installed skill directories (~/.agents/skills/, ~/.claude/skills/, and any project .agents/skills/) plus a skill registry/index if you keep one. If any overlap exists, the new skill MUST contain a disambiguation line ("X is handled here; for Y use Z") and the neighbor should gain a reciprocal pointer. Overlapping skills without mutual routing silently steal each other's triggers — observed even inside official repos.
Step 4 — Description engineering (R5)
The description is the skill's only activation surface — write it before the body, against references/description-engineering.md. Core rules: three parts (what-scope + "Use when" triggers + NOT-boundary/routing); trigger terms ranked by specificity (error strings, proper nouns, file extensions, verbatim user phrases > generic verbs, max 3 generic verbs); bilingual triggers for Chinese-language usage; never summarize the workflow's steps in the description (verified failure: the model follows the summary and skips the body); no "Always use ... even if" unless it is the domain's only entry point.
Step 5 — Draft the body (R7-R14)
Budget first: target 100-350 lines, hard ceiling 500; total knowledge >1500 lines forces a references/ split (one level deep only; >10k-word files need a grep-anchor index). Critical constraints go in the first screen, never after line 1000.
While drafting, enforce the type playbook plus these cross-type rules:
- •Anti-default inventory (R10): list the model's known bad defaults in this domain (aesthetic, language, behavior, stale knowledge) and write each as a checkable countermeasure — a ban table, numeric threshold, or "X not Y" override. This is the highest-value content a skill can carry (U7).
- •Freedom per subtask (R8): fragile/sequential subtasks get scripts or verbatim templates (low freedom); judgment subtasks get principles + anti-pattern fences + false-positive lists (high freedom). Mix freely within one skill.
- •Output contract (R12): deliverable-producing skills define the artifact's format (field table, fixed sections, badge, machine-parseable line) on the first screen.
- •Run-time verification (R16): build "assume there are problems" checks into the skill itself — a fix-and-verify loop, an until-clean command, a quality gate on the artifact. Generation is not completion.
- •Tables over prose (R13) for decisions, routing, and error handling; prose only to explain why a rule exists (explained rules generalize; bare MUSTs don't).
- •Negative boundary (R14): every skill gets "When NOT to use" with an alternative. Style/persistent skills also get an exit phrase and a safety exception.
- •Freshness protocol (R11): any version/price/limit/model-name gets one of:
checked <date>stamp + refresh path + "docs win" conflict rule; deliberate omission + official link; or a regenerate command. - •Discipline three-piece (R9): technique/discipline skills require a Rationalization table (excuse → reality), Red Flags list, and a letter-equals-spirit clause — sourced from observed excuses (Step 6 RED run), not invented ones.
- •Danger guards (R21): skills that scan, write, publish, or attack get triple authorization statements,
disable-model-invocation: truewhere user-initiated-only, and narrowallowed-tools. Prefer harness enforcement (hooks, allowed-tools) over polite instructions (R22): what can be mechanically blocked must not rely on model goodwill.
Step 6 — Test (R15, R16, R17)
Route by type; protocols and minimal local variants in references/testing-protocols.md:
| Type | Primary | Minimum acceptable |
|---|---|---|
| technique | Pressure test (RED baseline, record verbatim excuses → countermeasures) | 1 no-skill pressure scenario + verify the skill blocks the observed excuse |
| workflow | Eval loop (with-skill vs baseline subagents) | 2 real prompts, both arms, human compare |
| reference | Deletion test + reference lint + trigger eval | Lint + 3 trigger probes |
| tool-wrapper | Golden tests (sample input → expected output shape) | 1 golden test executed |
| domain | Behavior assertions (process/format, not content) + human review | evals.json with 2 cases |
| meta | Eval loop + self-compliance lint (it obeys its own rules) | Lint passes on itself |
Store test prompts in evals/evals.json (schema: skill_name, evals[].id/prompt/expected_output/assertions). Evals are development assets: they must exist and be kept; whether they ship in a package is a secondary choice. When improving an existing skill, snapshot it first — the old version is your baseline.
Step 7 — Lint (R18-R20)
Run the validator; it must pass with zero errors before delivery:
python3 ~/.agents/skills/skill-forge/scripts/validate_skill.py <skill-dir> [--json] [--strict]It checks: frontmatter field whitelist and name/description spec rules, reference existence (every `references/`, `scripts/`, `assets/` path mentioned must exist — dangling references are the highest-severity documented defect), line budgets, imperative-inflation density (MANDATORY/⚠️/CAPS), keyword-stuffing signals, secret patterns in examples, and promised-vs-actual file counts. The validator is also the Mode B pre-pass.
Step 8 — Install and register
- Install to
~/.agents/skills/<name>/for a user-level skill, or<repo>/.agents/skills/<name>/for a project skill. (~/.agents/skills/is the cross-tool convention read natively by Codex and other agents.) - Claude Code native discovery: symlink
ln -s ~/.agents/skills/<name> ~/.claude/skills/<name>(verify it resolves). - If your setup maintains a skill registry/index/catalog, rebuild it and add a route row (trigger, input, output, path, priority) so the new skill is discoverable and de-duplicated against neighbors.
- Restart the agent session so skill discovery refreshes.
Mode B: Judge
Read references/rubric.md and score the target against the 14 signals in causal order — S8 deletion test first (should it exist?), then S1 triggering (can it activate?), then S6/S7 structure and noise (can it execute?), then S10/S11 verification and contract (can its output be trusted?). A failure upstream makes downstream scores moot. Run the validator (Step 7) as the mechanical pre-pass; never hand-check what the script checks.
Default judging tier is validator + manual reasoning against the anchors. Signals whose detection column says subagent (S8 baseline run, S1 trigger probes) may be scored manually at this tier — state in the report that the probe was not executed; escalate to real probe runs only when the verdict hinges on them or the user asks for a deep audit.
Deliver: per-signal score with quoted evidence, the 2-3 highest-leverage fixes mapped to playbook patterns, and an honest verdict line: keep / improve / rebuild thin / retire. Do not pad findings to seem thorough — 3 real issues beat 10 cosmetic ones (the-fool's rule: cap challenges, no nitpick-stacking).
Mode C: Improve
- Judge first (Mode B) — fixes must trace to scored failures, not taste.
- Snapshot the current version (
cp -rto a workspace) as the eval baseline. - Apply fixes per the relevant playbook section; re-run lint and the type's test mechanism against the snapshot.
- Improvement philosophy (evidence-backed): generalize from feedback instead of overfitting to the test examples; cut what isn't pulling its weight (read transcripts — wasted model actions point at cuttable text); explain why instead of stacking MUSTs; if several test runs each hand-wrote a similar helper, that helper becomes
scripts/.
Rationalizations to reject (this skill's own discipline)
| Excuse | Reality |
|---|---|
| "The request is clear, skip the supply gate" | Clear requests still produce generic slop without observed expertise; the gate takes one question |
| "It's a small skill, no tests needed" | The minimum tier exists precisely for small skills; zero tests = "generation is the finish line", the documented top failure mode |
| "Lint later, deliver first" | Dangling references are written exactly at delivery time; lint costs seconds |
| "User is in a hurry, skip the deletion test" | A skill that shouldn't exist taxes every future session's context — slower forever to save a minute now |
| "I'll write the description from the body summary" | Verified failure mode: models execute the description's summary and skip the body |
Red flags you are about to violate this skill: writing the body before the description; a description containing step words ("first analyzes, then..."); delivering with "the skill is ready" without a lint run in the same turn; inventing an excuse table instead of running a RED observation.
When NOT to use
- •Routing/catalog questions about installed skills →
agent-skill-hub. - •Lark API skill content →
lark-skill-makerfor the domain template; come back here for description, tests, and lint. - •One-off prompts, CLAUDE.md/AGENTS.md rules, subagent definitions → not skills; write them directly.
- •In Claude Code with the user explicitly asking for Anthropic's interactive eval-viewer flow →
anthropic-skills:skill-creatorplugin remains available; itsrun_loop.py(description optimization) and eval-viewer can be reused from here for Step 6/D since they only requireclaude -p.
Resource map
| File | Load when |
|---|---|
references/rubric.md | Mode B/C; final QA of Mode A |
references/type-playbooks.md | Mode A Step 3-5; Mode C fixes |
references/description-engineering.md | Step 4; Mode D |
references/testing-protocols.md | Step 6; Mode C regression |
references/anti-patterns.md | Drafting QA; judging evidence lookup |
references/exemplars.md | Mode A Step 5 — imitate the verified best exemplar of the chosen type (52 spike exemplars, citation-verified) |
scripts/validate_skill.py | Step 7; Mode B pre-pass (execute, don't read) |
scripts/verify_citations.py | Maintaining evidence-backed assets — quotes must exist verbatim (execute) |
scripts/build_exemplars.py | Regenerate exemplars.md from verified JSON (execute) |
Install & Usage
mkdir -p .claude/skillsAdd the configuration to .claude/skills/skill-forge.md
/skill-forgeSecurity Audits
Frequently Asked Questions
What is skill-forge?
Create, evaluate, and improve agent skills (SKILL.md format) using an evidence-based methodology distilled from 244 top skills across the Anthropic, OpenAI, vendor, and community ecosystems. Use when the user wants to turn a repeated task, workflow, or expertise into a skill, asks to create or scaffold a SKILL.md, wants an existing skill judged, scored, or audited, or asks to improve a skill's structure, description, or triggering (中文触发:做成技能、创建 skill、沉淀为 skill、写一个技能、评估/评判这个 skill、优化 skill 描述、skill 体检). For routing between already-installed skills use agent-skill-hub instead; for Lark API skills, lark-skill-maker supplies the domain template while this skill governs quality and testing. NOT for writing ordinary prompts, CLAUDE.md/AGENTS.md rule files, or subagent definitions.
How to install skill-forge?
To install skill-forge: create the skills directory (mkdir -p .claude/skills), then add the config to .claude/skills/skill-forge.md. Finally, /skill-forge in Claude Code.
What is skill-forge best for?
skill-forge is a community categorized under General. It is designed for: testing, api, agent. Created by truthandstories07-sudo.