BeClaude

skill-forge

New
1GitHub TrendingGeneralby truthandstories07-sudo

Create, evaluate, and improve agent skills (SKILL.md format) using an evidence-based methodology distilled from 244 top skills across the Anthropic, OpenAI, vendor, and community ecosystems. Use when the user wants to turn a repeated task, workflow, or expertise into a skill, asks to create or scaffold a SKILL.md, wants an existing skill judged, scored, or audited, or asks to improve a skill's structure, description, or triggering (中文触发:做成技能、创建 skill、沉淀为 skill、写一个技能、评估/评判这个 skill、优化 skill 描述、skill 体检). For routing between already-installed skills use agent-skill-hub instead; for Lark API skills, lark-skill-maker supplies the domain template while this skill governs quality and testing. NOT for writing ordinary prompts, CLAUDE.md/AGENTS.md rule files, or subagent definitions.

First seen 6/12/2026
Community PluginView Source

Overview

Skill Forge

Create, judge, and improve skills. Every rule in this skill traces to evidence from a corpus study of 1400+ public skills, deep-analyzed in waves (anthropics/skills, obra/superpowers, openai/skills, Vercel/Cloudflare/Supabase/Expo/HF/Microsoft/Trail of Bits, Matt Pocock, Addy Osmani, baoyu, marketing/PM verticals, K-Dense). Rule IDs like U7/S8/R15 and the spike exemplars in references/exemplars.md trace to that study (methodology summarized in the project README). The skill works standalone; the evidence is provenance, not a runtime dependency.

A skill is not a document. It is three layers of machinery: a corrective patch for the model's bad defaults (not a restatement of domain knowledge), an optimal allocation of knowledge across context budget (progressive disclosure), and a falsifiable quality loop (tests that can fail). Drop any layer and you produce one of the twenty documented anti-patterns.

Mode routing

User intent sounds likeModeRead next
"turn this into a skill", "create a skill for X", 做成技能A: Createbelow + references/type-playbooks.md
"is this skill good?", "audit/score this skill", 评估这个skillB: Judgereferences/rubric.md
"improve/fix/refresh this skill", 优化这个skillC: Improvereferences/rubric.md then below
"the skill doesn't trigger / triggers too much"D: Tune descriptionreferences/description-engineering.md

Output language for user-facing reports: Chinese. Skill bodies you produce: English by default (instructions read by models), bilingual trigger terms in the description when the user works in Chinese; follow the user's explicit preference if stated.

Mode A: Create

Eight steps. Steps 1-2 are gates — they can kill the project, which is a success, not a failure. Do not skip gates to be agreeable; a skill that shouldn't exist costs context on every future session.

Step 1 — Supply gate (R1)

A skill must be built from real, observed expertise. Acceptable sources:

  • (a) A task actually completed in this or a recorded conversation — extract the steps that worked, the corrections the user made, the input/output formats observed.
  • (b) A named existing asset: runbook, incident report, review comments, a personal knowledge base you can search (Obsidian vault, notes directory, wiki), or an upstream skill to adapt.
  • (c) Documented external truth you can fetch and cite (official docs, changelogs, spec).

If none exists, say so and offer two honest paths: do the task once for real and harvest it, or produce a thin draft with explicit OPEN QUESTIONS the user must answer. Never fabricate procedures from general knowledge — the corpus shows this produces "vague, generic procedures" (the #1 supply-side failure; U9).

Step 2 — Deletion test (R2)

Ask: if this skill did not exist, would the model acting on its priors get it wrong?

  • Wrong → full skill justified.
  • Mostly right → build the thin version only: version pitfalls + safety lines + ecosystem routing, a few dozen lines. (Evidence: scikit-learn-style API restatements fail this test; pydeseq2-style trap collections pass it.)
  • Right → recommend no skill; a one-line note in an existing skill or rules file may suffice.

Also apply per-section while drafting: "delete this paragraph — does model behavior change?" If not, cut it.

Step 3 — Type and neighbors (R3, R4)

Classify into one of six types; the type selects the structure template, freedom defaults, and test mechanism. Read references/type-playbooks.md for the chosen type before drafting.

TypeOne-line test
techniqueChanges how the model works on any task (discipline, style)
workflowMulti-step process with stages, gates, handoffs
referenceKnowledge the model must consult (rules, APIs, conventions)
tool-wrapperTeaches a CLI/API/script the model wouldn't know
domainDomain judgment, taste, conventions, compliance
metaOperates on skills/agents/rules themselves

Then scan for neighbors: list your installed skill directories (~/.agents/skills/, ~/.claude/skills/, and any project .agents/skills/) plus a skill registry/index if you keep one. If any overlap exists, the new skill MUST contain a disambiguation line ("X is handled here; for Y use Z") and the neighbor should gain a reciprocal pointer. Overlapping skills without mutual routing silently steal each other's triggers — observed even inside official repos.

Step 4 — Description engineering (R5)

The description is the skill's only activation surface — write it before the body, against references/description-engineering.md. Core rules: three parts (what-scope + "Use when" triggers + NOT-boundary/routing); trigger terms ranked by specificity (error strings, proper nouns, file extensions, verbatim user phrases > generic verbs, max 3 generic verbs); bilingual triggers for Chinese-language usage; never summarize the workflow's steps in the description (verified failure: the model follows the summary and skips the body); no "Always use ... even if" unless it is the domain's only entry point.

Step 5 — Draft the body (R7-R14)

Budget first: target 100-350 lines, hard ceiling 500; total knowledge >1500 lines forces a references/ split (one level deep only; >10k-word files need a grep-anchor index). Critical constraints go in the first screen, never after line 1000.

While drafting, enforce the type playbook plus these cross-type rules:

  • Anti-default inventory (R10): list the model's known bad defaults in this domain (aesthetic, language, behavior, stale knowledge) and write each as a checkable countermeasure — a ban table, numeric threshold, or "X not Y" override. This is the highest-value content a skill can carry (U7).
  • Freedom per subtask (R8): fragile/sequential subtasks get scripts or verbatim templates (low freedom); judgment subtasks get principles + anti-pattern fences + false-positive lists (high freedom). Mix freely within one skill.
  • Output contract (R12): deliverable-producing skills define the artifact's format (field table, fixed sections, badge, machine-parseable line) on the first screen.
  • Run-time verification (R16): build "assume there are problems" checks into the skill itself — a fix-and-verify loop, an until-clean command, a quality gate on the artifact. Generation is not completion.
  • Tables over prose (R13) for decisions, routing, and error handling; prose only to explain why a rule exists (explained rules generalize; bare MUSTs don't).
  • Negative boundary (R14): every skill gets "When NOT to use" with an alternative. Style/persistent skills also get an exit phrase and a safety exception.
  • Freshness protocol (R11): any version/price/limit/model-name gets one of: checked <date> stamp + refresh path + "docs win" conflict rule; deliberate omission + official link; or a regenerate command.
  • Discipline three-piece (R9): technique/discipline skills require a Rationalization table (excuse → reality), Red Flags list, and a letter-equals-spirit clause — sourced from observed excuses (Step 6 RED run), not invented ones.
  • Danger guards (R21): skills that scan, write, publish, or attack get triple authorization statements, disable-model-invocation: true where user-initiated-only, and narrow allowed-tools. Prefer harness enforcement (hooks, allowed-tools) over polite instructions (R22): what can be mechanically blocked must not rely on model goodwill.

Step 6 — Test (R15, R16, R17)

Route by type; protocols and minimal local variants in references/testing-protocols.md:

TypePrimaryMinimum acceptable
techniquePressure test (RED baseline, record verbatim excuses → countermeasures)1 no-skill pressure scenario + verify the skill blocks the observed excuse
workflowEval loop (with-skill vs baseline subagents)2 real prompts, both arms, human compare
referenceDeletion test + reference lint + trigger evalLint + 3 trigger probes
tool-wrapperGolden tests (sample input → expected output shape)1 golden test executed
domainBehavior assertions (process/format, not content) + human reviewevals.json with 2 cases
metaEval loop + self-compliance lint (it obeys its own rules)Lint passes on itself

Store test prompts in evals/evals.json (schema: skill_name, evals[].id/prompt/expected_output/assertions). Evals are development assets: they must exist and be kept; whether they ship in a package is a secondary choice. When improving an existing skill, snapshot it first — the old version is your baseline.

Step 7 — Lint (R18-R20)

Run the validator; it must pass with zero errors before delivery:

bash
python3 ~/.agents/skills/skill-forge/scripts/validate_skill.py <skill-dir> [--json] [--strict]

It checks: frontmatter field whitelist and name/description spec rules, reference existence (every `references/`, `scripts/`, `assets/` path mentioned must exist — dangling references are the highest-severity documented defect), line budgets, imperative-inflation density (MANDATORY/⚠️/CAPS), keyword-stuffing signals, secret patterns in examples, and promised-vs-actual file counts. The validator is also the Mode B pre-pass.

Step 8 — Install and register

  1. Install to ~/.agents/skills/<name>/ for a user-level skill, or <repo>/.agents/skills/<name>/ for a project skill. (~/.agents/skills/ is the cross-tool convention read natively by Codex and other agents.)
  2. Claude Code native discovery: symlink ln -s ~/.agents/skills/<name> ~/.claude/skills/<name> (verify it resolves).
  3. If your setup maintains a skill registry/index/catalog, rebuild it and add a route row (trigger, input, output, path, priority) so the new skill is discoverable and de-duplicated against neighbors.
  4. Restart the agent session so skill discovery refreshes.

Mode B: Judge

Read references/rubric.md and score the target against the 14 signals in causal order — S8 deletion test first (should it exist?), then S1 triggering (can it activate?), then S6/S7 structure and noise (can it execute?), then S10/S11 verification and contract (can its output be trusted?). A failure upstream makes downstream scores moot. Run the validator (Step 7) as the mechanical pre-pass; never hand-check what the script checks.

Default judging tier is validator + manual reasoning against the anchors. Signals whose detection column says subagent (S8 baseline run, S1 trigger probes) may be scored manually at this tier — state in the report that the probe was not executed; escalate to real probe runs only when the verdict hinges on them or the user asks for a deep audit.

Deliver: per-signal score with quoted evidence, the 2-3 highest-leverage fixes mapped to playbook patterns, and an honest verdict line: keep / improve / rebuild thin / retire. Do not pad findings to seem thorough — 3 real issues beat 10 cosmetic ones (the-fool's rule: cap challenges, no nitpick-stacking).

Mode C: Improve

  1. Judge first (Mode B) — fixes must trace to scored failures, not taste.
  2. Snapshot the current version (cp -r to a workspace) as the eval baseline.
  3. Apply fixes per the relevant playbook section; re-run lint and the type's test mechanism against the snapshot.
  4. Improvement philosophy (evidence-backed): generalize from feedback instead of overfitting to the test examples; cut what isn't pulling its weight (read transcripts — wasted model actions point at cuttable text); explain why instead of stacking MUSTs; if several test runs each hand-wrote a similar helper, that helper becomes scripts/.

Rationalizations to reject (this skill's own discipline)

ExcuseReality
"The request is clear, skip the supply gate"Clear requests still produce generic slop without observed expertise; the gate takes one question
"It's a small skill, no tests needed"The minimum tier exists precisely for small skills; zero tests = "generation is the finish line", the documented top failure mode
"Lint later, deliver first"Dangling references are written exactly at delivery time; lint costs seconds
"User is in a hurry, skip the deletion test"A skill that shouldn't exist taxes every future session's context — slower forever to save a minute now
"I'll write the description from the body summary"Verified failure mode: models execute the description's summary and skip the body

Red flags you are about to violate this skill: writing the body before the description; a description containing step words ("first analyzes, then..."); delivering with "the skill is ready" without a lint run in the same turn; inventing an excuse table instead of running a RED observation.

When NOT to use

  • Routing/catalog questions about installed skills → agent-skill-hub.
  • Lark API skill content → lark-skill-maker for the domain template; come back here for description, tests, and lint.
  • One-off prompts, CLAUDE.md/AGENTS.md rules, subagent definitions → not skills; write them directly.
  • In Claude Code with the user explicitly asking for Anthropic's interactive eval-viewer flow → anthropic-skills:skill-creator plugin remains available; its run_loop.py (description optimization) and eval-viewer can be reused from here for Step 6/D since they only require claude -p.

Resource map

FileLoad when
references/rubric.mdMode B/C; final QA of Mode A
references/type-playbooks.mdMode A Step 3-5; Mode C fixes
references/description-engineering.mdStep 4; Mode D
references/testing-protocols.mdStep 6; Mode C regression
references/anti-patterns.mdDrafting QA; judging evidence lookup
references/exemplars.mdMode A Step 5 — imitate the verified best exemplar of the chosen type (52 spike exemplars, citation-verified)
scripts/validate_skill.pyStep 7; Mode B pre-pass (execute, don't read)
scripts/verify_citations.pyMaintaining evidence-backed assets — quotes must exist verbatim (execute)
scripts/build_exemplars.pyRegenerate exemplars.md from verified JSON (execute)

Install & Usage

1
Create the skills directory
mkdir -p .claude/skills
2
Download the skill file

Add the configuration to .claude/skills/skill-forge.md

3
Invoke in Claude Code
/skill-forge
View source on GitHub
testingapiagent

Security Audits

LicenseUnknownSourceWarnRepositoryPass

Frequently Asked Questions

What is skill-forge?

Create, evaluate, and improve agent skills (SKILL.md format) using an evidence-based methodology distilled from 244 top skills across the Anthropic, OpenAI, vendor, and community ecosystems. Use when the user wants to turn a repeated task, workflow, or expertise into a skill, asks to create or scaffold a SKILL.md, wants an existing skill judged, scored, or audited, or asks to improve a skill's structure, description, or triggering (中文触发:做成技能、创建 skill、沉淀为 skill、写一个技能、评估/评判这个 skill、优化 skill 描述、skill 体检). For routing between already-installed skills use agent-skill-hub instead; for Lark API skills, lark-skill-maker supplies the domain template while this skill governs quality and testing. NOT for writing ordinary prompts, CLAUDE.md/AGENTS.md rule files, or subagent definitions.

How to install skill-forge?

To install skill-forge: create the skills directory (mkdir -p .claude/skills), then add the config to .claude/skills/skill-forge.md. Finally, /skill-forge in Claude Code.

What is skill-forge best for?

skill-forge is a community categorized under General. It is designed for: testing, api, agent. Created by truthandstories07-sudo.