skill-forge

Q: How to install skill-forge?

Create the skills directory: mkdir -p .claude/skills. Then add the config to .claude/skills/skill-forge.md. Finally, /skill-forge in Claude Code.

Q: What is skill-forge best for?

skill-forge is categorized under General. It covers: testing, api, agent.

New

1GitHub TrendingGeneralby truthandstories07-sudo

Create, evaluate, and improve agent skills (SKILL.md format) using an evidence-based methodology distilled from 244 top skills across the Anthropic, OpenAI, vendor, and community ecosystems. Use when the user wants to turn a repeated task, workflow, or expertise into a skill, asks to create or scaffold a SKILL.md, wants an existing skill judged, scored, or audited, or asks to improve a skill's structure, description, or triggering (中文触发:做成技能、创建 skill、沉淀为 skill、写一个技能、评估/评判这个 skill、优化 skill 描述、skill 体检). For routing between already-installed skills use agent-skill-hub instead; for Lark API skills, lark-skill-maker supplies the domain template while this skill governs quality and testing. NOT for writing ordinary prompts, CLAUDE.md/AGENTS.md rule files, or subagent definitions.

First seen 6/12/2026

Community PluginView Source

Overview

Skill Forge

Create, judge, and improve skills. Every rule in this skill traces to evidence from a corpus study of 1400+ public skills, deep-analyzed in waves (anthropics/skills, obra/superpowers, openai/skills, Vercel/Cloudflare/Supabase/Expo/HF/Microsoft/Trail of Bits, Matt Pocock, Addy Osmani, baoyu, marketing/PM verticals, K-Dense). Rule IDs like U7/S8/R15 and the spike exemplars in references/exemplars.md trace to that study (methodology summarized in the project README). The skill works standalone; the evidence is provenance, not a runtime dependency.

A skill is not a document. It is three layers of machinery: a corrective patch for the model's bad defaults (not a restatement of domain knowledge), an optimal allocation of knowledge across context budget (progressive disclosure), and a falsifiable quality loop (tests that can fail). Drop any layer and you produce one of the twenty documented anti-patterns.

Mode routing

User intent sounds like	Mode	Read next
"turn this into a skill", "create a skill for X", 做成技能	A: Create	below + `references/type-playbooks.md`
"is this skill good?", "audit/score this skill", 评估这个skill	B: Judge	`references/rubric.md`
"improve/fix/refresh this skill", 优化这个skill	C: Improve	`references/rubric.md` then below
"the skill doesn't trigger / triggers too much"	D: Tune description	`references/description-engineering.md`

Output language for user-facing reports: Chinese. Skill bodies you produce: English by default (instructions read by models), bilingual trigger terms in the description when the user works in Chinese; follow the user's explicit preference if stated.

Mode A: Create

Eight steps. Steps 1-2 are gates — they can kill the project, which is a success, not a failure. Do not skip gates to be agreeable; a skill that shouldn't exist costs context on every future session.

Step 1 — Supply gate (R1)

A skill must be built from real, observed expertise. Acceptable sources:

•(a) A task actually completed in this or a recorded conversation — extract the steps that worked, the corrections the user made, the input/output formats observed.
•(b) A named existing asset: runbook, incident report, review comments, a personal knowledge base you can search (Obsidian vault, notes directory, wiki), or an upstream skill to adapt.
•(c) Documented external truth you can fetch and cite (official docs, changelogs, spec).

If none exists, say so and offer two honest paths: do the task once for real and harvest it, or produce a thin draft with explicit OPEN QUESTIONS the user must answer. Never fabricate procedures from general knowledge — the corpus shows this produces "vague, generic procedures" (the #1 supply-side failure; U9).

Step 2 — Deletion test (R2)

Ask: if this skill did not exist, would the model acting on its priors get it wrong?

•Wrong → full skill justified.
•Mostly right → build the thin version only: version pitfalls + safety lines + ecosystem routing, a few dozen lines. (Evidence: scikit-learn-style API restatements fail this test; pydeseq2-style trap collections pass it.)
•Right → recommend no skill; a one-line note in an existing skill or rules file may suffice.

Also apply per-section while drafting: "delete this paragraph — does model behavior change?" If not, cut it.

Step 3 — Type and neighbors (R3, R4)

Classify into one of six types; the type selects the structure template, freedom defaults, and test mechanism. Read references/type-playbooks.md for the chosen type before drafting.

Type	One-line test
technique	Changes how the model works on any task (discipline, style)
workflow	Multi-step process with stages, gates, handoffs
reference	Knowledge the model must consult (rules, APIs, conventions)
tool-wrapper	Teaches a CLI/API/script the model wouldn't know
domain	Domain judgment, taste, conventions, compliance
meta	Operates on skills/agents/rules themselves

Then scan for neighbors: list your installed skill directories (~/.agents/skills/, ~/.claude/skills/, and any project .agents/skills/) plus a skill registry/index if you keep one. If any overlap exists, the new skill MUST contain a disambiguation line ("X is handled here; for Y use Z") and the neighbor should gain a reciprocal pointer. Overlapping skills without mutual routing silently steal each other's triggers — observed even inside official repos.

Step 4 — Description engineering (R5)

The description is the skill's only activation surface — write it before the body, against references/description-engineering.md. Core rules: three parts (what-scope + "Use when" triggers + NOT-boundary/routing); trigger terms ranked by specificity (error strings, proper nouns, file extensions, verbatim user phrases > generic verbs, max 3 generic verbs); bilingual triggers for Chinese-language usage; never summarize the workflow's steps in the description (verified failure: the model follows the summary and skips the body); no "Always use ... even if" unless it is the domain's only entry point.

Step 5 — Draft the body (R7-R14)

Budget first: target 100-350 lines, hard ceiling 500; total knowledge >1500 lines forces a references/ split (one level deep only; >10k-word files need a grep-anchor index). Critical constraints go in the first screen, never after line 1000.

While drafting, enforce the type playbook plus these cross-type rules:

•Anti-default inventory (R10): list the model's known bad defaults in this domain (aesthetic, language, behavior, stale knowledge) and write each as a checkable countermeasure — a ban table, numeric threshold, or "X not Y" override. This is the highest-value content a skill can carry (U7).
•Freedom per subtask (R8): fragile/sequential subtasks get scripts or verbatim templates (low freedom); judgment subtasks get principles + anti-pattern fences + false-positive lists (high freedom). Mix freely within one skill.
•Output contract (R12): deliverable-producing skills define the artifact's format (field table, fixed sections, badge, machine-parseable line) on the first screen.
•Run-time verification (R16): build "assume there are problems" checks into the skill itself — a fix-and-verify loop, an until-clean command, a quality gate on the artifact. Generation is not completion.
•Tables over prose (R13) for decisions, routing, and error handling; prose only to explain why a rule exists (explained rules generalize; bare MUSTs don't).
•Negative boundary (R14): every skill gets "When NOT to use" with an alternative. Style/persistent skills also get an exit phrase and a safety exception.
•Freshness protocol (R11): any version/price/limit/model-name gets one of: checked <date> stamp + refresh path + "docs win" conflict rule; deliberate omission + official link; or a regenerate command.
•Discipline three-piece (R9): technique/discipline skills require a Rationalization table (excuse → reality), Red Flags list, and a letter-equals-spirit clause — sourced from observed excuses (Step 6 RED run), not invented ones.
•Danger guards (R21): skills that scan, write, publish, or attack get triple authorization statements, disable-model-invocation: true where user-initiated-only, and narrow allowed-tools. Prefer harness enforcement (hooks, allowed-tools) over polite instructions (R22): what can be mechanically blocked must not rely on model goodwill.

Step 6 — Test (R15, R16, R17)

Route by type; protocols and minimal local variants in references/testing-protocols.md:

Type	Primary	Minimum acceptable
technique	Pressure test (RED baseline, record verbatim excuses → countermeasures)	1 no-skill pressure scenario + verify the skill blocks the observed excuse
workflow	Eval loop (with-skill vs baseline subagents)	2 real prompts, both arms, human compare
reference	Deletion test + reference lint + trigger eval	Lint + 3 trigger probes
tool-wrapper	Golden tests (sample input → expected output shape)	1 golden test executed
domain	Behavior assertions (process/format, not content) + human review	evals.json with 2 cases
meta	Eval loop + self-compliance lint (it obeys its own rules)	Lint passes on itself

Store test prompts in evals/evals.json (schema: skill_name, evals[].id/prompt/expected_output/assertions). Evals are development assets: they must exist and be kept; whether they ship in a package is a secondary choice. When improving an existing skill, snapshot it first — the old version is your baseline.

Step 7 — Lint (R18-R20)

Run the validator; it must pass with zero errors before delivery:

bash

python3 ~/.agents/skills/skill-forge/scripts/validate_skill.py <skill-dir> [--json] [--strict]

It checks: frontmatter field whitelist and name/description spec rules, reference existence (every `references/`, `scripts/`, `assets/` path mentioned must exist — dangling references are the highest-severity documented defect), line budgets, imperative-inflation density (MANDATORY/⚠️/CAPS), keyword-stuffing signals, secret patterns in examples, and promised-vs-actual file counts. The validator is also the Mode B pre-pass.

Step 8 — Install and register

Install to ~/.agents/skills/<name>/ for a user-level skill, or <repo>/.agents/skills/<name>/ for a project skill. (~/.agents/skills/ is the cross-tool convention read natively by Codex and other agents.)
Claude Code native discovery: symlink ln -s ~/.agents/skills/<name> ~/.claude/skills/<name> (verify it resolves).
If your setup maintains a skill registry/index/catalog, rebuild it and add a route row (trigger, input, output, path, priority) so the new skill is discoverable and de-duplicated against neighbors.
Restart the agent session so skill discovery refreshes.

Mode B: Judge

Read references/rubric.md and score the target against the 14 signals in causal order — S8 deletion test first (should it exist?), then S1 triggering (can it activate?), then S6/S7 structure and noise (can it execute?), then S10/S11 verification and contract (can its output be trusted?). A failure upstream makes downstream scores moot. Run the validator (Step 7) as the mechanical pre-pass; never hand-check what the script checks.

Default judging tier is validator + manual reasoning against the anchors. Signals whose detection column says subagent (S8 baseline run, S1 trigger probes) may be scored manually at this tier — state in the report that the probe was not executed; escalate to real probe runs only when the verdict hinges on them or the user asks for a deep audit.

Deliver: per-signal score with quoted evidence, the 2-3 highest-leverage fixes mapped to playbook patterns, and an honest verdict line: keep / improve / rebuild thin / retire. Do not pad findings to seem thorough — 3 real issues beat 10 cosmetic ones (the-fool's rule: cap challenges, no nitpick-stacking).

Mode C: Improve

Judge first (Mode B) — fixes must trace to scored failures, not taste.
Snapshot the current version (cp -r to a workspace) as the eval baseline.
Apply fixes per the relevant playbook section; re-run lint and the type's test mechanism against the snapshot.
Improvement philosophy (evidence-backed): generalize from feedback instead of overfitting to the test examples; cut what isn't pulling its weight (read transcripts — wasted model actions point at cuttable text); explain why instead of stacking MUSTs; if several test runs each hand-wrote a similar helper, that helper becomes scripts/.

Rationalizations to reject (this skill's own discipline)

Excuse	Reality
"The request is clear, skip the supply gate"	Clear requests still produce generic slop without observed expertise; the gate takes one question
"It's a small skill, no tests needed"	The minimum tier exists precisely for small skills; zero tests = "generation is the finish line", the documented top failure mode
"Lint later, deliver first"	Dangling references are written exactly at delivery time; lint costs seconds
"User is in a hurry, skip the deletion test"	A skill that shouldn't exist taxes every future session's context — slower forever to save a minute now
"I'll write the description from the body summary"	Verified failure mode: models execute the description's summary and skip the body

Red flags you are about to violate this skill: writing the body before the description; a description containing step words ("first analyzes, then..."); delivering with "the skill is ready" without a lint run in the same turn; inventing an excuse table instead of running a RED observation.

When NOT to use

•Routing/catalog questions about installed skills → agent-skill-hub.
•Lark API skill content → lark-skill-maker for the domain template; come back here for description, tests, and lint.
•One-off prompts, CLAUDE.md/AGENTS.md rules, subagent definitions → not skills; write them directly.
•In Claude Code with the user explicitly asking for Anthropic's interactive eval-viewer flow → anthropic-skills:skill-creator plugin remains available; its run_loop.py (description optimization) and eval-viewer can be reused from here for Step 6/D since they only require claude -p.

Resource map

File	Load when
`references/rubric.md`	Mode B/C; final QA of Mode A
`references/type-playbooks.md`	Mode A Step 3-5; Mode C fixes
`references/description-engineering.md`	Step 4; Mode D
`references/testing-protocols.md`	Step 6; Mode C regression
`references/anti-patterns.md`	Drafting QA; judging evidence lookup
`references/exemplars.md`	Mode A Step 5 — imitate the verified best exemplar of the chosen type (52 spike exemplars, citation-verified)
`scripts/validate_skill.py`	Step 7; Mode B pre-pass (execute, don't read)
`scripts/verify_citations.py`	Maintaining evidence-backed assets — quotes must exist verbatim (execute)
`scripts/build_exemplars.py`	Regenerate exemplars.md from verified JSON (execute)

Install & Usage

Create the skills directory

mkdir -p .claude/skills

Download the skill file

Add the configuration to .claude/skills/skill-forge.md

Invoke in Claude Code

/skill-forge

View source on GitHub

testingapiagent

Security Audits

LicenseUnknownSourceWarnRepositoryPass

Frequently Asked Questions

What is skill-forge?

How to install skill-forge?

To install skill-forge: create the skills directory (mkdir -p .claude/skills), then add the config to .claude/skills/skill-forge.md. Finally, /skill-forge in Claude Code.

What is skill-forge best for?

skill-forge is a community categorized under General. It is designed for: testing, api, agent. Created by truthandstories07-sudo.