Skip to content
BeClaude
Industry2026-06-28

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills

Originally published byHacker News

Skills for Claude Code and Codex are hard to test. What I mean by hard is that there's no standard way to do it. You evaluate the skill once on something, it looks like it works. You publish it. Then the new super model releases (GLM 5.2 anyone?), it will quietly break for some part, and you...

The Reliability Gap in AI Skill Evaluation

The Hacker News post introducing Caliper highlights a growing pain point in the AI development ecosystem: the lack of standardized, reproducible testing for AI coding skills. The author describes a common frustration—a skill appears to work during initial evaluation, only to silently break after a model update. This is not a hypothetical scenario; it reflects the reality of working with fast-moving foundation models where behavior shifts between versions are unpredictable.

What Happened

Caliper is a tool designed to provide pass@k reliability testing for skills built on Claude Code and Codex. The pass@k metric, borrowed from code generation research, measures the probability that at least one of k generated solutions passes a given test. By applying this to custom skills, Caliper aims to give developers a quantitative, repeatable way to assess whether a skill actually works—not just once, but consistently across model versions. The tool addresses the specific problem that manual or ad-hoc testing cannot catch regressions introduced by new model releases.

Why It Matters

The core issue here is evaluation drift. As models like Claude and GPT are updated, their behavior on specific tasks can change in subtle ways. A skill that relied on a particular phrasing or reasoning pattern may degrade without any code change. This creates a trust deficit: developers cannot confidently publish skills because they have no assurance of future performance. Caliper’s approach matters because it introduces a systematic feedback loop into the skill development workflow. Without such tooling, the ecosystem risks fragmentation, where skills become brittle and users lose confidence in third-party contributions.

For the broader AI industry, this signals a maturation phase. Early adoption of AI coding assistants focused on raw capability—can the model write code? The next frontier is reliability engineering: can we guarantee that a skill works under defined conditions, across model versions, and with measurable confidence? Caliper is a small but significant step toward that goal.

Implications for AI Practitioners

  • Skill developers now have a tool to benchmark their work against model updates, reducing the risk of silent failures. This encourages more rigorous testing before publication.
  • End users benefit from higher-quality skills that carry explicit reliability metrics, enabling informed choices about which skills to trust.
  • Platform providers (Anthropic, OpenAI, etc.) should take note: the demand for standardized evaluation suggests that the community is outgrowing ad-hoc testing. Investing in official evaluation frameworks could reduce fragmentation.
  • The pass@k metric is a sensible choice, but practitioners should be aware of its limitations—it measures correctness on fixed tests, not robustness to edge cases or novel inputs. Caliper is a starting point, not a complete solution.

Key Takeaways

  • Caliper introduces pass@k reliability testing for Claude Code and Codex skills, addressing the problem of silent failures after model updates.
  • The tool enables reproducible, quantitative evaluation, which is critical as foundation models continue to evolve rapidly.
  • For AI practitioners, adopting systematic testing frameworks like Caliper reduces trust erosion and improves skill quality across the ecosystem.
  • The emergence of such tools signals a shift from capability-focused development to reliability engineering in AI-assisted coding.
hacker-newsclaude