Industry2026-06-27

Ask HN: How do we measure software in LLM era?

A bit of a rant. Sorry!With the probablistic pluggable 'brain' existing in parts of the solution how are you measuring anything is better or worse?I am at a loss to quantify whether anything is improving or worsening anything. It probably is also because of the various metrics that keeps...

The Measurement Crisis in Probabilistic Software

A recent Hacker News thread captures a growing frustration among developers: how do we measure software quality when LLMs introduce probabilistic, non-deterministic components into our stacks? The poster’s lament—that traditional metrics feel inadequate when a “probabilistic pluggable brain” sits at the heart of a solution—reflects a genuine industry blind spot.

What Happened

The discussion emerged from a developer struggling to quantify whether LLM-integrated software is improving or degrading. Traditional software metrics—unit test coverage, latency percentiles, error rates, cyclomatic complexity—assume deterministic behavior. An LLM call that returns a slightly different answer each time, or hallucinates under certain prompts, breaks these assumptions. The poster notes that existing metrics “keep changing” in ways that obscure true progress.

Why It Matters

This is not a niche concern. LLMs are now embedded in customer support chatbots, code generation tools, document summarizers, and even safety-critical systems like medical triage. When a model’s output varies by prompt phrasing, temperature setting, or model update, how do teams know if a new version is actually better? The risk is twofold:

False confidence: Teams may ship “improvements” that only appear better due to random variation in a small test set.
Stagnation: Teams may avoid changes because they cannot reliably measure impact, leading to technical debt and missed optimizations.

Traditional A/B testing with statistical significance can help, but it is slow and expensive for LLM-heavy applications. Moreover, user satisfaction often depends on subjective factors—helpfulness, tone, factual accuracy—that resist simple pass/fail classification.

Implications for AI Practitioners

First, adopt multi-metric evaluation frameworks. No single number captures LLM quality. Combine automated metrics (BLEU, ROUGE, BERTScore for text; exact match for code) with human evaluation (preference ratings, task completion rates) and system-level metrics (latency, cost per call, error rates). Weight them according to your use case.

Second, invest in regression test suites that capture edge cases and known failure modes. These should include adversarial prompts, ambiguous inputs, and high-stakes scenarios. While LLMs may not pass them deterministically, tracking pass rates over time reveals degradation.

Third, embrace continuous monitoring in production. Offline evaluation is insufficient. Measure real-world outcomes: user retention, support ticket deflection, time-to-resolution for coding assistants. These lagging indicators often correlate with quality better than any synthetic benchmark.

Finally, be transparent about uncertainty. When reporting metrics, include confidence intervals and note the probabilistic nature of results. This prevents stakeholders from overinterpreting small improvements.

Key Takeaways

Traditional deterministic metrics (test coverage, error rates) are insufficient for LLM-integrated software; new evaluation paradigms are needed.
Practitioners should combine automated, human, and system-level metrics to capture the multi-dimensional nature of LLM quality.
Production monitoring of user outcomes (retention, task completion) often provides more reliable signals than offline benchmarks.
Statistical rigor—confidence intervals, A/B testing—is essential to distinguish genuine improvements from random variation in probabilistic outputs.

Read Original Article on Hacker News

hacker-news