How to evaluate clustering with ground truth?
arXiv:2606.27061v1 Announce Type: new Abstract: External indexes can be used for cluster evaluation when ground truth is available. We review the most common external validity indexes focusing on set-matching-based measures. We recommend centroid index (CI), because it is an intuitive cluster-level...
A Quiet Shift in How We Judge Clustering Quality
A new arXiv preprint (2606.27061v1) wades into the often-overlooked but critical domain of cluster evaluation, specifically when ground truth labels are available. The authors conduct a review of external validity indexes—metrics that compare a clustering result against a known correct partition—and make a pointed recommendation: the centroid index (CI) should be preferred over more traditional measures like the adjusted Rand index (ARI) or normalized mutual information (NMI).
The core argument is that CI operates at the cluster level rather than the pair level. While ARI counts how often pairs of points are assigned to the same or different clusters, CI directly measures how many clusters from the ground truth are correctly identified by the algorithm. This is a subtle but important distinction: a high ARI can sometimes mask the fact that an algorithm merged several small true clusters into one large blob, or split a single true cluster into many fragments. CI penalizes these structural errors more transparently.
Why This Matters Beyond the Math
For AI practitioners, this is not merely an academic exercise. The choice of evaluation metric directly shapes which clustering algorithms are deemed “good” and which are discarded during model selection. If your metric rewards pairwise agreement, you may inadvertently favor algorithms that produce many small, homogeneous clusters over those that recover the true underlying structure. CI’s cluster-level focus aligns more closely with how humans interpret clustering results: we care about whether the algorithm found the right groups, not whether it got every pairwise relationship correct.
This is particularly relevant in domains where clusters have semantic meaning—customer segmentation, disease subtyping, or document topic discovery. In these settings, a merged cluster that conflates two distinct customer personas or disease subtypes is a failure, even if the pairwise agreement remains high.
Implications for AI Practitioners
First, practitioners should re-examine their evaluation pipelines. If you are using ARI or NMI as your primary external validation metric, consider adding CI as a complementary measure. It may reveal structural flaws that other metrics smooth over.
Second, the paper underscores a broader lesson: no single metric is sufficient. CI has its own limitations—it can be overly strict when ground truth clusters are highly imbalanced or when noise points exist. The authors’ recommendation should be interpreted as “use CI as a primary tool, not the only tool.”
Finally, this work highlights the value of interpretable evaluation. In an era of increasingly complex models, metrics that produce intuitive, cluster-level scores help bridge the gap between algorithmic output and human understanding. CI tells you, in plain terms, how many clusters you got right.
Key Takeaways
- The centroid index (CI) is recommended over ARI and NMI for external cluster validation because it measures cluster-level correctness rather than pairwise agreement.
- CI provides more intuitive and actionable feedback, especially when clusters have real-world semantic meaning.
- Practitioners should add CI to their evaluation toolkit but continue using multiple metrics to capture different failure modes.
- The choice of evaluation metric can significantly influence which clustering algorithms are selected, making metric selection a consequential design decision.