PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets
arXiv:2606.19597v1 Announce Type: cross Abstract: Mean opinion scores (MOS) are widely used for speech quality assessment, yet scalar labels are sensitive to rater variability and listening test differences. This introduces labeling noise, which limits the reliability of MOS prediction. Preference...
A Shift from Scalar MOS to Pairwise Preference in Speech Quality Assessment
A new paper, PrefSQA, proposes a fundamental rethinking of how we evaluate speech quality. Instead of relying on the traditional Mean Opinion Score (MOS)—a scalar average of human ratings—the researchers advocate for a pairwise preference prediction framework. This approach asks raters to compare two speech samples and choose the better one, rather than assigning an absolute numerical score.
The core problem PrefSQA addresses is the inherent noise in MOS labels. Human raters vary widely in their interpretation of scales like 1-to-5, and listening test conditions introduce further inconsistency. This "labeling noise" creates a ceiling effect for any model trained to predict MOS, as the target itself is unreliable. By shifting to relative comparisons, the researchers aim to capture a more consistent signal: humans are generally better at saying "A is clearer than B" than at assigning a precise quality number to A in isolation.
Why This MattersThis is not a minor methodological tweak. It strikes at the foundation of how speech quality is measured in both research and industry. For decades, MOS has been the gold standard for evaluating codecs, text-to-speech systems, and voice enhancement algorithms. If PrefSQA's approach proves robust, it could render many existing MOS datasets less useful, as they were collected under noisy scalar-label regimes.
The paper also emphasizes a critical, often overlooked point: the quality of the dataset matters more than the complexity of the model. No amount of architectural sophistication can compensate for a training set riddled with rater disagreement. PrefSQA's success hinges on collecting high-quality preference judgments, which are more expensive and time-consuming to gather than simple MOS ratings, but may yield far more reliable training signals.
Implications for AI PractitionersFor engineers building speech applications, this research has several practical consequences:
- Rethinking Evaluation Pipelines: If you currently use a MOS prediction model to benchmark your TTS or denoising system, consider testing pairwise preference as an alternative. It may better reflect real user perception, especially for subtle quality differences.
- Data Collection Strategy: The findings underscore that investing in cleaner, preference-based annotation is likely more valuable than scaling up noisy MOS-labeled data. Practitioners should budget for more rigorous rater training and inter-rater consistency checks.
- Model Architecture Choices: PrefSQA suggests that a simpler model trained on high-quality preference pairs can outperform a complex model trained on noisy MOS data. This is a reminder to prioritize data quality over model size.
- Benchmarking Caution: Be wary of comparing systems using MOS values from different studies, as rater populations and test conditions introduce systematic biases. Preference-based metrics may offer more transferable comparisons.
Key Takeaways
- PrefSQA replaces noisy scalar MOS labels with pairwise preference judgments, which are more consistent across raters and listening conditions.
- The research highlights that dataset quality—specifically, minimizing labeling noise—is the single most important factor for reliable speech quality prediction.
- AI practitioners should consider shifting evaluation from absolute scoring to relative comparisons, and invest in cleaner annotation pipelines rather than larger, noisier datasets.
- Existing MOS-based benchmarks may have hidden reliability issues, making cross-study comparisons less trustworthy than commonly assumed.