A Reproducible Benchmark of Lightweight CNNs: Accuracy, Efficiency, and the Impact of Pretrained Initialization
arXiv:2505.03303v3 Announce Type: replace-cross Abstract: Lightweight convolutional neural networks are often compared using results obtained with different training recipes, input settings, and pretrained checkpoints. Such differences make architecture rankings difficult to interpret. This study...
The Reproducibility Crisis Hits Lightweight CNNs
A new preprint on arXiv tackles a persistent problem in applied deep learning: the lack of standardized benchmarking for lightweight convolutional neural networks (CNNs). The study systematically evaluates how different training recipes, input resolutions, and—crucially—pretrained initialization strategies distort architecture comparisons. By controlling these variables, the authors provide a cleaner ranking of models like MobileNetV3, EfficientNet-Lite, and ShuffleNetV2 across accuracy and efficiency metrics.
The core finding is unsurprising yet important: pretrained initialization dramatically alters performance rankings. A model that appears superior when trained from scratch may fall behind when fine-tuned from ImageNet weights, and vice versa. This is not a minor detail—it means many published comparisons are effectively comparing apples to oranges, because they implicitly assume initialization is a neutral factor.
Why This Matters for the Field
The lightweight CNN space has become a crowded marketplace of architectures, each claiming Pareto-optimal trade-offs between speed and accuracy. Yet practitioners routinely find that reproducing these claims is difficult. The problem is not malice but methodology: researchers optimize training hyperparameters for their own models, use different data augmentation pipelines, and rely on different pretrained checkpoints. The result is a fragmented evidence base where the "best" model depends more on the evaluation setup than on architectural merit.
This study matters because it quantifies the problem. By holding training recipes constant and varying only initialization, the authors show that pretrained weights can shift accuracy by 2-5% on ImageNet—enough to flip rankings entirely. For edge deployment, where every percentage point of accuracy is hard-won, this is not noise; it is systematic bias.
Implications for AI Practitioners
First, never trust a single benchmark. If a paper claims Model A outperforms Model B, check whether both were trained with identical protocols. The authors provide a reproducible framework that others can adopt, and practitioners should demand this level of rigor before selecting an architecture for production.
Second, pretrained initialization is a design choice, not a free lunch. Many teams default to ImageNet-pretrained weights without considering whether the target domain matches ImageNet's distribution. This study reinforces that initialization is a hyperparameter that should be tuned alongside learning rate and batch size.
Third, lightweight CNNs are not yet commoditized. The field still lacks a clear winner, and the best model for a given latency budget depends on the specific deployment constraints. The study's controlled benchmarks offer a more reliable starting point, but practitioners should still validate on their own data and hardware.
Key Takeaways
- Pretrained initialization can alter accuracy rankings by 2-5%, making many published comparisons unreliable.
- Standardized training recipes are essential for meaningful architecture comparisons; this study provides a reproducible baseline.
- Practitioners should treat initialization as a tunable hyperparameter, not a fixed default.
- No single lightweight CNN dominates across all efficiency-accuracy trade-offs; selection requires domain-specific validation.