Research2026-06-30

New AI Methods Tackle Small, Noisy Datasets with Genetic Programming and Counterfactual Augmentation

Originally published byArxiv CS.AI

Two new papers propose innovative approaches to improve regression modeling on small, wide datasets: one uses cross-validated island-model genetic programming for interpretable symbolic regression, and the other introduces counterfactual residual data augmentation to enhance model robustness.

What Happened

Two recent arXiv preprints address the challenge of building reliable regression models from limited data. The first, "Evolutional Math: Cross-Validated Island-Model Genetic Programming for Interpretable Symbolic Regression on Small, Wide Datasets," tackles the problem of overfitting in symbolic regression when the number of features exceeds the number of samples. The authors propose a genetic programming approach that uses an island model with cross-validation to evolve compact, interpretable expressions that generalize better. The second paper, "Counterfactual Residual Data Augmentation for Regression," introduces a data augmentation technique that generates synthetic training samples by perturbing residuals in a counterfactual manner, improving model performance on noisy, small datasets.

Why It Matters

Small, wide datasets are common in fields like clinical trials, biostatistics, and engineering pilot studies, where collecting large samples is expensive or impractical. Traditional machine learning methods often overfit or produce uninterpretable black-box models. These new approaches offer practical solutions: the genetic programming method yields explicit mathematical formulas that domain experts can understand and validate, while the counterfactual augmentation method can be applied to any regression model to improve robustness without requiring additional data collection. Together, they address a critical gap in AI for high-stakes, data-scarce domains.

Implications for AI Practitioners

For practitioners working with small datasets, these methods provide actionable tools. The island-model genetic programming approach can be used to discover interpretable relationships in fields like epidemiology or engineering, where understanding the underlying mechanism is as important as prediction accuracy. The counterfactual augmentation technique is model-agnostic and can be integrated into existing regression pipelines to reduce overfitting and improve generalization. Both methods emphasize the importance of validation strategies (cross-validation, counterfactual reasoning) to avoid spurious correlations. However, practitioners should note that genetic programming can be computationally expensive, and the augmentation method requires careful tuning to avoid introducing bias.

Key Takeaways

Interpretable symbolic regression on small, wide datasets is achievable with cross-validated island-model genetic programming, producing compact formulas that resist overfitting.
Counterfactual residual augmentation offers a novel way to generate synthetic data for regression, improving model robustness without extra real data.
Both methods are particularly relevant for high-stakes domains like healthcare and engineering, where data is scarce and interpretability is crucial.
Practitioners should validate these approaches on their own datasets, as performance may vary depending on noise levels and feature correlations.

Read Original Article on Arxiv CS.AI

arxivpapers