small-sample-analysis
NewEnd-to-end methodology for supervised machine learning on small datasets (typically 30-200 samples) where standard "throw XGBoost at it" approaches fail. Use this skill whenever the user is building a predictive model on a small dataset, especially when sample-to-feature ratios are tight, when interpretability matters as much as accuracy, when the user needs to justify model choices to non-technical stakeholders, or when they need a rigorous "diagnose-improve-verify" workflow rather than just a final model. Trigger this even if the user only asks for a specific piece (e.g. "help me pick features", "validate this model"), since small-sample problems require the full methodology to avoid silent overfitting. Also trigger for store-selection / site-selection problems, B2B sales analytics, biomedical studies, A/B test analysis with limited cohorts, and any "we only have N stores/patients/experiments and need to predict Y" scenario.
Overview
Small Sample Analysis
A complete methodology for building defensible predictive models on small datasets (typically n < 200, often n < 50).
When this skill applies
Small-sample analysis differs fundamentally from standard ML workflows. The defaults that work on 100k+ rows actively harm models on small data:
- •XGBoost/LightGBM — overfit catastrophically; CV-R² often negative
- •Single train/test split — variance too high to draw conclusions
- •Stepwise feature selection — picks noise as signal
- •Headline metric reporting (just R²) — hides systematic bias
This skill captures a methodology that handles these pitfalls explicitly.
Triggers:
- •Sample size mentioned as small (< 200, especially < 50)
- •Feature-to-sample ratio is concerning (p/n > 0.1)
- •User asks "why not XGBoost" or shows confusion about model choice
- •User needs to justify decisions to non-technical stakeholders
- •User uses words like "stores", "patients", "experiments", "cohorts" with limited counts
- •Any predictive modeling task where the user needs interpretability + rigor
Output language
Match the user's natural language for all deliverables (Notebook markdown, Word body, chart labels, slides). Code, math notation, and standard ML abbreviations (Ridge, SHAP, R², MAPE) stay in English regardless. For non-Latin scripts, set CJK-capable fonts in matplotlib (Noto Sans CJK JP) and docx (Microsoft YaHei) to prevent □□□ rendering bugs.
Core principles
Drill these into every analysis:
- Simplicity beats flexibility at small n. Ridge regression often beats XGBoost.
- Cross-validation is non-negotiable. Never report training R² as the headline.
- Interpretability is a hard requirement. Black-box models can't be justified to stakeholders.
- Multi-method triangulation. Confirm key findings using 2+ independent methods.
- Diagnose-improve-verify is a loop, not a one-shot. Expose model flaws openly.
- Statistical significance is shaky at n < 50. Be honest about p-value limitations.
The 11-step workflow
This is the canonical end-to-end pipeline. Don't skip steps — each catches failures the next assumes are absent.
Step 1: Establish the sample constraint upfront
First thing in any small-sample project: state the n explicitly, derive the feature budget (n/5 is the OLS rule, slightly relaxed under regularization), and discuss constraints with the user. If they're surprised by the constraint, the rest of the analysis is at risk.
Step 2: Data cleaning + consistency checks
Always run hard consistency validations (e.g. sub-totals should sum to total to 1e-12 precision). Document failures, never silently fill.
Step 3: EDA with adversarial questions
Don't just plot distributions. Ask:
- •Are there hidden subgroups (e.g. "M0 is artificially high from launch subsidies")?
- •Are there counter-intuitive correlations that need business explanation?
- •Do early time periods reflect a non-equilibrium state that contaminates features?
Step 4: Feature engineering with budget enforcement
Apply n/5 budget. If unsure, see references/feature_engineering.md. Use dual-scheme designs (static-only vs static+early-operational) when the business has multiple decision time points.
Step 5: Model selection with explicit rejection logic
Don't just say "we used Ridge". Document why each alternative was rejected (XGBoost overfits, OLS unstable under collinearity, Lasso over-shrinks). See references/model_selection.md.
Step 6: Baseline comparison (mean prediction)
Always report relative to the "predict the mean" baseline. R² alone is meaningless; "12% RMSE reduction over baseline" is meaningful.
Step 7: 5-fold CV with out-of-fold metrics
Never report single train/test. Use OOF predictions for all metrics. For n < 50, consider LOOCV as additional validation.
Step 8: Residual diagnosis (4-quadrant)
Standard 4-panel: residual vs predicted, residual vs actual (the regression-to-mean detector), histogram, Q-Q plot. See references/residual_diagnosis.md for interpretation patterns. Honest exposure of model flaws beats hiding them.
Step 9: Multi-method cross-validation
Confirm key business findings using 2+ independent methods. Examples:
- •Decision tree splits vs SHAP non-linearity findings (both should find the same thresholds)
- •Supervised feature importance vs unsupervised clustering (both should identify high-value subgroups)
- •ARI between cluster assignments and tier assignments
Step 10: Iteration loop (diagnose → improve → verify)
When residuals show systematic issues:
- Generate hypotheses from the diagnosis (e.g. "regression to mean → need segment dummies")
- Test multiple variants systematically (see
references/iteration_workflow.md) - Pick the winner by held-out CV metrics
- Critical: re-diagnose after improvement. New residuals, new α sensitivity. See
references/overfit_validation.mdfor the 5-method overfitting check.
Step 11: Deliverable assembly
Small-sample projects often serve stakeholders who need to understand the reasoning. Produce:
- •A complete Notebook with markdown explaining every decision
- •A Word/PDF report with formula appendices
- •A presentation deck + Q&A handbook (anticipate the "why not XGBoost" question)
See references/deliverable_templates.md for layouts and the canonical 14-chapter structure.
Reference files
When working through the steps, consult the appropriate reference:
- •
references/model_selection.md— Choosing between Ridge / RF / GBM / decision trees, with rejection logic for XGBoost/Lasso/NN - •
references/feature_engineering.md— Top-K selection, dual-scheme designs, segment dummies, derived ratios - •
references/residual_diagnosis.md— 4-quadrant interpretation, regression-to-mean detection, slope analysis - •
references/iteration_workflow.md— How to set up A/B/C/D variant comparisons, threshold sources (data-driven vs business-driven) - •
references/overfit_validation.md— The 5-method overfitting check (train/CV gap, LOOCV, permutation test, learning curves, α sensitivity) - •
references/cross_validation_methods.md— Triangulating findings across decision trees, SHAP, clustering - •
references/deliverable_templates.md— Canonical 14-chapter structure, Notebook/Word/PPT layouts, math expression formatting
Read the file when you reach that step. Don't load them all upfront.
Common pitfalls (failure modes to actively prevent)
When you see the user heading toward these, intervene:
- "Let me just throw XGBoost at it" → Show CV-R² < 0 evidence, redirect to Ridge
- "R² = 0.4 is bad, model is useless" → Compare to baseline; small-sample R² of 0.2-0.4 is often state-of-the-art
- "Let me use all 30 features" → Enforce n/5 budget; explain dimensional curse
- "Train R² is 0.85, ship it" → Force out-of-fold CV; expect 50%+ drop
- "The improvement is significant" → Run permutation test; p might be 0.08 not 0.005
- "This model is perfect" → Run residual diagnosis; find the regression-to-mean
- "Just use 80% quantile as threshold" → Compare data-driven (decision tree) vs heuristic thresholds; data-driven typically wins
- "Why does competition correlate positively with sales?" → Don't dismiss; small-sample data often reveals counter-intuitive truths. Investigate the "shared cause" (good locations attract both competitors and customers)
Honest reporting standards
When writing the final deliverables, proactively disclose:
- •Sample size and n/p ratio
- •Out-of-fold metrics, not training metrics
- •Baseline comparison ("X% over predict-the-mean")
- •Residual diagnosis findings, especially regression-to-mean
- •p-values from permutation tests (often p ≈ 0.05-0.10 at small n — say so)
- •Models tried and rejected, with the reasoning
- •"Impossible triangle" trade-offs encountered (accuracy / robustness / extremes — pick two)
Suppressing these gets the model rejected in real audit / peer review. Including them builds credibility.
Quick reference card
| Decision point | Default answer | When to deviate | |
|---|---|---|---|
| Algorithm | Ridge regression, α=30 | If n > 500, can try gradient boosting | |
| Feature count | Top-10 by \ | r\ | Reduce if n < 30 |
| CV scheme | 5-fold | Use LOOCV if n < 30 | |
| Baseline | Predict the mean | Use prior model if iterating | |
| α grid | {0.1, 1, 5, 10, 30, 100, 300} | Narrow once you know roughly | |
| Significance threshold | p < 0.05 strict, p < 0.10 marginal | Lower at very small n | |
| Overfitting check | Train/CV gap, LOOCV, permutation, learning curve, α sensitivity | All 5 if final model is being claimed |
Install & Usage
mkdir -p .claude/skillsmkdir -p .claude/skills && curl -o .claude/skills/small-sample-analysis.md https://raw.githubusercontent.com/jiachengwang-punch/small-sample-analysis/main/SKILL.md/small-sample-analysisSecurity Audits
Frequently Asked Questions
What is small-sample-analysis?
End-to-end methodology for supervised machine learning on small datasets (typically 30-200 samples) where standard "throw XGBoost at it" approaches fail. Use this skill whenever the user is building a predictive model on a small dataset, especially when sample-to-feature ratios are tight, when interpretability matters as much as accuracy, when the user needs to justify model choices to non-technical stakeholders, or when they need a rigorous "diagnose-improve-verify" workflow rather than just a final model. Trigger this even if the user only asks for a specific piece (e.g. "help me pick features", "validate this model"), since small-sample problems require the full methodology to avoid silent overfitting. Also trigger for store-selection / site-selection problems, B2B sales analytics, biomedical studies, A/B test analysis with limited cohorts, and any "we only have N stores/patients/experiments and need to predict Y" scenario.
How to install small-sample-analysis?
To install small-sample-analysis: create the skills directory (mkdir -p .claude/skills), then run: mkdir -p .claude/skills && curl -o .claude/skills/small-sample-analysis.md https://raw.githubusercontent.com/jiachengwang-punch/small-sample-analysis/main/SKILL.md. Finally, /small-sample-analysis in Claude Code.
What is small-sample-analysis best for?
small-sample-analysis is a skill categorized under General. It is designed for: testing. Created by jiachengwang-punch.