BeClaude

small-sample-analysis

New
GitHub TrendingGeneralby jiachengwang-punch

End-to-end methodology for supervised machine learning on small datasets (typically 30-200 samples) where standard "throw XGBoost at it" approaches fail. Use this skill whenever the user is building a predictive model on a small dataset, especially when sample-to-feature ratios are tight, when interpretability matters as much as accuracy, when the user needs to justify model choices to non-technical stakeholders, or when they need a rigorous "diagnose-improve-verify" workflow rather than just a final model. Trigger this even if the user only asks for a specific piece (e.g. "help me pick features", "validate this model"), since small-sample problems require the full methodology to avoid silent overfitting. Also trigger for store-selection / site-selection problems, B2B sales analytics, biomedical studies, A/B test analysis with limited cohorts, and any "we only have N stores/patients/experiments and need to predict Y" scenario.

First seen 5/23/2026

Overview

Small Sample Analysis

A complete methodology for building defensible predictive models on small datasets (typically n < 200, often n < 50).

When this skill applies

Small-sample analysis differs fundamentally from standard ML workflows. The defaults that work on 100k+ rows actively harm models on small data:

  • XGBoost/LightGBM — overfit catastrophically; CV-R² often negative
  • Single train/test split — variance too high to draw conclusions
  • Stepwise feature selection — picks noise as signal
  • Headline metric reporting (just R²) — hides systematic bias

This skill captures a methodology that handles these pitfalls explicitly.

Triggers:

  • Sample size mentioned as small (< 200, especially < 50)
  • Feature-to-sample ratio is concerning (p/n > 0.1)
  • User asks "why not XGBoost" or shows confusion about model choice
  • User needs to justify decisions to non-technical stakeholders
  • User uses words like "stores", "patients", "experiments", "cohorts" with limited counts
  • Any predictive modeling task where the user needs interpretability + rigor

Output language

Match the user's natural language for all deliverables (Notebook markdown, Word body, chart labels, slides). Code, math notation, and standard ML abbreviations (Ridge, SHAP, R², MAPE) stay in English regardless. For non-Latin scripts, set CJK-capable fonts in matplotlib (Noto Sans CJK JP) and docx (Microsoft YaHei) to prevent □□□ rendering bugs.

Core principles

Drill these into every analysis:

  1. Simplicity beats flexibility at small n. Ridge regression often beats XGBoost.
  2. Cross-validation is non-negotiable. Never report training R² as the headline.
  3. Interpretability is a hard requirement. Black-box models can't be justified to stakeholders.
  4. Multi-method triangulation. Confirm key findings using 2+ independent methods.
  5. Diagnose-improve-verify is a loop, not a one-shot. Expose model flaws openly.
  6. Statistical significance is shaky at n < 50. Be honest about p-value limitations.

The 11-step workflow

This is the canonical end-to-end pipeline. Don't skip steps — each catches failures the next assumes are absent.

Step 1: Establish the sample constraint upfront

First thing in any small-sample project: state the n explicitly, derive the feature budget (n/5 is the OLS rule, slightly relaxed under regularization), and discuss constraints with the user. If they're surprised by the constraint, the rest of the analysis is at risk.

Step 2: Data cleaning + consistency checks

Always run hard consistency validations (e.g. sub-totals should sum to total to 1e-12 precision). Document failures, never silently fill.

Step 3: EDA with adversarial questions

Don't just plot distributions. Ask:

  • Are there hidden subgroups (e.g. "M0 is artificially high from launch subsidies")?
  • Are there counter-intuitive correlations that need business explanation?
  • Do early time periods reflect a non-equilibrium state that contaminates features?

Step 4: Feature engineering with budget enforcement

Apply n/5 budget. If unsure, see references/feature_engineering.md. Use dual-scheme designs (static-only vs static+early-operational) when the business has multiple decision time points.

Step 5: Model selection with explicit rejection logic

Don't just say "we used Ridge". Document why each alternative was rejected (XGBoost overfits, OLS unstable under collinearity, Lasso over-shrinks). See references/model_selection.md.

Step 6: Baseline comparison (mean prediction)

Always report relative to the "predict the mean" baseline. R² alone is meaningless; "12% RMSE reduction over baseline" is meaningful.

Step 7: 5-fold CV with out-of-fold metrics

Never report single train/test. Use OOF predictions for all metrics. For n < 50, consider LOOCV as additional validation.

Step 8: Residual diagnosis (4-quadrant)

Standard 4-panel: residual vs predicted, residual vs actual (the regression-to-mean detector), histogram, Q-Q plot. See references/residual_diagnosis.md for interpretation patterns. Honest exposure of model flaws beats hiding them.

Step 9: Multi-method cross-validation

Confirm key business findings using 2+ independent methods. Examples:

  • Decision tree splits vs SHAP non-linearity findings (both should find the same thresholds)
  • Supervised feature importance vs unsupervised clustering (both should identify high-value subgroups)
  • ARI between cluster assignments and tier assignments

Step 10: Iteration loop (diagnose → improve → verify)

When residuals show systematic issues:

  1. Generate hypotheses from the diagnosis (e.g. "regression to mean → need segment dummies")
  2. Test multiple variants systematically (see references/iteration_workflow.md)
  3. Pick the winner by held-out CV metrics
  4. Critical: re-diagnose after improvement. New residuals, new α sensitivity. See references/overfit_validation.md for the 5-method overfitting check.

Step 11: Deliverable assembly

Small-sample projects often serve stakeholders who need to understand the reasoning. Produce:

  • A complete Notebook with markdown explaining every decision
  • A Word/PDF report with formula appendices
  • A presentation deck + Q&A handbook (anticipate the "why not XGBoost" question)

See references/deliverable_templates.md for layouts and the canonical 14-chapter structure.

Reference files

When working through the steps, consult the appropriate reference:

  • references/model_selection.md — Choosing between Ridge / RF / GBM / decision trees, with rejection logic for XGBoost/Lasso/NN
  • references/feature_engineering.md — Top-K selection, dual-scheme designs, segment dummies, derived ratios
  • references/residual_diagnosis.md — 4-quadrant interpretation, regression-to-mean detection, slope analysis
  • references/iteration_workflow.md — How to set up A/B/C/D variant comparisons, threshold sources (data-driven vs business-driven)
  • references/overfit_validation.md — The 5-method overfitting check (train/CV gap, LOOCV, permutation test, learning curves, α sensitivity)
  • references/cross_validation_methods.md — Triangulating findings across decision trees, SHAP, clustering
  • references/deliverable_templates.md — Canonical 14-chapter structure, Notebook/Word/PPT layouts, math expression formatting

Read the file when you reach that step. Don't load them all upfront.

Common pitfalls (failure modes to actively prevent)

When you see the user heading toward these, intervene:

  1. "Let me just throw XGBoost at it" → Show CV-R² < 0 evidence, redirect to Ridge
  2. "R² = 0.4 is bad, model is useless" → Compare to baseline; small-sample R² of 0.2-0.4 is often state-of-the-art
  3. "Let me use all 30 features" → Enforce n/5 budget; explain dimensional curse
  4. "Train R² is 0.85, ship it" → Force out-of-fold CV; expect 50%+ drop
  5. "The improvement is significant" → Run permutation test; p might be 0.08 not 0.005
  6. "This model is perfect" → Run residual diagnosis; find the regression-to-mean
  7. "Just use 80% quantile as threshold" → Compare data-driven (decision tree) vs heuristic thresholds; data-driven typically wins
  8. "Why does competition correlate positively with sales?" → Don't dismiss; small-sample data often reveals counter-intuitive truths. Investigate the "shared cause" (good locations attract both competitors and customers)

Honest reporting standards

When writing the final deliverables, proactively disclose:

  • Sample size and n/p ratio
  • Out-of-fold metrics, not training metrics
  • Baseline comparison ("X% over predict-the-mean")
  • Residual diagnosis findings, especially regression-to-mean
  • p-values from permutation tests (often p ≈ 0.05-0.10 at small n — say so)
  • Models tried and rejected, with the reasoning
  • "Impossible triangle" trade-offs encountered (accuracy / robustness / extremes — pick two)

Suppressing these gets the model rejected in real audit / peer review. Including them builds credibility.

Quick reference card

Decision pointDefault answerWhen to deviate
AlgorithmRidge regression, α=30If n > 500, can try gradient boosting
Feature countTop-10 by \r\Reduce if n < 30
CV scheme5-foldUse LOOCV if n < 30
BaselinePredict the meanUse prior model if iterating
α grid{0.1, 1, 5, 10, 30, 100, 300}Narrow once you know roughly
Significance thresholdp < 0.05 strict, p < 0.10 marginalLower at very small n
Overfitting checkTrain/CV gap, LOOCV, permutation, learning curve, α sensitivityAll 5 if final model is being claimed

Install & Usage

1
Create the skills directory
mkdir -p .claude/skills
2
Download the skill file
mkdir -p .claude/skills && curl -o .claude/skills/small-sample-analysis.md https://raw.githubusercontent.com/jiachengwang-punch/small-sample-analysis/main/SKILL.md
3
Invoke in Claude Code
/small-sample-analysis
View source on GitHub
testing

Security Audits

LicenseUnknownSourceWarnRepositoryPass

Frequently Asked Questions

What is small-sample-analysis?

End-to-end methodology for supervised machine learning on small datasets (typically 30-200 samples) where standard "throw XGBoost at it" approaches fail. Use this skill whenever the user is building a predictive model on a small dataset, especially when sample-to-feature ratios are tight, when interpretability matters as much as accuracy, when the user needs to justify model choices to non-technical stakeholders, or when they need a rigorous "diagnose-improve-verify" workflow rather than just a final model. Trigger this even if the user only asks for a specific piece (e.g. "help me pick features", "validate this model"), since small-sample problems require the full methodology to avoid silent overfitting. Also trigger for store-selection / site-selection problems, B2B sales analytics, biomedical studies, A/B test analysis with limited cohorts, and any "we only have N stores/patients/experiments and need to predict Y" scenario.

How to install small-sample-analysis?

To install small-sample-analysis: create the skills directory (mkdir -p .claude/skills), then run: mkdir -p .claude/skills && curl -o .claude/skills/small-sample-analysis.md https://raw.githubusercontent.com/jiachengwang-punch/small-sample-analysis/main/SKILL.md. Finally, /small-sample-analysis in Claude Code.

What is small-sample-analysis best for?

small-sample-analysis is a skill categorized under General. It is designed for: testing. Created by jiachengwang-punch.