28 Cheatsheets

One-page topical summaries, aggregated here as a printable reference. Each is a self-contained reminder of the syntax, the assumptions, and the diagnostics that go with the methods it covers.

28.1 Foundations

28.1.1 Toolchain, tidy data, and `ggplot2`## The research workflow

Question → Measurements → Design → Acquisition → Description → Analysis → Interpretation → Validation → Knowledge

Biological vs statistical significance: “p < 0.05” ≠ “effect matters”.

28.2 R / RStudio / Quarto / renv

Task	Command
Install packages	`install.packages("pkg")`
Pin project	`renv::init()` / `renv::snapshot()` / `renv::restore()`
Render one file	`quarto render path/to/file.qmd`
Preview with live reload	`quarto preview`
Render only slides	`quarto render file.qmd --to revealjs`

28.3 Data types and scales

Scale	Example	Statistic
Nominal	sex, site	counts, proportions
Ordinal	pain 0–10	median, IQR, rank tests
Interval	°C	mean, SD
Ratio	kg, time	mean, SD, ratios

Accuracy ≠ precision. Bias shifts the target; variance spreads the shots.

28.4 Tidy data

One variable per column.
One observation per row.
One observational unit per table.

28.5 `dplyr` verbs

df |>
  filter(age >= 18) |>
  select(id, age, sbp) |>
  mutate(sbp_z = (sbp - mean(sbp)) / sd(sbp)) |>
  group_by(arm) |>
  summarise(mean_sbp = mean(sbp, na.rm = TRUE), n = n())

Joins: left_join, inner_join, anti_join by a key.

Missingness: is.na(), drop_na(), tidyr::complete().

28.6 `ggplot2` grammar

ggplot(data, aes(x, y, colour = group)) +
  geom_point() +
  facet_wrap(~ factor) +
  labs(x = "x label", y = "y label") +
  theme_minimal()

Layer order: data → aes → geom → facet → scales → labels → theme.

28.7 Plot families

One variable continuous → histogram, density.
Two continuous → scatter, sometimes binned / smoothed.
One categorical + one continuous → boxplot, violin, strip.
Two categorical → bar with fill, mosaic.

28.8 Decision rule

Cannot read the data into a tibble → fix data entry first.
Tibble reads but has missing or mixed types → clean before modelling.
Plot is ugly or cluttered → trust the plot, not the summary statistic.

28.9 Common pitfalls

Running statistics on untyped / partially-missing data.
Adjusting colour scales instead of simplifying the chart.
Reporting mean + SD for a skewed variable.
Forgetting set.seed() in any simulation chunk.

28.10 Further reading

Wickham, R for Data Science (2e).
Posit ggplot2 and dplyr cheatsheets.

28.10.1 Descriptive statistics, Bayes, and distributions## Descriptive statistics

Situation	Summary
Roughly symmetric, no outliers	mean ± SD
Skewed or has outliers	median, IQR, range
Categorical	counts and proportions
Grouped	Table 1 via `gtsummary::tbl_summary()`

28.11 Contingency tables

table(df$x, df$y)
prop.table(table(df$x, df$y), margin = 1)

28.12 Bayes’ theorem

\(P(A \mid B) = \dfrac{P(B \mid A) P(A)}{P(B)}\)

In odds form: posterior odds = prior odds × likelihood ratio.

28.13 Diagnostic testing

Metric	Formula
Sensitivity	TP / (TP + FN)
Specificity	TN / (TN + FP)
PPV	TP / (TP + FP)
NPV	TN / (TN + FN)
LR+	Se / (1 − Sp)
LR−	(1 − Se) / Sp

PPV collapses at low prevalence — even a 95%-accurate test is mostly false positives.

28.14 Discrete distributions

Distribution	Use	R
Bernoulli(p)	single trial	`rbinom(n, 1, p)`
Binomial(n, p)	# successes	`dbinom / pbinom / rbinom`
Poisson(λ)	rare events	`dpois / ppois / rpois`
Negative binomial	overdispersed counts	`MASS::rnegbin`, `dnbinom`

Binomial → Poisson as n grows and p shrinks with np = λ.

28.15 Continuous distributions

Distribution	Use
Normal(μ, σ)	almost everything via the CLT
Student t(ν)	inference about means, small n
χ²(ν)	variance, counts, GoF
F(ν₁, ν₂)	variance ratios, ANOVA
Exponential(λ)	time between events, memoryless

R uses the d / p / q / r prefix: density, CDF, quantile, random.

28.16 Q-Q plot

ggplot(df, aes(sample = x)) + stat_qq() + stat_qq_line()

S-shape → heavy tails. U-shape → skew. Plot before the test.

28.17 Decision rule

Reporting a mean? First check a histogram.
Testing a proportion? First sketch a 2×2 table.
Choosing a distribution? Simulate before believing.

28.18 Common pitfalls

Quoting PPV without disclosing prevalence.
Assuming normality because n > 30.
Using mean ± SD on skewed data.

28.19 Further reading

Altman, Practical Statistics for Medical Research.
Gelman et al., Bayesian Data Analysis, ch. 1.

28.19.1 Sampling, estimation, and one-sample tests## Populations vs samples

Parameter = truth about the population (\(\mu\), \(\sigma\), \(\pi\)).
Estimator = statistic computed on a sample (\(\bar x\), \(s\), \(\hat p\)).
Bias = \(E[\hat\theta] - \theta\). MSE = bias² + variance.

28.20 Central limit theorem

For iid samples with finite variance, as \(n \to \infty\): \(\bar X \sim N\!\left(\mu, \sigma / \sqrt n\right)\)

Holds roughly by \(n \approx 30\) for most non-pathological distributions.

28.21 Standard error

\(\mathrm{SE}(\bar X) = \sigma / \sqrt n\), estimated by \(s / \sqrt n\).

28.22 Bootstrap

B <- 2000
boot_mean <- replicate(B, mean(sample(x, replace = TRUE)))
quantile(boot_mean, c(0.025, 0.975))   # 95% percentile CI
sd(boot_mean)                           # bootstrap SE

28.23 Permutation test

obs  <- mean(a) - mean(b)
pool <- c(a, b)
null <- replicate(5000, {
  idx <- sample(seq_along(pool), length(a))
  mean(pool[idx]) - mean(pool[-idx])
})
mean(abs(null) >= abs(obs))   # two-sided p

28.24 Maximum likelihood

Given data \(x\) and model \(f(x; \theta)\):

\(\ell(\theta) = \sum \log f(x_i; \theta)\).
\(\hat\theta_{\text{MLE}} = \arg\max_\theta \ell(\theta)\).
Fisher information \(I(\theta)\); asymptotic SE = \(1/\sqrt{I(\hat\theta)}\).

28.25 One-sample tests (five-step template)

Hypothesis → Visualise → Assumptions → Conduct → Conclude.

Test	R	Assumptions
One-sample t	`t.test(x, mu = 0)`	roughly normal or large n
One-proportion	`prop.test(k, n, p = 0.5)` or `binom.test`	\(np, n(1-p) \geq 5\) for `prop.test`

28.26 Hypothesis-testing vocabulary

Type I: reject true H₀ (probability \(\alpha\)).
Type II: accept false H₀ (probability \(\beta\)).
Power = \(1 - \beta\).
A p-value is not the probability that H₀ is true.

28.27 Decision rule

Want a CI? Bootstrap first, then ask whether a formula applies.
Want a test? State H₀ and α in writing before running it.
Want to know if n is big enough? Simulate the CLT for your outcome.

28.28 Common pitfalls

Reporting SE when you meant SD (or vice versa).
Using the 95% CI to “prove” there is no effect.
Running many tests and quoting the smallest p-value.

28.29 Further reading

Harrell, Biostatistics for Biomedical Research, ch. 3–5.
Efron & Tibshirani, An Introduction to the Bootstrap.

28.29.1 Two-group comparisons, effect sizes, and power## Choosing a comparison

Design	Continuous outcome	Binary outcome
Two independent groups	`t.test(y ~ g)` (Welch); Wilcoxon rank-sum	`prop.test` / `fisher.test`
Paired / pre-post	`t.test(pre, post, paired = TRUE)`; Wilcoxon signed-rank	McNemar
> 2 groups, continuous	`aov(y ~ g)`; Kruskal-Wallis	chi-square

28.30 Effect sizes

Statistic	What it reports
Mean difference + 95% CI	primary for continuous
Cohen’s d	standardised mean difference
Hedges’ g	small-sample correction of d
Risk ratio (RR) / odds ratio (OR)	two proportions
Risk difference	absolute, clinically intuitive

effectsize::cohens_d(y ~ g)

28.31 Two proportions

prop.test(c(tA, tB), c(nA, nB))         # asymptotic
fisher.test(matrix(c(tA, nA - tA,
                     tB, nB - tB), 2))  # exact, small cells

Report RR (or OR) with 95% CI, not just the p-value.

28.32 Correlation

Method	Captures	Assumptions
Pearson	linear association, continuous	bivariate normal, no outliers
Spearman	monotonic association	rank-based, robust
Kendall	concordant/discordant pairs	robust, slow on large data

cor.test(x, y, method = "spearman")

28.33 Non-parametric tests

Test	Replaces
Wilcoxon rank-sum (Mann-Whitney)	two-sample t
Wilcoxon signed-rank	paired t
Kruskal-Wallis	one-way ANOVA
Sign test	paired t, when even ranks fail

28.34 Power and sample size

pwr::pwr.t.test(d = 0.5, power = 0.80, sig.level = 0.05,
                type = "two.sample")

Family	Function
two-sample t	`pwr.t.test`
two proportions	`pwr.2p.test`, `pwr.2p2n.test`
correlation	`pwr.r.test`
one-way ANOVA	`pwr.anova.test`

Simulation-based power for anything the textbooks skip: simr::powerSim.

28.35 Reporting with `gtsummary`

library(gtsummary)
trial |>
  tbl_summary(by = arm, statistic = list(all_continuous() ~ "{mean} ({sd})")) |>
  add_p() |>
  add_overall()

28.36 Decision rule

Ask: one-group, two-group, paired, or many-group?
Check: normal-ish, or should I use a rank-based test?
Report: point estimate + 95% CI + effect size; p-value last, not first.

28.37 Common pitfalls

Equal-variance t-test (default in some languages) when variances differ.
Reporting OR when the audience expects RR (or vice versa).
Forgetting that the p-value of a paired test depends on the pairing.
Chaining correlations until one is “significant”.

28.38 Further reading

Altman, Machin et al., Statistics with Confidence.
gtsummary docs.

28.39 Regression

28.39.1 Linear regression and diagnostics## Correlation vs regression

Correlation: symmetric; both variables random (Model II).
Regression: asymmetric; predictor fixed, outcome random (Model I).
Use Model II (e.g. smatr::sma) when both variables have error.

28.40 Simple linear regression

\(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2)\)

fit <- lm(y ~ x, data = df)
broom::tidy(fit, conf.int = TRUE)
broom::glance(fit)

28.41 Multiple regression

fit <- lm(y ~ x1 + x2 + x1:x2, data = df)

Concept	Fix
Confounding	adjust by including the confounder
Interaction	include `x1:x2` (or `x1 * x2`)
Non-linearity	`splines::ns(x, 4)` or `I(x^2)`
Scale sensitivity	centre / standardise predictors

28.42 Diagnostics

plot(fit)                   # base: 4 diagnostic plots
performance::check_model(fit)   # tidy overview
car::vif(fit)               # multicollinearity

Plot	What it flags
Residuals vs fitted	non-linearity, non-constant variance
Normal Q-Q	non-normal residuals
Scale-location	heteroscedasticity
Leverage / Cook’s	influential outliers

VIF > 5 → investigate collinearity; > 10 → fix it.

28.43 Robust / weighted

MASS::rlm(y ~ x, data = df)          # robust to outliers
lm(y ~ x, weights = 1 / var_i)        # weighted LS

# Heteroscedasticity-consistent SEs
library(sandwich); library(lmtest)
coeftest(fit, vcov = vcovHC(fit, type = "HC3"))

28.44 Decision rule

Residual plot ugly? Transform, add terms, or switch link.
VIF explodes? Drop one of the collinear predictors or combine them.
Outlier with high Cook’s distance? Investigate before deleting.
Variance depends on X? HC3 SEs or weighted LS.

28.45 Common pitfalls

Interpreting \(\beta_1\) as “effect of X” under confounding.
Reporting \(R^2\) on in-sample data as if it were generalisable.
Including an interaction but only reporting main effects.
Centring categorical predictors (don’t — use reference coding).

28.46 Further reading

Harrell, Regression Modeling Strategies, ch. 4–5.
Fox, Applied Regression Analysis.

28.46.1 ANOVA, mixed models, and non-linear regression## One-way ANOVA

aov(y ~ group, data = df) |> summary()

ANOVA is a linear model with a categorical predictor. The F-test compares between-group to within-group variance.

28.47 Contrasts with `emmeans`

library(emmeans)
emm <- emmeans(fit, ~ group)
pairs(emm, adjust = "tukey")   # all pairwise
contrast(emm, list(TrtVsCtrl = c(-1, 1, 1, 1) / 3))

Pre-specify contrasts before looking at the data; correct for multiplicity.

28.48 Two-way / factorial ANOVA

aov(y ~ A * B, data = df) |> summary()
emmip(fit, A ~ B)   # interaction plot

Interaction means “effect of A differs by level of B”. Report the interaction first; main effects are conditional.

28.49 Repeated measures / blocking

RCBD: aov(y ~ treatment + Error(block)).
Repeated measures: move to a mixed model.

library(lme4); library(lmerTest)
lmer(y ~ treatment + time + (1 | subject), data = df)

28.50 GAMs — smooth non-linear terms

library(mgcv); library(gratia)
fit <- gam(y ~ s(x, k = 10) + z, data = df)
summary(fit)      # edf tells you how "wiggly"
draw(fit)         # smooth + CI

edf ≈ 1 → nearly linear; > 4 → clearly non-linear.

28.51 Non-linear regression (`nls`)

# Michaelis-Menten: y = Vmax * x / (K + x)
fit <- nls(y ~ Vmax * x / (K + x),
           data = df, start = list(Vmax = 1, K = 1))

Start values matter. If it fails, plot first to guess reasonable starts.

28.52 Decision rule

Categorical predictor, > 2 levels → ANOVA + contrasts.
Factorial design → include interaction, report it first.
Effect obviously curved → GAM with spline; else try nls.
Repeated measures → mixed model, not repeated-measures ANOVA.

28.53 Common pitfalls

Tukey HSD without pre-specified contrasts of interest.
ANOVA p < 0.05 reported alone — without naming which groups differ.
Forcing a GAM onto monotonic data that nls fits cleanly.
Ignoring the random effect in clustered designs (pseudoreplication).

28.54 Further reading

Wood, Generalized Additive Models, 2e.
emmeans vignette.

28.54.1 Generalised linear models and prediction evaluation## GLM link functions

Outcome	Family	Link	R
Continuous	gaussian	identity	`lm()`
Binary	binomial	logit	`glm(y ~ x, family = binomial)`
Count	poisson	log	`glm(y ~ x, family = poisson, offset = log(t))`
Overdispersed count	neg. binomial	log	`MASS::glm.nb`
Ordinal	cumulative	logit	`MASS::polr`
Nominal	multinomial	logit	`nnet::multinom`

28.55 Logistic regression

fit <- glm(y ~ x1 + x2, data = df, family = binomial)
exp(coef(fit))                        # odds ratios
exp(confint(fit))                     # 95% CI on OR
predict(fit, newdata = nd, type = "response")

Check for perfect separation (huge SEs).
Interpret ORs cautiously; RR is more intuitive for the audience.

28.56 ANCOVA in an RCT

Adjust for baseline; do not analyse the change score.

lm(y_followup ~ arm + y_baseline, data = trial)

More efficient than simple t-test on change scores when baseline and follow-up are correlated.

28.57 Poisson / negative binomial

glm(cases ~ x + offset(log(person_years)),
    family = poisson, data = df)
MASS::glm.nb(cases ~ x + offset(log(person_years)), data = df)

Check dispersion: sum(residuals(fit, type = "pearson")^2) / df.residual. If > 1.5, switch to NB.

28.58 Evaluation — calibration + discrimination

Metric	Means	R
Calibration plot	predicted vs observed	`rms::val.prob`, manual bin
ROC / AUC	rank ordering	`pROC::roc(y, phat)`
Brier score	overall accuracy	`mean((phat - y)^2)`
Calibration slope / intercept	systematic bias	from logistic recalibration

library(pROC)
roc_obj <- roc(y, phat)
auc(roc_obj); ci.auc(roc_obj)
plot(roc_obj)

28.59 Decision rule

Binary outcome → logistic; report OR + 95% CI.
Count outcome → Poisson; check overdispersion; NB if needed.
Trial analysis → ANCOVA, not change score.
Prediction model → calibration curve first, ROC second, decision curves third.

28.60 Common pitfalls

Quoting AUC without calibration (a discriminating but miscalibrated model is dangerous).
Ignoring offsets in count data.
Using ordinal logit when the proportional-odds assumption fails.
Presenting logistic regression coefficients on the log-odds scale without OR.

28.61 Further reading

Harrell, RMS, ch. 10–12.
Steyerberg, Clinical Prediction Models.

28.61.1 Agreement, survival primer, and reporting## Don’t dichotomise

Cutting a continuous variable at the median loses ≈ 37% of the information for a linear association. Keep predictors continuous; model non-linearity with a spline if needed.

28.62 Change scores vs ANCOVA

Change scores suffer from regression to the mean.
ANCOVA (adjust for baseline) is the efficient analysis in an RCT.
In observational studies with unequal baselines, both approaches answer subtly different questions — state which.

28.63 Agreement & reliability

Statistic	Use	R
Cohen’s κ	two raters, categorical	`irr::kappa2(cbind(r1, r2))`
Weighted κ	ordinal categorical	`irr::kappa2(..., weight = "squared")`
ICC(2,1) / ICC(3,1)	continuous	`psych::ICC(df)`
Bland-Altman	two methods, continuous	plot \(\bar{xy}\) vs \(y - x\)

mean_diff <- mean(y - x)
loa <- mean_diff + c(-1.96, 1.96) * sd(y - x)
ggplot(df, aes((x + y)/2, y - x)) + geom_point() +
  geom_hline(yintercept = c(mean_diff, loa), linetype = 2)

28.64 Survival primer

Concept	R
Kaplan-Meier	`survfit(Surv(time, event) ~ g)` + `ggsurvfit::ggsurvfit`
Log-rank test	`survdiff(Surv(time, event) ~ g)`
Cox PH model	`coxph(Surv(time, event) ~ x1 + x2)`
PH check	`cox.zph(fit)`

Report a hazard ratio with 95% CI.
PH violated? → time-varying coefficient or stratify.
Interpret HR only after checking for non-informative censoring.

28.65 Decision curves, NRI, IDI

Decision curve: net benefit vs threshold probability — dominates “treat all” and “treat none” when useful.
NRI: how many events move up / non-events move down in risk.
IDI: mean change in predicted probability by class.
Report decision curve first; NRI/IDI as secondary.

28.66 Explanation vs prediction (Shmueli)

Explanatory	Predictive
Goal: inference about β	Goal: minimise out-of-sample loss
Tools: ANOVA, diagnostics, intervals	Tools: CV, regularisation, ensembles
Metric: interval coverage, bias	Metric: RMSE, AUC, Brier on hold-out

28.67 Reporting guidelines

Design	Guideline
Randomised trial	CONSORT
Observational	STROBE
Diagnostic-accuracy	STARD
Prediction model (incl. AI)	TRIPOD / TRIPOD-AI
Systematic review	PRISMA
Animal research	ARRIVE

28.68 Decision rule

Continuous predictor + continuous outcome → no median splits.
Two-group with repeated measurement → ANCOVA.
New rater / device → Bland-Altman + ICC.
Time-to-event → KM + Cox, always check PH.
Prediction model → TRIPOD checklist at submission.

28.69 Common pitfalls

Reporting κ on near-constant outcomes (high % agreement, low κ).
Quoting median survival that is never reached.
Forgetting to mention PH assumption checking.
Claiming a “prediction model” built on the same data used for evaluation.

28.70 Further reading

Harrell, BBR, ch. 17–18.
Royston & Altman, Prognosis research.

28.71 Design & Causal Inference

28.71.1 Study designs and power## Observational designs

Design	Strengths	Weaknesses
Cohort	temporality, incidence	expensive, loss to follow-up
Case-control	rare outcomes, cheap	recall + selection bias
Cross-sectional	prevalence, screening	no temporality
Case-crossover	within-person comparison	for transient exposures only

STROBE checklist: https://www.strobe-statement.org/

28.72 Trial designs

Type	When
Parallel-group	most common; independent arms
Crossover	stable chronic condition, carry-over washable
Cluster	intervention at group level (schools, clinics)
Factorial	two or more interventions, test interactions
Adaptive	pre-specified modifications based on interim data
Non-inferiority	new treatment not worse than control by margin \(\Delta\)
Equivalence	two-sided non-inferiority

CONSORT for RCTs: https://www.consort-statement.org/

28.73 Bench / translational

Blocking reduces nuisance variation (plate, day, operator).
Factorial tests interactions efficiently.
Split-plot handles two levels of randomisation.
Pseudoreplication: technical replicates ≠ biological replicates.

28.74 Power — closed form

library(pwr)
pwr.t.test(d = 0.5, power = 0.80, sig.level = 0.05,
           type = "two.sample")
pwr.2p.test(h = ES.h(0.3, 0.2), power = 0.80)
pwr.r.test(r = 0.3, power = 0.80)
pwr.anova.test(k = 4, f = 0.25, power = 0.80)

Effect-size conventions (Cohen): small \(d\) = 0.2, medium 0.5, large 0.8.

28.75 Power — simulation

library(simr)
# Build a pilot model, increase N, or tweak fixef, then:
ps <- powerSim(model, nsim = 500, test = fixed("arm"))
ps

Simulation wins for any design the textbook skips: mixed models, adaptive rules, non-standard outcomes.

28.76 Decision rule

Randomise if you can. If not, draw a DAG and name the biases.
Power calculation before the protocol freeze, not after data collection.
Cluster randomisation → design effect \(1 + (\bar m - 1)\rho\); inflate N.
Bench experiments → treat batch, plate, and operator as random effects.

28.77 Common pitfalls

Planning a cluster RCT without inflating N for design effect.
Borrowing a pilot effect size without acknowledging noise.
Running a non-inferiority trial as if it were superiority.
Treating technical replicates as if they were biological.

28.78 Further reading

Matthews, Introduction to Randomized Controlled Clinical Trials.
Chow & Liu, Design and Analysis of Clinical Trials.

28.78.1 Missing data, mixed models, and longitudinal## Missingness mechanisms

Mechanism	Definition	Complete-case analysis?
MCAR	missingness independent of data	unbiased but inefficient
MAR	missingness depends on observed data	biased, imputation helps
MNAR	missingness depends on unobserved values	imputation under model only

Test is impossible without assumptions — reason about the mechanism.

28.79 Multiple imputation with `mice`

library(mice)
imp <- mice(df, m = 20, seed = 42, printFlag = FALSE)
fit <- with(imp, lm(y ~ x1 + x2))
pooled <- pool(fit)
summary(pooled, conf.int = TRUE)

Rule of thumb: m ≥ 100 × fraction of missing information, but 20 is a fine starting point.

28.80 Linear mixed models

library(lme4); library(lmerTest)
lmer(y ~ treatment * time + (1 | subject), data = df)
lmer(y ~ treatment * time + (time | subject), data = df)  # random slope

Random intercept: each cluster has its own baseline.
Random slope: each cluster has its own trajectory.
ICC \(\rho = \sigma_u^2 / (\sigma_u^2 + \sigma_e^2)\).

REML for variance components; ML only for likelihood-ratio tests on fixed effects.

28.81 GLMMs & GEE

library(lme4)
glmer(y ~ x + (1 | cluster), data = df, family = binomial)

library(geepack)
geeglm(y ~ x, id = cluster, data = df,
       family = binomial, corstr = "exchangeable")

GLMM: subject-specific effects, good for prediction.
GEE: population-average effects, robust to working-correlation misspecification (when combined with robust SEs).

28.82 Time series basics

library(forecast)
ts_obj <- ts(x, frequency = 12)
decompose(ts_obj) |> plot()
auto.arima(ts_obj)

library(changepoint)
cpt.mean(x, method = "PELT") |> plot()

28.83 Decision rule

Missing data > 5% → at least sensitivity analysis; prefer MI.
Repeated measurements → mixed model, not repeated-measures ANOVA.
Binary outcome with clusters → GLMM (if subject effects matter) or GEE (if marginal effect is the estimand).
Time series with trend / seasonality → decompose before modelling.

28.84 Common pitfalls

Last-observation-carried-forward; usually biased.
Imputing after transformation rather than in the right scale.
Ignoring random effects when clusters are small in number (degrees of freedom correction).
Quoting GEE coefficients as if they were subject-specific.

28.85 Further reading

van Buuren, Flexible Imputation of Missing Data.
Fitzmaurice, Laird, Ware, Applied Longitudinal Analysis.

28.85.1 Causal inference methods## Time-varying covariates / landmarking

Counting-process data: split follow-up into intervals.
Landmark analysis: define a landmark time \(t^*\), condition on surviving to \(t^*\), classify by status at \(t^*\).

library(survival)
coxph(Surv(tstart, tstop, event) ~ x, data = long)

Immortal-time bias: exposed group can only be exposed if they live long enough to be exposed → spurious survival benefit.

28.86 Competing risks

library(tidycmprsk); library(ggsurvfit)
cif <- cuminc(Surv(time, event_f) ~ x, data = df)
ggsurvfit::ggcuminc(cif, outcome = "event1")

# Fine-Gray subdistribution hazard for event1
crr(Surv(time, event_f) ~ x, data = df, failcode = "event1")

Cause-specific HR answers “given you are event-free, what is the hazard for event k?”
Subdistribution HR answers “what is the hazard for event k in the whole cohort, treating competing events as obstacles?”

28.87 DAGs

library(dagitty); library(ggdag)
dag <- dagitty("dag { X -> Y ; C -> X ; C -> Y }")
adjustmentSets(dag, exposure = "X", outcome = "Y")
impliedConditionalIndependencies(dag)

Adjust for confounders, not colliders.
Adjusting for a mediator estimates the direct effect only.

28.88 Propensity scores / IPTW

library(MatchIt); library(cobalt)
m <- matchit(treat ~ x1 + x2, data = df, method = "nearest")
love.plot(m)   # check balance: SMD < 0.1 is the goal
fit <- lm(y ~ treat, data = match.data(m))

IPTW: weight by \(1/\hat{e}_i\) for treated, \(1/(1-\hat{e}_i)\) for untreated. Stabilised weights stay closer to 1; trim extremes to avoid outliers.

28.89 G-methods, IV, DiD, RDD

Method	Identifying assumption
G-formula / IPTW	positivity + no unmeasured confounding
Instrumental variables	exclusion restriction + relevance
Difference-in-differences	parallel trends
Regression discontinuity	continuity at the cutoff

28.90 Heterogeneous treatment effects

Causal forest (grf::causal_forest) or meta-learners (S-, T-, X-).
Report conditional ATEs with adjusted intervals; avoid spurious subgroup p-values.

28.91 Decision rule

Treatment assigned over time → time-varying Cox.
Multiple competing events → cumulative incidence, Fine-Gray.
Observational causal question → DAG first, choose identification strategy second.
Want the whole distribution of treatment effect? → causal forest.

28.92 Common pitfalls

Adjusting for a post-treatment variable (induces collider bias).
Quoting HR from a Cox model that violates PH.
Matching but never checking balance.
Calling IV estimates “generalisable” when they are local to compliers.

28.93 Further reading

Hernán & Robins, Causal Inference: What If.
Therneau & Grambsch, Modeling Survival Data.

28.93.1 Synthesis and pre-registration## Systematic review workflow

PICO: Population, Intervention, Comparator, Outcome.
Register protocol on PROSPERO.
Build search across ≥ 2 databases; peer-review with a librarian.
Two-screener title/abstract and full-text.
Extract, appraise (RoB 2 / ROBINS-I), synthesise.
Report with the PRISMA 2020 checklist and flow diagram.

28.94 Meta-analysis

library(metafor)
dat <- escalc(measure = "RR",
              ai = tpos, bi = tneg, ci = cpos, di = cneg,
              data = studies)
fit <- rma(yi, vi, data = dat, method = "REML")
fit
forest(fit, header = TRUE)
funnel(fit); regtest(fit)

Statistic	Meaning
\(\tau^2\)	between-study variance
\(I^2\)	% of total variance due to heterogeneity
Egger’s test	funnel-plot asymmetry (small-study effects)

Fixed-effect only when heterogeneity is effectively zero. Default to random effects.

28.95 Network meta-analysis

library(netmeta)
net <- netmeta(TE, seTE, treat1, treat2, studlab,
               data = dat, sm = "MD", random = TRUE)
netgraph(net)
netrank(net, small.values = "good")    # SUCRA ranking
decomp.design(net)                     # inconsistency

Transitivity + consistency are the core assumptions.

28.96 Infectious disease — SIR / SEIR

library(deSolve)
sir <- function(t, y, p) with(as.list(c(y, p)), {
  N <- S + I + R
  list(c(-beta * S * I / N,
          beta * S * I / N - gamma * I,
          gamma * I))
})
out <- ode(c(S = 999, I = 1, R = 0),
           times = 0:120,
           func = sir,
           parms = c(beta = 0.3, gamma = 0.1))

\(R_0 = \beta / \gamma\). Herd-immunity threshold \(\approx 1 - 1/R_0\).

28.97 Pre-registration and SAP

A pre-registration specifies, before data analysis:

Primary outcome + exact estimand.
Analysis model (fixed terms, adjustments, interactions).
Handling of missing data.
Multiplicity control.
Sample-size justification.

OSF and ClinicalTrials.gov are the two common registries.

28.98 Decision rule

More than a handful of studies on a question → plan a systematic review.
Multiple comparators of interest → network meta-analysis.
Outbreak modelling → start with SIR, extend only as data justify.
Confirmatory analysis → pre-registered; exploratory analyses are labelled as such in the manuscript.

28.99 Common pitfalls

Declaring a review “systematic” without a pre-registered protocol.
Pooling studies with fundamentally different populations / outcomes.
Interpreting SUCRA rankings as if they were certain.
Calling a compartmental model a “forecast” without uncertainty.

28.100 Further reading

Higgins et al., Cochrane Handbook for Systematic Reviews.
Keeling & Rohani, Modeling Infectious Diseases in Humans and Animals.
Nosek et al. (2018), The preregistration revolution.

28.101 Machine Learning & High-Dimensional Data

28.101.1 Cross-validation, regularisation, and unsupervised## Cross-validation

Scheme	Use
k-fold (k = 5 or 10)	baseline, fast
Repeated k-fold	reduces Monte Carlo noise
Leave-one-out	small n, high variance
Stratified k-fold	imbalanced class labels
Grouped / blocked	correlated units (patients, sites)
Nested	honest evaluation of tuned models

Nested CV is the only defensible way to estimate generalisation error when you are also tuning hyperparameters.

28.102 Regularisation

library(glmnet)
x <- model.matrix(y ~ . - 1, df); y <- df$y
fit <- cv.glmnet(x, y, alpha = 1)     # alpha = 1 = lasso; 0 = ridge
coef(fit, s = "lambda.1se")           # sparse solution
plot(fit)                              # CV curve

lambda.min: empirical minimum of CV error.
lambda.1se: more regularised, simpler, within 1 SE of min.
Elastic net: alpha between 0 and 1 blends ridge and lasso.

28.103 Multivariate workhorses

Method	Purpose
PCA	variance → low-d summary; `prcomp(x, scale. = TRUE)`
FA	latent-variable modelling; `psych::fa`
CCA	correlation structure between two sets
LDA	supervised low-d projection; `MASS::lda`

pc <- prcomp(x, scale. = TRUE)
summary(pc)   # proportion of variance per PC

Scale before PCA when variables are on different units.

28.104 Clustering

km <- kmeans(scale(x), centers = 4, nstart = 25)
hc <- hclust(dist(scale(x)), method = "ward.D2")
library(mclust); bic <- mclustBIC(x)      # model-based, BIC-selected

Choose k by silhouette, gap statistic, or BIC.
Scale first. K-means assumes round clusters of similar size.

28.105 Non-linear embeddings

library(uwot)
umap_xy <- umap(scale(x), n_neighbors = 15, min_dist = 0.1)

library(Rtsne)
ts <- Rtsne(scale(x), perplexity = 30)$Y

Embeddings are visualisations, not distances you should input into downstream stats. Global geometry is often distorted.

28.106 Decision rule

Lots of predictors, small n → lasso; report lambda.1se.
Exploring structure → PCA + UMAP on scaled data.
Clustering for biology → model-based + sensitivity to k.
Tuning anything → nested CV.

28.107 Common pitfalls

Selecting features on full data, then estimating error by k-fold.
PCA before scaling when features are on different units.
Interpreting UMAP distances as real distances.
Using AUC on rank-ordered hold-out error as if it were CV error.

28.108 Further reading

Hastie, Tibshirani, Friedman, The Elements of Statistical Learning.
James et al., ISLR2.

28.108.1 Trees, boosting, and interpretability## Tree-based models

Model	R	Strengths
CART	`rpart::rpart`	interpretable single tree
Random forest	`ranger::ranger`	robust, OOB error, variable importance
XGBoost	`xgboost::xgb.train`	top tabular performance, careful tuning
LightGBM	`lightgbm::lgb.train`	fast on large data

library(ranger)
fit <- ranger(y ~ ., data = df, importance = "permutation",
              num.trees = 500, mtry = floor(sqrt(ncol(df) - 1)))
fit$prediction.error   # OOB error

28.109 Interpretability

library(DALEX); library(iml)

ex  <- explain(fit, data = df |> select(-y), y = df$y)
ip  <- model_parts(ex)           # permutation importance
pd  <- model_profile(ex, variables = "x1")   # partial dependence
sh  <- Shapley$new(Predictor$new(fit, data = df, y = df$y),
                   x.interest = df[1, ])     # SHAP

PDP assumes feature independence; if features are correlated, use ALE plots.

28.110 Tabular neural networks with `torch`

# sketch only — typically in a loop on GPU
library(torch)
nn <- nn_sequential(
  nn_linear(p, 64), nn_relu(),
  nn_linear(64, 1)
)
opt <- optim_adam(nn$parameters)

On small biomedical tabular data, boosted trees usually beat NN. Use NN when you need end-to-end training with images, sequences, or text.

28.111 tidymodels pipeline

library(tidymodels)
rec  <- recipe(y ~ ., data = df) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

mod  <- rand_forest(mtry = tune(), trees = 500) |>
  set_engine("ranger") |> set_mode("classification")

wf   <- workflow() |> add_recipe(rec) |> add_model(mod)
cv   <- vfold_cv(df, v = 5, strata = y)
tuned <- tune_grid(wf, cv, grid = 20)
collect_metrics(tuned)

28.112 Decision rule

Tabular, < 100k rows → gradient-boosted trees.
Need explanations → SHAP + PDP, validate with held-out.
Images / sequences → CNN / transformer, use torch.
Reproducibility → tidymodels pipeline + fixed seed + locked recipe.

28.113 Common pitfalls

Reporting feature importance from a model trained on the full data.
Interpreting PDPs when features are strongly correlated.
Claiming NN superiority without CV on the same split as a tree model.
Forgetting to set a seed for any tune / split operation.

28.114 Further reading

Biecek & Burzykowski, Explanatory Model Analysis.
Molnar, Interpretable Machine Learning.

28.114.1 Bayesian and survival ML## Bayesian thinking

\(\underbrace{p(\theta \mid y)}_{\text{posterior}} \propto \underbrace{p(y \mid \theta)}_{\text{likelihood}} \underbrace{p(\theta)}_{\text{prior}}\)

Priors are part of the model — specify and defend them.
Posterior summaries: mean, median, 95% credible interval.
No p-values; use posterior probability of direction, Bayes factors, or LOO.

28.115 `brms` / Stan

library(brms)
fit <- brm(y ~ x + (1 | group), data = df,
           prior = c(prior(normal(0, 1), class = "b"),
                     prior(exponential(1), class = "sd")),
           chains = 4, iter = 2000, seed = 42)

summary(fit)
loo(fit)               # leave-one-out cross-validation
pp_check(fit)           # posterior predictive checks

Prior predictive check before fitting: simulate \(y\) from the prior and confirm the implications are plausible.

28.116 Biomarker statistics

Question	Statistic
Does a biomarker classify?	AUC, cut-point (Youden’s index)
Does it add over an existing model?	ΔAUC, NRI, IDI, decision curves
Does it move prognosis?	Calibration plus discrimination
Is a cut-off reproducible?	Bootstrap CI on the cut-off

pROC::roc(y, biomarker) |>
  pROC::coords("best", best.method = "youden")

28.117 Survival ML

Model	R
Random survival forest	`randomForestSRC::rfsrc(Surv(...) ~ ., data)`
Gradient-boosted Cox	`xgboost` with `objective = "survival:cox"`
DeepSurv (conceptual)	`torch` with partial-likelihood loss

Evaluate with time-dependent AUC, integrated Brier score, and IPA (index of prediction accuracy = 1 − Brier(model) / Brier(null)).

library(timeROC)
r <- timeROC(T = df$time, delta = df$event,
             marker = pred, cause = 1, times = c(365, 730))

28.118 External validation

Never trust the apparent performance on training data.
Minimum: split-sample or CV on internal data.
Better: external validation in a second cohort.
Report calibration slope / intercept, discrimination, and decision curves.

28.119 Decision rule

Rare outcome with need for uncertainty → Bayesian, not bootstrap of MLE.
New biomarker → NRI + decision curve, not just ΔAUC.
Survival prediction → time-dependent Brier / IPA, not global AUC.
Any clinical claim → external validation before publication.

28.120 Common pitfalls

Reporting a posterior with default flat priors on unscaled predictors.
Selecting the Youden cut-off on the full data and using the same data to evaluate sensitivity / specificity.
Quoting C-statistic at a single time point and calling it survival ML.
Publishing a prediction model without TRIPOD-compliant reporting.

28.121 Further reading

Gelman et al., Bayesian Data Analysis, 3e.
Royston & Altman, External validation of a Cox prognostic model.

28.121.1 Omics, FDR, and TRIPOD-AI## Bulk RNA-seq

library(DESeq2)
dds <- DESeqDataSetFromMatrix(counts, coldata, ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "B", "A"))

library(edgeR)
dge <- DGEList(counts, group = coldata$condition)
dge <- calcNormFactors(dge)
design <- model.matrix(~ condition, coldata)
fit <- glmQLFit(estimateDisp(dge, design), design)
qlf <- glmQLFTest(fit, coef = 2)
topTags(qlf)

Package	Distribution	Notes
DESeq2	negative binomial	shrinks LFCs; standard for small cohorts
edgeR	negative binomial	QL-F test has good type-I control
limma-voom	mean-variance weights	fast on very large studies

28.122 Enrichment

library(fgsea)
fgsea(pathways, stats = stat_vector, nperm = 10000) |>
  arrange(padj)

# Over-representation with a hypergeometric test
phyper(k - 1, K, N - K, n, lower.tail = FALSE)

Always pre-rank by a signed statistic (log-FC × -log10 p), not by raw p alone.

28.123 Single-cell RNA-seq (Seurat pattern)

library(Seurat)
so <- CreateSeuratObject(counts) |>
  NormalizeData() |>
  FindVariableFeatures() |>
  ScaleData() |>
  RunPCA() |>
  FindNeighbors() |>
  FindClusters(resolution = 0.4) |>
  RunUMAP(dims = 1:20)
DimPlot(so)

28.124 FDR, knockoffs, replication

Correction	When
Bonferroni	few planned tests, strict control
Benjamini-Hochberg	many tests, accept proportion of false discoveries
Knockoffs	controlled variable selection with FDR
Permutation / SEA	small n, non-standard models

p.adjust(pvals, method = "BH")

Replication crisis shorthand: pre-register + report effect sizes + share data. Power, not p-values.

28.125 TRIPOD-AI + fairness

TRIPOD-AI extends TRIPOD for prediction models built with ML.
Report: data provenance, training/evaluation splits, subgroup performance, calibration, external validation.
Fairness: stratify AUC / calibration by sex, ancestry, age bands. Disparities in calibration are often larger than in AUC.

28.126 Reproducibility at scale with `targets`

# _targets.R
library(targets)
tar_option_set(packages = c("tidyverse"))
list(
  tar_target(raw, read_csv("data/raw.csv")),
  tar_target(clean, clean_data(raw)),
  tar_target(fit, fit_model(clean)),
  tar_target(report, quarto::quarto_render("report.qmd"))
)

Rscript -e 'targets::tar_make()'

28.127 Decision rule

Bulk DE → DESeq2 for small samples, edgeR/limma for large.
Many hypotheses → BH by default; knockoffs for variable selection.
scRNA-seq → Seurat pipeline with batch correction + cluster annotation.
ML prediction model → TRIPOD-AI + subgroup analysis.
Any multi-step analysis → targets pipeline.

28.128 Common pitfalls

Running DE on a pathway-level aggregate rather than gene-level.
Cherry-picking a pathway database post-hoc.
Over-clustering scRNA-seq data until “known” groups appear.
Calling an ML model “fair” after checking only overall AUC.
Submitting a paper without a sessionInfo() or a lock file.

28.129 Further reading

Love, Huber, Anders (2014), Moderated estimation of fold change…
Collins et al. (2024), TRIPOD-AI.
Peng, Reproducible Research with R.

This book was built by the bookdown R package.

27 Writing a report

Colophon