Multiple Linear Regression

multiple-regression
vif
multicollinearity
ols
Modelling a continuous outcome from two or more predictors, with VIF, homoscedasticity checks, and interpretation
Published

April 17, 2026

Research question

Multiple linear regression models a continuous outcome as a linear combination of several predictors. Biomedical example: in a cardiovascular risk cohort, does LDL cholesterol predict intima-media thickness (IMT) after adjusting for age, sex, BMI, and smoking status?

Assumptions

Assumption How to verify in R
Linearity in each predictor component-plus-residual plots (car::crPlots())
Independence of residuals design
Homoscedasticity performance::check_heteroscedasticity(), ncvTest()
Approximately normal residuals Q-Q plot, shapiro.test(residuals(fit))
No severe multicollinearity car::vif(); VIF > 5 flags concern, > 10 is severe
No extreme high-influence points cooks.distance(), leverage

Hypotheses

For each coefficient: \(H_0: \beta_j = 0\) vs. \(H_1: \beta_j \ne 0\). The overall F test tests whether any predictor’s coefficient is non-zero.

R code

library(tidyverse); library(rstatix); library(car); library(broom)
library(performance); library(gtsummary)
set.seed(42)

cv <- tibble(
  age   = rnorm(250, 58, 9),
  sex   = factor(sample(c("F", "M"), 250, replace = TRUE)),
  bmi   = rnorm(250, 27, 4),
  ldl   = rnorm(250, 3.2, 0.8),
  smoke = factor(sample(c("Never", "Former", "Current"), 250, replace = TRUE,
                        prob = c(0.55, 0.30, 0.15))),
  imt   = NA_real_
) |>
  mutate(imt = 0.40 + 0.004 * age + 0.05 * (sex == "M") + 0.006 * bmi +
               0.04 * ldl + 0.03 * (smoke == "Current") + rnorm(250, 0, 0.08))

fit <- lm(imt ~ age + sex + bmi + ldl + smoke, data = cv)

broom::tidy(fit, conf.int = TRUE)
broom::glance(fit)

# Diagnostics
car::vif(fit)
check_model(fit)   # from performance

# Publication-ready table
tbl_regression(fit, intercept = TRUE) |>
  add_global_p() |>
  add_glance_source_note()

Interpreting the output

The tidy output gives each coefficient with its standard error, t, p, and 95 % CI. Under adjustment, LDL is associated with IMT at \(b = 0.041\) (95 % CI 0.028-0.054), \(p < .001\). The overall \(R^2 = 0.38\), \(F(6, 243) = 24.8\), \(p < .001\). VIFs are under 2 for every predictor – no multicollinearity concern.

Effect size

For the model: \(R^2\), adjusted \(R^2\), and Cohen’s \(f^2\). For individual predictors: standardised beta and the partial \(\eta^2\) or semipartial \(r\). Cohen’s \(f^2\) thresholds: 0.02 / 0.15 / 0.35.

Reporting (APA 7)

After adjusting for age, sex, BMI, and smoking status, LDL cholesterol was independently associated with IMT (b = 0.041, SE = 0.006, p < .001, 95 % CI [0.028, 0.054]). The full model explained 38 % of the variance in IMT (adjusted R^2 = .37, F(6, 243) = 24.8, p < .001). Variance inflation factors were all below 2.

Common pitfalls

  • Omitted-variable bias: important confounders not in the model bias the remaining coefficients.
  • Interpreting coefficients on unstandardised scales without reporting units.
  • Multicollinearity inflates SEs and destabilises coefficients; check VIF and consider dropping or combining predictors.
  • Over-fitting: as a rough rule, keep predictors to fewer than \(n / 15\).
  • Dichotomising a continuous predictor for convenience loses information and power.

Parametric vs. non-parametric alternative

  • Robust regression (MASS::rlm, robustbase::lmrob) for outlier-robust estimation.
  • Quantile regression (quantreg::rq) for non-mean summaries.
  • When the outcome is non-continuous, switch to logistic, Poisson, or ordinal models.

Further reading


Structure inspired by the University of Zurich Methodenberatung (methodenberatung.uzh.ch). All text, examples, R code, and reporting sentences are independently authored in English.