The Bias-Variance Tradeoff

Machine Learning

bias-variance

decomposition

flexibility

Decomposition of prediction error into structural and sampling components

Published

April 17, 2026

Introduction

Every supervised-learning error decomposes into structural bias (the wrong model class cannot fit the truth), variance (sampling noise in the fit), and irreducible noise. The bias-variance tradeoff is the single most important mental model for picking between a rigid and a flexible estimator.

Prerequisites

Regression basics; expected value and variance of estimators.

Theory

For squared loss and a target \(f(x) + \varepsilon\) with \(\mathbb{E}[\varepsilon] = 0\), \(\text{Var}(\varepsilon) = \sigma^2\):

\[\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\mathbb{E}[\hat{f}(x)] - f(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]}_{\text{Variance}} + \sigma^2.\]

Complex models: low bias, high variance. Simple models: high bias, low variance. Regularisation, averaging (bagging), and dropout explicitly shift along this curve.

Assumptions

Squared loss and iid samples; trained estimator is a random function of the training set.

R Implementation

set.seed(2026)

true_f <- function(x) sin(x)
x_test <- seq(0, 2 * pi, length.out = 100)
f_true <- true_f(x_test)

n_reps  <- 200; n_train <- 30

simulate_fit <- function(degree) {
  preds <- replicate(n_reps, {
    x_train <- runif(n_train, 0, 2 * pi)
    y_train <- true_f(x_train) + rnorm(n_train, 0, 0.3)
    mdl <- lm(y_train ~ poly(x_train, degree))
    predict(mdl, newdata = data.frame(x_train = x_test))
  })
  bias <- rowMeans(preds) - f_true
  variance <- apply(preds, 1, var)
  list(bias2 = mean(bias^2), variance = mean(variance))
}

sapply(c(1, 3, 7, 15), simulate_fit)

Output & Results

As polynomial degree grows, bias^2 falls and variance rises; the sum (excess MSE beyond noise) is minimised at an intermediate flexibility.

Interpretation

“Polynomial degree 3 balanced bias (0.12) and variance (0.09), minimising excess MSE; degree 15 reduced bias to 0.001 but inflated variance to 0.45, hurting generalisation.”

Practical Tips

Overfitting corresponds to high variance (the same structure fits differently across training resamples).
Underfitting corresponds to high bias (structurally too restrictive to capture the signal).
Ensembling (bagging, random forests) reduces variance without increasing bias.
Regularisation (ridge, lasso) increases bias but reduces variance – tune by CV.
Deep learning’s “double descent” is a modern complication; the classical curve is the best starting intuition.