Hypothesis testing as a game between nature and the statistician

statistical-foundations

hypothesis-testing

minimax

Reframe classical hypothesis testing as a two-player minimax game in which nature chooses the true state and the statistician selects a decision rule. Derive the Neyman-Pearson lemma as the minimax optimal strategy and explore power analysis as strategic optimisation.

Author

Raban Heller

Published

May 8, 2026

Keywords

hypothesis testing, minimax, Neyman-Pearson, power analysis, game theory, decision theory

_common.R — shared setup for all #equilibria tutorials

Source this at the top of every article: source(here::here(“R”, “_common.R”))

suppressPackageStartupMessages({ library(tidyverse) library(here) library(scales) library(knitr) library(kableExtra) })

Source publication theme and helpers

source(here::here(“R”, “theme_publication.R”)) source(here::here(“R”, “plotly_helpers.R”))

Global knitr options

knitr::opts_chunk$set( fig.align = “center”, fig.retina = 2, out.width = “100%”, dpi = 300, dev = c(“png”, “pdf”), fig.path = “figures/” )

Okabe-Ito colorblind-safe palette

okabe_ito <- c( “#E69F00”, “#56B4E9”, “#009E73”, “#F0E442”, “#0072B2”, “#D55E00”, “#CC79A7”, “#999999” )

Set default ggplot theme

theme_set(theme_publication())

Introduction & motivation

Statisticians routinely talk about hypothesis tests in procedural terms: choose a significance level, compute a test statistic, compare to a critical value. But underneath this recipe lies a deep strategic structure. The data-generating process can be viewed as a move by Nature, who selects the true parameter value from the parameter space. The statistician responds by choosing a decision rule — accept or reject the null hypothesis — that minimises the worst-case error. This is precisely a two-player zero-sum game, and the minimax theorem guarantees the existence of an optimal strategy.

Viewing hypothesis testing through a game-theoretic lens clarifies several otherwise opaque concepts. The significance level $\alpha$ becomes a budget constraint on the statistician’s strategy set. The Neyman–Pearson lemma emerges not as an ad-hoc optimality result but as the unique minimax solution for simple-versus-simple testing. Power analysis — choosing sample sizes to achieve a target detection rate — is revealed as a form of strategic resource allocation against an adversarial Nature who may place the true effect size at the hardest-to-detect value. This perspective unifies frequentist testing, Wald’s statistical decision theory, and modern minimax estimation under a single conceptual roof.

This tutorial formalises the game, derives the optimal test as a likelihood ratio, and visualises the resulting trade-offs between Type I and Type II error across a range of effect sizes and sample sizes.

Mathematical formulation

Define a two-player game $\Gamma = (\Theta, \mathcal{D}, L)$ where:

Nature chooses $\theta \in \Theta = \{\theta_0, \theta_1\}$ (null vs. alternative),
Statistician chooses a decision rule $\delta: \mathcal{X} \to \{0, 1\}$ from the set $\mathcal{D}$ of all measurable decision functions,
Loss $L(\theta, \delta)$ penalises errors: $L(\theta_0, 1) = 1$ (Type I), $L(\theta_1, 0) = 1$ (Type II), and $L = 0$ otherwise.

The statistician’s risk under $\theta$ is $R(\theta, \delta) = E_\theta[L(\theta, \delta(X))]$. The minimax decision rule solves:

\[ \delta^* = \arg\min_{\delta \in \mathcal{D}} \max_{\theta \in \Theta} R(\theta, \delta) \]

For testing $H_0: X \sim N(\mu_0, \sigma^2)$ against $H_1: X \sim N(\mu_1, \sigma^2)$, the Neyman–Pearson lemma states that the most powerful test at level $\alpha$ rejects when the likelihood ratio exceeds a threshold:

\[ \Lambda(x) = \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} > k_\alpha \]

The power of this test is:

\[ \beta(\mu_1) = P\!\left(\bar{X} > \mu_0 + z_{1-\alpha} \cdot \frac{\sigma}{\sqrt{n}} \;\middle|\; \mu = \mu_1\right) = \Phi\!\left(\frac{\mu_1 - \mu_0}{\sigma / \sqrt{n}} - z_{1-\alpha}\right) \]

R implementation

We compute the power function across a grid of effect sizes $\delta = (\mu_1 - \mu_0) / \sigma$ and sample sizes $n$, visualising how the statistician’s detection ability improves with more data.

alpha <- 0.05
z_alpha <- qnorm(1 - alpha)
effect_sizes <- seq(0.1, 1.5, by = 0.05)
sample_sizes <- c(10, 30, 50, 100, 200)

power_df <- expand.grid(delta = effect_sizes, n = sample_sizes) |>
  as_tibble() |>
  mutate(
    power = pnorm(delta * sqrt(n) - z_alpha),
    n_label = paste0("n = ", n)
  )

# Show a summary table for delta = 0.5
power_df |>
  filter(abs(delta - 0.5) < 0.01) |>
  select(n_label, delta, power) |>
  knitr::kable(
    digits = 3,
    col.names = c("Sample size", "Effect size (delta)", "Power"),
    caption = "Power at delta = 0.5 for various sample sizes (alpha = 0.05)"
  )

Power at delta = 0.5 for various sample sizes (alpha = 0.05)
Sample size	Effect size (delta)	Power
n = 10	0.5	0.475
n = 30	0.5	0.863
n = 50	0.5	0.971
n = 100	0.5	1.000
n = 200	0.5	1.000

Static publication-ready figure

The power curves below map the statistician’s strategic capacity: for each sample size, the curve traces the probability of correctly rejecting $H_0$ as a function of the true effect size. Larger samples shift the curve leftward, making even small effects detectable.

p_static <- ggplot(power_df, aes(x = delta, y = power, colour = n_label,
                                  text = paste0("Effect: ", round(delta, 2),
                                                "\nPower: ", round(power, 3),
                                                "\n", n_label))) +
  geom_line(linewidth = 0.9) +
  geom_hline(yintercept = 0.8, linetype = "dashed", colour = "grey50") +
  annotate("text", x = 1.4, y = 0.83, label = "80% power",
           colour = "grey40", size = 3.5) +
  scale_colour_manual(values = okabe_ito[1:5]) +
  labs(
    x = expression("Effect size " * delta * " = (" * mu[1] - mu[0] * ") / " * sigma),
    y = "Power (1 - Type II error)",
    colour = NULL,
    title = "Power as a strategic resource",
    subtitle = expression("One-sided z-test at " * alpha * " = 0.05")
  ) +
  coord_cartesian(ylim = c(0, 1)) +
  theme_publication()

save_pub_fig(p_static, "figures/hypothesis-testing-static")

Error in `ggplot2::ggsave()`:
! Cannot find directory 'figures'.
ℹ Please supply an existing directory or use `create.dir = TRUE`.

p_static

Figure 1: Figure 1. Power curves for the one-sided z-test at alpha = 0.05 across five sample sizes. The dashed horizontal line marks the conventional 80% power threshold. Larger samples enable detection of smaller effects. Okabe-Ito palette.

Interactive figure

Hover over the power curves to read off the exact power at any effect size and sample size combination. This is especially useful for sample-size planning, where the analyst needs to find the smallest $n$ that pushes the curve above the 80% threshold for a given expected effect.

to_plotly_pub(p_static, tooltip = c("text"))

Figure 2

Interpretation

The power curves make the strategic structure of hypothesis testing visually explicit. Nature’s strongest move is to place the true effect size just above zero, where the statistician’s test has almost no power regardless of sample size. The statistician’s counter-strategy is to increase $n$, which compresses the null and alternative sampling distributions apart and shifts the power curve leftward. At $n = 200$, even a modest effect of $\delta = 0.3$ is detected with over 80% probability, whereas $n = 10$ requires an effect exceeding $\delta = 0.9$ to reach the same threshold.

This framing also illuminates the role of $\alpha$. The significance level is not an arbitrary convention but a constraint on the statistician’s strategy set — it bounds the probability of a false alarm. Relaxing $\alpha$ (raising it toward 0.10) uniformly increases power at the cost of more Type I errors, exactly as a player might trade defensive solidity for offensive capability. The Neyman–Pearson lemma guarantees that the likelihood ratio test extracts maximal power from the available $\alpha$ budget, making it the unique minimax-optimal response to Nature’s choice.

A limitation of the simple normal model is that it assumes known variance. When $\sigma$ is estimated, the test statistic follows a $t$-distribution and the power formula requires noncentral $t$ calculations. This extension is straightforward but shifts the power curves slightly downward for small $n$.

References

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{heller2026,
  author = {Heller, Raban},
  title = {Hypothesis Testing as a Game Between Nature and the
    Statistician},
  date = {2026-05-08},
  url = {https://r-heller.github.io/equilibria/tutorials/statistical-foundations/hypothesis-testing-game-theoretic/},
  langid = {en}
}

For attribution, please cite this work as:

Heller, Raban. 2026. “Hypothesis Testing as a Game Between Nature and the Statistician.” May 8. https://r-heller.github.io/equilibria/tutorials/statistical-foundations/hypothesis-testing-game-theoretic/.

--- # ============================================================================ # #equilibria — Article Frontmatter # ============================================================================ title: "Hypothesis testing as a game between nature and the statistician" description: > Reframe classical hypothesis testing as a two-player minimax game in which nature chooses the true state and the statistician selects a decision rule. Derive the Neyman-Pearson lemma as the minimax optimal strategy and explore power analysis as strategic optimisation. author: "Raban Heller" date: 2026-05-08 categories: - statistical-foundations - hypothesis-testing - minimax keywords: ["hypothesis testing", "minimax", "Neyman-Pearson", "power analysis", "game theory", "decision theory"] labels: ["statistics", "minimax", "neyman-pearson"] tier: 1 bibliography: ../../../references.bib vgwort: "TODO_VGWORT_statistical-foundations_hypothesis-testing-game-theoretic" image: thumbnail.png image-alt: "ROC-style plot showing the trade-off between Type I and Type II error as a strategic frontier" citation: type: webpage url: https://r-heller.github.io/equilibria/tutorials/statistical-foundations/hypothesis-testing-game-theoretic/ license: "CC BY-SA 4.0" draft: false has_static_fig: true has_interactive_fig: true ---    {{< include ../../../R/_common.R >}} ```{r} #| label: setup #| include: false source(here::here("R", "_common.R")) library(ggplot2) library(dplyr) library(tidyr) library(plotly) ``` ## Introduction & motivation Statisticians routinely talk about hypothesis tests in procedural terms: choose a significance level, compute a test statistic, compare to a critical value. But underneath this recipe lies a deep strategic structure. The data-generating process can be viewed as a move by Nature, who selects the true parameter value from the parameter space. The statistician responds by choosing a decision rule --- accept or reject the null hypothesis --- that minimises the worst-case error. This is precisely a two-player zero-sum game, and the minimax theorem guarantees the existence of an optimal strategy. Viewing hypothesis testing through a game-theoretic lens clarifies several otherwise opaque concepts. The significance level $\alpha$ becomes a budget constraint on the statistician's strategy set. The Neyman--Pearson lemma emerges not as an ad-hoc optimality result but as the unique minimax solution for simple-versus-simple testing. Power analysis --- choosing sample sizes to achieve a target detection rate --- is revealed as a form of strategic resource allocation against an adversarial Nature who may place the true effect size at the hardest-to-detect value. This perspective unifies frequentist testing, Wald's statistical decision theory, and modern minimax estimation under a single conceptual roof. This tutorial formalises the game, derives the optimal test as a likelihood ratio, and visualises the resulting trade-offs between Type I and Type II error across a range of effect sizes and sample sizes. ## Mathematical formulation Define a two-player game $\Gamma = (\Theta, \mathcal{D}, L)$ where: - **Nature** chooses $\theta \in \Theta = \{\theta_0, \theta_1\}$ (null vs. alternative), - **Statistician** chooses a decision rule $\delta: \mathcal{X} \to \{0, 1\}$ from the set $\mathcal{D}$ of all measurable decision functions, - **Loss** $L(\theta, \delta)$ penalises errors: $L(\theta_0, 1) = 1$ (Type I), $L(\theta_1, 0) = 1$ (Type II), and $L = 0$ otherwise. The statistician's risk under $\theta$ is $R(\theta, \delta) = E_\theta[L(\theta, \delta(X))]$. The minimax decision rule solves: $$ \delta^* = \arg\min_{\delta \in \mathcal{D}} \max_{\theta \in \Theta} R(\theta, \delta) $$ For testing $H_0: X \sim N(\mu_0, \sigma^2)$ against $H_1: X \sim N(\mu_1, \sigma^2)$, the **Neyman--Pearson lemma** states that the most powerful test at level $\alpha$ rejects when the likelihood ratio exceeds a threshold: $$ \Lambda(x) = \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} > k_\alpha $$ The power of this test is: $$ \beta(\mu_1) = P\!\left(\bar{X} > \mu_0 + z_{1-\alpha} \cdot \frac{\sigma}{\sqrt{n}} \;\middle|\; \mu = \mu_1\right) = \Phi\!\left(\frac{\mu_1 - \mu_0}{\sigma / \sqrt{n}} - z_{1-\alpha}\right) $$ ## R implementation We compute the power function across a grid of effect sizes $\delta = (\mu_1 - \mu_0) / \sigma$ and sample sizes $n$, visualising how the statistician's detection ability improves with more data. ```{r} #| label: power-computation alpha <- 0.05 z_alpha <- qnorm(1 - alpha) effect_sizes <- seq(0.1, 1.5, by = 0.05) sample_sizes <- c(10, 30, 50, 100, 200) power_df <- expand.grid(delta = effect_sizes, n = sample_sizes) |> as_tibble() |> mutate( power = pnorm(delta * sqrt(n) - z_alpha), n_label = paste0("n = ", n) ) # Show a summary table for delta = 0.5 power_df |> filter(abs(delta - 0.5) < 0.01) |> select(n_label, delta, power) |> knitr::kable( digits = 3, col.names = c("Sample size", "Effect size (delta)", "Power"), caption = "Power at delta = 0.5 for various sample sizes (alpha = 0.05)" ) ``` ## Static publication-ready figure The power curves below map the statistician's strategic capacity: for each sample size, the curve traces the probability of correctly rejecting $H_0$ as a function of the true effect size. Larger samples shift the curve leftward, making even small effects detectable. ```{r} #| label: fig-hypothesis-testing-static #| fig-cap: "Figure 1. Power curves for the one-sided z-test at alpha = 0.05 across five sample sizes. The dashed horizontal line marks the conventional 80% power threshold. Larger samples enable detection of smaller effects. Okabe-Ito palette." #| dev: [png, pdf] #| fig-width: 7 #| fig-height: 5 #| dpi: 300 p_static <- ggplot(power_df, aes(x = delta, y = power, colour = n_label, text = paste0("Effect: ", round(delta, 2), "\nPower: ", round(power, 3), "\n", n_label))) + geom_line(linewidth = 0.9) + geom_hline(yintercept = 0.8, linetype = "dashed", colour = "grey50") + annotate("text", x = 1.4, y = 0.83, label = "80% power", colour = "grey40", size = 3.5) + scale_colour_manual(values = okabe_ito[1:5]) + labs( x = expression("Effect size " * delta * " = (" * mu[1] - mu[0] * ") / " * sigma), y = "Power (1 - Type II error)", colour = NULL, title = "Power as a strategic resource", subtitle = expression("One-sided z-test at " * alpha * " = 0.05") ) + coord_cartesian(ylim = c(0, 1)) + theme_publication() save_pub_fig(p_static, "figures/hypothesis-testing-static") p_static ``` ## Interactive figure Hover over the power curves to read off the exact power at any effect size and sample size combination. This is especially useful for sample-size planning, where the analyst needs to find the smallest $n$ that pushes the curve above the 80% threshold for a given expected effect. ```{r} #| label: fig-hypothesis-testing-interactive to_plotly_pub(p_static, tooltip = c("text")) ``` ## Interpretation The power curves make the strategic structure of hypothesis testing visually explicit. Nature's strongest move is to place the true effect size just above zero, where the statistician's test has almost no power regardless of sample size. The statistician's counter-strategy is to increase $n$, which compresses the null and alternative sampling distributions apart and shifts the power curve leftward. At $n = 200$, even a modest effect of $\delta = 0.3$ is detected with over 80% probability, whereas $n = 10$ requires an effect exceeding $\delta = 0.9$ to reach the same threshold. This framing also illuminates the role of $\alpha$. The significance level is not an arbitrary convention but a constraint on the statistician's strategy set --- it bounds the probability of a false alarm. Relaxing $\alpha$ (raising it toward 0.10) uniformly increases power at the cost of more Type I errors, exactly as a player might trade defensive solidity for offensive capability. The Neyman--Pearson lemma guarantees that the likelihood ratio test extracts maximal power from the available $\alpha$ budget, making it the unique minimax-optimal response to Nature's choice. A limitation of the simple normal model is that it assumes known variance. When $\sigma$ is estimated, the test statistic follows a $t$-distribution and the power formula requires noncentral $t$ calculations. This extension is straightforward but shifts the power curves slightly downward for small $n$. ## Extensions & related tutorials The game-theoretic view extends naturally to composite hypotheses, where Nature chooses from a continuum $\Theta_1$ and the minimax test must guard against the least favourable distribution. Sequential testing (Wald's SPRT) adds a temporal dimension: the statistician dynamically decides when to stop sampling, turning the game into a stochastic optimal stopping problem. Multiple testing corrections (Bonferroni, Benjamini--Hochberg) can be interpreted as budget allocation across simultaneous games. - [Building a game theory R package from scratch](../../r-package-development/building-game-theory-r-package/) - [Modelling strategic interaction with VAR models](../../time-series-econometrics/strategic-interaction-var-models/) - [Algorithmic fairness through the lens of game theory](../../ethics-applications/algorithmic-fairness-game-theory/) ## References ::: {#refs} :::