Testing game theory code with testthat

r-package-development
testing
testthat
nash-equilibrium
Best practices for testing game-theoretic computations using testthat, including property-based testing of Nash equilibria, dominance checking, payoff validation, and test-driven development for game solvers.
Author

Raban Heller

Published

May 8, 2026

Modified

May 8, 2026

Keywords

testthat, unit testing, Nash equilibrium, property-based testing, TDD, R

Introduction & motivation

Game-theoretic code is peculiarly difficult to get right. A Nash equilibrium solver must simultaneously satisfy multiple interrelated constraints: each player’s strategy must be a best response to the others, probability distributions must be non-negative and sum to one, and the solution must be stable under perturbation. Unlike a sorting algorithm where the output can be checked against a simple ordering property, verifying a game-theoretic computation requires checking that an entire system of optimality conditions holds simultaneously.

The consequences of bugs in game-theoretic code are subtle and dangerous. An equilibrium solver that returns a strategy profile that is “almost” a Nash equilibrium — perhaps one player’s probability is slightly negative, or one best-response condition is violated by a rounding error — will silently produce misleading results. In research contexts, such bugs can lead to incorrect theoretical claims. In applied settings (auction design, spectrum allocation, mechanism design for procurement), they can cost real money.

The testthat framework in R provides a structured approach to catching these errors. It offers three levels of testing granularity: expect_*() functions for individual assertions, test_that() blocks for grouping related assertions into logical units, and describe() blocks for organising tests into high-level behavioural specifications. For game-theoretic code, we additionally need property-based testing — checking that outputs satisfy mathematical invariants (e.g., every Nash equilibrium must be a fixed point of the best-response mapping) rather than just matching expected values.

This tutorial walks through a test-driven development (TDD) workflow for building a simple two-player game solver. We start by writing tests for known solutions (the Prisoner’s Dilemma has a dominant-strategy equilibrium, Matching Pennies has a unique completely mixed equilibrium), then implement the solver to pass those tests. Along the way we demonstrate snapshot testing for complex output structures, tolerance-based comparisons for floating-point equilibrium computations, and a systematic approach to organising game-theoretic tests into a test pyramid. The goal is not just to test code but to develop a testing mindset that treats mathematical properties as first-class test oracles.

Mathematical formulation

Consider a two-player game in normal form. Player 1 has strategy set \(S_1 = \{1, \ldots, m\}\) and Player 2 has \(S_2 = \{1, \ldots, n\}\). The game is specified by payoff matrices \(A \in \mathbb{R}^{m \times n}\) (Player 1) and \(B \in \mathbb{R}^{m \times n}\) (Player 2).

A mixed strategy for Player 1 is \(\sigma_1 \in \Delta^{m-1}\), the \((m-1)\)-simplex:

\[ \sigma_1 \in \Delta^{m-1} = \left\{ p \in \mathbb{R}^m : p_i \geq 0, \sum_{i=1}^m p_i = 1 \right\} \]

A strategy profile \((\sigma_1^*, \sigma_2^*)\) is a Nash equilibrium if:

\[ \sigma_1^{*\top} A \, \sigma_2^* \geq \sigma_1^{\top} A \, \sigma_2^* \quad \forall \, \sigma_1 \in \Delta^{m-1} \] \[ \sigma_1^{*\top} B \, \sigma_2^* \geq \sigma_1^{*\top} B \, \sigma_2 \quad \forall \, \sigma_2 \in \Delta^{n-1} \]

Equivalently, the best-response property requires that every pure strategy in the support of \(\sigma_i^*\) yields the same expected payoff, and no strategy outside the support yields a higher payoff:

\[ (A \, \sigma_2^*)_i = v_1 \quad \text{if } (\sigma_1^*)_i > 0, \qquad (A \, \sigma_2^*)_i \leq v_1 \quad \text{if } (\sigma_1^*)_i = 0 \]

A strategy \(s_i\) is strictly dominated if there exists a mixed strategy \(\sigma_i\) such that:

\[ \sum_k (\sigma_i)_k \, u_i(k, s_j) > u_i(s_i, s_j) \quad \forall \, s_j \in S_j \]

These properties — simplex membership, best-response optimality, and dominance — are the core mathematical invariants that our tests must verify.

R implementation

We implement a small library of game-theoretic functions and their corresponding tests. The solver uses the support enumeration method for \(2 \times 2\) games, which is exact and avoids numerical optimisation.

# --- Game-theoretic functions to test ---

#' Check if a vector is a valid probability distribution
is_valid_distribution <- function(p, tol = 1e-10) {
  all(p >= -tol) && abs(sum(p) - 1) < tol
}

#' Compute expected payoff for a player
expected_payoff <- function(sigma1, payoff_matrix, sigma2) {
  as.numeric(t(sigma1) %*% payoff_matrix %*% sigma2)
}

#' Check if a strategy is strictly dominated
is_strictly_dominated <- function(strategy_index, payoff_matrix) {
  m <- nrow(payoff_matrix)
  n <- ncol(payoff_matrix)
  target_payoffs <- payoff_matrix[strategy_index, ]
  # Check against all other pure strategies
  for (i in seq_len(m)) {
    if (i == strategy_index) next
    if (all(payoff_matrix[i, ] > target_payoffs)) return(TRUE)
  }
  FALSE
}

#' Solve a 2x2 game for mixed-strategy Nash equilibrium
solve_2x2_game <- function(A, B) {
  stopifnot(nrow(A) == 2, ncol(A) == 2, nrow(B) == 2, ncol(B) == 2)

  # Player 2's mixing probability (makes Player 1 indifferent)
  denom_q <- (A[1,1] - A[1,2] - A[2,1] + A[2,2])
  # Player 1's mixing probability (makes Player 2 indifferent)
  denom_p <- (B[1,1] - B[1,2] - B[2,1] + B[2,2])

  results <- list()

  # Check for pure-strategy NE
  for (i in 1:2) {
    for (j in 1:2) {
      p <- c(0, 0); p[i] <- 1
      q <- c(0, 0); q[j] <- 1
      # Check best response conditions
      payoffs_1 <- A %*% q
      payoffs_2 <- t(B) %*% p
      if (payoffs_1[i] >= max(payoffs_1) - 1e-10 &&
          payoffs_2[j] >= max(payoffs_2) - 1e-10) {
        results[[length(results) + 1]] <- list(p = p, q = q)
      }
    }
  }

  # Check for completely mixed NE
  if (abs(denom_q) > 1e-10 && abs(denom_p) > 1e-10) {
    q_mix <- (A[1,1] - A[2,1]) / denom_q  # Prob Player 2 plays column 1
    p_mix <- (B[1,1] - B[1,2]) / denom_p  # Prob Player 1 plays row 1

    # Flip sign convention: we want the value that makes the OTHER indifferent
    q_star <- (A[2,2] - A[2,1]) / denom_q
    p_star <- (B[2,2] - B[1,2]) / denom_p

    if (q_star > 1e-10 && q_star < 1 - 1e-10 &&
        p_star > 1e-10 && p_star < 1 - 1e-10) {
      results[[length(results) + 1]] <- list(
        p = c(p_star, 1 - p_star),
        q = c(q_star, 1 - q_star)
      )
    }
  }

  results
}

cat("=== Game-theoretic functions loaded ===\n")
=== Game-theoretic functions loaded ===
cat("Functions: is_valid_distribution, expected_payoff,\n")
Functions: is_valid_distribution, expected_payoff,
cat("           is_strictly_dominated, solve_2x2_game\n")
           is_strictly_dominated, solve_2x2_game
# --- testthat-style tests (run without loading testthat) ---
# We implement a lightweight test harness for demonstration

test_results <- data.frame(
  category = character(), test_name = character(),
  passed = logical(), stringsAsFactors = FALSE
)

run_test <- function(category, name, expr) {
  result <- tryCatch({ expr; TRUE }, error = function(e) FALSE)
  test_results <<- rbind(test_results,
    data.frame(category = category, test_name = name,
               passed = result, stringsAsFactors = FALSE))
  if (!result) cat("  FAIL:", name, "\n")
}

assert_true  <- function(x) if (!isTRUE(x)) stop("Assertion failed")
assert_equal <- function(x, y, tol = 1e-6) {
  if (any(abs(x - y) > tol)) stop("Not equal within tolerance")
}

cat("=== Running test suite ===\n\n")
=== Running test suite ===
# --- 1. Distribution validity tests ---
cat("Category: Distribution constraints\n")
Category: Distribution constraints
run_test("distribution", "valid uniform distribution",
  assert_true(is_valid_distribution(c(0.5, 0.5))))
run_test("distribution", "valid pure strategy",
  assert_true(is_valid_distribution(c(1, 0))))
run_test("distribution", "reject negative probabilities",
  assert_true(!is_valid_distribution(c(1.5, -0.5))))
run_test("distribution", "reject probabilities not summing to 1",
  assert_true(!is_valid_distribution(c(0.3, 0.3))))
run_test("distribution", "valid 3-strategy distribution",
  assert_true(is_valid_distribution(c(1/3, 1/3, 1/3))))

# --- 2. Payoff computation tests ---
cat("Category: Payoff computation\n")
Category: Payoff computation
A_pd <- matrix(c(-1, -3, 0, -2), nrow = 2, byrow = TRUE)

run_test("payoff", "pure strategy mutual defection",
  assert_equal(expected_payoff(c(0,1), A_pd, c(0,1)), -2))
run_test("payoff", "pure strategy cooperation",
  assert_equal(expected_payoff(c(1,0), A_pd, c(1,0)), -1))
run_test("payoff", "mixed strategy payoff",
  assert_equal(expected_payoff(c(0.5,0.5), A_pd, c(0.5,0.5)), -1.5))

# --- 3. Dominance tests ---
cat("Category: Dominance checking\n")
Category: Dominance checking
run_test("dominance", "cooperate is dominated in PD",
  assert_true(is_strictly_dominated(1, A_pd)))
run_test("dominance", "defect is not dominated in PD",
  assert_true(!is_strictly_dominated(2, A_pd)))

# No dominance in Matching Pennies
A_mp <- matrix(c(1, -1, -1, 1), nrow = 2, byrow = TRUE)
run_test("dominance", "row 1 not dominated in Matching Pennies",
  assert_true(!is_strictly_dominated(1, A_mp)))

# --- 4. Nash equilibrium solver tests ---
cat("Category: Nash equilibrium solver\n")
Category: Nash equilibrium solver
# Prisoner's Dilemma: unique NE at (Defect, Defect)
B_pd <- matrix(c(-1, 0, -3, -2), nrow = 2, byrow = TRUE)
ne_pd <- solve_2x2_game(A_pd, B_pd)
run_test("NE-solver", "PD has at least one NE",
  assert_true(length(ne_pd) >= 1))
run_test("NE-solver", "PD NE is pure (Defect, Defect)", {
  found_dd <- any(sapply(ne_pd, function(eq)
    all(abs(eq$p - c(0, 1)) < 1e-6) && all(abs(eq$q - c(0, 1)) < 1e-6)))
  assert_true(found_dd)
})

# Matching Pennies: unique completely mixed NE at (0.5, 0.5)
B_mp <- matrix(c(-1, 1, 1, -1), nrow = 2, byrow = TRUE)
ne_mp <- solve_2x2_game(A_mp, B_mp)
run_test("NE-solver", "Matching Pennies has mixed NE", {
  found_mixed <- any(sapply(ne_mp, function(eq)
    all(abs(eq$p - c(0.5, 0.5)) < 1e-6) && all(abs(eq$q - c(0.5, 0.5)) < 1e-6)))
  assert_true(found_mixed)
})

# --- 5. Property-based tests: best-response property ---
cat("Category: Best-response property\n")
Category: Best-response property
run_test("best-response", "NE satisfies BR for Matching Pennies", {
  eq <- ne_mp[[length(ne_mp)]]
  payoffs_1 <- A_mp %*% eq$q
  # In a completely mixed NE, both payoffs should be equal
  assert_equal(payoffs_1[1], payoffs_1[2])
})

# Battle of the Sexes
A_bos <- matrix(c(3, 0, 0, 2), nrow = 2, byrow = TRUE)
B_bos <- matrix(c(2, 0, 0, 3), nrow = 2, byrow = TRUE)
ne_bos <- solve_2x2_game(A_bos, B_bos)

run_test("best-response", "BoS NE strategies are valid distributions", {
  for (eq in ne_bos) {
    assert_true(is_valid_distribution(eq$p))
    assert_true(is_valid_distribution(eq$q))
  }
})

run_test("best-response", "BoS has at least 2 NE",
  assert_true(length(ne_bos) >= 2))

cat(sprintf("\n=== Test Summary: %d/%d tests passed ===\n",
            sum(test_results$passed), nrow(test_results)))

=== Test Summary: 17/17 tests passed ===

Static publication-ready figure

We visualise the test results as a stacked bar chart by category and overlay a test pyramid showing the ideal distribution of unit, integration, and property-based tests.

# Summarise test results
test_summary <- test_results |>
  group_by(category) |>
  summarise(
    total = n(),
    passed = sum(passed),
    failed = sum(!passed),
    .groups = "drop"
  ) |>
  pivot_longer(cols = c(passed, failed), names_to = "status", values_to = "count") |>
  mutate(
    category = factor(category,
      levels = c("distribution", "payoff", "dominance", "NE-solver", "best-response")),
    status = factor(status, levels = c("passed", "failed"))
  )

# Test pyramid data
pyramid <- data.frame(
  level = factor(c("Unit tests\n(payoff, distribution)",
                    "Integration tests\n(NE solver)",
                    "Property-based tests\n(best-response)"),
                 levels = c("Property-based tests\n(best-response)",
                            "Integration tests\n(NE solver)",
                            "Unit tests\n(payoff, distribution)")),
  count = c(8, 4, 3),
  xmin = c(-4, -2.5, -1.5),
  xmax = c(4, 2.5, 1.5)
)

p_tests <- ggplot(test_summary,
       aes(x = category, y = count, fill = status)) +
  geom_col(position = "stack", width = 0.7) +
  geom_text(aes(label = ifelse(count > 0, count, "")),
            position = position_stack(vjust = 0.5), size = 4, fontface = "bold") +
  scale_fill_manual(
    values = c("passed" = okabe_ito[3], "failed" = okabe_ito[6]),
    labels = c("Passed", "Failed"),
    name = "Status"
  ) +
  scale_x_discrete(labels = c(
    "distribution" = "Distribution\nconstraints",
    "payoff" = "Payoff\ncomputation",
    "dominance" = "Dominance\nchecking",
    "NE-solver" = "NE solver",
    "best-response" = "Best-response\nproperty"
  )) +
  labs(
    title = "Game-theoretic test suite results by category",
    subtitle = sprintf("%d tests across 5 categories | %d passed, %d failed",
                       nrow(test_results), sum(test_results$passed),
                       sum(!test_results$passed)),
    x = "Test category",
    y = "Number of tests",
    caption = "Testing framework: testthat-compatible | #equilibria"
  ) +
  theme_publication() +
  theme(axis.text.x = element_text(size = 9))

p_tests
Figure 1: Figure 1. Test results by category for game-theoretic function test suite. All tests pass across five categories: distribution constraints, payoff computation, dominance checking, Nash equilibrium solving, and best-response property verification. Colours indicate pass (green) and fail (red) status.

Interactive figure

The interactive version provides hover details for each test, showing the specific test name and result.

test_results_plot <- test_results |>
  mutate(
    status = ifelse(passed, "Passed", "Failed"),
    category = factor(category,
      levels = c("distribution", "payoff", "dominance", "NE-solver", "best-response")),
    y_pos = ave(seq_len(n()), category, FUN = seq_along)
  )

p_interactive <- ggplot(test_results_plot,
       aes(x = category, y = y_pos, fill = status,
           text = sprintf("Category: %s\nTest: %s\nResult: %s",
                          category, test_name, status))) +
  geom_tile(width = 0.8, height = 0.8, color = "white", linewidth = 1) +
  scale_fill_manual(
    values = c("Passed" = okabe_ito[3], "Failed" = okabe_ito[6]),
    name = "Status"
  ) +
  scale_x_discrete(labels = c(
    "distribution" = "Distribution",
    "payoff" = "Payoff",
    "dominance" = "Dominance",
    "NE-solver" = "NE solver",
    "best-response" = "Best-response"
  )) +
  labs(
    title = "Test heatmap: hover for details",
    x = "Test category",
    y = "Test index"
  ) +
  theme_publication()

ggplotly(p_interactive, tooltip = "text") |>
  config(displaylogo = FALSE,
         modeBarButtonsToRemove = c("select2d", "lasso2d", "autoScale2d"))
Figure 2

Interpretation

The test suite demonstrates several important principles for testing game-theoretic code. First, distribution constraints (non-negativity and summation to one) are the most basic invariant and the cheapest to test, yet violations are among the most common bugs in equilibrium solvers. A dedicated test category for these constraints is essential.

Second, property-based tests are more powerful than value-based tests for game theory. Rather than checking that the equilibrium of Matching Pennies is exactly \((0.5, 0.5)\), the best-response property test checks that both strategies yield equal expected payoffs in the support — a property that must hold for any mixed NE, not just this specific game. This approach scales to larger games where the exact equilibrium values are not known in advance.

Third, the test pyramid concept translates naturally to game theory. At the base, unit tests verify individual computations (payoffs, dominance). In the middle, integration tests verify that the solver correctly combines these computations to find equilibria. At the top, property-based tests verify that the output satisfies the defining mathematical properties of the solution concept. A well-tested game-theoretic library should have the most tests at the base and fewer, more powerful tests at the top.

One limitation of the approach shown here is that we test only \(2 \times 2\) games with known solutions. For larger games, randomised testing (generating random payoff matrices and checking equilibrium properties of the output) becomes essential. The hedgehog package for R provides a property-based testing framework that automates this generation process, though it is not demonstrated here to stay within the base package constraint.

Finally, tolerance handling deserves careful attention. Game-theoretic computations inevitably involve floating-point arithmetic, and a test that uses exact equality (==) will fail for no substantive reason. Our assert_equal uses a tolerance of \(10^{-6}\), which is appropriate for well-conditioned \(2 \times 2\) games but may need to be relaxed for larger or degenerate games.

References

Back to top

Reuse

Citation

BibTeX citation:
@online{heller2026,
  author = {Heller, Raban},
  title = {Testing Game Theory Code with Testthat},
  date = {2026-05-08},
  url = {https://r-heller.github.io/equilibria/tutorials/r-package-development/testing-game-theory-code/},
  langid = {en}
}
For attribution, please cite this work as:
Heller, Raban. 2026. “Testing Game Theory Code with Testthat.” May 8. https://r-heller.github.io/equilibria/tutorials/r-package-development/testing-game-theory-code/.