From Perceptron to deep learning — a historical R implementation

ai-ml-foundations-and-applications

neural-networks

perceptron

deep-learning

historical

Trace the evolution of neural networks from Rosenblatt’s 1958 Perceptron through the Minsky-Papert critique to backpropagation and modern deep learning, implementing key milestones in R.

Author

Raban Heller

Published

May 8, 2026

Modified

May 8, 2026

Keywords

perceptron, neural networks, deep learning, backpropagation, Rosenblatt, Minsky, Rumelhart, XOR problem

Introduction & motivation

The history of neural networks is a story of grand ambitions, devastating critiques, and eventual triumph — and it is deeply intertwined with game-theoretic concepts of learning, adaptation, and optimization. Frank Rosenblatt (1958) introduced the Perceptron as a model of learning inspired by biological neurons: a simple linear classifier that adjusts weights based on errors, provably converging to a solution if the data are linearly separable. The excitement was immense — the New York Times reported that the Navy had built a machine that could “think.” Then came the devastating critique: Minsky and Papert (1969) proved that a single-layer perceptron cannot learn the XOR function or any non-linearly-separable pattern, and their pessimistic framing effectively froze neural network research for over a decade (the first “AI winter”). The revival came with Rumelhart et al. (1986), who popularized backpropagation — an efficient algorithm for training multi-layer networks that could overcome the XOR limitation. This led ultimately to the deep learning revolution crystallized by Krizhevsky et al. (2012)’s AlexNet, which won the ImageNet competition and launched the current era of AI. This tutorial implements each milestone in pure R: the single-layer perceptron, its failure on XOR, a two-layer network with backpropagation that solves XOR, and a comparison of learning dynamics — providing hands-on understanding of why depth matters and how gradient-based learning works.

Mathematical formulation

Perceptron (Rosenblatt 1958): Given input $\mathbf{x} \in \mathbb{R}^d$ with bias term, weights $\mathbf{w} \in \mathbb{R}^{d+1}$, the perceptron computes:

\[ \hat{y} = \text{sign}(\mathbf{w}^\top \mathbf{x}) \]

The update rule for a misclassified example $(\mathbf{x}_i, y_i)$ is $\mathbf{w} \leftarrow \mathbf{w} + \eta \, y_i \, \mathbf{x}_i$, where $\eta$ is the learning rate. The Perceptron Convergence Theorem guarantees convergence in finite steps if a separating hyperplane exists.

XOR Problem (Minsky & Papert 1969): The function $\text{XOR}(x_1, x_2) = x_1 \oplus x_2$ is not linearly separable. No single hyperplane in $\mathbb{R}^2$ can separate the positive examples $(0,1), (1,0)$ from the negative examples $(0,0), (1,1)$. Therefore no single-layer perceptron can learn it.

Multi-layer network with backpropagation (Rumelhart, Hinton & Williams 1986): A two-layer network with hidden units and nonlinear activation $\sigma$ (sigmoid) computes:

\[ \mathbf{h} = \sigma(W_1 \mathbf{x} + \mathbf{b}_1), \quad \hat{y} = \sigma(\mathbf{w}_2^\top \mathbf{h} + b_2) \]

Training minimizes binary cross-entropy via gradient descent, with gradients computed by the chain rule (backpropagation). This architecture can learn XOR and, with sufficient width and depth, can approximate any continuous function (Universal Approximation Theorem).

R implementation

The Perceptron

# Perceptron learning algorithm
perceptron_train <- function(X, y, eta = 0.1, max_iter = 100) {
  n <- nrow(X)
  d <- ncol(X)
  w <- rep(0, d)
  errors_per_epoch <- numeric(max_iter)

  for (epoch in 1:max_iter) {
    n_errors <- 0
    for (i in 1:n) {
      y_hat <- sign(sum(w * X[i, ]))
      if (y_hat == 0) y_hat <- -1
      if (y_hat != y[i]) {
        w <- w + eta * y[i] * X[i, ]
        n_errors <- n_errors + 1
      }
    }
    errors_per_epoch[epoch] <- n_errors
    if (n_errors == 0) {
      errors_per_epoch[(epoch+1):max_iter] <- 0
      break
    }
  }
  list(weights = w, errors = errors_per_epoch, converged_epoch = epoch)
}

# --- Linearly separable data (AND gate) ---
X_and <- cbind(1, matrix(c(0,0, 0,1, 1,0, 1,1), ncol = 2, byrow = TRUE))
y_and <- c(-1, -1, -1, 1)  # AND: only (1,1) -> +1

result_and <- perceptron_train(X_and, y_and)
cat(sprintf("AND gate: converged in %d epochs\n", result_and$converged_epoch))

AND gate: converged in 6 epochs

cat("Weights:", round(result_and$weights, 3), "\n")

Weights: -0.2 0.2 0.1

# --- XOR data ---
X_xor <- cbind(1, matrix(c(0,0, 0,1, 1,0, 1,1), ncol = 2, byrow = TRUE))
y_xor <- c(-1, 1, 1, -1)  # XOR

result_xor <- perceptron_train(X_xor, y_xor, max_iter = 100)
cat(sprintf("\nXOR: errors after 100 epochs = %d (never converges)\n",
            result_xor$errors[100]))


XOR: errors after 100 epochs = 4 (never converges)

Two-layer network with backpropagation

sigmoid <- function(z) 1 / (1 + exp(-z))

# Two-layer network for XOR
mlp_train_xor <- function(eta = 1.0, n_hidden = 4, max_iter = 5000, seed = 42) {
  set.seed(seed)

  # XOR data (using 0/1 encoding for sigmoid output)
  X <- matrix(c(0,0, 0,1, 1,0, 1,1), ncol = 2, byrow = TRUE)
  y <- c(0, 1, 1, 0)

  # Initialize weights
  W1 <- matrix(rnorm(2 * n_hidden, sd = 1), nrow = 2, ncol = n_hidden)
  b1 <- rep(0, n_hidden)
  w2 <- rnorm(n_hidden, sd = 1)
  b2 <- 0

  loss_history <- numeric(max_iter)

  for (iter in 1:max_iter) {
    # Forward pass
    z1 <- X %*% W1 + matrix(b1, nrow = 4, ncol = n_hidden, byrow = TRUE)
    h <- sigmoid(z1)
    z2 <- h %*% w2 + b2
    y_hat <- sigmoid(z2)

    # Binary cross-entropy loss
    eps <- 1e-8
    loss <- -mean(y * log(y_hat + eps) + (1 - y) * log(1 - y_hat + eps))
    loss_history[iter] <- loss

    # Backward pass
    dz2 <- (y_hat - y) / 4  # gradient of loss w.r.t. z2
    dw2 <- as.numeric(t(h) %*% dz2)
    db2 <- sum(dz2)

    dh <- outer(dz2, w2)
    dz1 <- dh * h * (1 - h)  # sigmoid derivative
    dW1 <- t(X) %*% dz1
    db1 <- colSums(dz1)

    # Update
    W1 <- W1 - eta * dW1
    b1 <- b1 - eta * db1
    w2 <- w2 - eta * dw2
    b2 <- b2 - eta * db2
  }

  # Final predictions
  z1 <- X %*% W1 + matrix(b1, nrow = 4, ncol = n_hidden, byrow = TRUE)
  h <- sigmoid(z1)
  y_hat <- sigmoid(h %*% w2 + b2)

  list(predictions = round(y_hat, 3), loss_history = loss_history,
       W1 = W1, b1 = b1, w2 = w2, b2 = b2)
}

mlp_result <- mlp_train_xor()

Error in `dh * h`:
! non-conformable arrays

cat("XOR predictions after backprop training:\n")

XOR predictions after backprop training:

cat(sprintf("  (0,0) -> %.3f (target: 0)\n", mlp_result$predictions[1]))

Error:
! object 'mlp_result' not found

cat(sprintf("  (0,1) -> %.3f (target: 1)\n", mlp_result$predictions[2]))

Error:
! object 'mlp_result' not found

cat(sprintf("  (1,0) -> %.3f (target: 1)\n", mlp_result$predictions[3]))

Error:
! object 'mlp_result' not found

cat(sprintf("  (1,1) -> %.3f (target: 0)\n", mlp_result$predictions[4]))

Error:
! object 'mlp_result' not found

Static publication-ready figure

# Prepare data
and_df <- tibble(epoch = 1:100, errors = result_and$errors, task = "Perceptron — AND")
xor_perc_df <- tibble(epoch = 1:100, errors = result_xor$errors, task = "Perceptron — XOR")
xor_mlp_df <- tibble(epoch = 1:5000, loss = mlp_result$loss_history, task = "MLP — XOR (backprop)")

Error:
! object 'mlp_result' not found

# Panel 1 & 2: Perceptron errors
perc_df <- bind_rows(and_df, xor_perc_df)
p1 <- ggplot(perc_df, aes(x = epoch, y = errors, color = task)) +
  geom_line(linewidth = 0.8) +
  facet_wrap(~task, scales = "free_y") +
  scale_color_manual(values = c(okabe_ito[3], okabe_ito[6])) +
  labs(x = "Epoch", y = "Classification errors") +
  theme_publication() +
  theme(legend.position = "none", strip.text = element_text(face = "bold"))

# Panel 3: MLP loss
p2 <- ggplot(xor_mlp_df, aes(x = epoch, y = loss)) +
  geom_line(color = okabe_ito[5], linewidth = 0.8) +
  labs(x = "Epoch", y = "Cross-entropy loss",
       subtitle = "MLP — XOR (backprop)") +
  theme_publication()

Error:
! object 'xor_mlp_df' not found

# Combine using patchwork-style approach (side by side)
gridExtra::grid.arrange(p1, p2, ncol = 2, widths = c(2, 1),
                         top = grid::textGrob("Neural network learning: from Perceptron to backpropagation",
                                              gp = grid::gpar(fontsize = 14, fontface = "bold")))

Error:
! object 'p2' not found

Interactive figure

# Visualise the MLP decision boundary for XOR
grid_pts <- expand.grid(
  x1 = seq(-0.5, 1.5, length.out = 100),
  x2 = seq(-0.5, 1.5, length.out = 100)
)

# Forward pass through trained network
X_grid <- as.matrix(grid_pts)
z1 <- X_grid %*% mlp_result$W1 + matrix(mlp_result$b1, nrow = nrow(X_grid),
                                          ncol = length(mlp_result$b1), byrow = TRUE)

Error:
! object 'mlp_result' not found

h <- sigmoid(z1)

Error:
! object 'z1' not found

y_grid <- sigmoid(h %*% mlp_result$w2 + mlp_result$b2)

Error:
! object 'h' not found

grid_pts$prediction <- as.numeric(y_grid)

Error:
! object 'y_grid' not found

# XOR training points
xor_points <- tibble(
  x1 = c(0, 0, 1, 1), x2 = c(0, 1, 0, 1),
  label = c("0", "1", "1", "0")
)

p_boundary <- ggplot() +
  geom_tile(data = grid_pts, aes(x = x1, y = x2, fill = prediction),
            alpha = 0.7) +
  scale_fill_gradient2(low = okabe_ito[6], mid = "white", high = okabe_ito[3],
                        midpoint = 0.5, name = "P(y=1)") +
  geom_point(data = xor_points, aes(x = x1, y = x2, color = label),
             size = 4, shape = 16) +
  scale_color_manual(values = c("0" = okabe_ito[6], "1" = okabe_ito[3]),
                      name = "True label") +
  coord_fixed() +
  labs(
    title = "MLP decision boundary for XOR",
    subtitle = "Two-layer network with 4 hidden units learns the non-linear boundary",
    x = expression(x[1]), y = expression(x[2])
  ) +
  theme_publication()

ggplotly(p_boundary) |>
  config(displaylogo = FALSE,
         modeBarButtonsToRemove = c("select2d", "lasso2d"))

Error:
! object 'prediction' not found

Interpretation

This historical journey through neural network milestones illustrates a fundamental principle that connects machine learning to game theory: the structure of the hypothesis space determines what can be learned. The single-layer perceptron, despite having a beautiful convergence guarantee for linearly separable data, is fundamentally limited — as Minsky and Papert (1969) proved, it cannot represent functions like XOR that require nonlinear decision boundaries. This is not a failure of the learning algorithm but of the model class. The addition of a hidden layer with nonlinear activations transforms the problem: the hidden units learn internal representations that make the data linearly separable in a higher-dimensional feature space, and backpropagation (Rumelhart et al. (1986)) provides an efficient gradient-based method for optimizing the entire network end-to-end. The XOR example — trivial by modern standards — was historically pivotal because it demonstrated that depth matters: a two-layer network with just 4 hidden units solves what no single-layer network can. The learning curves tell this story visually: the perceptron’s error on XOR never reaches zero (it oscillates), while the MLP’s loss decreases smoothly. This lesson scaled up to the deep learning revolution: Krizhevsky et al. (2012) showed that very deep convolutional networks with millions of parameters, trained on massive datasets using the same backpropagation principle, could achieve superhuman performance on image classification. The connection to game theory is not incidental — training a neural network is an optimization problem, adversarial training involves minimax games (GANs), and multi-agent reinforcement learning combines deep learning with strategic interaction.

References

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25. https://doi.org/10.1145/3065386.

Minsky, Marvin, and Seymour Papert. 1969. Perceptrons: An Introduction to Computational Geometry. MIT Press.

Rosenblatt, Frank. 1958. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65 (6): 386–408. https://doi.org/10.1037/h0042519.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{heller2026,
  author = {Heller, Raban},
  title = {From {Perceptron} to Deep Learning — a Historical {R}
    Implementation},
  date = {2026-05-08},
  url = {https://r-heller.github.io/equilibria/tutorials/ai-ml-foundations-and-applications/perceptron-to-deep-learning-historical-r-implementation/},
  langid = {en}
}

For attribution, please cite this work as:

Heller, Raban. 2026. “From Perceptron to Deep Learning — a Historical R Implementation.” May 8. https://r-heller.github.io/equilibria/tutorials/ai-ml-foundations-and-applications/perceptron-to-deep-learning-historical-r-implementation/.

--- title: "From Perceptron to deep learning — a historical R implementation" description: "Trace the evolution of neural networks from Rosenblatt's 1958 Perceptron through the Minsky-Papert critique to backpropagation and modern deep learning, implementing key milestones in R." author: "Raban Heller" date: 2026-05-08 date-modified: 2026-05-08 categories: - ai-ml-foundations-and-applications - neural-networks - perceptron - deep-learning - historical keywords: ["perceptron", "neural networks", "deep learning", "backpropagation", "Rosenblatt", "Minsky", "Rumelhart", "XOR problem"] labels: ["ai-ml-history", "neural-networks"] tier: 1 bibliography: ../../../references.bib vgwort: "TODO_VGWORT_ai-ml-foundations-and-applications_perceptron-to-deep-learning" image: thumbnail.png image-alt: "Learning curves showing perceptron convergence on linearly separable data and failure on XOR" citation: type: webpage url: https://r-heller.github.io/equilibria/tutorials/ai-ml-foundations-and-applications/perceptron-to-deep-learning-historical-r-implementation/ license: "CC BY-SA 4.0" draft: false has_static_fig: true has_interactive_fig: true has_shiny_app: false --- ```{r} #| label: setup #| include: false library(ggplot2) library(dplyr) library(tidyr) library(plotly) okabe_ito <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") theme_publication <- function(base_size = 12) { theme_minimal(base_size = base_size) + theme( plot.title = element_text(size = base_size * 1.2, face = "bold"), plot.subtitle = element_text(size = base_size * 0.9, color = "grey40"), axis.line = element_line(color = "grey30", linewidth = 0.3), panel.grid.minor = element_blank(), legend.position = "bottom", plot.margin = margin(10, 10, 10, 10) ) } ``` ## Introduction & motivation The history of neural networks is a story of grand ambitions, devastating critiques, and eventual triumph — and it is deeply intertwined with game-theoretic concepts of learning, adaptation, and optimization. Frank @rosenblatt_1958 introduced the Perceptron as a model of learning inspired by biological neurons: a simple linear classifier that adjusts weights based on errors, provably converging to a solution if the data are linearly separable. The excitement was immense — the New York Times reported that the Navy had built a machine that could "think." Then came the devastating critique: @minsky_papert_1969 proved that a single-layer perceptron cannot learn the XOR function or any non-linearly-separable pattern, and their pessimistic framing effectively froze neural network research for over a decade (the first "AI winter"). The revival came with @rumelhart_hinton_williams_1986, who popularized backpropagation — an efficient algorithm for training multi-layer networks that could overcome the XOR limitation. This led ultimately to the deep learning revolution crystallized by @krizhevsky_2012's AlexNet, which won the ImageNet competition and launched the current era of AI. This tutorial implements each milestone in pure R: the single-layer perceptron, its failure on XOR, a two-layer network with backpropagation that solves XOR, and a comparison of learning dynamics — providing hands-on understanding of why depth matters and how gradient-based learning works. ## Mathematical formulation **Perceptron** (Rosenblatt 1958): Given input $\mathbf{x} \in \mathbb{R}^d$ with bias term, weights $\mathbf{w} \in \mathbb{R}^{d+1}$, the perceptron computes: $$ \hat{y} = \text{sign}(\mathbf{w}^\top \mathbf{x}) $$ The update rule for a misclassified example $(\mathbf{x}_i, y_i)$ is $\mathbf{w} \leftarrow \mathbf{w} + \eta \, y_i \, \mathbf{x}_i$, where $\eta$ is the learning rate. The **Perceptron Convergence Theorem** guarantees convergence in finite steps if a separating hyperplane exists. **XOR Problem** (Minsky & Papert 1969): The function $\text{XOR}(x_1, x_2) = x_1 \oplus x_2$ is not linearly separable. No single hyperplane in $\mathbb{R}^2$ can separate the positive examples $(0,1), (1,0)$ from the negative examples $(0,0), (1,1)$. Therefore no single-layer perceptron can learn it. **Multi-layer network with backpropagation** (Rumelhart, Hinton & Williams 1986): A two-layer network with hidden units and nonlinear activation $\sigma$ (sigmoid) computes: $$ \mathbf{h} = \sigma(W_1 \mathbf{x} + \mathbf{b}_1), \quad \hat{y} = \sigma(\mathbf{w}_2^\top \mathbf{h} + b_2) $$ Training minimizes binary cross-entropy via gradient descent, with gradients computed by the chain rule (backpropagation). This architecture can learn XOR and, with sufficient width and depth, can approximate any continuous function (Universal Approximation Theorem). ## R implementation ### The Perceptron ```{r} #| label: perceptron # Perceptron learning algorithm perceptron_train <- function(X, y, eta = 0.1, max_iter = 100) { n <- nrow(X) d <- ncol(X) w <- rep(0, d) errors_per_epoch <- numeric(max_iter) for (epoch in 1:max_iter) { n_errors <- 0 for (i in 1:n) { y_hat <- sign(sum(w * X[i, ])) if (y_hat == 0) y_hat <- -1 if (y_hat != y[i]) { w <- w + eta * y[i] * X[i, ] n_errors <- n_errors + 1 } } errors_per_epoch[epoch] <- n_errors if (n_errors == 0) { errors_per_epoch[(epoch+1):max_iter] <- 0 break } } list(weights = w, errors = errors_per_epoch, converged_epoch = epoch) } # --- Linearly separable data (AND gate) --- X_and <- cbind(1, matrix(c(0,0, 0,1, 1,0, 1,1), ncol = 2, byrow = TRUE)) y_and <- c(-1, -1, -1, 1) # AND: only (1,1) -> +1 result_and <- perceptron_train(X_and, y_and) cat(sprintf("AND gate: converged in %d epochs\n", result_and$converged_epoch)) cat("Weights:", round(result_and$weights, 3), "\n") # --- XOR data --- X_xor <- cbind(1, matrix(c(0,0, 0,1, 1,0, 1,1), ncol = 2, byrow = TRUE)) y_xor <- c(-1, 1, 1, -1) # XOR result_xor <- perceptron_train(X_xor, y_xor, max_iter = 100) cat(sprintf("\nXOR: errors after 100 epochs = %d (never converges)\n", result_xor$errors[100])) ``` ### Two-layer network with backpropagation ```{r} #| label: backprop-xor sigmoid <- function(z) 1 / (1 + exp(-z)) # Two-layer network for XOR mlp_train_xor <- function(eta = 1.0, n_hidden = 4, max_iter = 5000, seed = 42) { set.seed(seed) # XOR data (using 0/1 encoding for sigmoid output) X <- matrix(c(0,0, 0,1, 1,0, 1,1), ncol = 2, byrow = TRUE) y <- c(0, 1, 1, 0) # Initialize weights W1 <- matrix(rnorm(2 * n_hidden, sd = 1), nrow = 2, ncol = n_hidden) b1 <- rep(0, n_hidden) w2 <- rnorm(n_hidden, sd = 1) b2 <- 0 loss_history <- numeric(max_iter) for (iter in 1:max_iter) { # Forward pass z1 <- X %*% W1 + matrix(b1, nrow = 4, ncol = n_hidden, byrow = TRUE) h <- sigmoid(z1) z2 <- h %*% w2 + b2 y_hat <- sigmoid(z2) # Binary cross-entropy loss eps <- 1e-8 loss <- -mean(y * log(y_hat + eps) + (1 - y) * log(1 - y_hat + eps)) loss_history[iter] <- loss # Backward pass dz2 <- (y_hat - y) / 4 # gradient of loss w.r.t. z2 dw2 <- as.numeric(t(h) %*% dz2) db2 <- sum(dz2) dh <- outer(dz2, w2) dz1 <- dh * h * (1 - h) # sigmoid derivative dW1 <- t(X) %*% dz1 db1 <- colSums(dz1) # Update W1 <- W1 - eta * dW1 b1 <- b1 - eta * db1 w2 <- w2 - eta * dw2 b2 <- b2 - eta * db2 } # Final predictions z1 <- X %*% W1 + matrix(b1, nrow = 4, ncol = n_hidden, byrow = TRUE) h <- sigmoid(z1) y_hat <- sigmoid(h %*% w2 + b2) list(predictions = round(y_hat, 3), loss_history = loss_history, W1 = W1, b1 = b1, w2 = w2, b2 = b2) } mlp_result <- mlp_train_xor() cat("XOR predictions after backprop training:\n") cat(sprintf(" (0,0) -> %.3f (target: 0)\n", mlp_result$predictions[1])) cat(sprintf(" (0,1) -> %.3f (target: 1)\n", mlp_result$predictions[2])) cat(sprintf(" (1,0) -> %.3f (target: 1)\n", mlp_result$predictions[3])) cat(sprintf(" (1,1) -> %.3f (target: 0)\n", mlp_result$predictions[4])) ``` ## Static publication-ready figure ```{r} #| label: fig-learning-curves #| fig-cap: "Figure 1. Learning curves for the Perceptron (AND gate — converges) and two-layer network (XOR — solved by backpropagation). Left: the Perceptron converges to zero errors on the linearly separable AND gate within a few epochs. Centre: the Perceptron fails on XOR, oscillating with persistent errors — confirming the Minsky-Papert impossibility result. Right: a two-layer network with 4 hidden units learns XOR via backpropagation, with loss decreasing smoothly to near zero. Okabe-Ito palette." #| dev: [png, pdf] #| fig-width: 10 #| fig-height: 4 #| dpi: 300 # Prepare data and_df <- tibble(epoch = 1:100, errors = result_and$errors, task = "Perceptron — AND") xor_perc_df <- tibble(epoch = 1:100, errors = result_xor$errors, task = "Perceptron — XOR") xor_mlp_df <- tibble(epoch = 1:5000, loss = mlp_result$loss_history, task = "MLP — XOR (backprop)") # Panel 1 & 2: Perceptron errors perc_df <- bind_rows(and_df, xor_perc_df) p1 <- ggplot(perc_df, aes(x = epoch, y = errors, color = task)) + geom_line(linewidth = 0.8) + facet_wrap(~task, scales = "free_y") + scale_color_manual(values = c(okabe_ito[3], okabe_ito[6])) + labs(x = "Epoch", y = "Classification errors") + theme_publication() + theme(legend.position = "none", strip.text = element_text(face = "bold")) # Panel 3: MLP loss p2 <- ggplot(xor_mlp_df, aes(x = epoch, y = loss)) + geom_line(color = okabe_ito[5], linewidth = 0.8) + labs(x = "Epoch", y = "Cross-entropy loss", subtitle = "MLP — XOR (backprop)") + theme_publication() # Combine using patchwork-style approach (side by side) gridExtra::grid.arrange(p1, p2, ncol = 2, widths = c(2, 1), top = grid::textGrob("Neural network learning: from Perceptron to backpropagation", gp = grid::gpar(fontsize = 14, fontface = "bold"))) ``` ## Interactive figure ```{r} #| label: fig-decision-boundary-interactive # Visualise the MLP decision boundary for XOR grid_pts <- expand.grid( x1 = seq(-0.5, 1.5, length.out = 100), x2 = seq(-0.5, 1.5, length.out = 100) ) # Forward pass through trained network X_grid <- as.matrix(grid_pts) z1 <- X_grid %*% mlp_result$W1 + matrix(mlp_result$b1, nrow = nrow(X_grid), ncol = length(mlp_result$b1), byrow = TRUE) h <- sigmoid(z1) y_grid <- sigmoid(h %*% mlp_result$w2 + mlp_result$b2) grid_pts$prediction <- as.numeric(y_grid) # XOR training points xor_points <- tibble( x1 = c(0, 0, 1, 1), x2 = c(0, 1, 0, 1), label = c("0", "1", "1", "0") ) p_boundary <- ggplot() + geom_tile(data = grid_pts, aes(x = x1, y = x2, fill = prediction), alpha = 0.7) + scale_fill_gradient2(low = okabe_ito[6], mid = "white", high = okabe_ito[3], midpoint = 0.5, name = "P(y=1)") + geom_point(data = xor_points, aes(x = x1, y = x2, color = label), size = 4, shape = 16) + scale_color_manual(values = c("0" = okabe_ito[6], "1" = okabe_ito[3]), name = "True label") + coord_fixed() + labs( title = "MLP decision boundary for XOR", subtitle = "Two-layer network with 4 hidden units learns the non-linear boundary", x = expression(x[1]), y = expression(x[2]) ) + theme_publication() ggplotly(p_boundary) |> config(displaylogo = FALSE, modeBarButtonsToRemove = c("select2d", "lasso2d")) ``` ## Interpretation This historical journey through neural network milestones illustrates a fundamental principle that connects machine learning to game theory: **the structure of the hypothesis space determines what can be learned**. The single-layer perceptron, despite having a beautiful convergence guarantee for linearly separable data, is fundamentally limited — as @minsky_papert_1969 proved, it cannot represent functions like XOR that require nonlinear decision boundaries. This is not a failure of the learning algorithm but of the model class. The addition of a hidden layer with nonlinear activations transforms the problem: the hidden units learn internal representations that make the data linearly separable in a higher-dimensional feature space, and backpropagation (@rumelhart_hinton_williams_1986) provides an efficient gradient-based method for optimizing the entire network end-to-end. The XOR example — trivial by modern standards — was historically pivotal because it demonstrated that depth matters: a two-layer network with just 4 hidden units solves what no single-layer network can. The learning curves tell this story visually: the perceptron's error on XOR never reaches zero (it oscillates), while the MLP's loss decreases smoothly. This lesson scaled up to the deep learning revolution: @krizhevsky_2012 showed that very deep convolutional networks with millions of parameters, trained on massive datasets using the same backpropagation principle, could achieve superhuman performance on image classification. The connection to game theory is not incidental — training a neural network is an optimization problem, adversarial training involves minimax games (GANs), and multi-agent reinforcement learning combines deep learning with strategic interaction. ## Extensions & related tutorials - [Mixed-strategy Nash equilibrium in 2×2 games](../../foundations/nash-equilibrium-mixed/) — optimization and equilibrium concepts underlying learning. - [Reinforcement learning meets game theory](../rl-game-theory/) — when agents learn strategies through interaction. - [GANs as zero-sum games](../gans-zero-sum/) — generative adversarial networks as minimax optimization. - [Neural network game solvers](../nn-game-solvers/) — using deep learning to find Nash equilibria. - [Multi-agent RL in social dilemmas](../marl-social-dilemmas/) — combining deep learning with strategic interaction. ## References ::: {#refs} :::