From Perceptron to deep learning — a historical R implementation

ai-ml-foundations-and-applications
neural-networks
perceptron
deep-learning
historical
Trace the evolution of neural networks from Rosenblatt’s 1958 Perceptron through the Minsky-Papert critique to backpropagation and modern deep learning, implementing key milestones in R.
Author

Raban Heller

Published

May 8, 2026

Modified

May 8, 2026

Keywords

perceptron, neural networks, deep learning, backpropagation, Rosenblatt, Minsky, Rumelhart, XOR problem

Introduction & motivation

The history of neural networks is a story of grand ambitions, devastating critiques, and eventual triumph — and it is deeply intertwined with game-theoretic concepts of learning, adaptation, and optimization. Frank Rosenblatt (1958) introduced the Perceptron as a model of learning inspired by biological neurons: a simple linear classifier that adjusts weights based on errors, provably converging to a solution if the data are linearly separable. The excitement was immense — the New York Times reported that the Navy had built a machine that could “think.” Then came the devastating critique: Minsky and Papert (1969) proved that a single-layer perceptron cannot learn the XOR function or any non-linearly-separable pattern, and their pessimistic framing effectively froze neural network research for over a decade (the first “AI winter”). The revival came with Rumelhart et al. (1986), who popularized backpropagation — an efficient algorithm for training multi-layer networks that could overcome the XOR limitation. This led ultimately to the deep learning revolution crystallized by Krizhevsky et al. (2012)’s AlexNet, which won the ImageNet competition and launched the current era of AI. This tutorial implements each milestone in pure R: the single-layer perceptron, its failure on XOR, a two-layer network with backpropagation that solves XOR, and a comparison of learning dynamics — providing hands-on understanding of why depth matters and how gradient-based learning works.

Mathematical formulation

Perceptron (Rosenblatt 1958): Given input \(\mathbf{x} \in \mathbb{R}^d\) with bias term, weights \(\mathbf{w} \in \mathbb{R}^{d+1}\), the perceptron computes:

\[ \hat{y} = \text{sign}(\mathbf{w}^\top \mathbf{x}) \]

The update rule for a misclassified example \((\mathbf{x}_i, y_i)\) is \(\mathbf{w} \leftarrow \mathbf{w} + \eta \, y_i \, \mathbf{x}_i\), where \(\eta\) is the learning rate. The Perceptron Convergence Theorem guarantees convergence in finite steps if a separating hyperplane exists.

XOR Problem (Minsky & Papert 1969): The function \(\text{XOR}(x_1, x_2) = x_1 \oplus x_2\) is not linearly separable. No single hyperplane in \(\mathbb{R}^2\) can separate the positive examples \((0,1), (1,0)\) from the negative examples \((0,0), (1,1)\). Therefore no single-layer perceptron can learn it.

Multi-layer network with backpropagation (Rumelhart, Hinton & Williams 1986): A two-layer network with hidden units and nonlinear activation \(\sigma\) (sigmoid) computes:

\[ \mathbf{h} = \sigma(W_1 \mathbf{x} + \mathbf{b}_1), \quad \hat{y} = \sigma(\mathbf{w}_2^\top \mathbf{h} + b_2) \]

Training minimizes binary cross-entropy via gradient descent, with gradients computed by the chain rule (backpropagation). This architecture can learn XOR and, with sufficient width and depth, can approximate any continuous function (Universal Approximation Theorem).

R implementation

The Perceptron

# Perceptron learning algorithm
perceptron_train <- function(X, y, eta = 0.1, max_iter = 100) {
  n <- nrow(X)
  d <- ncol(X)
  w <- rep(0, d)
  errors_per_epoch <- numeric(max_iter)

  for (epoch in 1:max_iter) {
    n_errors <- 0
    for (i in 1:n) {
      y_hat <- sign(sum(w * X[i, ]))
      if (y_hat == 0) y_hat <- -1
      if (y_hat != y[i]) {
        w <- w + eta * y[i] * X[i, ]
        n_errors <- n_errors + 1
      }
    }
    errors_per_epoch[epoch] <- n_errors
    if (n_errors == 0) {
      errors_per_epoch[(epoch+1):max_iter] <- 0
      break
    }
  }
  list(weights = w, errors = errors_per_epoch, converged_epoch = epoch)
}

# --- Linearly separable data (AND gate) ---
X_and <- cbind(1, matrix(c(0,0, 0,1, 1,0, 1,1), ncol = 2, byrow = TRUE))
y_and <- c(-1, -1, -1, 1)  # AND: only (1,1) -> +1

result_and <- perceptron_train(X_and, y_and)
cat(sprintf("AND gate: converged in %d epochs\n", result_and$converged_epoch))
AND gate: converged in 6 epochs
cat("Weights:", round(result_and$weights, 3), "\n")
Weights: -0.2 0.2 0.1 
# --- XOR data ---
X_xor <- cbind(1, matrix(c(0,0, 0,1, 1,0, 1,1), ncol = 2, byrow = TRUE))
y_xor <- c(-1, 1, 1, -1)  # XOR

result_xor <- perceptron_train(X_xor, y_xor, max_iter = 100)
cat(sprintf("\nXOR: errors after 100 epochs = %d (never converges)\n",
            result_xor$errors[100]))

XOR: errors after 100 epochs = 4 (never converges)

Two-layer network with backpropagation

sigmoid <- function(z) 1 / (1 + exp(-z))

# Two-layer network for XOR
mlp_train_xor <- function(eta = 1.0, n_hidden = 4, max_iter = 5000, seed = 42) {
  set.seed(seed)

  # XOR data (using 0/1 encoding for sigmoid output)
  X <- matrix(c(0,0, 0,1, 1,0, 1,1), ncol = 2, byrow = TRUE)
  y <- c(0, 1, 1, 0)

  # Initialize weights
  W1 <- matrix(rnorm(2 * n_hidden, sd = 1), nrow = 2, ncol = n_hidden)
  b1 <- rep(0, n_hidden)
  w2 <- rnorm(n_hidden, sd = 1)
  b2 <- 0

  loss_history <- numeric(max_iter)

  for (iter in 1:max_iter) {
    # Forward pass
    z1 <- X %*% W1 + matrix(b1, nrow = 4, ncol = n_hidden, byrow = TRUE)
    h <- sigmoid(z1)
    z2 <- h %*% w2 + b2
    y_hat <- sigmoid(z2)

    # Binary cross-entropy loss
    eps <- 1e-8
    loss <- -mean(y * log(y_hat + eps) + (1 - y) * log(1 - y_hat + eps))
    loss_history[iter] <- loss

    # Backward pass
    dz2 <- (y_hat - y) / 4  # gradient of loss w.r.t. z2
    dw2 <- as.numeric(t(h) %*% dz2)
    db2 <- sum(dz2)

    dh <- outer(dz2, w2)
    dz1 <- dh * h * (1 - h)  # sigmoid derivative
    dW1 <- t(X) %*% dz1
    db1 <- colSums(dz1)

    # Update
    W1 <- W1 - eta * dW1
    b1 <- b1 - eta * db1
    w2 <- w2 - eta * dw2
    b2 <- b2 - eta * db2
  }

  # Final predictions
  z1 <- X %*% W1 + matrix(b1, nrow = 4, ncol = n_hidden, byrow = TRUE)
  h <- sigmoid(z1)
  y_hat <- sigmoid(h %*% w2 + b2)

  list(predictions = round(y_hat, 3), loss_history = loss_history,
       W1 = W1, b1 = b1, w2 = w2, b2 = b2)
}

mlp_result <- mlp_train_xor()
Error in `dh * h`:
! non-conformable arrays
cat("XOR predictions after backprop training:\n")
XOR predictions after backprop training:
cat(sprintf("  (0,0) -> %.3f (target: 0)\n", mlp_result$predictions[1]))
Error:
! object 'mlp_result' not found
cat(sprintf("  (0,1) -> %.3f (target: 1)\n", mlp_result$predictions[2]))
Error:
! object 'mlp_result' not found
cat(sprintf("  (1,0) -> %.3f (target: 1)\n", mlp_result$predictions[3]))
Error:
! object 'mlp_result' not found
cat(sprintf("  (1,1) -> %.3f (target: 0)\n", mlp_result$predictions[4]))
Error:
! object 'mlp_result' not found

Static publication-ready figure

# Prepare data
and_df <- tibble(epoch = 1:100, errors = result_and$errors, task = "Perceptron — AND")
xor_perc_df <- tibble(epoch = 1:100, errors = result_xor$errors, task = "Perceptron — XOR")
xor_mlp_df <- tibble(epoch = 1:5000, loss = mlp_result$loss_history, task = "MLP — XOR (backprop)")
Error:
! object 'mlp_result' not found
# Panel 1 & 2: Perceptron errors
perc_df <- bind_rows(and_df, xor_perc_df)
p1 <- ggplot(perc_df, aes(x = epoch, y = errors, color = task)) +
  geom_line(linewidth = 0.8) +
  facet_wrap(~task, scales = "free_y") +
  scale_color_manual(values = c(okabe_ito[3], okabe_ito[6])) +
  labs(x = "Epoch", y = "Classification errors") +
  theme_publication() +
  theme(legend.position = "none", strip.text = element_text(face = "bold"))

# Panel 3: MLP loss
p2 <- ggplot(xor_mlp_df, aes(x = epoch, y = loss)) +
  geom_line(color = okabe_ito[5], linewidth = 0.8) +
  labs(x = "Epoch", y = "Cross-entropy loss",
       subtitle = "MLP — XOR (backprop)") +
  theme_publication()
Error:
! object 'xor_mlp_df' not found
# Combine using patchwork-style approach (side by side)
gridExtra::grid.arrange(p1, p2, ncol = 2, widths = c(2, 1),
                         top = grid::textGrob("Neural network learning: from Perceptron to backpropagation",
                                              gp = grid::gpar(fontsize = 14, fontface = "bold")))
Error:
! object 'p2' not found

Interactive figure

# Visualise the MLP decision boundary for XOR
grid_pts <- expand.grid(
  x1 = seq(-0.5, 1.5, length.out = 100),
  x2 = seq(-0.5, 1.5, length.out = 100)
)

# Forward pass through trained network
X_grid <- as.matrix(grid_pts)
z1 <- X_grid %*% mlp_result$W1 + matrix(mlp_result$b1, nrow = nrow(X_grid),
                                          ncol = length(mlp_result$b1), byrow = TRUE)
Error:
! object 'mlp_result' not found
h <- sigmoid(z1)
Error:
! object 'z1' not found
y_grid <- sigmoid(h %*% mlp_result$w2 + mlp_result$b2)
Error:
! object 'h' not found
grid_pts$prediction <- as.numeric(y_grid)
Error:
! object 'y_grid' not found
# XOR training points
xor_points <- tibble(
  x1 = c(0, 0, 1, 1), x2 = c(0, 1, 0, 1),
  label = c("0", "1", "1", "0")
)

p_boundary <- ggplot() +
  geom_tile(data = grid_pts, aes(x = x1, y = x2, fill = prediction),
            alpha = 0.7) +
  scale_fill_gradient2(low = okabe_ito[6], mid = "white", high = okabe_ito[3],
                        midpoint = 0.5, name = "P(y=1)") +
  geom_point(data = xor_points, aes(x = x1, y = x2, color = label),
             size = 4, shape = 16) +
  scale_color_manual(values = c("0" = okabe_ito[6], "1" = okabe_ito[3]),
                      name = "True label") +
  coord_fixed() +
  labs(
    title = "MLP decision boundary for XOR",
    subtitle = "Two-layer network with 4 hidden units learns the non-linear boundary",
    x = expression(x[1]), y = expression(x[2])
  ) +
  theme_publication()

ggplotly(p_boundary) |>
  config(displaylogo = FALSE,
         modeBarButtonsToRemove = c("select2d", "lasso2d"))
Error:
! object 'prediction' not found

Interpretation

This historical journey through neural network milestones illustrates a fundamental principle that connects machine learning to game theory: the structure of the hypothesis space determines what can be learned. The single-layer perceptron, despite having a beautiful convergence guarantee for linearly separable data, is fundamentally limited — as Minsky and Papert (1969) proved, it cannot represent functions like XOR that require nonlinear decision boundaries. This is not a failure of the learning algorithm but of the model class. The addition of a hidden layer with nonlinear activations transforms the problem: the hidden units learn internal representations that make the data linearly separable in a higher-dimensional feature space, and backpropagation (Rumelhart et al. (1986)) provides an efficient gradient-based method for optimizing the entire network end-to-end. The XOR example — trivial by modern standards — was historically pivotal because it demonstrated that depth matters: a two-layer network with just 4 hidden units solves what no single-layer network can. The learning curves tell this story visually: the perceptron’s error on XOR never reaches zero (it oscillates), while the MLP’s loss decreases smoothly. This lesson scaled up to the deep learning revolution: Krizhevsky et al. (2012) showed that very deep convolutional networks with millions of parameters, trained on massive datasets using the same backpropagation principle, could achieve superhuman performance on image classification. The connection to game theory is not incidental — training a neural network is an optimization problem, adversarial training involves minimax games (GANs), and multi-agent reinforcement learning combines deep learning with strategic interaction.

References

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25. https://doi.org/10.1145/3065386.
Minsky, Marvin, and Seymour Papert. 1969. Perceptrons: An Introduction to Computational Geometry. MIT Press.
Rosenblatt, Frank. 1958. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65 (6): 386–408. https://doi.org/10.1037/h0042519.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.
Back to top

Reuse

Citation

BibTeX citation:
@online{heller2026,
  author = {Heller, Raban},
  title = {From {Perceptron} to Deep Learning — a Historical {R}
    Implementation},
  date = {2026-05-08},
  url = {https://r-heller.github.io/equilibria/tutorials/ai-ml-foundations-and-applications/perceptron-to-deep-learning-historical-r-implementation/},
  langid = {en}
}
For attribution, please cite this work as:
Heller, Raban. 2026. “From Perceptron to Deep Learning — a Historical R Implementation.” May 8. https://r-heller.github.io/equilibria/tutorials/ai-ml-foundations-and-applications/perceptron-to-deep-learning-historical-r-implementation/.