UMAP and t-SNE: Overview

Multivariate Statistics
umap
tsne
embedding
visualisation
Modern non-linear dimensionality reduction for visualisation: UMAP, t-SNE, and their caveats
Published

April 17, 2026

Introduction

UMAP (uniform manifold approximation and projection) and t-SNE (t-distributed stochastic neighbour embedding) are non-linear dimensionality-reduction techniques dominant in visualisation of high-dimensional data (single-cell, imaging, embeddings).

Prerequisites

PCA, non-linear dimensionality reduction.

Theory

Both methods optimise a 2D or 3D embedding such that pairwise similarities (computed in the original high-dimensional space) are preserved.

  • t-SNE: converts distances to Student-t-distributed probabilities, minimises KL divergence to 2D probabilities. Non-parametric.
  • UMAP: uses fuzzy topological structure; faster, better at preserving global structure.

Both are stochastic and non-deterministic unless seed is fixed.

Assumptions

Similarity-preserving embedding is informative for the data type.

R Implementation

library(umap); library(Rtsne)

set.seed(2026)
X <- as.matrix(iris[, 1:4])

# UMAP
u <- umap(X)
plot(u$layout, col = iris$Species, pch = 16,
     main = "UMAP of iris")

# t-SNE
t <- Rtsne(X, perplexity = 15, check_duplicates = FALSE)
plot(t$Y, col = iris$Species, pch = 16,
     main = "t-SNE of iris")

Output & Results

2D embeddings with separated clusters (species for iris).

Interpretation

“UMAP and t-SNE both separated the three iris species; UMAP preserved the setosa-versicolor gap more faithfully than t-SNE.”

Practical Tips

  • Distances in the embedding are not metric; interpret only as relative neighbourhoods.
  • t-SNE perplexity (5-50) controls neighbourhood scale; tune it.
  • UMAP min_dist controls local packing density.
  • Cluster sizes in the embedding are meaningless; do not interpret.
  • Run PCA first to reduce noise before UMAP / t-SNE on very high-dimensional data.