UMAP and t-SNE: Overview
Multivariate Statistics
umap
tsne
embedding
visualisation
Modern non-linear dimensionality reduction for visualisation: UMAP, t-SNE, and their caveats
Introduction
UMAP (uniform manifold approximation and projection) and t-SNE (t-distributed stochastic neighbour embedding) are non-linear dimensionality-reduction techniques dominant in visualisation of high-dimensional data (single-cell, imaging, embeddings).
Prerequisites
PCA, non-linear dimensionality reduction.
Theory
Both methods optimise a 2D or 3D embedding such that pairwise similarities (computed in the original high-dimensional space) are preserved.
- t-SNE: converts distances to Student-t-distributed probabilities, minimises KL divergence to 2D probabilities. Non-parametric.
- UMAP: uses fuzzy topological structure; faster, better at preserving global structure.
Both are stochastic and non-deterministic unless seed is fixed.
Assumptions
Similarity-preserving embedding is informative for the data type.
R Implementation
library(umap); library(Rtsne)
set.seed(2026)
X <- as.matrix(iris[, 1:4])
# UMAP
u <- umap(X)
plot(u$layout, col = iris$Species, pch = 16,
main = "UMAP of iris")
# t-SNE
t <- Rtsne(X, perplexity = 15, check_duplicates = FALSE)
plot(t$Y, col = iris$Species, pch = 16,
main = "t-SNE of iris")Output & Results
2D embeddings with separated clusters (species for iris).
Interpretation
“UMAP and t-SNE both separated the three iris species; UMAP preserved the setosa-versicolor gap more faithfully than t-SNE.”
Practical Tips
- Distances in the embedding are not metric; interpret only as relative neighbourhoods.
- t-SNE perplexity (5-50) controls neighbourhood scale; tune it.
- UMAP min_dist controls local packing density.
- Cluster sizes in the embedding are meaningless; do not interpret.
- Run PCA first to reduce noise before UMAP / t-SNE on very high-dimensional data.