Pearson Correlation

pearson

correlation

linear

Measuring linear association between two continuous, approximately bivariate-normal variables

Published

April 17, 2026

Research question

Pearson’s \(r\) quantifies linear association between two continuous variables. Biomedical examples: (1) in patients with chronic heart failure, is left-ventricular ejection fraction linearly associated with six-minute-walk distance?; (2) in a Parkinson’s cohort, does striatal dopamine-transporter binding correlate with motor symptom severity?

Assumptions

Assumption	How to verify in R
Both variables continuous	scale
Approximate bivariate normality	`shapiro_test` on each; scatter plot with marginal histograms
Linear relationship (no obvious curvature)	scatter plot
No extreme bivariate outliers	Mahalanobis distance or boxplot on each

If the relationship is monotonic but not linear, use Spearman. If variables are heavily skewed, apply a transformation or use Spearman / Kendall.

Hypotheses

\[H_0: \rho = 0 \qquad H_1: \rho \ne 0\]

R code

library(tidyverse); library(rstatix); library(broom); library(ggstatsplot)
set.seed(42)

# 80 heart-failure patients: LVEF (%) and 6MWD (m)
hf <- tibble(
  lvef = round(rnorm(80, 38, 8)),
  mwd  = round(200 + 7 * lvef + rnorm(80, 0, 45))
)

# Assumptions
hf |> shapiro_test(lvef, mwd)

ggplot(hf, aes(lvef, mwd)) +
  geom_point(colour = "#2A9D8F", size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", colour = "#F4A261") +
  labs(x = "LVEF (%)", y = "6-min walk (m)") +
  theme_minimal()

# Pearson correlation with confidence interval
cor.test(hf$lvef, hf$mwd, method = "pearson") |> tidy()

# Tidy version via rstatix
hf |> cor_test(lvef, mwd, method = "pearson")

# Inline stats plot
ggscatterstats(data = hf, x = lvef, y = mwd, type = "parametric",
               xlab = "LVEF (%)", ylab = "6-min walk (m)")

Interpreting the output

Pearson \(r = 0.78\) with a 95 % CI of \([0.68, 0.86]\), \(t(78) \approx 11\), \(p < .001\). The linear association is strong: higher ejection fraction is associated with longer walking distance.

Effect size

Pearson’s \(r\) itself is the effect size. Cohen’s thresholds: small 0.10, medium 0.30, large 0.50. \(r^2\) gives the proportion of variance shared linearly.

Reporting (APA 7)

LVEF was positively correlated with 6-minute-walk distance (r = .78, 95 % CI [.68, .86], t(78) = 11.0, p < .001). The two variables share 60.8 % of their variance in linear form.

Common pitfalls

Pearson is sensitive to outliers; one or two leverage points can dominate \(r\).
A near-zero Pearson does not imply no relationship; it means no linear relationship. Check the scatter plot for curvature.
Correlation is not causation: a strong \(r\) between two variables does not establish one causes the other; confounders are always possible.
Testing many correlations without multiple-testing correction inflates false positives (the “correlation matrix” trap).

Parametric vs. non-parametric alternative

Non-parametric: Spearman rank correlation and Kendall’s tau.
Directed: simple linear regression when one variable is considered the predictor.