Pearson Correlation

pearson
correlation
linear
r
Measuring linear association between two continuous, approximately bivariate-normal variables
Published

April 17, 2026

Research question

Pearson’s \(r\) quantifies linear association between two continuous variables. Biomedical examples: (1) in patients with chronic heart failure, is left-ventricular ejection fraction linearly associated with six-minute-walk distance?; (2) in a Parkinson’s cohort, does striatal dopamine-transporter binding correlate with motor symptom severity?

Assumptions

Assumption How to verify in R
Both variables continuous scale
Approximate bivariate normality shapiro_test on each; scatter plot with marginal histograms
Linear relationship (no obvious curvature) scatter plot
No extreme bivariate outliers Mahalanobis distance or boxplot on each

If the relationship is monotonic but not linear, use Spearman. If variables are heavily skewed, apply a transformation or use Spearman / Kendall.

Hypotheses

\[H_0: \rho = 0 \qquad H_1: \rho \ne 0\]

R code

library(tidyverse); library(rstatix); library(broom); library(ggstatsplot)
set.seed(42)

# 80 heart-failure patients: LVEF (%) and 6MWD (m)
hf <- tibble(
  lvef = round(rnorm(80, 38, 8)),
  mwd  = round(200 + 7 * lvef + rnorm(80, 0, 45))
)

# Assumptions
hf |> shapiro_test(lvef, mwd)

ggplot(hf, aes(lvef, mwd)) +
  geom_point(colour = "#2A9D8F", size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", colour = "#F4A261") +
  labs(x = "LVEF (%)", y = "6-min walk (m)") +
  theme_minimal()

# Pearson correlation with confidence interval
cor.test(hf$lvef, hf$mwd, method = "pearson") |> tidy()

# Tidy version via rstatix
hf |> cor_test(lvef, mwd, method = "pearson")

# Inline stats plot
ggscatterstats(data = hf, x = lvef, y = mwd, type = "parametric",
               xlab = "LVEF (%)", ylab = "6-min walk (m)")

Interpreting the output

Pearson \(r = 0.78\) with a 95 % CI of \([0.68, 0.86]\), \(t(78) \approx 11\), \(p < .001\). The linear association is strong: higher ejection fraction is associated with longer walking distance.

Effect size

Pearson’s \(r\) itself is the effect size. Cohen’s thresholds: small 0.10, medium 0.30, large 0.50. \(r^2\) gives the proportion of variance shared linearly.

Reporting (APA 7)

LVEF was positively correlated with 6-minute-walk distance (r = .78, 95 % CI [.68, .86], t(78) = 11.0, p < .001). The two variables share 60.8 % of their variance in linear form.

Common pitfalls

  • Pearson is sensitive to outliers; one or two leverage points can dominate \(r\).
  • A near-zero Pearson does not imply no relationship; it means no linear relationship. Check the scatter plot for curvature.
  • Correlation is not causation: a strong \(r\) between two variables does not establish one causes the other; confounders are always possible.
  • Testing many correlations without multiple-testing correction inflates false positives (the “correlation matrix” trap).

Parametric vs. non-parametric alternative

Further reading

  • Bishara, A. J., & Hittner, J. B. (2012). Testing the significance of a correlation with nonnormal data. Psychological Methods, 17(3), 399-417.

Structure inspired by the University of Zurich Methodenberatung (methodenberatung.uzh.ch). All text, examples, R code, and reporting sentences are independently authored in English.