Cohen’s Kappa

Clinical Biostatistics
kappa
agreement
reliability
Chance-corrected agreement between two raters on categorical data
Published

April 17, 2026

Introduction

Cohen’s kappa (1960) measures agreement between two raters on a categorical scale, correcting for the agreement expected by chance. It is the de facto standard for inter-rater reliability on nominal outcomes and is widely reported in reliability studies of imaging, pathology, and diagnostic assessments.

Prerequisites

Categorical data; proportion agreement.

Theory

\[\kappa = \frac{p_o - p_e}{1 - p_e},\] where \(p_o\) is observed agreement and \(p_e\) is agreement expected by chance (product of marginals). \(\kappa = 1\) means perfect agreement; 0 means chance-level; negative means worse than chance.

Landis-Koch benchmarks: 0.01-0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, 0.81-1.00 almost perfect.

Assumptions

Two raters; categories are exclusive and exhaustive; ratings are independent between raters; the same subject is rated by both.

R Implementation

library(psych)

set.seed(2026)
n <- 100
# Simulate two raters with ~80% agreement
rater1 <- factor(sample(c("pos", "neg"), n, replace = TRUE))
agree <- rbinom(n, 1, 0.8)
rater2 <- ifelse(agree == 1, as.character(rater1),
                 ifelse(rater1 == "pos", "neg", "pos"))
rater2 <- factor(rater2, levels = levels(rater1))

tab <- table(rater1, rater2)
tab

cohen.kappa(cbind(rater1, rater2))

Output & Results

Cross-table of ratings and Cohen’s kappa (~0.6); 95 % CI reflects the modest sample size.

Interpretation

“Inter-rater agreement was substantial (kappa = 0.58, 95 % CI 0.41-0.75) per Landis-Koch; observed agreement 80 % with chance agreement 52 %.”

Practical Tips

  • Kappa depends on prevalence; very low-prevalence categories produce small kappa even with high agreement (kappa paradox).
  • Report both kappa and observed agreement for transparency.
  • For > 2 raters, use Fleiss’ kappa or intraclass correlation.
  • For ordinal data, use weighted kappa to credit partial agreement.
  • Confidence interval on kappa is typically via bootstrap or delta method.