Cohen’s Kappa
Introduction
Cohen’s kappa (1960) measures agreement between two raters on a categorical scale, correcting for the agreement expected by chance. It is the de facto standard for inter-rater reliability on nominal outcomes and is widely reported in reliability studies of imaging, pathology, and diagnostic assessments.
Prerequisites
Categorical data; proportion agreement.
Theory
\[\kappa = \frac{p_o - p_e}{1 - p_e},\] where \(p_o\) is observed agreement and \(p_e\) is agreement expected by chance (product of marginals). \(\kappa = 1\) means perfect agreement; 0 means chance-level; negative means worse than chance.
Landis-Koch benchmarks: 0.01-0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, 0.81-1.00 almost perfect.
Assumptions
Two raters; categories are exclusive and exhaustive; ratings are independent between raters; the same subject is rated by both.
R Implementation
library(psych)
set.seed(2026)
n <- 100
# Simulate two raters with ~80% agreement
rater1 <- factor(sample(c("pos", "neg"), n, replace = TRUE))
agree <- rbinom(n, 1, 0.8)
rater2 <- ifelse(agree == 1, as.character(rater1),
ifelse(rater1 == "pos", "neg", "pos"))
rater2 <- factor(rater2, levels = levels(rater1))
tab <- table(rater1, rater2)
tab
cohen.kappa(cbind(rater1, rater2))Output & Results
Cross-table of ratings and Cohen’s kappa (~0.6); 95 % CI reflects the modest sample size.
Interpretation
“Inter-rater agreement was substantial (kappa = 0.58, 95 % CI 0.41-0.75) per Landis-Koch; observed agreement 80 % with chance agreement 52 %.”
Practical Tips
- Kappa depends on prevalence; very low-prevalence categories produce small kappa even with high agreement (kappa paradox).
- Report both kappa and observed agreement for transparency.
- For > 2 raters, use Fleiss’ kappa or intraclass correlation.
- For ordinal data, use weighted kappa to credit partial agreement.
- Confidence interval on kappa is typically via bootstrap or delta method.