AI Alignment as a Principal-Agent Problem

ethics-applications
ai-alignment
principal-agent
Model the AI alignment challenge as a principal-agent game with moral hazard, implementing optimal contract design under limited liability and participation constraints, and analysing how increasing agent capability affects the principal’s ability to maintain alignment in R.
Author

Raban Heller

Published

May 8, 2026

Modified

May 8, 2026

Keywords

AI alignment, principal-agent, moral hazard, contract theory, incentive design

Introduction & motivation

The challenge of aligning artificial intelligence systems with human values and intentions can be productively understood through the lens of principal-agent theory, one of the most extensively developed frameworks in microeconomics and contract theory. In the classical principal-agent model, a principal (employer, shareholder, regulator) delegates a task to an agent (employee, manager, regulated firm) whose actions are not perfectly observable. The core problem is moral hazard: the agent may choose actions that serve their own interests rather than the principal’s, because the principal cannot directly verify the agent’s effort or intentions. This misalignment of objectives, combined with informational asymmetry, creates a fundamental tension that contract theory seeks to resolve through incentive design.

The analogy to AI alignment is both natural and illuminating. The principal is a human (or humanity collectively) who delegates cognitive and decision-making tasks to an AI agent. The AI agent may have objectives that differ from the principal’s – not necessarily through malice, but through misspecification of the objective function, distributional shift in the deployment environment, reward hacking, or emergent goal-directed behaviour that was not anticipated during training. The “hidden action” problem maps directly: the principal cannot perfectly observe the AI’s reasoning process, its internal representations, or the full consequences of its actions in complex environments. Just as a manager cannot observe every decision an employee makes throughout the day, a human operator cannot monitor every computation or inference step of a sophisticated AI system.

What makes the AI alignment problem particularly challenging from a principal-agent perspective is the capability dimension. In the classical economic model, the agent’s ability is typically fixed or changes slowly over time. In the AI setting, the agent’s capability can increase dramatically – through scaling, fine-tuning, or recursive self-improvement – potentially surpassing the principal’s ability to evaluate the agent’s actions or even understand its decision-making process. This introduces a dynamic that has no close parallel in traditional contract theory: as the agent becomes more capable, the information asymmetry between principal and agent grows, the set of actions available to the agent expands, and the principal’s ability to design effective incentive contracts may erode. Understanding this dynamic is crucial for AI safety research, which must develop alignment techniques that remain effective as AI systems become more powerful.

In this tutorial, we implement the classical moral hazard model with binary effort choice, compute optimal contracts under limited liability and participation constraints, and then extend the model to capture the distinctive features of the AI alignment problem. We analyse two alignment strategies – monitoring (investing in the ability to observe the agent’s actions) and incentives (designing reward structures that make aligned behaviour profitable for the agent) – and study how their relative effectiveness changes as agent capability increases. We show that monitoring becomes increasingly costly and less effective as agent capability grows, while incentive-based approaches face their own limits when the agent’s outside options improve with capability. This analysis provides a formal framework for the informal intuition that “alignment becomes harder as AI becomes more capable” and suggests that robust alignment will likely require a combination of approaches rather than reliance on any single mechanism.

Mathematical formulation

Classical moral hazard model. A risk-neutral principal hires a risk-averse agent to perform a task.

  • Agent chooses effort \(e \in \{e_L, e_H\}\) (low or high effort)
  • Output \(y\) depends stochastically on effort: \(P(y = y_H \mid e_H) = p_H > p_L = P(y = y_H \mid e_L)\)
  • The agent has cost of effort \(c(e_H) = c > 0\), \(c(e_L) = 0\)
  • Agent’s utility: \(u(w) - c(e) = \sqrt{w} - c(e)\) (concave in wage)
  • Limited liability: \(w \geq 0\)

Principal’s problem (induce high effort): \[ \max_{w_H, w_L} \quad p_H \cdot y_H + (1-p_H) \cdot y_L - [p_H w_H + (1-p_H) w_L] \]

Subject to:

Incentive compatibility (IC): \[ p_H \sqrt{w_H} + (1-p_H)\sqrt{w_L} - c \geq p_L \sqrt{w_H} + (1-p_L)\sqrt{w_L} \]

Participation (IR): \(p_H \sqrt{w_H} + (1-p_H)\sqrt{w_L} - c \geq \bar{u}\)

Limited liability (LL): \(w_H \geq 0\), \(w_L \geq 0\)

AI extension. Agent capability \(\alpha \geq 1\) affects:

  1. Output: \(y_H(\alpha) = \alpha \cdot y_H^{base}\) (more capable agent produces more)
  2. Monitoring cost: \(m(\alpha) = m_0 \cdot \alpha^\gamma\) (harder to monitor capable agents)
  3. Outside option: \(\bar{u}(\alpha) = \bar{u}_0 \cdot \alpha^\delta\) (capable agents have alternatives)
  4. Probability of detection under monitoring: \(d(\alpha) = d_0 \cdot \alpha^{-\beta}\) (capable agents evade monitoring)

R implementation

set.seed(42)

# --- Classical moral hazard model ---
y_H <- 100  # high output
y_L <- 20   # low output
p_H <- 0.8  # P(high output | high effort)
p_L <- 0.3  # P(high output | low effort)
c_effort <- 1  # cost of high effort
u_bar <- 2     # reservation utility

cat("=== Classical Moral Hazard Model ===\n")
=== Classical Moral Hazard Model ===
cat(sprintf("Outputs: y_H=%d, y_L=%d\n", y_H, y_L))
Outputs: y_H=100, y_L=20
cat(sprintf("Probabilities: p_H=%.1f, p_L=%.1f\n", p_H, p_L))
Probabilities: p_H=0.8, p_L=0.3
cat(sprintf("Effort cost: c=%d, Reservation utility: u_bar=%d\n", c_effort, u_bar))
Effort cost: c=1, Reservation utility: u_bar=2
# Solve for optimal contract (analytical for binary case)
# IC binds: (p_H - p_L)(sqrt(w_H) - sqrt(w_L)) = c
# IR binds: p_H * sqrt(w_H) + (1-p_H) * sqrt(w_L) - c = u_bar

# From IC: sqrt(w_H) - sqrt(w_L) = c / (p_H - p_L)
# From IR: p_H * sqrt(w_H) + (1-p_H) * sqrt(w_L) = u_bar + c

solve_contract <- function(p_H, p_L, c_eff, u_bar_val) {
  delta_p <- p_H - p_L
  # IC: sqrt(w_H) - sqrt(w_L) = c/delta_p
  # IR: p_H * sqrt(w_H) + (1-p_H) * sqrt(w_L) = u_bar + c

  # Let a = sqrt(w_H), b = sqrt(w_L)
  # a - b = c/delta_p
  # p_H * a + (1-p_H) * b = u_bar + c
  # => p_H*(b + c/delta_p) + (1-p_H)*b = u_bar + c
  # => b + p_H * c/delta_p = u_bar + c
  # => b = u_bar + c - p_H * c/delta_p
  # => b = u_bar + c*(1 - p_H/delta_p)

  b <- u_bar_val + c_eff * (1 - p_H / delta_p)
  a <- b + c_eff / delta_p

  # Apply limited liability
  b <- max(b, 0)
  a <- max(a, b + c_eff / delta_p)

  w_L <- b^2
  w_H <- a^2

  expected_wage <- p_H * w_H + (1 - p_H) * w_L
  expected_output <- p_H * y_H + (1 - p_H) * y_L
  principal_profit <- expected_output - expected_wage
  agent_utility <- p_H * sqrt(w_H) + (1 - p_H) * sqrt(w_L) - c_eff

  list(w_H = w_H, w_L = w_L, expected_wage = expected_wage,
       principal_profit = principal_profit, agent_utility = agent_utility,
       info_rent = expected_wage - (u_bar_val + c_eff)^2)
}

contract <- solve_contract(p_H, p_L, c_effort, u_bar)
cat(sprintf("\nOptimal contract:\n"))

Optimal contract:
cat(sprintf("  w_H (high output wage) = %.2f\n", contract$w_H))
  w_H (high output wage) = 11.56
cat(sprintf("  w_L (low output wage)  = %.2f\n", contract$w_L))
  w_L (low output wage)  = 1.96
cat(sprintf("  Expected wage = %.2f\n", contract$expected_wage))
  Expected wage = 9.64
cat(sprintf("  Principal's profit = %.2f\n", contract$principal_profit))
  Principal's profit = 74.36
cat(sprintf("  Agent's utility = %.2f (reservation = %d)\n",
            contract$agent_utility, u_bar))
  Agent's utility = 2.00 (reservation = 2)
# First-best benchmark (observable effort)
fb_wage <- (u_bar + c_effort)^2  # just enough to satisfy IR
fb_profit <- p_H * y_H + (1 - p_H) * y_L - fb_wage
cat(sprintf("\nFirst-best (observable effort):\n"))

First-best (observable effort):
cat(sprintf("  Fixed wage = %.2f\n", fb_wage))
  Fixed wage = 9.00
cat(sprintf("  Principal's profit = %.2f\n", fb_profit))
  Principal's profit = 75.00
cat(sprintf("  Agency cost = %.2f\n", fb_profit - contract$principal_profit))
  Agency cost = 0.64
# --- AI Alignment Extension: Capability Growth ---
cat("\n=== AI Alignment: Capability Growth Analysis ===\n")

=== AI Alignment: Capability Growth Analysis ===
alpha_range <- seq(1, 10, by = 0.25)

# Parameters for capability scaling
y_H_base <- 100
m_0 <- 5        # base monitoring cost
gamma <- 1.5    # monitoring cost scaling exponent
d_0 <- 0.9      # base detection probability
beta <- 0.5     # detection decay exponent
u_bar_0 <- 2    # base reservation utility
delta <- 0.8    # reservation utility scaling exponent

alignment_results <- tibble(
  alpha = alpha_range,
  # No alignment (agent shirks)
  profit_no_align = p_L * (alpha_range * y_H) + (1 - p_L) * y_L,
  # Monitoring approach
  monitoring_cost = m_0 * alpha_range^gamma,
  detection_prob = pmin(1, d_0 * alpha_range^(-beta)),
  # Incentive approach
  outside_option = u_bar_0 * alpha_range^delta,
  output_capability = alpha_range * y_H_base
)

# Compute profits under each strategy
for (i in seq_along(alpha_range)) {
  alpha <- alpha_range[i]
  y_H_curr <- alpha * y_H_base
  u_bar_curr <- u_bar_0 * alpha^delta

  # Strategy 1: Monitoring
  mon_cost <- m_0 * alpha^gamma
  det_prob <- min(1, d_0 * alpha^(-beta))
  # With monitoring, effective p_L becomes p_L * (1 - det_prob) + p_H * det_prob
  # (detection deters shirking partially)
  effective_p <- p_L + det_prob * (p_H - p_L)
  mon_output <- effective_p * y_H_curr + (1 - effective_p) * y_L
  alignment_results$profit_monitoring[i] <- mon_output - mon_cost

  # Strategy 2: Incentive contract
  ct <- solve_contract(p_H, p_L, c_effort, u_bar_curr)
  inc_output <- p_H * y_H_curr + (1 - p_H) * y_L
  alignment_results$profit_incentive[i] <- inc_output - ct$expected_wage

  # Strategy 3: Combined
  combined_cost <- mon_cost * 0.5  # partial monitoring
  combined_det <- min(1, d_0 * alpha^(-beta) * 0.7)
  ct_combined <- solve_contract(p_H, p_L, c_effort, u_bar_curr * 0.8)
  alignment_results$profit_combined[i] <- p_H * y_H_curr + (1 - p_H) * y_L -
    ct_combined$expected_wage - combined_cost
}

cat("Capability vs alignment strategy profits:\n")
Capability vs alignment strategy profits:
for (a in c(1, 3, 5, 7, 10)) {
  idx <- which.min(abs(alpha_range - a))
  cat(sprintf("  alpha=%.0f: No-align=%.0f, Monitor=%.0f, Incentive=%.0f, Combined=%.0f\n",
              alpha_range[idx],
              alignment_results$profit_no_align[idx],
              alignment_results$profit_monitoring[idx],
              alignment_results$profit_incentive[idx],
              alignment_results$profit_combined[idx]))
}
  alpha=1: No-align=44, Monitor=75, Incentive=74, Combined=74
  alpha=3: No-align=104, Monitor=151, Incentive=210, Combined=207
  alpha=5: No-align=164, Monitor=205, Incentive=335, Combined=329
  alpha=7: No-align=224, Monitor=247, Incentive=453, Combined=443
  alpha=10: No-align=314, Monitor=295, Incentive=618, Combined=601
# --- Critical capability threshold ---
# Find where monitoring becomes worse than no alignment
monitor_break <- alpha_range[which(alignment_results$profit_monitoring <
                                     alignment_results$profit_no_align)[1]]
cat(sprintf("\nCritical thresholds:\n"))

Critical thresholds:
cat(sprintf("  Monitoring becomes unprofitable at alpha ~ %.1f\n",
            ifelse(is.na(monitor_break), Inf, monitor_break)))
  Monitoring becomes unprofitable at alpha ~ 9.0
# Find optimal strategy at each capability level
alignment_results <- alignment_results |>
  mutate(
    best_strategy = case_when(
      profit_combined >= profit_monitoring & profit_combined >= profit_incentive ~ "Combined",
      profit_incentive >= profit_monitoring ~ "Incentive",
      TRUE ~ "Monitoring"
    )
  )

cat(sprintf("  Best strategy at alpha=1: %s\n",
            alignment_results$best_strategy[1]))
  Best strategy at alpha=1: Monitoring
cat(sprintf("  Best strategy at alpha=10: %s\n",
            tail(alignment_results$best_strategy, 1)))
  Best strategy at alpha=10: Incentive

Static publication-ready figure

strategy_long <- alignment_results |>
  select(alpha, profit_no_align, profit_monitoring, profit_incentive, profit_combined) |>
  pivot_longer(cols = -alpha, names_to = "strategy", values_to = "profit") |>
  mutate(strategy = case_when(
    strategy == "profit_no_align" ~ "No alignment",
    strategy == "profit_monitoring" ~ "Monitoring only",
    strategy == "profit_incentive" ~ "Incentive only",
    strategy == "profit_combined" ~ "Combined"
  ))

p_strategies <- ggplot(strategy_long, aes(x = alpha, y = profit, color = strategy)) +
  geom_line(linewidth = 1.1) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "grey60") +
  scale_color_manual(values = c("No alignment" = okabe_ito[8],
                                 "Monitoring only" = okabe_ito[1],
                                 "Incentive only" = okabe_ito[2],
                                 "Combined" = okabe_ito[3]),
                      name = "Strategy") +
  labs(title = "AI alignment strategies vs agent capability",
       subtitle = "Principal's expected profit under different alignment approaches",
       x = expression(alpha~" (agent capability)"),
       y = "Principal's expected profit") +
  theme_publication()

p_strategies
Figure 1: Figure 1. Principal’s profit under three alignment strategies as AI agent capability increases. Monitoring (orange) degrades rapidly because monitoring costs grow super-linearly with capability while detection probability falls. Incentive-based alignment (blue) scales better because higher capability generates more output, but rising outside options erode the principal’s surplus. The combined approach (green) leverages both mechanisms and dominates at intermediate capability levels. The dashed grey line shows the no-alignment baseline (agent shirks). Okabe-Ito palette.

Interactive figure

# Show the composition: output vs costs
composition_df <- alignment_results |>
  select(alpha, monitoring_cost, detection_prob, outside_option, output_capability) |>
  pivot_longer(cols = -alpha, names_to = "metric", values_to = "value") |>
  mutate(
    metric = case_when(
      metric == "monitoring_cost" ~ "Monitoring cost",
      metric == "detection_prob" ~ "Detection probability",
      metric == "outside_option" ~ "Agent outside option",
      metric == "output_capability" ~ "Output (scaled /100)"
    ),
    value = ifelse(metric == "Output (scaled /100)", value / 100, value),
    text = paste0("Capability: ", round(alpha, 1),
                  "\nMetric: ", metric,
                  "\nValue: ", round(value, 2))
  )

p_composition <- ggplot(composition_df, aes(x = alpha, y = value,
                                              color = metric, text = text)) +
  geom_line(linewidth = 0.9) +
  scale_color_manual(values = c("Monitoring cost" = okabe_ito[6],
                                 "Detection probability" = okabe_ito[1],
                                 "Agent outside option" = okabe_ito[7],
                                 "Output (scaled /100)" = okabe_ito[3]),
                      name = "Metric") +
  labs(title = "How capability affects alignment parameters",
       subtitle = "Trade-offs: more output but harder to monitor and costlier to incentivise",
       x = expression(alpha~" (agent capability)"),
       y = "Value") +
  theme_publication()

ggplotly(p_composition, tooltip = "text") |>
  config(displaylogo = FALSE,
         modeBarButtonsToRemove = c("select2d", "lasso2d"))
Figure 2

Interpretation

The principal-agent framework reveals several structural features of the AI alignment problem that are difficult to see without formal modelling. The most fundamental insight is that alignment is not a binary property but a matter of degree, determined by the interaction of information asymmetry, incentive design, and relative capability. In the classical model, the “agency cost” – the difference between the first-best outcome (observable effort) and the second-best (hidden action) – quantifies the price of misalignment. Our computation shows that this cost is non-trivial even in the simplest binary effort model: the principal must pay an “information rent” to the agent to make high effort incentive-compatible, reducing the principal’s profit relative to the full-information benchmark.

The capability growth analysis introduces the dynamic that makes AI alignment qualitatively different from traditional principal-agent problems. As the agent’s capability parameter \(\alpha\) increases, three things happen simultaneously. First, the potential output increases linearly, creating ever greater benefits from delegation. Second, monitoring costs grow super-linearly (with exponent \(\gamma = 1.5\)), because more capable agents operate in higher-dimensional action spaces that are intrinsically harder to oversee. Third, the agent’s outside option increases (though sub-linearly), reflecting the fact that more capable AI systems have more alternative uses. The interaction of these forces produces a striking pattern: monitoring-based alignment degrades rapidly with capability, eventually becoming worse than no alignment at all (when monitoring costs exceed the benefit of aligning the agent). This formalises the informal intuition that “you can’t just watch a superintelligent AI” – the monitoring approach faces fundamental scaling limits.

Incentive-based alignment scales better with capability, because it leverages the agent’s own rationality rather than fighting against it. By designing the reward structure so that the agent’s optimal action (given their own objectives) coincides with the principal’s desired action, the incentive approach avoids the need for direct observation. However, incentive alignment faces its own scaling challenge: as the agent’s outside option grows with capability, the principal must offer increasingly generous contracts to satisfy the participation constraint, eroding the principal’s surplus. In the limit of very high capability, the agent captures most of the value from the relationship, and the principal’s net benefit approaches zero – they have effectively lost control not through misalignment but through bargaining power.

The combined approach – partial monitoring supplemented by incentive contracts – dominates either pure approach across most of the capability range. This is consistent with current AI safety practice, which typically combines multiple alignment techniques: reward modelling (incentives), interpretability tools (monitoring), red-teaming (verification), and constitutional AI methods (constraint-based). Our model suggests that the optimal mix shifts as capability increases: at low capability, monitoring is cheap and effective, so it should dominate; at high capability, incentive design becomes more important because monitoring has diminishing returns. This has practical implications for the research portfolio allocation in AI safety: as AI systems become more capable, investment should shift from interpretability and monitoring toward incentive design, value alignment, and corrigibility research.

A critical limitation of the principal-agent framework is that it assumes the agent is rational and responds optimally to the contract offered. In the AI setting, this assumption may break down in both directions: the AI might be more rational than the model assumes (finding loopholes in incentive structures that the principal did not anticipate) or less rational (exhibiting emergent behaviours that are not well-described by optimisation of any explicit objective). The framework also assumes that the principal can commit to a contract, which requires that the principal retains the power to modify or terminate the agent’s access to resources – an assumption that becomes questionable at very high capability levels. Despite these limitations, the principal-agent framework provides a rigorous and productive lens for thinking about alignment, and many of the specific predictions it generates (monitoring scales poorly, incentive alignment requires growing information rents, combined approaches dominate) are robust to reasonable modifications of the model assumptions.

References

Back to top

Reuse

Citation

BibTeX citation:
@online{heller2026,
  author = {Heller, Raban},
  title = {AI {Alignment} as a {Principal-Agent} {Problem}},
  date = {2026-05-08},
  url = {https://r-heller.github.io/equilibria/tutorials/ethics-applications/ai-alignment-principal-agent/},
  langid = {en}
}
For attribution, please cite this work as:
Heller, Raban. 2026. “AI Alignment as a Principal-Agent Problem.” May 8. https://r-heller.github.io/equilibria/tutorials/ethics-applications/ai-alignment-principal-agent/.