---
title: "Understanding Ability-Risk Tuning"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Understanding Ability-Risk Tuning}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4
)
```

# Why this vignette exists

This vignette is meant to alleviate some common misunderstandings of mixed-subjects IRT calibration. While the core idea of augmenting human data with LLM-generated data and estimating a $\lambda$ value to tune how much you want the LLM-generated to contribute is straightforward, the nuances of the ability-risk objective are not. In this vignette, we will go into the data requirements for mixed-subject calibrations and the ability-risk objective itself to help set users up for success. If there is one major takeaway from this document, it is that the tuning parameter $\lambda$ is not asking whether LLM-generated responses look human in the aggregate. It is, instead, asking whether an LLM-based response-generation procedure can predict the **row-level** human response structure enough to reduce downstream ability-estimation error. The most important object is therefore the **paired prediction matrix**, $P$.


# Key Intuition

## The three response matrices

Mixed-subjects IRT requires three item response matrices. Let $J$ be the number of items, $n$ be the number of observed human respondents, and $N$ be the number of additional generated respondents.

### 1. Observed human responses: $O$

This is the real human pilot calibration response matrix, with structure

$$O \in \{0,1\}^{n \times J}.$$

Each row is the full observed response string from one human respondent. Each column corresponds to an item. Each entry is an observed, dichotomously scored, response.

### 2. Paired LLM-predicted human responses: $P$

This is the LLM-predicted response matrix for the **same human respondents** in $O$, with structure

$$P \in \{0,1\}^{n \times J}.$$

Pay special attention to the requirement of "same human respondents." Row $i$ in $P$ must be predicted responses for the respondent in row $i$ in $O$. Column $j$ in $P$ must correspond to column $j$ in $O$. This is the diagnostic matrix. It tells the method whether the LLM response-generation procedure is informative about the human response process and defines the magnitude and confidence in the mixed-subjects correction.

Importantly, this means that when generating $P$, users need to transmit both the content of the items being responded to and some sort of information about the rows of $O$ to the LLM generating the predicted responses. This can be in the form of narrative or covariate information about the respondents, held out item responses, or something else.

### 3. Additional LLM-generated responses: $G$

This is the larger synthetic or LLM-generated response matrix, with structure

$$G \in \{0,1\}^{N \times J}.$$

Typically, $N \gg n$ to maximize the potential improvement in post-calibration ability estimation precision. The rows in $G$ are _not_ paired with rows in $O$, but instead additional generated respondents. However, the crucial requirement is that $G$ is meant to be sampled from the same distribution as $P$, meaning it should be created using a procedure that is as close as possible to the procedure used to create $P$ and some amount of information about the ability distribution in $P$.

A useful way to remember the design is:

| Matrix | Shape | Rows | Purpose |
|--------|-------|------|---------|
| $O$ | $n \times J$ | observed human respondents | anchor the calibration target to the human population |
| $P$ | $n \times J$ | LLM predictions for those same human respondents | estimate how the LLM procedure agrees with or deviates from human responses |
| $G$ | $N \times J$ | additional LLM-generated rows | provide extra precision derived from synthetic information after correction |

Another helpful rule of thumb is that if $P$ is not row-aligned with $O$, you should expect $\lambda \to 0$.

## The mixed-subjects IRT objective

The recommended scalar-$\lambda$ workflow in `mixedsubjectsirt` uses a marginal maximum likelihood (MML) objective of the form

$$\hat\gamma_\lambda =\arg\min_\gamma\big[L_O^{\mathrm{marg}}(\gamma)+\lambda\big(L_G^{\mathrm{marg}}(\gamma)-L_P^{\mathrm{marg}}(\gamma)\big)\big].$$

Here $\gamma = \{a_1, \ldots, a_J, d_1, \ldots, d_J \}$ is the vector of item parameters. For a 2PL model, we write item $j$'s response probability as

$$p_j(\theta;\gamma_j) =\mathrm{logit}^{-1}(d_j + a_j \theta),$$

where $d_j$ is the intercept and $a_j$ is the discrimination.

The three pieces of the objective each have different jobs in the mixed-subjects loss:
$$L_O^{\mathrm{marg}}(\gamma)$$
is the usual human-data marginal IRT likelihood evaluated at $\gamma$. This term anchors the calibration to humans.
$$L_G^{\mathrm{marg}}(\gamma)$$
is the likelihood contribution from the additional generated response rows.
$$L_P^{\mathrm{marg}}(\gamma)$$
is the correction term. It estimates what the same LLM-generation procedure says about the humans whose actual responses are observed.

The term $L_G^{\mathrm{marg}} - L_P^{\mathrm{marg}}$ is the prediction-powered correction. The generated matrix $G$ adds information, while the paired prediction matrix $P$ allows us to estimate the LLM procedure's bias, noise, and covariance with the human response process.


## What lambda is learning

The tuning parameter $\lambda$ weights how much the LLM-generated responses are allowed to contribute to parameter estimation. When $\lambda = 0$,

$$
L_O^{\mathrm{marg}}(\gamma)
+
\lambda
\{L_G^{\mathrm{marg}}(\gamma)-L_P^{\mathrm{marg}}(\gamma)\}
=
L_O^{\mathrm{marg}}(\gamma),
$$

so the method falls back to human-only calibration. When $\lambda > 0$, the method borrows information from $G$, corrected by $P$.

This means $\lambda=0$ is not a failure of the software, instead it is a protective outcome that occurs when the LLM responses do not reduce expected ability-estimation error. A high $\lambda$ requires more than plausible synthetic response rows. It requires the paired LLM predictions in $P$ to track the human responses in $O$ in a way that is consistent across respondents.

## Ability-risk tuning

There are two related (but distinct) tuning ideas in the package. The original PPI++ paper minimizes the standard errors of the estimated parameters (the trace of the covariance matrix). The function `tune_lambda_ppi_score()` implements this.

The function `tune_lambda_ability_risk()` asks a more practical psychometric question: Which value of $\lambda$ minimizes _expected downstream ability-estimation error_?

The approximate target is
$$
\widehat R(\lambda)
=
\frac{1}{M}
\sum_{m=1}^{M}
g_m^\top
\widehat\Sigma_{\gamma,\lambda}
g_m,
$$
where:

- $y_m$ is a target response pattern;
- $\hat\theta(y_m;\hat\gamma_\lambda)$ is the ability estimate produced from that response pattern;
- $g_m = \nabla_\gamma \hat\theta(y_m;\hat\gamma_\lambda)$ is gradient of the ability estimate, which captures the sensitivity of the ability estimate to item-parameter error;
- $\widehat\Sigma_{\gamma,\lambda}$ is the full covariance matrix of the item-parameter estimates under the mixed-subjects fit.

$g_m^\top \widehat\Sigma_{\gamma,\lambda} g_m$ thus propagates item-parameter uncertainty and covariance structure into ability-estimation uncertainty.

This matters because ability-risk tuning is not the same as minimizing average item-parameter standard errors. The off-diagonal elements of the item-parameter covariance matrix now matter, describing how uncertainty in one item parameter moves with uncertainty in another. Some covariance patterns may cancel out for ability scoring; others may distort the scale in high-information regions. Ability-risk tuning weights these covariance patterns by their downstream impact on $\hat\theta$, averaged over an expected ability distribution.


## Why row alignment matters

The easiest way to understand $\lambda$ is to compare three cases.

### Case A: perfect paired prediction

Suppose

$$
P = O.
$$

Here, the paired LLM prediction is exactly equal to the human response matrix. From the perspective of maximizing the contribution of $G$ to estimation, this is the best possible version of $P$. In this case, $\lambda$ should be large, though not necessarily $\lambda = 1$. The finite-$N$ benchmark is

$$
\lambda_{\max}
=
\frac{1}{1+n/N}
=
\frac{N}{n+N}.
$$

If $n=400$ and $N=1200$, we see

$$
\lambda_{\max} = \frac{1200}{400+1200}=0.75.
$$

So even when $P=O$, if $N \not\gg n$, $\lambda < 1$ should be expected.


### Case B: row-shuffled perfect predictions

Now suppose

$$
P = \mathrm{shuffle\_rows}(O).
$$

The marginal item means are identical. The total-score distribution is identical. The item difficulty information is identical. The only thing that's changed is row $i$ in $P$ is no longer a prediction for row $i$ in $O$. This should produce $\lambda \approx 0$, because the row-aligned covariance structure has been destroyed. To observe this for both $\lambda$ tuning objectives, you can run:

```{r shuffled-diagnostic, eval = FALSE}
predicted_shuffled <- observed[sample(nrow(observed)), ]

lambda_shuffled_ppi <- tune_lambda_ppi_score(
  observed    = observed,
  predicted   = predicted_shuffled,
  item_pars   = human_fit$pars,
  n_generated = nrow(generated)
)

lambda_shuffled_ppi$lambda

lambda_shuffled_ability_risk <- tune_lambda_ability_risk(
  observed    = observed,
  predicted   = predicted_shuffled,
  item_pars   = human_fit$pars,
  n_generated = nrow(generated)
)

lambda_shuffled_ability_risk$lambda
```


### Case C: same DGP, fresh Bernoulli draw

Suppose both $O$ and $P$ are generated from the same IRT model:

$$
O_{ij} \mid \theta_i \sim \mathrm{Bernoulli}\{p_j(\theta_i)\},
$$

$$
P_{ij} \mid \theta_i \sim \mathrm{Bernoulli}\{p_j(\theta_i)\}.
$$

This is "same DGP," but $P$ is still a fresh stochastic response. It is not the same as $O$. The two matrices share person ability and item parameters, but they do not share response noise.

For a single item,

$$
\operatorname{Cov}(O_{ij},P_{ij})
=
\operatorname{Var}_\theta[p_j(\theta)].
$$

But

$$
\operatorname{Var}(P_{ij})
=
\operatorname{Var}_\theta[p_j(\theta)]
+
\mathbb{E}_\theta[p_j(\theta)\{1-p_j(\theta)\}].
$$

The Bernoulli noise in $P$ dilutes the control-variate signal. As a result, a fresh same-DGP draw may produce a modest $\lambda$, not a large one. This distinction is important: Merely producing the same item parameters does not imply strong paired prediction.

## What kind of LLM data produces higher lambda?

A useful $P$ matrix has to predict row-level response structure. A good $P$ should have these properties:

1. **Same rows as $O$**: row $i$ in $P$ predicts row $i$ in $O$.
2. **Same item columns as $O$**.
3. **Target response not leaked**: when predicting $P_{ij}$, the prompt must not include $O_{ij}$.
4. **Construct-relevant respondent information**: covariates or context should help infer the respondent's likely response.
5. **Within-person response structure**: the LLM should be able to infer something about the respondent and their knowledge or ability level from other responses or covariates.
6. **Same procedure for $P$ and $G$**: the generated matrix should be produced by an analogous procedure to the paired predicted matrix.

### One approach to row alignment: leave-one-item-out prediction

When you have a human response matrix $O$, an approach to build $P$ is leave-one-item-out response prediction. For each respondent $i$ and item $j$:

1. Mask response $O_{ij}$.
2. Give the LLM the text and responses for the other items $O_{i,-j}$.
3. Give the LLM the text for item $j$.
4. Ask it to predict the response to item $j$.
5. Store the result as $P_{ij}$.


### Another approach: covariate-based prediction

If you have construct-relevant covariates, you can build $P$ without using other item responses or augment the LOO-response prediction approach outlined above.

Example covariates include:

- grade level;
- age;
- prior achievement;
- language background;
- prior placement scores;
- response-time or engagement indicators;
- classroom, school, or instructional context;
- demographic variables, where appropriate and ethically justified.


### Something that probably won't work: item-text-only generation

Item-text-only generation usually predicts column properties (like item parameters and relative item spacings), not row-aligned responses. Row-aligned responses are important because they allow the method to link the underlying ability distributions between human respondents and LLM-generated respondents. This is why an item-text-only approach may produce synthetic data that looks plausible in aggregate, but still produces $\lambda=0$.

## How to generate $G$

The generated matrix $G$ should be produced using a procedure that mirrors the procedure used to produce $P$.

If $P$ is generated with leave-one-item-out prompts, then $G$ should also be generated with leave-one-item-out or masked-item prompts.

One possible procedure:

1. Sample or resample a respondent profile.
2. Create a partial response context.
3. Mask one target item.
4. Ask the LLM to predict the masked response.
5. Repeat until a full generated response row is built.


The most important rule is that $G$ and $P$ should be generated by the same prediction mechanism. If $P$ is generated with row-aligned covariates and response history, but $G$ is generated by asking the LLM to invent full response strings from scratch, then $L_G^{\mathrm{marg}} - L_P^{\mathrm{marg}}$ may not be zero in expectation, producing asymptotically biased parameter estimates.

## Summary

Mixed-subjects IRT is useful when the LLM response-generation procedure captures respondent-level response structure. The generated matrix $G$ matters, but the paired matrix $P$ is what lets the method learn whether $G$ should be trusted. The key understanding is that $G$ supplies extra synthetic rows, but $P$ tells us how much to trust them.

If $P$ is row-aligned and predictive of $O$, $\lambda$ can be positive and the generated data can improve calibration. If $P$ only reproduces marginal item difficulty, or if its rows are not aligned with $O$, then $\lambda$ should shrink toward zero. This is a _feature_, not a bug.




# Technical Explanation

The [Choosing Lambda](lambda-tuning.html) vignette explains which tuning function to use and when. This section derives the mathematics those functions implement: how item-parameter uncertainty is propagated into ability scores, and why minimizing the resulting ability risk is a fundamentally different objective from the original PPI++ trace criterion.

Throughout, let $\gamma = \{a_1, \dots, a_J, d_1, \dots, d_J\}$ collect the $2J$ item parameters of a $J$-item 2PL model, ordered as all discriminations followed by all intercepts. This is the ordering convention used by package functions like `fit$par`, `vcov()`, and `ability_gradient()`. The item response function is again

$$
p_{j}(\theta;\gamma_j) \;=\; \Pr(Y_j = 1 \mid \theta)
            \;=\; \operatorname{logit}^{-1}\!\big(d_j + a_j \theta\big).
$$

## Overview: four objects, one objective

Ability-risk tuning chains four quantities together:

1. The mixed-subjects estimator $\hat\gamma(\lambda)$, a function of the tuning parameter $\lambda$.
2. Its sandwich covariance $\Sigma_\gamma(\lambda) = \operatorname{Cov}(\hat\gamma)$.
3. The ability estimate $\hat\theta_i(\gamma)$ for a response pattern $y_i$, together with its gradient $g_i = \partial \hat\theta_i / \partial \gamma$.
4. The propagated risk $g_i' \Sigma_\gamma(\lambda)\, g_i$, averaged over a target    population.

Tuning chooses $\lambda$ to minimize that average. The sections below build up each link in turn.

## 1. The estimator and its estimating equation

The mixed-subjects estimator minimizes a PPI++-style combined objective over human (observed), paired-predicted, and generated responses,

$$
L_\lambda(\gamma)
  \;=\; L_O(\gamma)
      \;+\; \lambda\,\big[L_G(\gamma)
                         - L_P(\gamma)\big],
$$

where each $L$ is a marginal (or Bock–Aitkin expected-count) negative log-likelihood. The estimator $\hat\gamma(\lambda)$ solves the estimating equation

$$
\Psi_\lambda(\gamma)
  \;=\; \psi_O(\gamma)
      + \lambda\,\big[\psi_G(\gamma) - \psi_P(\gamma)\big]
  \;=\; 0,
\qquad \psi = \nabla_\gamma L.
$$

Setting $\lambda = 0$ recovers the human-only calibration; $\lambda > 0$ borrows strength from the LLM responses while the $-\psi_P$ term de-biases that contribution at the population level (the PPI++ correction). The per-person score contributions to $\psi$ are, for item $j$,

$$
s_{ij}^{a} = (y_{ij} - \bar p_{ij})\,\bar\theta_i, \qquad
s_{ij}^{d} = (y_{ij} - \bar p_{ij}),
$$

evaluated under the posterior over $\theta$.

## 2. The sandwich covariance of $\hat\gamma$

$\hat\gamma(\lambda)$ has asymptotic covariance of the form

$$
\Sigma_\gamma(\lambda)
  \;=\; A_\lambda^{-1}\, B_\lambda\, A_\lambda^{-1},
$$

with bread $A_\lambda = \mathbb{E}[\nabla_\gamma \Psi_\lambda]$ and meat $B_\lambda = \operatorname{Cov}(\Psi_\lambda)$.

**Bread.** Normally we would combine the three Hessians block by block,

$$
A_\lambda \;=\; H_O
              + \lambda\,\big(H_G - H_P \big),
$$

where each $H$ is the appropriate Hessian. Since he MML estimator marginalizes over $\theta$, its bread must use Louis's (1982) observed-information identity $A^{\text{marg}} = H^{\text{comp}} - I^{\text{miss}}$, subtracting the missing information `louis_missing_info()` from the complete-data Hessian. Using the complete-data Hessian alone would overstate efficiency. This is what the package computes by default.

**Meat.** The meat is the covariance of the labeled correction plus the independent generated contribution,

$$
B_\lambda \;=\; \frac{1}{n}\operatorname{Cov}\!\big(S_{\text{obs}} - \lambda S_{\text{pred}}\big)
              \;+\; \frac{\lambda^2}{N}\operatorname{Cov}\!\big(S_{\text{gen}}\big),
$$

with $n$ labeled and $N$ generated subjects. The `vcov()` S3 method dispatches automatically.

## 3. Ability scoring and the implicit gradient

Given item parameters $\gamma$, the bounded maximum-likelihood ability estimate for response pattern $y_i$ solves the scoring equation

$$
S(\theta;\gamma, y_i)
  \;=\; \sum_{j} a_j\,\big(y_{ij} - p_j(\theta)\big) \;=\; 0,
$$

which `score_theta()` finds by 1-D optimization on the interval `bounds`. The risk machinery needs the sensitivity of that solution $\hat\theta_i$ to the item parameters, $g_i = \partial \hat\theta_i / \partial \gamma$. Because $\hat\theta_i$ is defined *implicitly* by $S(\hat\theta_i; \gamma) = 0$, the implicit function theorem gives

$$
\frac{\partial \hat\theta_i}{\partial \gamma_k}
  \;=\; -\,\Big(\frac{\partial S}{\partial \theta}\Big)^{-1}
          \frac{\partial S}{\partial \gamma_k}.
$$

The denominator is the (negative) test information at $\hat\theta_i$,

$$
\frac{\partial S}{\partial \theta}
  \;=\; -\sum_j a_j^2\, p_j(1 - p_j),
$$

and the numerators, for the discrimination and intercept of item $j$, are

$$
\frac{\partial S}{\partial a_j} = (y_{ij} - p_j) - a_j\, p_j(1 - p_j)\,\hat\theta_i,
\qquad
\frac{\partial S}{\partial d_j} = -\,a_j\, p_j(1 - p_j).
$$

**Where the gradient is undefined.** The implicit-function argument requires an *interior* optimum. At a boundary estimate (all-correct or all-incorrect patterns push $\hat\theta_i$ to a bound), the score equation does not hold and the gradient is theoretically undefined; `ability_gradient()` returns `NA` for those rows, and they drop out of the risk average via `na.rm = TRUE`. Rows with vanishing test information $(|\partial S/\partial\theta| < \varepsilon)$ are treated the same way.

## 4. Delta-method propagation and the risk

With $g_i$ in hand, the delta method propagates item-parameter uncertainty into
the score:

$$
\rho_i(\lambda) = \operatorname{Var}\big(\hat\theta_i\big)
  \;\approx\; g_i'\, \Sigma_\gamma(\lambda)\, g_i.
$$

Averaging over a target population of $M$ response patterns gives the scalar objective that the tuners minimize,

$$
R(\lambda)
  \;=\; \frac{1}{M}\sum_{i=1}^{M} \rho_i(\lambda)
  \;=\; \mathbb{E}_{\text{target}}\!\big[\, g'\,\Sigma_\gamma(\lambda)\, g \,\big].
$$

The expectation is over the *target* population's ability distribution, which is why `target_resp` matters: tuning for the observed calibration sample (`target_resp = observed`) and tuning for a separate operational scoring population generally give different $\lambda$.

## 5. Why this differs from the PPI++ trace objective

The original PPI++ tuning rule minimizes the trace of the item-parameter covariance,

$$
\lambda^{\star}_{\text{PPI}} = \arg\min_\lambda \operatorname{Tr}\big[\Sigma_\gamma(\lambda)\big],
$$

which `tune_lambda_ppi_score()` evaluates in closed form. Writing $\Sigma_\gamma = (\sigma_{kl})$, the two objectives expand as

$$
\operatorname{Tr}(\Sigma_\gamma) = \sum_k \sigma_{kk},
\qquad
g'\Sigma_\gamma g = \sum_{k} g_k^2\,\sigma_{kk}
                   + \sum_{k \neq l} g_k g_l\,\sigma_{kl}.
$$

The trace sees only the diagonal variances and weights every parameter equally. The ability risk weights each variance by $g_k^2$ (how much that particular parameter moves the score) and uses the off-diagonal covariances $\sigma_{kl}$, which encode the scale/identification structure of the 2PL. Errors in $a_j$ and $d_j$ that are correlated in a direction that leaves $\hat\theta$ unchanged are penalized by the trace but (correctly) ignored by the ability risk. The two criteria therefore select different $\lambda$ in general; use the ability risk for operational scoring and the trace as a theoretical diagnostic.

## Summary

| Symbol | Meaning | Computed by |
|--------|---------|-------------|
| $\hat\gamma(\lambda)$ | Mixed-subjects item parameters | `fit_mixed_subjects_mml()` |
| $\Sigma_\gamma(\lambda)$ | Sandwich covariance of $\hat\gamma$ | `vcov()` → `vcov_mixed_subjects_mml()` |
| $\hat\theta_i$ | Bounded ML ability score | `score_theta()` |
| $g_i = \partial\hat\theta_i/\partial\gamma$ | Implicit ability gradient | `ability_gradient()` |
| $g_i'\Sigma_\gamma g_i$ | Propagated score variance | `ability_risk()` |
| $R(\lambda) = \mathbb{E}[g'\Sigma_\gamma g]$ | Ability-risk objective | `tune_lambda_ability_risk()` |

For the practical workflow built on these pieces — cross-fitting, target
populations, and the choice between estimators — see the
[Choosing Lambda](lambda-tuning.html) vignette.
