Introduction to aiDIF: Detecting Differential Item Functioning in AI-Scored Assessments

Background

When AI systems score essays, short-answer responses, or structured tasks, a critical fairness question arises: does the AI scoring engine shift item difficulties differently for different demographic groups?

Classical DIF methods test whether an item performs differently across groups within a single scoring condition. aiDIF extends this to a paired design:

  1. Human-scoring DIF — robust M-estimation of item-level bias
  2. AI-scoring DIF — the same analysis applied to AI-scored data
  3. Differential AI Scoring Bias (DASB) — a new test for group-dependent parameter shifts from human to AI scoring

The Example Dataset

make_aidif_eg() returns a built-in example with item parameter MLEs for 6 items in two groups under both scoring conditions. The planted structure is:

eg <- make_aidif_eg()
str(eg, max.level = 2)
#> List of 2
#>  $ human:List of 3
#>   ..$ par.names:List of 2
#>   ..$ est      :List of 2
#>   ..$ var.cov  :List of 2
#>  $ ai   :List of 3
#>   ..$ par.names:List of 2
#>   ..$ est      :List of 2
#>   ..$ var.cov  :List of 2

Fitting the Model

fit_aidif() runs the robust IRLS engine under each scoring condition and performs the DASB test.

mod <- fit_aidif(
  human_mle = eg$human,
  ai_mle    = eg$ai,
  alpha     = 0.05
)
print(mod)
#> AI-DIF Analysis
#> ----------------------------------------
#> Human scoring  — robust scale est: -0.5776  (SE: 0.0747)
#>                — DIF items flagged: 3 / 6
#> AI scoring     — robust scale est: -0.5921  (SE: 0.0748)
#>                — DIF items flagged: 3 / 6
#> DASB test      — items with differential AI bias: 1 / 6

Full Report

summary(mod)
#> =============================================================
#>  AI Differential Item Functioning Analysis (aiDIF)
#> =============================================================
#> 
#> --- Human Scoring DIF ----------------------------------------
#>   Robust scale estimate:  -0.5776  (SE: 0.0747)
#>   Wald DIF tests:
#>            delta     se       z  p_val
#> item1_d1  0.5693 0.0759  7.4995 0.0000
#> item2_d1  0.0366 0.1060  0.3448 0.7303
#> item3_d1  0.2302 0.0623  3.6953 0.0002
#> item4_d1  0.0163 0.0931  0.1756 0.8606
#> item5_d1  0.2700 0.0693  3.8947 0.0001
#> item6_d1 -0.1181 0.1232 -0.9584 0.3379
#> 
#> --- AI Scoring DIF -------------------------------------------
#>   Robust scale estimate:  -0.5921  (SE: 0.0748)
#>   Wald DIF tests:
#>            delta     se       z  p_val
#> item1_d1  0.5756 0.0761  7.5596 0.0000
#> item2_d1  0.0466 0.1046  0.4458 0.6557
#> item3_d1  0.5499 0.0619  8.8820 0.0000
#> item4_d1  0.0046 0.0926  0.0495 0.9605
#> item5_d1  0.3308 0.0695  4.7559 0.0000
#> item6_d1 -0.1455 0.1240 -1.1737 0.2405
#> 
#> --- Differential AI Scoring Bias (DASB) ---------------------
#>   H0: AI scoring shift does not differ across groups
#>   (Positive DASB => AI scoring disadvantages focal group)
#> 
#>       shift_g1 shift_g2  DASB   se      z  p_val
#> item1     0.13     0.12 -0.01 0.14 -0.071 0.9431
#> item2     0.08     0.07 -0.01 0.14 -0.071 0.9431
#> item3     0.11     0.54  0.43 0.14  3.071 0.0021
#> item4     0.12     0.09 -0.03 0.14 -0.214 0.8303
#> item5     0.07     0.13  0.06 0.14  0.429 0.6682
#> item6     0.11     0.08 -0.03 0.14 -0.214 0.8303
#> 
#> --- AI-Effect Classification ---------------------------------
#>   stable_clean  : not flagged in either condition
#>   stable_dif    : flagged in both (same direction)
#>   introduced    : flagged only under AI scoring
#>   masked        : flagged only under human scoring
#>   new_direction : flagged in both, opposite direction
#> 
#>          human_delta ai_delta human_flag ai_flag       status
#> item1_d1      0.5693   0.5756       TRUE    TRUE   stable_dif
#> item2_d1      0.0366   0.0466      FALSE   FALSE stable_clean
#> item3_d1      0.2302   0.5499       TRUE    TRUE   stable_dif
#> item4_d1      0.0163   0.0046      FALSE   FALSE stable_clean
#> item5_d1      0.2700   0.3308       TRUE    TRUE   stable_dif
#> item6_d1     -0.1181  -0.1455      FALSE   FALSE stable_clean
#> 
#>   Status counts:
#> 
#> stable_clean   stable_dif 
#>            3            3

The DASB Test

scoring_bias_test() can also be called directly.

sb <- scoring_bias_test(eg$human, eg$ai)
print(sb)
#>       shift_g1 shift_g2  DASB   se      z  p_val
#> item1     0.13     0.12 -0.01 0.14 -0.071 0.9431
#> item2     0.08     0.07 -0.01 0.14 -0.071 0.9431
#> item3     0.11     0.54  0.43 0.14  3.071 0.0021
#> item4     0.12     0.09 -0.03 0.14 -0.214 0.8303
#> item5     0.07     0.13  0.06 0.14  0.429 0.6682
#> item6     0.11     0.08 -0.03 0.14 -0.214 0.8303

Item 3 should be significant, reflecting the planted group-dependent AI scoring bias.

AI-Effect Classification

eff <- ai_effect_summary(mod$dif_human, mod$dif_ai)
print(eff)
#>          human_delta ai_delta human_flag ai_flag       status
#> item1_d1      0.5693   0.5756       TRUE    TRUE   stable_dif
#> item2_d1      0.0366   0.0466      FALSE   FALSE stable_clean
#> item3_d1      0.2302   0.5499       TRUE    TRUE   stable_dif
#> item4_d1      0.0163   0.0046      FALSE   FALSE stable_clean
#> item5_d1      0.2700   0.3308       TRUE    TRUE   stable_dif
#> item6_d1     -0.1181  -0.1455      FALSE   FALSE stable_clean
Status Meaning
introduced AI scoring creates DIF not present under human scoring
masked AI scoring hides DIF that existed under human scoring
stable_dif DIF detected in both conditions
stable_clean No DIF in either condition

Visualisations

plot(mod, type = "dif_forest")   # human vs AI DIF side by side
plot(mod, type = "dasb")         # DASB bar chart with error bars
plot(mod, type = "weights")      # bi-square anchor weights

Simulation

dat <- simulate_aidif_data(
  n_items    = 8,
  n_obs      = 600,
  dif_items  = c(1, 2),
  dif_mag    = 0.5,
  dasb_items = 5,
  dasb_mag   = 0.4,
  seed       = 123
)
sim_mod <- fit_aidif(dat$human, dat$ai)
print(sim_mod)
#> AI-DIF Analysis
#> ----------------------------------------
#> Human scoring  — robust scale est: -0.2670  (SE: 0.0322)
#>                — DIF items flagged: 4 / 8
#> AI scoring     — robust scale est: 0.0536  (SE: 0.0363)
#>                — DIF items flagged: 5 / 8
#> DASB test      — items with differential AI bias: 1 / 8

References