The two-way fixed effects (TWFE) model has long been a standard approach to staggered difference-in-differences (DID) designs. However, TWFE can produce biased estimates of the average treatment effect (ATT) if treatment effects are heterogeneous across time or treatment cohorts.
Wooldridge (2025) shows that an “extended” TWFE (ETWFE) model can eliminate bias due to heterogeneity. The ETWFE model lets the treatment interact with time and cohort dummies. It also compares newly treated units only to units that have not yet been treated, or will never be treated – eliminating “forbidden comparisons” with units that have already been treated.
When heterogeneity is substantial and the sample size is adequate to detect it, ETWFE can produce highly informative estimates than TWFE. ETWFE estimates are less biased and can be more precise as well. Inspection of interactions can reveal how effects vary across cohorts and time.
Under some circumstances, however, ETWFE can also make estimates in worse. If there is no heterogeneity, or if the heterogeneity is too mild and the sample is too small to detect it reliably, then two problems may arise:
ETWFE estimates of the ATT may be more variable than TWFE estimates—with larger standard errors, and larger mean squared error if the increase in variance exceeds any reduction in bias. The increase in variance comes from ETWFE using a smaller analytic sample and adding interactions that may not be necessary.
The interaction effects may be mostly or entirely noise rather than signal. That is, most of the apparent variance across time or cohorts may be estimation error rather than true heterogeneity.
Under these circumstances, the simpler TWFE model may be preferable.
The selectTWFE package addresses these issues in three
ways
selectTWFE chooses between the TWFE and ETWFE models
using Cochran’s (1954) Q statistic, which tests whether
treatment effects vary significantly across cohort-time cells.
selectTWFE chooses ETWFE when heterogeneity is significant
and selects TWFE if the null hypothesis of homogeneity cannot be
rejected.
selectTWFE estimates how much of the effects
variance across time or cohorts represents true heterogeneity rather
than estimation error. It estimats this using the heterogeneity fraction
statistic, \(I^2\) (Higgins et al.,
2003).
To reduce noise, selectTWFE uses empirical Bayes to
shrink time- and cohort-specific effects toward the average treatment
effect (Efron & Morris, 1975). It also widens confidence intervals
around the time- and cohort-specific effects, using the Bonferroni
correction to account for multiple estimates.
We use the mpdta dataset included in
selectTWFE, which contains log teen employment in 500 US
counties from 2003-2007, with staggered state-level minimum wage
increases.
library(selectTWFE)
data("mpdta", package = "selectTWFE")
result_mpdta <- select_twfe(
fml = lemp ~ lpop,
tvar = year,
gvar = first.treat,
data = mpdta,
ivar = countyreal,
cgroup = "never",
vcov = ~countyreal
)
print(result_mpdta)
#> Model selection: ETWFE wins (Cochran's Q test)
#>
#> ATT SE
#> ETWFE (selected) -0.0420 0.0109
#> TWFE -0.0365 0.0133
#>
#> ETWFE vs. TWFE:
#> Cochran's Q statistic: 22.3745
#> Degrees of freedom: 6
#> Cochran's Q p-value: 0.001035 (p < 0.050 selects ETWFE)
#> └--> ETWFE wins
#>
#>
#> Heterogeneity fraction (I^2): 0.73
#> Use plot() to see shrunken estimates by time and cohort.The Cochran’s Q test shows significant evidence of heterogeneity (p-value < 0.05), so ETWFE is selected. Note that the ATT estimates from the two models are similar (-0.042 vs -0.037), but ETWFE has a lower standard error. The heterogeneity fraction \(I^2 = 0.77\) indicates that roughly three-quarters of the variance in cohort-time estimates reflects true heterogeneity rather than sampling noise.
The event study suggests a growing negative effect over time. This is only visible in the 2004 and 2006 cohorts since the 2007 cohort has only one year post-treatment. Note that all cohort-time estimates are shrunk toward the ATT and have confidence intervals widened to adjust for multiple estimates.
When treatment effects are constant across cohorts and time periods, vanilla TWFE is correctly specified and ETWFE wastes degrees of freedom on interaction terms that are all equal. The Q test should find no significant heterogeneity and select TWFE.
set.seed(42)
n_units <- 500
n_periods <- 6
# Assign cohorts: treated in period 3, 4, or 5; some never treated
cohort <- sample(c(3, 4, 5, NA), n_units, replace = TRUE,
prob = c(.25, .25, .25, .25))
unit_fe <- rnorm(n_units, sd = 1)
time_fe <- rnorm(n_periods, sd = 0.5)
panel <- expand.grid(unit = 1:n_units, period = 1:n_periods)
panel$cohort <- cohort[panel$unit]
panel$unit_fe <- unit_fe[panel$unit]
panel$time_fe <- time_fe[panel$period]
panel$treated <- !is.na(panel$cohort) & panel$period >= panel$cohort
# Homogeneous treatment effect = 1.0 for all cohorts and periods
panel$y <- panel$unit_fe + panel$time_fe +
1.0 * panel$treated + rnorm(nrow(panel), sd = 0.5)
panel$gvar <- ifelse(is.na(panel$cohort), Inf, panel$cohort)
result_homo <- select_twfe(
fml = y ~ 1,
tvar = period,
gvar = gvar,
data = panel,
ivar = unit,
cgroup = "never",
vcov = ~unit
)
print(result_homo)
#> Model selection: TWFE wins (Cochran's Q test)
#>
#> ATT SE
#> TWFE (selected) 0.9233 0.0320
#> ETWFE 0.9011 0.0421
#>
#> ETWFE vs. TWFE:
#> Cochran's Q statistic: 3.8922
#> Degrees of freedom: 8
#> Cochran's Q p-value: 0.866716 (p < 0.050 selects ETWFE)
#> └--> TWFE winsAs expected, TWFE is selected. The Cochran’s Q test finds no significant evidence of heterogeneity (p-value ≥ 0.05), since the true effects are homogeneous. TWFE is also more precise because the never-treated group is large, making the restriction to the smaller ETWFE comparison group costlier than any bias reduction benefit.
When treatment effects vary substantially across cohorts, TWFE is biased and the Q test detects significant heterogeneity, leading to selection of ETWFE.
set.seed(42)
# Substantial heterogeneous treatment effects: 0.5, 1.0, 2.0 by cohort
effect_map <- c("3" = 0.5, "4" = 1.0, "5" = 2.0)
effects <- ifelse(
is.na(panel$cohort), 0,
effect_map[as.character(panel$cohort)]
)
panel$y <- panel$unit_fe + panel$time_fe +
effects * panel$treated + rnorm(nrow(panel), sd = 0.5)
result_hetero <- select_twfe(
fml = y ~ 1,
tvar = period,
gvar = gvar,
data = panel,
ivar = unit,
cgroup = "never",
vcov = ~unit
)
print(result_hetero)
#> Model selection: ETWFE wins (Cochran's Q test)
#>
#> ATT SE
#> ETWFE (selected) 1.0312 0.0448
#> TWFE 1.2136 0.0394
#>
#> ETWFE vs. TWFE:
#> Cochran's Q statistic: 382.6950
#> Degrees of freedom: 8
#> Cochran's Q p-value: 0.000000 (p < 0.050 selects ETWFE)
#> └--> ETWFE wins
#>
#>
#> Heterogeneity fraction (I^2): 0.98
#> Use plot() to see shrunken estimates by time and cohort.
plot(result_hetero)The Cochran’s Q test shows highly significant heterogeneity (small p-value), so ETWFE is selected. This is correct because the true effects vary substantially (0.5, 1.0, 2.0 by cohort), causing TWFE to be biased.
When treatment effects differ only slightly across cohorts and the sample is small, the Q test may or may not detect heterogeneity depending on the noise level and realized cohort composition. We illustrate with very mild heterogeneity (cohort effects of 0.90, 1.0, and 1.10) in a small sample (30 units).
Note that, while the Q test selects the “wrong” model here, the selected TWFE model actually estimates the ATT more accurately than the “right” ETWFE model. The TWFE point estimate is closer to the true ATT of 1, and the TWFE has a smaller standard error as well. This illustrates the general point that in small samples a simpler model can produce better estimates even when it is not quite correct (Hastie, Tibshirani, and Friedman 2009; the bias-variance tradeoff).
set.seed(42)
n_units_small <- 30 # Very small sample
# Assign cohorts and build panel with small sample
cohort_small <- sample(c(3, 4, 5, NA), n_units_small, replace = TRUE,
prob = c(.25, .25, .25, .25))
unit_fe_small <- rnorm(n_units_small, sd = 1)
time_fe_small <- rnorm(n_periods, sd = 0.5)
panel_small <- expand.grid(unit = 1:n_units_small, period = 1:n_periods)
panel_small$cohort <- cohort_small[panel_small$unit]
panel_small$unit_fe <- unit_fe_small[panel_small$unit]
panel_small$time_fe <- time_fe_small[panel_small$period]
panel_small$treated <- !is.na(panel_small$cohort) & panel_small$period >= panel_small$cohort
# Very mild heterogeneous treatment effects
effect_map_veryMild <- c("3" = 0.90, "4" = 1.0, "5" = 1.10)
effects_veryMild <- ifelse(
is.na(panel_small$cohort), 0,
effect_map_veryMild[as.character(panel_small$cohort)]
)
panel_small$y <- panel_small$unit_fe + panel_small$time_fe +
effects_veryMild * panel_small$treated + rnorm(nrow(panel_small), sd = 0.5)
panel_small$gvar <- ifelse(is.na(panel_small$cohort), Inf, panel_small$cohort)
result_mild <- select_twfe(
fml = y ~ 1,
tvar = period,
gvar = gvar,
data = panel_small,
ivar = unit,
cgroup = "never",
vcov = ~unit
)
print(result_mild)
#> Model selection: TWFE wins (Cochran's Q test)
#>
#> ATT SE
#> TWFE (selected) 1.1384 0.1444
#> ETWFE 1.2099 0.1655
#>
#> ETWFE vs. TWFE:
#> Cochran's Q statistic: 8.4753
#> Degrees of freedom: 8
#> Cochran's Q p-value: 0.388472 (p < 0.050 selects ETWFE)
#> └--> TWFE winsFor a comprehensive Monte Carlo study that validates the Q criterion
across many replications and sample sizes, see
paper/monte_carlo.R in the package repository. That script
runs a 3×2 factorial design (3 heterogeneity levels × 2 sample sizes)
with 1,000 replications per cell. The results confirm that the Q test
correctly detects heterogeneity: rejection rates are low under
homogeneity and high under true heterogeneity. The results confirm that
the Q test outperforms an alternative model-selection criterion that
tries to minimize the MSE of the ATT.
This appendix describes in more detail the statistics used by
selectTWFE.
Cochran’s Q statistic measures heterogeneity across cohort-time effects:
\[Q = \sum_k w_k(\hat{\delta}_k - \bar{\delta}_w)^2\]
where \(w_k = 1/\hat{\sigma}_k^2\) are inverse-variance weights, \(\hat{\delta}_k\) are the cohort-time effects, and \(\bar{\delta}_w\) is the precision-weighted mean effect.
Under the null hypothesis of homogeneous effects, \(Q \sim \chi^2_{k-1}\) where \(k\) is the number of cohort-time cells.
Decision rule: - If p-value \(< \alpha\) (default \(\alpha = 0.05\)): Evidence of significant heterogeneity → Select ETWFE - If p-value \(\geq \alpha\): No significant heterogeneity → Select TWFE
Users can adjust the significance level via the alpha
parameter (e.g., alpha = 0.01 for conservative testing,
alpha = 0.10 for exploratory analysis).
When ETWFE is selected, selectTWFE also reports the
heterogeneity fraction \(I^2\) (Higgins et al. 2003):
\[I^2 = \max\!\left(0,\; \frac{Q - (K-1)}{Q}\right)\]
where \(Q\) is Cochran’s heterogeneity statistic and \(K\) is the number of cohort-time cells. \(I^2\) ranges from 0 (all variance is sampling noise) to 1 (all variance is genuine heterogeneity).
When ETWFE is selected and shrink_heterogeneity = TRUE
(the default), the cohort-time ATT estimates shown in the event study
plot are shrunk toward the grand mean ATT using the empirical Bayes
formula:
\[\hat{\theta}_i^{EB} = \bar{\theta} + \left(1 - \frac{\sigma^2_i}{\sigma^2_i + \hat{\tau}^2}\right)(\hat{\theta}_i - \bar{\theta})\]
Precisely estimated cells shrink least; noisy cells shrink most. This produces a more honest picture of treatment effect heterogeneity.
The standard errors of the cohort-time effects are also shrunk by a
factor of \(\sqrt{1 - B_i}\), where
\(B_i = \sigma^2_i / (\sigma^2_i +
\hat\tau^2)\) is the shrinkage factor. This is the naive
empirical Bayes posterior SD, \(\sqrt{(1 -
B_i)\sigma^2_i}\). The original (unshrunk) SEs are retained in
the se_raw column of the returned
estimates_shrunk data frame.
The error bars on the event study plot are Bonferroni-adjusted 95% confidence intervals. With \(K\) cohort-time cells displayed, each interval uses the critical value \(z_{1 - 0.025/K}\) rather than \(1.96\), so that the familywise probability of at least one interval failing to cover its target is at most 5% under a global null. The plot also shows a solid horizontal reference line at the estimated ATT.