---
title: "Roy Model: Sector Choice with a Latent Ability Factor"
author: "Greg Veramendi"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Roy Model: Sector Choice with a Latent Ability Factor}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

In a Roy model, individuals choose the sector that offers the highest utility,
and only their sector-specific wage is observed. An unobserved ability factor
affects both test scores and potential wages, and the choice is driven by the
implied gain. **factorana** can jointly estimate:

1.  A **measurement system** for ability (multiple test scores).
2.  **Potential outcomes** in each sector (wages, partially observed).
3.  A **discrete choice** of sector (probit).

All three components share the same latent ability factor, and the model is
identified because test scores contain information about ability independent
of sector choice and realised wages.

The methodology is described in
Heckman, Humphries and Veramendi (2016)
[doi:10.1016/j.jeconom.2015.12.001](https://doi.org/10.1016/j.jeconom.2015.12.001) and
(2018) [doi:10.1086/698760](https://doi.org/10.1086/698760).

## Simulate data

```{r}
library(factorana)

set.seed(108)
n <- 500

# Observed covariates and the unobserved ability factor
x1 <- rnorm(n)           # shifts wages
x2 <- rnorm(n)           # shifts wages and sector choice
f  <- rnorm(n, sd = 1)   # latent ability (unobserved in practice)

# Test scores measure ability with error (shared scale across tests)
T1 <- 2.0 + 1.0 * f + rnorm(n, 0, 0.5)
T2 <- 1.5 + 1.2 * f + rnorm(n, 0, 0.6)
T3 <- 1.0 + 0.8 * f + rnorm(n, 0, 0.4)

# Potential wages in each sector
wage0 <- 2.0 + 0.5 * x1 + 0.3 * x2 + 0.5 * f + rnorm(n, 0, 0.6)
wage1 <- 2.5 + 0.6 * x1 +             1.0 * f + rnorm(n, 0, 0.7)

# Sector choice: higher ability -> more likely to pick sector 1
z_sector <- 0.0 + 0.4 * x2 + 0.8 * f
sector <- as.numeric(runif(n) < pnorm(z_sector))

# Only the wage in the chosen sector is observed
wage <- ifelse(sector == 1, wage1, wage0)

dat <- data.frame(
  intercept = 1,
  x1 = x1, x2 = x2,
  T1 = T1, T2 = T2, T3 = T3,
  wage = wage,
  sector = sector,
  eval_tests = 1L,            # always observe all three tests
  eval_wage0 = 1L - sector,   # wage0 observed iff sector = 0
  eval_wage1 = sector,        # wage1 observed iff sector = 1
  eval_sector = 1L            # sector choice always observed
)
```

## Model specification

Only one latent factor (ability). Every component loads on that factor, so
`loading_normalization` is a scalar here.

```{r}
fm <- define_factor_model(n_factors = 1, n_types = 1)
```

### Measurement equations (tests)

Fix T1's loading at 1.0 to pin the factor scale; leave T2 and T3 free.

```{r}
mc_T1 <- define_model_component(
  name = "T1", data = dat, outcome = "T1", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = 1.0,
  evaluation_indicator = "eval_tests"
)
mc_T2 <- define_model_component(
  name = "T2", data = dat, outcome = "T2", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_tests"
)
mc_T3 <- define_model_component(
  name = "T3", data = dat, outcome = "T3", factor = fm,
  covariates = "intercept", model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_tests"
)
```

### Potential outcomes (wages)

Each wage equation is a linear model evaluated only on the subsample that
chose that sector. `evaluation_indicator` tells **factorana** which rows
contribute to which component's likelihood.

```{r}
mc_wage0 <- define_model_component(
  name = "wage0", data = dat, outcome = "wage", factor = fm,
  covariates = c("intercept", "x1", "x2"), model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_wage0"
)
mc_wage1 <- define_model_component(
  name = "wage1", data = dat, outcome = "wage", factor = fm,
  covariates = c("intercept", "x1"), model_type = "linear",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_wage1"
)
```

### Sector-choice equation

A probit with a free factor loading (positive ability raises the probability
of choosing sector 1 when the simulated value is positive).

```{r}
mc_sector <- define_model_component(
  name = "sector", data = dat, outcome = "sector", factor = fm,
  covariates = c("intercept", "x2"), model_type = "probit",
  loading_normalization = NA_real_,
  evaluation_indicator = "eval_sector"
)
```

### Assemble the system

```{r}
ms <- define_model_system(
  components = list(mc_T1, mc_T2, mc_T3, mc_wage0, mc_wage1, mc_sector),
  factor     = fm
)
```

## Estimation

```{r}
ctrl <- define_estimation_control(n_quad_points = 8, num_cores = 1)

fit <- estimate_model_rcpp(
  model_system = ms,
  data         = dat,
  control      = ctrl,
  optimizer    = "nlminb",
  parallel     = FALSE,
  verbose      = FALSE
)

fit$convergence   # 0 == strict convergence
fit$loglik
```

## Results

```{r}
components_table(fit, digits = 3)
```

Estimated loadings on the ability factor track the simulated values: the test
equations recover (1.0, 1.2, 0.8); the wage equations recover roughly 0.5 in
sector 0 and 1.0 in sector 1, capturing the positive selection on ability into
the high-return sector.

## Recover factor scores

Given the fitted model, we can back out each individual's posterior mean
ability.

```{r}
fscores <- estimate_factorscores_rcpp(
  fit, dat, control = ctrl, parallel = FALSE, verbose = FALSE
)
head(fscores[, c("obs_id", "factor_1", "se_factor_1", "converged")])

# Correlation of estimated scores with the true (simulated) ability
cor(fscores$factor_1, f)
```

## Two-stage estimation (optional)

For larger applications it is often convenient to estimate the measurement
system first, then freeze those parameters and estimate the structural part.
See `?estimate_model_rcpp` (argument `previous_stage` in
`define_model_system()`) and `?set_adaptive_quadrature_cpp` for the adaptive
integration that makes the second stage cheap.