---
title: "spboost: Nonlinear Spatial Autoregressive Models with Boosting and CFE"
author: "Ghislain Geniaux"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{spboost: Nonlinear Spatial Autoregressive Models with Boosting and CFE}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE
)
library(spboost)
```

Flexible nonlinear spatial autoregressive models: a gradient boosting approach with closed-form estimation

Geniaux, 2026.


# Scientific motivation

Spatial autoregressive models are a central tool in spatial econometrics because observations located close to each other often interact. The spatial lag model introduced in the classical spatial econometric literature (Ord, 1975; Anselin, 1988) formalizes this idea through a row-standardized spatial weights matrix $W$, which defines spatial lags such as $Wy$. In empirical work, the parameter attached to $Wy$ is usually interpreted as a spillover or diffusion effect: outcomes observed at one location partly depend on outcomes observed at neighbouring locations.

The difficulty is that spatial dependence and nonlinear misspecification are hard to distinguish empirically. McMillen (2003) shows that residual spatial autocorrelation may arise because the regression function is misspecified, not necessarily because there is a genuine spatial spillover. This point is especially important when covariates are themselves spatially structured. A flexible regression component can correctly recover nonlinear covariate effects, but if it is too flexible it can also absorb the same low-frequency spatial signal that the researcher wants to attribute to a structural spatial parameter --- a phenomenon analysed as the *geoadditive identification trap* in Geniaux (2026, Section 3.4). Recent work on identification in spatial spillover models emphasizes this tension and the need to distinguish functional-form flexibility from structural spatial dependence (Debarsy, 2025).

`spboost` is motivated by this identification problem, building on the methodological developments in Geniaux (2026), *"Flexible nonlinear spatial autoregressive models: a gradient boosting approach with closed-form estimation"*. It combines nonlinear regression tools with spatial autoregressive models while keeping the spatial parameter explicitly identified and amenable to standard diagnostics. The nonlinear component $f(X)$ can be fitted with component-wise boosting (`mboost`), penalized splines (`mgcv`), MARS (`earth`), or tree/linear boosting (`xgboost`). Component-wise boosting performs regularized variable selection while fitting additive effects (Bühlmann and van de Geer, 2011), thereby connecting the variable-selection literature in statistical learning to spatial econometric estimation.

The second ingredient is computational. Likelihood-based spatial models require repeated evaluation of log-determinants such as $\log |I-\rho W|$, which becomes expensive for large samples. Smirnov (2020, doi:10.1111/gean.12268) proposed a Closed-Form Estimator (CFE) for linear spatial dependence models that avoids this repeated determinant calculation. The CFE is computationally very fast because it relies on three auxiliary regressions without spatial dependence, while achieving accuracy close to maximum-likelihood estimation in the settings studied by Smirnov (2020, doi:10.1111/gean.12268) for the estimation of the spatial parameter. The methodological question behind `spboost` is whether this determinant-free idea remains valid when $f(X)$ is estimated nonparametrically. Geniaux (2026) gives a positive answer for additive estimators that admit a fixed smoothing-matrix or projection representation: a formal asymptotic result establishes the consistency of the Closed-Form Estimator and the validity of its asymptotic variance formula under standard regularity conditions on the additive components, with no additional regularity requirement induced by the spatial autoregressive structure. For boosting-based estimators, the formal proof for the greedy operator remains open, but extensive finite-sample evidence from the Monte Carlo experiments of Geniaux (2026) supports the same conclusion.

The package focuses on three Gaussian spatial data-generating structures:

$$
y = \rho Wy + f(X) + \varepsilon \qquad \text{(SAR)}
$$

$$
y = f(X) + (I - \lambda W)^{-1}\varepsilon \qquad \text{(SEM)}
$$

$$
y = \rho Wy + f(X) + (I - \lambda W_2)^{-1}\varepsilon \qquad \text{(SARAR)}.
$$

Here $f(X)$ is an additive or machine-learning regression component. In the boosting backend, it is written as

$$
f(X) = \beta_0 + \sum_j h_j(X_j),
$$

where each $h_j$ is selected and regularized through component-wise gradient boosting. The SARAR specification supports two distinct row-standardized weight matrices --- $W$ for the spatial lag and $W_2$ for the error process --- the common case $W_2 = W$ corresponding to the SARAR model studied in Geniaux (2026). The main interface, `spbgam()`, allows the user to specify the spatial model (`DGP`) and the estimation method (`method`).

Two families of spatial-parameter estimators are implemented. Methods ending in `ML` use a profile maximum-likelihood criterion and evaluate the spatial log-determinant. Methods ending in `CFE` use the determinant-free Closed-Form Estimator of Smirnov (2020, doi:10.1111/gean.12268), extended to nonlinear additive regression in Geniaux (2026). In the nonlinear setting, CFE proceeds by estimating auxiliary residuals from regressions involving $y$, $Wy$, and, for SAR, $W^2y$, and then solving a quadratic estimating equation for the spatial parameter.

The theoretical status of the estimators differs across estimation backends. The formal asymptotic result applies to nonlinear additive estimators that admit a fixed smoothing-matrix representation, including penalized spline GAM estimators and MARS estimators when the retained basis dimension is fixed. Component-wise boosting estimators are supported by consistent finite-sample evidence from the simulations of Geniaux (2026) and by their asymptotic connection to penalized spline regularization, but the formal proof for the greedy boosting operator remains open. Tree-based `xgboost` variants are provided as experimental predictive benchmarks whose validity in the CFE framework requires empirical validation on a case-by-case basis.

# Main interface

The executable function signature is:

```{r}
args(spbgam)
```

The key arguments are:

| Argument | Meaning |
|---|---|
| `formula` | Regression formula. Its syntax must match the backend: `mboost` base learners for `BSPA`, `mgcv` smooths for `GAM`, ordinary formula terms for `MARS` and `XGBOOST`. |
| `data` | Data frame containing the response and covariates. |
| `W` | Row-standardized spatial weights matrix for SAR or SEM dependence. |
| `W2` | Optional second row-standardized matrix for the error process in SARAR models. |
| `DGP` | Spatial model: `"SAR"`, `"SEM"`, or `"SARAR"`. |
| `method` | Estimator/backend combination, such as `"BSPA_SAR_CFE"` or `"MARS_SEM_ML"`. |
| `control` | List of backend and tuning controls. |

# Model and estimator catalogue

The method names follow a compact grammar:

```text
<backend>_<model>_<spatial estimator>
```

The backend determines how \(f(X)\) is estimated; the model determines the
spatial structure; the suffix determines how the spatial parameter is
estimated.

| Prefix | Regression component | Typical formula syntax | Returned object |
|---|---|---|---|
| `BSPA` | Component-wise boosting with spline base learners from `mboost` | `Y ~ bbs(X1) + bbs(X2) + bols(X3)` | `mboost` object augmented with class `spboost` |
| `BLA` | Component-wise boosting with linear base learners | `Y ~ bols(X1) + bols(X2)` | `mboost` object augmented with class `spboost` |
| `GAM` | Penalized spline GAM from `mgcv` | `Y ~ s(X1) + s(X2)` | `gam`/`bam` object augmented with class `spboost` |
| `MARS` | Multivariate Adaptive Regression Splines through `earth::earth` | `Y ~ X1 + X2 + X3` | `earth` object augmented with class `spboost` |
| `XGBOOST` | Linear or tree-based gradient boosting through `xgboost` | `Y ~ X1 + X2 + X3` | `xgb.Booster` object augmented with class `spboost` |
| `LM` | Linear benchmark | `Y ~ X1 + X2 + X3` | Linear spatial model object augmented with spatial fields |

Available combinations through `spbgam()` are:

| Spatial model | Equation | Available methods |
|---|---|---|
| `SAR` | \(y=\rho Wy+f(X)+\varepsilon\) | `BSPA_SAR_ML`, `BSPA_SAR_CFE`, `BLA_SAR_CFE`, `MARS_SAR_ML`, `MARS_SAR_CFE`, `GAM_SAR_ML`, `GAM_SAR_CFE`, `GAM_SAR_FIVA`, `GAM_SAR_2SLSA`, `XGBOOST_SAR_ML`, `XGBOOST_SAR_CFE`, `LM_SAR_ML`, `BLA_SAR_ML` |
| `SEM` | \(y=f(X)+(I-\lambda W)^{-1}\varepsilon\) | `BSPA_SEM_ML`, `BSPA_SEM_CFE`, `BSPA_SEM_CFE_iter`, `BSPA_SEM_CFE_BRUT`, `MARS_SEM_ML`, `MARS_SEM_CFE`, `GAM_SEM_CFE`, `BLA_SEM_ML` |
| `SARAR` | \(y=\rho Wy+f(X)+(I-\lambda W_2)^{-1}\varepsilon\) | `BSPA_SARAR_ML`, `BSPA_SARAR_CFE`, `BLA_SARAR_ML` |

In user-facing discussions, `spboost_earth_cfe` corresponds to the `MARS_*_CFE`
branch, because the implementation uses `earth::earth`; `spboost_xgboost_cfe`
corresponds to the `XGBOOST_SAR_CFE` branch.

# Theoretical status

The table below is intentionally conservative. It separates formal asymptotic
results, finite-sample empirical validation, and experimental methods.

| Status | Estimators | Interpretation |
|---|---|---|
| Formal asymptotic guarantees | `GAM_SAR_CFE`, `GAM_SEM_CFE`; `MARS_SAR_CFE` and `MARS_SEM_CFE` when `nprune` is fixed | The CFE residual projection argument applies to additive estimators with a fixed smoothing-matrix or projection representation. Consistency and asymptotic variance validity follow under the regularity and identification conditions stated in the methodological paper. |
| Finite-sample validated by simulation evidence | `BSPA_SAR_CFE`, `BSPA_SEM_CFE`, `BSPA_SEM_CFE_iter`, `BSPA_SEM_CFE_BRUT`; `MARS_*_CFE` when `nprune` is selected by cross-validation | The formal proof does not directly cover the discrete model-selection path, but Monte Carlo evidence shows behaviour close to the corresponding GAM/MARS reference estimators in the studied designs. |
| Likelihood reference estimators | `*_ML` methods such as `BSPA_SAR_ML`, `MARS_SAR_ML`, `BSPA_SEM_ML`, `BSPA_SARAR_ML` | These are natural robustness checks for CFE results. They are usually more computationally demanding because they rely on profile likelihood and log-determinant evaluation. |
| Experimental spatial estimators | `GAM_SAR_FIVA`, `GAM_SAR_2SLSA`, `XGBOOST_SAR_CFE`, `XGBOOST_SAR_ML`, `BSPA_SARAR_CFE` | These are useful for benchmarking, prediction, or exploratory analysis, but their theoretical status is more limited in the current framework. Tree-based `xgboost` methods do not admit the smoothing-matrix representation used by the CFE proof. |

For SEM models with high spatial error dependence near the boundary, CFE
fallbacks may be triggered. In that case, it is good practice to compare
against the corresponding `ML` estimator and to inspect the `rho_method` field
stored in the fitted object.

# Returned objects

All successful fits are given class `spboost`, in addition to the class of the
underlying backend. Common fields include:

| Field | Meaning |
|---|---|
| `rho` | Estimated spatial autoregressive parameter for SAR; in SEM routines this field often stores the estimated error parameter \(\lambda\). |
| `lambda` | Error-process parameter for SARAR methods, when available. |
| `fitted` | Full fitted value on the response scale. |
| `fittedYS` | Fitted nonlinear component after the spatial transformation used internally. |
| `residuals` | Full-model residuals. |
| `rmse` | Root mean squared error of full-model residuals. |
| `var_rho` | CFE asymptotic variance estimate when available and numerically stable. |
| `rho_method` | Diagnostic label for the CFE branch, such as `cfe_exact`, `cfe_ratio_R`, or another fallback. |
| `spbgam_meta` | Metadata added by `spbgam()`, including the method, DGP, stopping rule, and CV information. |

The package provides `summary.spboost()`, `predict.spboost()`, and
`fitted_decomp_spboost()` methods for model inspection, prediction, and
decomposition of fitted values by variable.

# Formula syntax

The formula must match the backend.

For component-wise boosting:

```{r}
formula_bspa <- Y ~ bbs(X1) + bbs(X2) + bbs(X3) + bols(X4)
```

For `mgcv` GAM backends:

```{r}
formula_gam <- Y ~ s(X1) + s(X2) + s(X3)
```

For `earth`/MARS and `xgboost`:

```{r}
formula_plain <- Y ~ X1 + X2 + X3 + X4
```

The MARS backend converts compatible formulas to an `earth::earth` formula.
For reproducible inference with the current asymptotic argument, fix
`control_earth$nprune`. If `use_cv_nprune = TRUE`, the estimator is still
available and useful, but its theoretical status becomes empirical rather
than formally covered by the fixed-projection proof.

# Example 1: SAR model with fixed hyperparameters

The first example simulates a nonlinear SAR process and fits two `BSPA`
estimators with known hyperparameters. The CFE version is fast and
determinant-free; the ML version is a useful reference because it estimates
\(\rho\) by profile likelihood.

```{r}
library(spboost)
library(mboost)

set.seed(1)
sim <- dgp(
  n = 1000,
  rho = 0.6,
  betas = c(0, 0.5, 1, -1),
  sigma2 = 1,
  model = "SAR",
  nonlin = TRUE,
  myseed = 1
)

fit_sar_cfe <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "BSPA_SAR_CFE",
  control = list(
    control_gamboost = boost_control(mstop = 150, nu = 0.1)
  )
)

fit_sar_cfe$rho
fit_sar_cfe$rmse
fit_sar_cfe$rho_method
summary(fit_sar_cfe)
```

The likelihood counterpart uses the same regression formula and stopping
iteration:

```{r}
fit_sar_ml <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "BSPA_SAR_ML",
  control = list(
    control_gamboost = boost_control(mstop = 150, nu = 0.1)
  )
)

c(CFE = fit_sar_cfe$rho, ML = fit_sar_ml$rho)
summary(fit_sar_ml)
```

The fitted nonlinear components can be inspected visually. Because additive
components are identifiable up to constants, the true and estimated
component-specific curves are centered before plotting.

```{r, fig.width=8, fig.height=12}
decomp_cfe_ex1 <- fitted_decomp_spboost(fit_sar_cfe)
decomp_ml_ex1 <- fitted_decomp_spboost(fit_sar_ml)

center_curve <- function(x) as.numeric(x - mean(x, na.rm = TRUE))
plot_component_ex1 <- function(true_col, fit_col, title) {
  ord <- order(sim$data[[fit_col]])
  x_component <- sim$data[[fit_col]][ord]
  true_component <- center_curve(sim$data[[true_col]])[ord]
  cfe_component <- center_curve(decomp_cfe_ex1[[fit_col]])[ord]
  ml_component <- center_curve(decomp_ml_ex1[[fit_col]])[ord]

  plot(
    x_component,
    ml_component,
    type = "l",
    lwd = 1,
    col = "#D95F02",
    xlab = fit_col,
    ylab = "Centered component",
    main = title,
    ylim = range(c(true_component, cfe_component, ml_component), na.rm = TRUE)
  )
  lines(x_component, cfe_component, col = "#1B9E77", lwd = 1)
  lines(x_component, true_component, col = "black", lwd = 3)
}

old_par_ex1 <- par(mfrow = c(3, 1), mar = c(4, 4, 2, 1))
plot_component_ex1("f1", "X1", "Component f1(X1)")
legend(
  "topleft",
  legend = c("True component", "BSPA_SAR_CFE", "BSPA_SAR_ML"),
  col = c("black", "#1B9E77", "#D95F02"),
  lwd = c(3, 1, 1),
  bty = "n"
)
plot_component_ex1("f2", "X2", "Component f2(X2)")
plot_component_ex1("f3", "X3", "Component f3(X3)")
par(old_par_ex1)
```

# Example 2: Internal cross-validation, random and spatial

The same SAR model can be estimated with internal cross-validation for the
boosting stopping iteration. With `cv_mode = "random"`, folds are random.
With `cv_mode = "spatial_block"`, each fold holds out one spatial cluster
built with `blockCV::cv_cluster()`. With `cv_mode = "spatial_hex"`, folds 
are defined as spatial hexagonal blocks built from the coordinate columns 
`x` and `y`. However, this strategy is generally not recommended for 
autoregressive models because it may strongly disrupt the underlying 
neighborhood graph structure (see Geniaux, 2026). The
spatial options require `blockCV` and `sf`; if they are not installed, the
package falls back to a safe random split. Here `mstop = 500` is not the final
fixed stopping iteration: it is the maximum value searched by the internal CV
procedure. Use `cv_plot = TRUE` to display the spatial folds as points coloured
by block; the plotted coordinates are also stored in
`fit$mstop_selection$cv_plot_data`.

```{r, fig.show='hide'}
fit_sar_cv_random <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "BSPA_SAR_CFE",
  control = list(
    # Maximum mstop considered by the internal CV search.
    control_gamboost = boost_control(mstop = 500, nu = 0.1),
    mstop_init = 100,
    mstop_criterion = "CV",
    nfold = 5,
    ncore = 1,
    cv_mode = "random",
    cv_seed_spatial = 1001
  )
)

fit_sar_cv_spatial <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "BSPA_SAR_CFE",
  control = list(
    # Maximum mstop considered by the internal CV search.
    control_gamboost = boost_control(mstop = 500, nu = 0.1),
    mstop_init = 100,
    mstop_criterion = "CV",
    nfold = 5,
    ncore = 1,
    cv_mode = "spatial_block",
    cv_coord_vars_spatial = c("x", "y"),
    cv_hex_size_spatial = 0.15,
    cv_seed_spatial = 1001,
    cv_plot = TRUE
  )
)

fit_sar_cv_random$spbgam_meta$mstop_final
fit_sar_cv_spatial$spbgam_meta$mstop_final
summary(fit_sar_cv_random)
summary(fit_sar_cv_spatial)
```

The same fold assignment can be plotted explicitly from the fitted object.
This is useful when knitting the vignette because the spatial CV diagnostic is
then a standard top-level figure.

```{r, fig.width=8, fig.height=6, echo=FALSE, eval=TRUE}
cv_plot_data <- fit_sar_cv_spatial$mstop_selection$cv_plot_data

if (is.null(cv_plot_data) &&
    requireNamespace("blockCV", quietly = TRUE) &&
    requireNamespace("sf", quietly = TRUE)) {
  set.seed(1001)
  pts_cv <- sf::st_as_sf(
    sim$data,
    coords = c("x", "y"),
    crs = 3857,
    remove = FALSE
  )
  cv_blocks <- blockCV::cv_cluster(x = pts_cv, k = 5, report = FALSE)
  cv_plot_data <- data.frame(
    x = sim$data$x,
    y = sim$data$y,
    fold = as.integer(cv_blocks$folds_ids)
  )
}

stopifnot(!is.null(cv_plot_data))

fold_levels <- sort(unique(cv_plot_data$fold))
fold_cols <- grDevices::hcl.colors(length(fold_levels), palette = "Dark 3")
fold_index <- match(cv_plot_data$fold, fold_levels)

plot(
  cv_plot_data$x,
  cv_plot_data$y,
  col = fold_cols[fold_index],
  pch = 19,
  xlab = "x",
  ylab = "y",
  main = "Spatial block CV folds"
)
legend(
  "topright",
  legend = paste("fold", fold_levels),
  col = fold_cols,
  pch = 19,
  bty = "n",
  cex = 0.8
)
```

# Example 3: Comparison table for SAR estimators

The next block compares six estimators on the same simulated SAR sample:
linear `lagsarlm`, non-spatial `gamboost`, `BSPA_SAR_CFE`, `BSPA_SAR_ML`,
`GAM_SAR_CFE`, and `GAM_SAR_ML`. The table reports the response RMSE, the
error on \(\rho\), and the RMSE of the estimated structural component \(f(X)\).
For the non-spatial `gamboost` benchmark, \(\rho\) is not estimated.

The `GAM_SAR_*` versions are especially useful once the specification is known
or has been selected in a preliminary BSPA workflow. They avoid the
variable-selection boosting loop and are therefore much faster on large
samples; in particular, `GAM_SAR_CFE` can switch to `mgcv::bam()` internally.

```{r}
library(spatialreg)
library(spdep)

listw <- spdep::mat2listw(sim$W, style = "W")

fit_lagsarlm <- spatialreg::lagsarlm(
  Y ~ X1 + X2 + X3,
  data = sim$data,
  listw = listw,
  method = "LU",
  zero.policy = TRUE
)

fit_gamboost <- mboost::gamboost(
  Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data,
  control = boost_control(mstop = 150, nu = 0.1)
)

fit_gam_sar_cfe <- spbgam(
  formula = Y ~ s(X1, k = 5) + s(X2, k = 5) + s(X3, k = 5),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "GAM_SAR_CFE",
  control = list(ncore = 1)
)

fit_gam_sar_ml <- spbgam(
  formula = Y ~ s(X1, k = 5) + s(X2, k = 5) + s(X3, k = 5),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "GAM_SAR_ML",
  control = list(ncore = 1)
)

true_f <- rowSums(sim$data[, c("f0", "f1", "f2", "f3")])
wy <- as.numeric(sim$W %*% sim$data$Y)

f_lagsarlm <- as.numeric(
  {
    x_lag <- model.matrix(Y ~ X1 + X2 + X3, sim$data)
    b_lag <- coef(fit_lagsarlm)[colnames(x_lag)]
    x_lag %*% b_lag
  }
)
f_gamboost <- as.numeric(fitted(fit_gamboost))
f_cfe <- as.numeric(fit_sar_cfe$fittedYS)
f_ml <- as.numeric(fit_sar_ml$fittedYS)
f_gam_cfe <- as.numeric(fit_gam_sar_cfe$fittedYS)
f_gam_ml <- as.numeric(fit_gam_sar_ml$fittedYS)

comparison_sar <- data.frame(
  estimator = c(
    "SAR(lagsarlm)", "gamboost", "BSPA_SAR_CFE", "BSPA_SAR_ML",
    "GAM_SAR_CFE", "GAM_SAR_ML"
  ),
  rmse_y = c(
    sqrt(mean((sim$data$Y - suppressMessages(fitted(fit_lagsarlm)))^2)),
    sqrt(mean((sim$data$Y - fitted(fit_gamboost))^2)),
    fit_sar_cfe$rmse,
    fit_sar_ml$rmse,
    fit_gam_sar_cfe$rmse,
    fit_gam_sar_ml$rmse
  ),
  rmse_rho = c(
    abs(fit_lagsarlm$rho - 0.6),
    NA_real_,
    abs(fit_sar_cfe$rho - 0.6),
    abs(fit_sar_ml$rho - 0.6),
    abs(fit_gam_sar_cfe$rho - 0.6),
    abs(fit_gam_sar_ml$rho - 0.6)
  ),
  rmse_f = c(
    sqrt(mean((true_f - f_lagsarlm)^2)),
    sqrt(mean((true_f - f_gamboost)^2)),
    sqrt(mean((true_f - f_cfe)^2)),
    sqrt(mean((true_f - f_ml)^2)),
    sqrt(mean((true_f - f_gam_cfe)^2)),
    sqrt(mean((true_f - f_gam_ml)^2))
  )
)

comparison_sar
```

# Example 4: SEM model with spatial-error filtering

For SEM models, the spatial parameter governs the error process. The SEM-CFE
methods build auxiliary residual regressions and can optionally tune the
boosting stopping iteration using a SEM-filtered criterion. In this example,
the simulated DGP is SEM and the target parameter is \(\lambda\), stored in the
`rho` field by the current SEM routines.

```{r}
sim_sem <- dgp(
  n = 1000,
  rho = 0.5,
  betas = c(0, 0.5, 1, -1),
  sigma2 = 1,
  model = "SEM",
  nonlin = TRUE,
  myseed = 2
)

fit_sem_cfe <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim_sem$data,
  W = sim_sem$W,
  DGP = "SEM",
  method = "BSPA_SEM_CFE",
  control = list(
    control_gamboost = boost_control(mstop = 250, nu = 0.1),
    mstop_criterion = "CV",
    cv_mode = "random",
    nfold = 5,
    ncore = 1
  )
)

fit_sem_cfe$rho
fit_sem_cfe$rmse
fit_sem_cfe$rho_method
summary(fit_sem_cfe)
```

When the estimated spatial error parameter is close to the boundary, compare
with `BSPA_SEM_ML`:

```{r}
fit_sem_ml <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim_sem$data,
  W = sim_sem$W,
  DGP = "SEM",
  method = "BSPA_SEM_ML",
  control = list(
    control_gamboost = boost_control(mstop = 250, nu = 0.1)
  )
)

c(CFE = fit_sem_cfe$rho, ML = fit_sem_ml$rho)
summary(fit_sem_ml)
```

A SEM comparison table can be built in the same spirit as the SAR table. Here
`errorsarlm` is the linear spatial-error reference, while `gamboost` remains a
non-spatial nonlinear benchmark.

```{r}
fit_errorsarlm <- spatialreg::errorsarlm(
  Y ~ X1 + X2 + X3,
  data = sim_sem$data,
  listw = spdep::mat2listw(sim_sem$W, style = "W"),
  method = "LU",
  zero.policy = TRUE
)

fit_gamboost_sem <- mboost::gamboost(
  Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim_sem$data,
  control = boost_control(mstop = 250, nu = 0.1)
)

true_f_sem <- rowSums(sim_sem$data[, c("f0", "f1", "f2", "f3")])
f_errorsarlm <- as.numeric(
  {
    x_err <- model.matrix(Y ~ X1 + X2 + X3, sim_sem$data)
    b_err <- coef(fit_errorsarlm)[colnames(x_err)]
    x_err %*% b_err
  }
)
f_gamboost_sem <- as.numeric(fitted(fit_gamboost_sem))
f_sem_cfe <- as.numeric(fit_sem_cfe$fittedYS)
f_sem_ml <- as.numeric(fit_sem_ml$fittedYS)
lambda_errorsarlm <- if (!is.null(fit_errorsarlm$lambda)) {
  fit_errorsarlm$lambda
} else {
  fit_errorsarlm$lambda.se
}
if (length(lambda_errorsarlm) == 0L) lambda_errorsarlm <- NA_real_
scalar_or_na <- function(x) {
  if (is.null(x) || length(x) == 0L) return(NA_real_)
  as.numeric(x[1])
}

comparison_sem <- data.frame(
  estimator = c("SEM(errorsarlm)", "gamboost", "BSPA_SEM_CFE", "BSPA_SEM_ML"),
  rmse_y = c(
    sqrt(mean((sim_sem$data$Y - suppressMessages(fitted(fit_errorsarlm)))^2)),
    sqrt(mean((sim_sem$data$Y - fitted(fit_gamboost_sem))^2)),
    scalar_or_na(fit_sem_cfe$rmse),
    scalar_or_na(fit_sem_ml$rmse)
  ),
  rmse_lambda = c(
    abs(as.numeric(lambda_errorsarlm[1]) - 0.5),
    NA_real_,
    abs(scalar_or_na(fit_sem_cfe$rho) - 0.5),
    abs(scalar_or_na(fit_sem_ml$rho) - 0.5)
  ),
  rmse_f = c(
    sqrt(mean((true_f_sem - f_errorsarlm)^2)),
    sqrt(mean((true_f_sem - f_gamboost_sem)^2)),
    sqrt(mean((true_f_sem - f_sem_cfe)^2)),
    sqrt(mean((true_f_sem - f_sem_ml)^2))
  )
)

comparison_sem
```

# Example 5: MARS / earth CFE branch

The `MARS_*_CFE` methods use `earth::earth` for the nonlinear component. This
branch is useful when piecewise-linear effects are expected and a compact
model is desired.

```{r}
fit_mars <- spbgam(
  formula = Y ~ X1 + X2 + X3,
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "MARS_SAR_CFE",
  control = list(
    control_earth = list(
      degree = 1,
      nprune = 12,
      trace = 0
    )
  )
)

fit_mars$rho
fit_mars$rmse
fit_mars$rho_method
```

For theory-oriented work, fixing `nprune` keeps the estimator within the
projection argument described above. For predictive work, `nprune` can also be
selected by internal cross-validation:

```{r}
fit_mars_cv <- spbgam(
  formula = Y ~ X1 + X2 + X3,
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "MARS_SAR_CFE",
  control = list(
    control_earth = list(
      degree = 1,
      use_cv_nprune = TRUE,
      cv_nfold = 5,
      cv_ncore = 1,
      trace = 0
    )
  )
)
```

# Example 6: Experimental xgboost CFE branch

The `XGBOOST_SAR_CFE` estimator can be useful as a predictive benchmark, but
it falls outside the formal smoothing-matrix framework. Use it as an
experimental method and validate it empirically against CFE/ML additive
estimators.

```{r}
fit_xgb <- spbgam(
  formula = Y ~ X1 + X2 + X3,
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "XGBOOST_SAR_CFE",
  control = list(
    mstop_init = 100,
    myparams_xgboost = list(
      booster = "gbtree",
      eta = 0.05,
      max_depth = 2,
      subsample = 0.8,
      colsample_bytree = 0.8,
      nfold = 5,
      early_stopping_rounds = 5,
      nthread = 1,
      verbose = 0
    )
  )
)

fit_xgb$rho
fit_xgb$rmse
```

# Example 7: Prediction and decomposition

For fitted `spboost` objects, use the S3 prediction method. Out-of-sample
prediction with spatial correction requires the full-sample `W` matrix
covering both training and prediction locations.

```{r}
pred_in <- predict(fit_sar_cfe)
head(pred_in)

decomp <- fitted_decomp_spboost(fit_sar_cfe)
head(decomp)
```

For a new dataset, refit on a training subset and provide a full-sample
weights matrix covering both training and prediction locations:

```{r}
train_id <- 1:800
test_id <- 801:1000

W_train <- sim$W[train_id, train_id, drop = FALSE]
row_sum_train <- Matrix::rowSums(W_train)
W_train <- Matrix::Diagonal(
  x = ifelse(row_sum_train > 0, 1 / row_sum_train, 0)
) %*% W_train

fit_sar_train <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data[train_id, ],
  W = W_train,
  DGP = "SAR",
  method = "BSPA_SAR_CFE",
  control = list(
    control_gamboost = boost_control(mstop = 100, nu = 0.1)
  )
)

pred_new <- predict(
  fit_sar_train,
  newdata = sim$data[test_id, ],
  data = sim$data[train_id, ],
  W = sim$W,
  type = "BPN"
)

head(pred_new)

# diff RMSE train - test
fit_sar_train$rmse
rmse_test<-sqrt(mean((pred_new-sim$data[test_id, 'Y'])^2))
rmse_test
```

# Example 8: SARAR models

SARAR models include both a spatial lag in the response and a spatially
structured error process. They require two spatial matrices unless `W2 = W` is
acceptable for the application.

```{r}
sim_sarar <- dgp(
  n = 1000,
  rho = 0.5,
  lambda = -0.3,
  betas = c(0, 0.5, 1, -1),
  sigma2 = 1,
  model = "SARAR",
  nonlin = TRUE,
  myseed = 3
)

fit_sarar <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim_sarar$data,
  W = sim_sarar$W,
  W2 = sim_sarar$W2,
  DGP = "SARAR",
  method = "BSPA_SARAR_CFE",
  control = list(
    control_gamboost = boost_control(mstop = 200, nu = 0.1),
    iter_max = 3,
    tol_lambda = 1e-4
  )
)

fit_sarar$rho
fit_sarar$lambda
fit_sarar$rmse
```

Because SARAR-CFE involves joint determinant-free estimation of two spatial
parameters, it should be treated more cautiously than the marginal SAR and SEM
branches.

# Practical recommendations

Start with the simplest estimator that matches the scientific question. For
inference-oriented nonlinear SAR or SEM work with known smooth structure,
`GAM_*_CFE` or fixed-`nprune` `MARS_*_CFE` are the closest to the formal
asymptotic theory. For automatic variable selection, use `BSPA_*` methods and
compare CFE results with their `ML` counterparts. For prediction-oriented
benchmarks, MARS and `xgboost` can be valuable, but `xgboost` CFE should be
reported as experimental.

For very large datasets, a practical workflow is to search for the model
specification with `BSPA_*_CFE` on representative subsamples, using boosting
as a variable-selection and formula-discovery device. Once the relevant
covariates, nonlinear terms, and spatial specification are known, refit a
`GAM_*_CFE` model on the full sample with the selected formula. On large
samples, the GAM-CFE branch can use `mgcv::bam()`, which is extremely fast for
large additive models. This combines the exploratory strength of boosting with
the stronger asymptotic status and computational stability of the GAM-CFE
branch at full scale.

Always inspect:

```{r}
fit <- fit_sar_cfe
summary(fit)
fit$rho
fit$lambda
fit$rho_method
fit$rmse
fit$spbgam_meta
```

When `rho_method` is not an exact CFE branch, report the fallback and compare
with the likelihood counterpart. When covariates or geographic smoothers are
strongly spatially structured, keep the model deliberately parsimonious and
check sensitivity to `mstop`, formula specification, and the choice of
cross-validation regime.

# References

Anselin, L. (1988). *Spatial Econometrics: Methods and Models*. Kluwer.

Buhlmann, P. and van de Geer, S. (2011). *Statistics for High-Dimensional
Data: Methods, Theory and Applications*. Springer.

Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system.
*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining*, 785-794.

Debarsy, N. and Le Gallo, J. (2025). Identification of spatial spillovers:
Do's and don'ts. *Journal of Economic Surveys*, 39(5), 2152-2173.

Friedman, J. H. (1991). Multivariate adaptive regression splines. *The Annals
of Statistics*, 19(1), 1-67.

Geniaux, G. (2026). *Flexible nonlinear spatial autoregressive models: a
gradient boosting approach with closed-form estimation*, 20th WORLD CONFERENCE OF THE SPATIAL ECONOMETRICS ASSOCIATION & 24th INTERNATIONAL WORKSHOP ON SPATIAL ECONOMETRICS AND STATISTICS, Paris, June 17-18, 2026

McMillen, D. P. (2003). Spatial autocorrelation or model misspecification?
*International Regional Science Review*, 26(2), 208-217.

Ord, K. (1975). Estimation methods for models of spatial interaction. *Journal
of the American Statistical Association*, 70(349), 120-126.

Smirnov, O. A. (2020). A closed-form consistent estimator for linear models
with spatial dependence. *Geographical Analysis*. doi:10.1111/gean.12268.

Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal
likelihood estimation of semiparametric generalized linear models. *Journal of
the Royal Statistical Society: Series B*, 73(1), 3-36.
