---
title: "Chained Multiple Imputation Workflows with mimar"
output: rmarkdown::html_vignette
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Chained Multiple Imputation Workflows with mimar}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

## Overview

Missing data are ubiquitous in applied statistics. A large proportion of
real data sets contain at least some incomplete cases, and the way missingness
is handled can substantially affect point estimates, standard errors, and
downstream inference. Complete-case analysis discards potentially
informative observations and, unless data are missing completely at random,
introduces selection bias. Single imputation fills missing values once but
ignores imputation uncertainty, leading to standard errors that are too
narrow.

Multiple imputation addresses both problems. By generating $m$ plausible
completions of the data and combining results using Rubin's pooling rules
[@rubin1987], it propagates uncertainty about what the missing values might
have been into the final estimates. The chained-equation formulation, sometimes
called fully conditional specification, makes multiple imputation tractable for
mixed-type data by modeling each incomplete variable conditionally on all
others, cycling through variables until the chained summaries appear stable
[@vanbuuren2011mice].

`mimar` is designed for analyses where missing-data handling, artificial
missingness benchmarks, imputation diagnostics, and downstream pooling should
live in one compact grammar. Its contribution is not to replace the mature
special-purpose imputation ecosystem, but to make a complete workflow concise:
describe the missingness, create benchmark amputations when needed, impute with
native or learner-backed update rules, inspect diagnostics, evaluate recovered
cells when truth is available, and pool post-fit quantities.

`mimar` organizes this workflow into a small set of composable functions:

```r
describe()        # characterise missingness structure
ampute()          # introduce controlled artificial missingness
imputer_registry()# inspect available imputation methods
impute()          # run the chained multiple imputation
complete()        # extract a single completed dataset
evaluate()        # score imputation quality against known truth
pool()            # apply Rubin's pooling rules
plot()            # visualise any of the above objects
```

The central design decision is that `mimar` owns the outer chained-equation
loop and multiple-imputation bookkeeping. Each imputer is simply a
variable-level update rule slotted into that loop. This separation means that
native parametric methods, donor-based methods, and machine-learning-backed
methods are called identically and obey the same convergence and
type-restoration logic:

```r
impute(data, imputer = "pmm",     m = 5, maxit = 5)
impute(data, imputer = "rf",      m = 5, maxit = 5)
impute(data, imputer = "xgboost", m = 5, maxit = 5)
```

No external modelling framework is required. Learner-backed imputers call their
original packages directly, and those backend packages are installed with
`mimar` so all registered imputers are available without extra installation
steps.

## Succinct Use

For everyday work, `impute()` is the only function you need. The input data
can contain `NA`, and the completed outputs returned by `complete()` do not.

```{r}
succinct_dat <- data.frame(
  x = c(1, NA, 3, 4, NA),
  z = factor(c("a", "b", NA, "a", "b"))
)

i <- mimar::impute(succinct_dat, imputer = "knn", m = 2, maxit = 2, seed = 1)
mimar::complete(i, 1)
mimar::complete(i, "all")
```

The workflow can be summarized as:

| Stage | Function | Main object |
|---|---|---|
| Missingness description | `describe()` | `mimar_missing` |
| Artificial missingness | `ampute()` | `mimar_amputation` |
| Imputation | `impute()` | `mimar_imputation` |
| Completed data extraction | `complete()` | data frame or list |
| Quality assessment | `evaluate()` | `mimar_evaluation` |
| Post-fit pooling | `pool()` | `mimar_pool` |
| Diagnostics | `plot()` | `ggplot` |

```{r}
library(mimar)

set.seed(1)
dat <- data.frame(
  age   = rnorm(120, 50, 10),
  bmi   = rnorm(120, 25,  4),
  sex   = factor(sample(c("F", "M"),          120, TRUE)),
  group = factor(sample(c("A", "B", "C"),     120, TRUE)),
  smoker = sample(c(TRUE, FALSE),             120, TRUE)
)

head(dat)
```

## Characterizing the Missingness Structure

Before imputing, it is good practice to understand the extent and pattern of
missingness. `describe()` returns a summary object that reports per-variable
missingness proportions, complete-case counts, and the missingness pattern
matrix.

Formally, let $X$ be an $n \times p$ data matrix and define the missingness
indicator

$$
R_{ij} =
\begin{cases}
1, & X_{ij}\ \text{is missing},\\
0, & X_{ij}\ \text{is observed}.
\end{cases}
$$

The per-variable missingness proportion is

$$
\hat{\pi}_j = \frac{1}{n}\sum_{i=1}^{n} R_{ij},
$$

and the complete-case count is

$$
n_{\mathrm{cc}} = \sum_{i=1}^{n} I(R_{i1}=0,\ldots,R_{ip}=0).
$$

These summaries answer the first practical question: how much information
would be lost under complete-case analysis, and which variables are driving
that loss?

```{r}
d <- describe(dat)
d
summary(d)
```

```{r, fig.width=7, fig.height=4}
plot(d)
```

## Imputer Registry

`mimar` ships with native imputers and learner-backed methods. The registry
documents each method, whether it is implemented natively or through a backend
package, and which target types it supports.

```{r}
imputer_registry()
```

`describe("imputers")` returns the same registry through the unified
`describe` interface, which also exposes the corresponding print and plot
methods.

```{r}
describe("imputers")
```

## Simulating Missingness for Benchmarking

When the goal is to evaluate imputation quality, a natural approach is to
start with a complete dataset, introduce missingness under a known mechanism,
impute, and compare the recovered values to the truth. `ampute()` supports
this workflow by generating artificial missingness and tracking three masks:

- `mask_original`: cells already missing before amputation;
- `mask_added`: cells made missing by `ampute()`;
- `mask_total`: all missing cells after amputation.

Only `mask_added` cells have known truth and are used by `evaluate()`.

### Missing completely at random

Under MCAR, each cell in the target variable is independently assigned to be
missing with constant probability:

$$
A_{ij} \sim \operatorname{Bernoulli}(\pi),
\qquad X_{ij} \leftarrow \mathrm{NA}\ \text{if}\ A_{ij}=1,
$$

where $\pi$ is `prop`. MCAR is the only mechanism under which complete-case
analysis is unbiased, making it a useful baseline when comparing imputers.

### Missing at random

Under MAR, the probability of missingness depends on observed covariates but
not on the missing value itself. `mimar` constructs a standardized linear
score $s_i$ from the `by` variables and maps it through a logistic function:

$$
p_i = \frac{1}{1+\exp[-(\alpha+s_i)]}.
$$

The intercept $\alpha$ is calibrated so that $n^{-1}\sum_i p_i \approx
\texttt{prop}$. This ensures that the targeted missingness proportion is
respected on average while allowing individual probabilities to vary with
the observed predictors.

### Missing not at random

Under MNAR, $s_i$ is derived from the target variable itself, which is the
mechanism most harmful to inference because the missingness is informative
about the value being missed. For numeric targets, `ampute()` can emphasize
the right tail, left tail, or both tails of the distribution.

```{r}
a <- ampute(
  dat,
  prop      = 0.25,
  mechanism = "MAR",
  target    = c("bmi", "group"),
  by        = c("age", "sex"),
  seed      = 1
)

a
summary(a)
```

## Quick Visual Checks

The first thing most users want to see is whether the diagnostics are actually
visible and legible. This compact set of plots appears early in the vignette so
the rendered HTML shows the core visual outputs without requiring a long
scroll.

```{r, fig.width=7, fig.height=4, fig.show='hold'}
i_knn <- impute(a, imputer = "knn", m = 3, maxit = 3, seed = 1)
plot(i_knn, type = "density", variable = "bmi")
plot(i_knn, type = "xy", formula = bmi ~ age | sex)
```

## The Chained Imputation Algorithm

The fully conditional specification approach imputes each incomplete variable
in turn, conditioning on all other variables. This is repeated for $T$
iterations and replicated $m$ times to obtain $m$ independent completed
datasets.

For incomplete variable $X_j$, let

$$
O_j = \{i : R_{ij}=0\}, \qquad M_j = \{i : R_{ij}=1\}.
$$

At iteration $t$, with the current completed data $\widetilde X^{(t)}$,
`mimar` fits an imputation model using the observed rows of variable $j$:

$$
\hat f_j^{(t)}:\widetilde X_{-j,O_j}^{(t)} \rightarrow X_{j,O_j}.
$$

Missing values are then updated by applying the learned model to the missing
rows:

$$
\widetilde X_{j,M_j}^{(t+1)}
= g_j\!\left(\hat f_j^{(t)},\widetilde X_{-j,M_j}^{(t)}\right),
$$

where $g_j$ is the prediction or donor-sampling step specific to the chosen
imputer. Observed values are never overwritten; $\widetilde X_{j,O_j}^{(t)}$
is reset to $X_{j,O_j}$ after each update.

The full algorithm, written compactly, is as follows. Let $h$ be the
imputer, $T$ the number of chained iterations, $m$ the number of
imputations, and $\mathcal{B}(O_j)$ a bootstrap sample from the observed
rows of variable $j$.

$$
\begin{array}{ll}
\textbf{Input:} &
X,\ R,\ h,\ m,\ T.\\[2mm]
\textbf{Initialize:} &
\widetilde X^{(0)} \leftarrow \operatorname{init}(X),
\quad\text{medians for numeric, modes for categorical variables.}\\[2mm]
\textbf{For } k=1,\ldots,m: &
\widetilde X_k^{(0)} \leftarrow \widetilde X^{(0)}.\\[1mm]
& \textbf{For } t=1,\ldots,T:\\[1mm]
& \quad \textbf{For each } j \text{ with } M_j \ne \varnothing:\\[1mm]
& \quad\quad B_j \leftarrow \mathcal{B}(O_j).\\[1mm]
& \quad\quad \hat f_{jk}^{(t)} \leftarrow
  \operatorname{fit}_h\!\left(\widetilde X_{k,B_j,-j}^{(t-1)}, X_{B_j,j}\right).\\[1mm]
& \quad\quad \widetilde X_{k,M_j,j}^{(t)} \leftarrow
  \operatorname{restore}_j\!\left[
  g_h\!\left(\hat f_{jk}^{(t)}, \widetilde X_{k,M_j,-j}^{(t-1)}\right)
  \right].\\[1mm]
& \quad\quad \widetilde X_{k,O_j,j}^{(t)} \leftarrow X_{O_j,j}.\\[2mm]
\textbf{Return:} &
\{\widetilde X_1^{(T)},\ldots,\widetilde X_m^{(T)}\}
\text{ as a \texttt{mimar\_imputation} object.}
\end{array}
$$

The bootstrap draw within each iteration is what introduces between-imputation
variability and ensures that the $m$ completed datasets are not identical.
The `restore` step converts predictions back to the target storage type
(numeric, integer, `Date`, logical, factor, or character), which allows the
same loop to handle numeric-only, categorical-only, and mixed-type data frames
without special-casing at the call site.

## Implemented Update Equations

The imputers differ in how $\hat f_j$ is fitted and how $g_j$ generates draws.
The choice of imputer determines both the distributional assumptions made about
each variable and the computational cost per iteration.

### Mean, Median, Mode, and Naive

These constant-predictor imputers ignore the predictor variables entirely. For
numeric targets, `mean` and `median` use

$$
\widetilde x_{ij} = \bar x_j
\quad\text{or}\quad
\widetilde x_{ij} = \operatorname{median}(x_{O_j,j}).
$$

For categorical targets, `mode` imputes the most frequent observed category:

$$
\widetilde x_{ij}
= \arg\max_c \sum_{i \in O_j} I(x_{ij}=c).
$$

`naive` selects median for numeric-like variables and mode for
categorical-like variables automatically. These methods are primarily useful
as baselines and for initialising the chain; they do not preserve the
conditional distribution of the missing variable given observed covariates.

### Normal Linear Draw

`norm` fits an ordinary linear model

$$
X_j = X_{-j}\beta_j + \epsilon_j, \qquad \epsilon_j \sim N(0,\sigma_j^2),
$$

and draws from the predictive distribution at missing rows:

$$
\widetilde X_{j,M_j} = \hat X_{-j,M_j}\hat\beta_j + e,
\qquad e \sim N(0,\hat\sigma_j^2).
$$

The stochastic draw is important: deterministic regression imputation
underestimates the variance of the imputed variable and collapses the
distribution around the regression surface.

### Predictive Mean Matching

`pmm` combines linear prediction with donor sampling, which addresses the
main limitation of `norm` for non-Gaussian variables: imputed values are
constrained to the observed support. The method fits the same linear model but
instead of drawing from the normal predictive distribution, it selects a donor
from the observed rows whose predicted value is closest to the missing row's
prediction:

$$
d(i,\ell) = |\hat\mu_i - \hat\mu_\ell|, \qquad
\widetilde x_{ij} = x_{\ell j},\quad \ell \sim \mathcal{D}_i.
$$

This makes PMM robust to departures from normality and ensures that imputed
values respect the range and granularity of the observed data
[@vanbuuren2011mice]. `spmm` uses the same donor-matching rule in a
single-step variant.

### Logistic and Polytomous Regression

For binary targets, `logreg` models the conditional probability directly:

$$
\Pr(X_j = 1 \mid X_{-j}) = \operatorname{logit}^{-1}(X_{-j}\beta_j),
$$

and imputes by drawing

$$
\widetilde X_{ij} \sim \operatorname{Bernoulli}(\hat p_i).
$$

For multiclass targets, `polyreg` fits one-vs-rest logistic models for each
class $c = 1,\ldots,C$ and normalises the resulting probabilities:

$$
\widetilde p_{ic}
= \frac{\hat p_{ic}}{\sum_{r=1}^{C}\hat p_{ir}},
$$

then draws from $\operatorname{Categorical}(\widetilde p_i)$. Stochastic
class draws, rather than hard majority-class assignment, are necessary for
preserving the uncertainty that multiple imputation is designed to propagate.

### Donor Methods: knn, hotdeck, famd

`knn` and `hotdeck` are donor-based imputers that do not assume any
parametric form for the conditional distribution. Predictors are encoded
into a numeric design matrix $Z$. The donor distance for missing row $i$ is

$$
d(i,\ell) = \left\|Z_i - Z_\ell\right\|_2.
$$

`knn` samples the imputed value from the $k$ nearest observed donors.
`hotdeck` uses the same standardized distance inside the chained-equation
loop, operating as a stochastic hot-deck update. `famd` extends the donor
approach to mixed-type predictors by first projecting observations into
FAMD-assisted coordinates with `missMDA` [@josse2016missmda], then applying
donor matching in that space.

### Learner-Backed Imputers

`rf`, `ranger`, `rpart`, `nbayes`, `svm`, `bart`, `glmnet`, `gbm`, and
`xgboost` treat the imputation of each variable as a supervised learning
problem. The learner is fitted on observed rows and predicts (or draws from)
the conditional distribution at missing rows:

$$
\hat f_j = \mathcal{A}_h(X_{-j,O_j}, X_{j,O_j}).
$$

Numeric targets use regression predictions; binary and multiclass targets use
sampled class probabilities when the backend exposes them. These learner-backed
methods are pragmatic stochastic chained update rules. They are useful for
predictive recovery and sensitivity analyses, but users should inspect
between-imputation variability, convergence-screening plots, and downstream
model sensitivity rather than assuming that every learner automatically yields
proper multiple-imputation uncertainty. `rf` operates as
a variable-level supervised update inside `mimar`'s chained loop; it is not
a whole-dataframe imputation engine. This is conceptually related to the
missForest strategy [@stekhoven2012missforest], but the outer
iteration, multiple-imputation replication, and type handling are controlled
by `mimar`. The `ranger`, `glmnet`, `gbm`, and `xgboost` backends follow
@wright2017ranger, @friedman2010glmnet, @friedman2001gbm, and
@chen2016xgboost, respectively.

## Running the Imputation

All imputers are called through the same `impute()` interface. The `m`
argument controls the number of completed datasets and `maxit` the number
of chained-equation cycles per dataset. Increasing `maxit` allows the
conditional distributions more cycles to converge; in practice, 5 to 10
iterations is often sufficient for well-behaved data, though trace diagnostics
and sensitivity analyses should be checked for complex missingness patterns.

When a learner-backed imputer is requested, `mimar` applies that imputer to
each incomplete variable rather than silently substituting another method.
This keeps benchmark comparisons interpretable: `imputer = "rf"` means that
random-forest updates are used for numeric, binary, and multiclass targets,
and the same principle holds for the other learner-backed imputers.

```{r}
i_knn <- impute(a, imputer = "knn", m = 3, maxit = 3, seed = 1)
i_knn
summary(i_knn)
complete(i_knn, 1)
```

The vignette uses `knn` for the main diagnostics because it is dependency-light,
fast on the small example data, and produces visible stochastic donor variation
across imputations. The same plotting calls work for the other registered
imputers.

The `m` completed datasets are conditionally independent once their random
seeds have been fixed. `mimar` therefore parallelises at the completed-dataset
level when `ncore > 1`. Internally, the package maps the same single-chain
imputation routine over `1:m` with `functionals::fmap()`. This is the natural
parallel boundary: each worker receives the original incomplete data, the
initial median/mode completion, the imputer specification, and a deterministic
seed offset. Worker `k` uses `seed + k - 1`, so a run remains reproducible for
a fixed `seed`, `m`, `maxit`, and imputer.

```r
i_parallel <- impute(
  a,
  imputer = "knn",
  m       = 5,
  maxit   = 5,
  seed    = 1,
  ncore   = 2
)
```

The default `ncore = 1` is deliberately conservative. It avoids parallel
overhead for small data and gives the most predictable behavior in constrained
execution environments. Increasing `ncore` is most useful when `m` is greater
than one and each chain is expensive, for example with random forests,
gradient boosting, penalized regression, or BART. Parallelism does not change
the chained-equation model; it only changes how the independent completed
datasets are scheduled.

Methods that differ in their statistical assumptions can be compared by
running them on the same amputed object:

```{r}
i_knn_small <- impute(a, imputer = "knn",     m = 1, maxit = 2, seed = 1)
i_hotdeck <- impute(a, imputer = "hotdeck", m = 1, maxit = 2, seed = 1)

summary(i_knn_small)
summary(i_hotdeck)
```

Learner-backed methods are available through the same call:

```r
impute(a, imputer = "rf",      m = 5, maxit = 5, seed = 1)
impute(a, imputer = "glmnet",  m = 5, maxit = 5, seed = 1)
impute(a, imputer = "xgboost", m = 5, maxit = 5, seed = 1)
impute(a, imputer = "superlearner", m = 5, maxit = 5, seed = 1)
```

## Tuning Imputers

Learner-backed imputers expose their hyperparameters through a reusable
specification object, or directly through `...` when calling `impute()`. The
same hyperparameter set is applied across the full chained-imputation
pipeline.

```{r}
rf_spec <- imputer("rf", num.trees = 500)
xgb_spec <- imputer("xgboost", nrounds = 100, max_depth = 3)

describe(a)

i_knn <- impute(a, imputer = "knn", m = 3, maxit = 3, seed = 1, donors = 10)

summary(i_knn)
```

The learner-backed specifications can then be used in the same pipeline:

```r
i_rf <- impute(a, imputer = rf_spec, m = 3, maxit = 3, seed = 1)
i_xgb <- impute(a, imputer = xgb_spec, m = 3, maxit = 3, seed = 1)
```

## Super Learner Imputation

`superlearner` builds a cross-validated ensemble of candidate imputers. For each
incomplete variable, the candidate library is evaluated on observed cells,
non-negative weights are assigned from the observed-cell prediction loss, and
the weighted ensemble is used inside the same chained-imputation loop.

```{r}
sl_spec <- imputer(
  "superlearner",
  library = c("pmm", "knn", "rpart"),
  folds = 3,
  metalearner = "inverse_loss"
)

i_sl <- impute(a, imputer = sl_spec, m = 2, maxit = 2, seed = 1)
summary(i_sl)
```

The shorter `sl` alias is equivalent:

```r
impute(a, imputer = "sl", m = 5, maxit = 5, seed = 1)
```

## Evaluating Imputation Quality

When the imputation object was produced from an amputed dataset, the true
values of the artificially introduced missing cells are known. `evaluate()`
exploits this by scoring only cells where `mask_added == TRUE`, isolating
imputation error from any pre-existing missingness.

For numeric variables with true values $y_i$ and imputed values $\hat y_i$
over the $q$ recovered cells:

$$
\operatorname{RMSE} =
\sqrt{\frac{1}{q}\sum_{i=1}^{q}(\hat y_i-y_i)^2},
\qquad
\operatorname{MAE} =
\frac{1}{q}\sum_{i=1}^{q}|\hat y_i-y_i|,
\qquad
\operatorname{Bias} =
\frac{1}{q}\sum_{i=1}^{q}(\hat y_i-y_i).
$$

RMSE penalises large errors more heavily than MAE; comparing the two
gives a sense of whether the imputer is making a few large mistakes or many
small ones. Bias diagnoses systematic over- or under-imputation, which can
occur when the imputation model is misspecified relative to the true
conditional distribution.

For categorical variables,

$$
\operatorname{Accuracy}
= \frac{1}{q}\sum_{i=1}^{q} I(\hat y_i=y_i).
$$

Balanced accuracy averages per-class recall and is more informative when
class frequencies are unequal among the missing cells.

```{r}
e <- evaluate(i_knn)
e
describe(e)
head(e$recovery_by_imputation)
```

```{r, fig.width=7, fig.height=4}
plot(i_knn)
plot(i_knn, type = "missing")
plot(i_knn, type = "density")
plot(e)
```

## Diagnostic Plotting

Diagnostic plotting is not a cosmetic step in multiple imputation. The pooled
analysis assumes that the imputed values are plausible draws from the
conditional distribution implied by the imputation model. No single plot proves
that assumption or establishes convergence, but several simple graphics expose
common failures:
chains that have not stabilised, imputed values with implausible marginal
distributions, categorical levels that are over- or under-produced, and
relationships between variables that are distorted among the imputed cells.

`mimar` keeps lightweight iteration diagnostics in the imputation object. For
each completed dataset, iteration, and incomplete variable, it records the
number of imputed cells and, for numeric variables, the mean and standard
deviation of the current imputed values. These summaries are small enough to
store by default but rich enough to support convergence-screening trace plots.

```{r}
head(i_knn$diagnostics$trace)
```

The trace plot shows how a summary of the imputed values changes over the
chained iterations. Strong monotone trends or chains that remain separated can
signal unstable chained updates, feedback between derived variables, or a poor
predictor specification. Flat traces are not automatically good: deterministic
or constant imputers such as `naive` can be flat because they do not generate
meaningful stochastic updates. Trace plots are screening tools, not formal
proofs of convergence, and are most informative for stochastic methods such as
`pmm`, `norm`, `logreg`, `polyreg`, and learner-backed imputers.

```{r, fig.width=7, fig.height=4}
plot(i_knn, type = "trace", statistic = "mean")
```

Density plots compare the observed distribution with the distribution of
imputed values. The observed data are drawn once, while imputed values are
drawn separately for each completed dataset. This mirrors the common
diagnostic style where the observed curve acts as a reference and each
imputation contributes its own imputed curve. Density diagnostics are drawn as
lines rather than filled areas so that several completed datasets can be
compared without the later layers obscuring the earlier ones. The marginal
distributions do not have to be identical under MAR, because missingness may
depend on observed covariates. Large differences should still be investigated
because they may represent either a real conditional difference or an
imputation-model problem.

```{r, fig.width=7, fig.height=4}
plot(i_knn, type = "density", variable = "bmi")
```

Box-and-whisker diagnostics provide the same comparison in a more robust form.
Imputation number `0` denotes the observed values and `1:m` denote the
completed datasets. This is useful when densities are unstable because there
are few imputed values or because the variable has a skewed distribution with
outliers.

```{r, fig.width=7, fig.height=4}
plot(i_knn, type = "boxplot", variable = "bmi")
```

The strip plot displays individual observed and imputed values by imputation
number. It is especially helpful for small datasets, variables with few missing
cells, and situations where overplotting is limited. A tight pile of imputed
points at one value reveals deterministic or overly concentrated imputation.

```{r, fig.width=7, fig.height=4}
plot(i_knn, type = "strip", variable = "bmi")
```

Bivariate diagnostics examine whether the imputed rows preserve relationships
between variables. The formula interface follows the familiar `y ~ x` pattern,
with an optional conditioning variable after `|`. Observed points are shown
separately from rows where either plotted variable had to be imputed.

```{r, fig.width=7, fig.height=4}
plot(i_knn, type = "xy", formula = bmi ~ age | sex)
```

For categorical variables, density and boxplot diagnostics are not meaningful.
The proportion plot compares the category distribution among observed values
with the category distribution generated within each imputation. A stratified
formula can be used when a categorical discrepancy may be explained by another
observed variable.

```{r, fig.width=7, fig.height=4}
plot(i_knn, type = "proportion", variable = "group")
plot(i_knn, type = "proportion", formula = group ~ sex)
```

## Pooling Across Imputations

The purpose of generating $m > 1$ completed datasets is to obtain valid
standard errors that reflect uncertainty about the missing data. Analysing
each completed dataset separately and then combining results using Rubin's
rules [@rubin1987; @marshall2009pooling] accomplishes this.

`pool()` is designed for post-fit quantities computed after imputation. The
statistical target is the quantity, not the storage container. A quantity may be
a scalar estimate, a vector of regression coefficients, a covariance-aware
parameter vector, a matrix of survival probabilities, or a scalar performance
metric. Data frames are accepted only as a tidy adapter for scalar estimates
that are naturally stored one row per term and imputation.

For a single quantity, let $Q_k$ be its estimate and $U_k$ the complete-data
variance estimate from completed dataset $k$. The pooled point estimate and
between-imputation variance are:

$$
\bar Q = \frac{1}{m}\sum_{k=1}^{m}Q_k,
\qquad
B = \frac{1}{m-1}\sum_{k=1}^{m}(Q_k-\bar Q)^2.
$$

The total variance combines within- and between-imputation components:

$$
\bar U = \frac{1}{m}\sum_{k=1}^{m}U_k,
\qquad
T = \bar U + \left(1+\frac{1}{m}\right)B.
$$

The term $(1+m^{-1})B$ is the excess variance attributable to missingness.
The fraction of missing information, $\lambda = (1+m^{-1})B/T$, quantifies
how much the missingness inflates uncertainty relative to a complete-data
analysis; variables with large $\lambda$ are those for which inference is
most sensitive to the imputation model.

The result is a `mimar_pool` object whose `pooled` table contains the pooled
quantity, total standard error, confidence interval, and variance diagnostics
for each `term`.

```{r}
pool(c(0.10, 0.11, 0.09), std.error = c(0.04, 0.05, 0.04), name = "age")
```

For several coefficients from the same fitted model, the vector and covariance
matrix should be pooled together. This keeps the joint uncertainty structure,
which is needed for multivariate Wald tests and downstream contrasts.

```{r}
betas <- list(
  c(age = 0.10, bmi = 0.30),
  c(age = 0.11, bmi = 0.32),
  c(age = 0.09, bmi = 0.29)
)
covariances <- list(
  diag(c(0.04, 0.08)^2),
  diag(c(0.05, 0.09)^2),
  diag(c(0.04, 0.08)^2)
)

pooled_betas <- pool(betas, covariance = covariances)
pooled_betas
pooled_betas$estimate
pooled_betas$variance
```

Matrix-valued quantities are supplied as a list with one matrix per imputation.
For example, a survival probability matrix might have time points in rows and
covariate profiles in columns. Unless a full joint covariance is available,
`pool()` treats each cell as a scalar estimand and pools element by element.

```{r}
survival_probabilities <- list(
  matrix(c(0.90, 0.80, 0.70, 0.60), nrow = 2),
  matrix(c(0.91, 0.79, 0.72, 0.61), nrow = 2),
  matrix(c(0.89, 0.81, 0.71, 0.59), nrow = 2)
)

pooled_survival <- pool(survival_probabilities)
pooled_survival
pooled_survival$estimate
```

The tidy data-frame adapter remains useful for model outputs that already have
one scalar row per term and imputation. In the next example there are six input
rows because two quantities, `age` and `bmi`, are estimated in each of three
imputations. `pool()` returns two pooled quantities: one row for `age` and one
row for `bmi`.

```{r}
external_results <- data.frame(
  term       = rep(c("age", "bmi"), each = 3),
  estimate   = c(0.10, 0.11, 0.09, 0.30, 0.32, 0.29),
  std.error  = c(0.04, 0.05, 0.04, 0.08, 0.09, 0.08),
  imputation = rep(1:3, times = 2)
)

p <- pool(external_results)
p
```

When no reliable complete-data variance is available, as is common for bounded
performance measures, Marshall et al. recommend robust summaries. Accordingly,
`pool()` reports the median, interquartile range, and range by default for such
quantities. Use `rule = "mean"` only when a mean and between-imputation standard
error are substantively appropriate.

## Limitations and Future Work

`mimar` is intentionally compact. It standardizes missingness description,
amputation, imputation, diagnostics, evaluation, and pooling, but it does not
yet expose all controls available in larger imputation systems. In particular,
there is no full predictor-matrix interface, passive-imputation grammar for
deterministic derived variables, formal convergence statistic, or automatic
tuning layer for learner-backed imputers.

The learner-backed methods should be interpreted as supervised update rules
inside a chained stochastic workflow. They can improve predictive recovery, but
they do not by themselves guarantee congeniality with a downstream analysis
model or valid uncertainty quantification in every setting. For inferential
analyses, users should compare methods, inspect trace and distribution
diagnostics, evaluate recovery when benchmark truth is available, and assess
whether substantive conclusions are robust to plausible imputation choices.

## References