
Bayesian Household Transmission Modeling in R
HHBayes is an R package for simulating and estimating infectious disease transmission within households. It couples a stochastic, age-structured household simulator with Bayesian inference via Stan, enabling researchers to study within-household spread, evaluate intervention strategies, and reconstruct transmission chains from longitudinal diagnostic data.
Household transmission studies generate rich but complex data — repeated tests on every member, overlapping infections, reinfections, and age-dependent viral kinetics. HHBayes provides an integrated pipeline:
# install.packages("devtools")
devtools::install_github("keli5734/HHBayes")HHBayes depends on RStan, which requires a working C++ toolchain. See the RStan Getting Started Guide for platform-specific setup.
library(HHBayes)
library(rstan)
library(ggpubr)
# 0. Generate surveillance dataset
study_start <- "2024-07-01"
study_end <- "2025-06-30"
dates_weekly <- seq(from = as.Date(study_start), to = as.Date(study_end), by = "week")
surveillance_data <- data.frame(
date = dates_weekly,
# Random epidemic curve (low start, peak middle, low end)
cases = 0.1 + 100 * exp(-0.0002 * (as.numeric(dates_weekly - mean(dates_weekly)))^2) + abs(rnorm( length(dates_weekly),mean = 0, sd = 10))
)
# 1. Simulate 50 households with ODE-based viral dynamics
sim <- simulate_multiple_households_comm(
n_households = 50,
viral_testing = "viral load",
start_date = "2024-07-01",
end_date = "2025-06-30",
seed = 123,
surveillance_df = surveillance_data
)
# 2. Summarize attack rates
rates <- summarize_attack_rates(sim)
rates$primary_by_role
# 3. Prepare data and fit the Bayesian model
stan_input <- prepare_stan_data(sim$diagnostic_df,
use_vl_data = 1,
surveillance_df = surveillance_data,
study_start_date = as.Date("2024-07-01"),
study_end_date = as.Date("2025-06-30"),
seed = 123)
options(mc.cores = parallel::detectCores())
fit <- fit_household_model(stan_input, iter = 2000, chains = 4)
# 4. Visualize
plot_posterior_distributions(fit)The sections below walk through a complete analysis: from defining a study scenario, through simulation, data preparation, Bayesian model fitting, and visualization.
The simulation is anchored to a calendar period. You can optionally
supply external surveillance data (a dataframe with date
and cases columns) to drive time-varying community
importation. The case counts are automatically normalized and
interpolated to daily resolution.
study_start <- "2024-07-01"
study_end <- "2025-06-30"
# Weekly surveillance data with a mid-study epidemic peak
dates_weekly <- seq(as.Date(study_start), as.Date(study_end), by = "week")
surveillance_data <- data.frame(
date = dates_weekly,
cases = 0.1 + 100 * exp(-0.0002 * (as.numeric(dates_weekly - mean(dates_weekly)))^2) +
abs(rnorm(length(dates_weekly), 0, 10))
)If no surveillance data is provided, you can supply custom
seasonal_forcing_list vectors or leave it flat (constant
community force).
By default, all household members contact each other equally. A
role_mixing_matrix lets you specify differential
contact weights between roles. Each cell represents the
relative contact intensity between a source (row) and target
(column):
role_mixing_weights <- matrix(c(
# Target: Infant Toddler Adult Elderly
0.0, 0.5, 1.0, 0.5, # Source: Infant (high contact with adults)
0.5, 0.9, 0.7, 0.5, # Source: Toddler (high peer contact)
1.0, 0.7, 0.6, 0.7, # Source: Adult
0.5, 0.5, 0.7, 0.0 # Source: Elderly (limited infant contact)
), nrow = 4, byrow = TRUE,
dimnames = list(
c("infant", "toddler", "adult", "elderly"),
c("infant", "toddler", "adult", "elderly")))This matrix is expanded internally to an N x N individual-level
contact matrix for each household, based on the role of each member. For
example, in a household with roles
c("adult", "adult", "infant"), the adult-to-infant weight
(1.0) is applied to those pairs.
Each simulated household is assembled randomly according to a
demographic profile. The household_profile_list controls
the probability of different compositions:
household_profile <- list(
prob_adults = c(0, 0, 1), # 0% chance of 0 or one adult, 100% chance of two
prob_infant = 1.0, # 100% chance of one infant
prob_siblings = c(0, 0.8, 0.2), # 80% chance of one toddler, 20% chance of two
prob_elderly = c(0.7, 0.1, 0.2) # 70% no elderly, 10% one, 20% two
)| Field | Values | Meaning |
|---|---|---|
prob_adults |
c(P0, P1, P2) |
Probability of 0, 1, or 2 adults |
prob_infant |
0 to 1 |
Probability of an infant being present |
prob_siblings |
c(P0, P1, P2) |
Probability of 0, 1, or 2 toddler siblings |
prob_elderly |
c(P0, P1, P2) |
Probability of 0, 1, or 2 elderly members |
Interventions (vaccination, masking, prophylaxis, etc.) are specified
as a list via covariates_config. Each entry defines an
intervention with a name, what it affects, its efficacy, and
role-specific coverage probabilities:
sim_config <- list(
list(
name = "vacc_status",
efficacy = 0.5, # 50% reduction
effect_on = "both", # "susceptibility", "infectivity", or "both"
coverage = list(
infant = 0.00,
toddler = 0.30,
adult = 0.80,
elderly = 0.90
)
)
)How efficacy works: For each person, a Bernoulli
draw (based on coverage) determines if they receive the
intervention. If they do, their susceptibility and/or infectivity is
multiplied by (1 - efficacy). Multiple interventions stack
multiplicatively.
Setting efficacy = 0 (as in the baseline example)
creates the covariate column without any actual effect — useful for
testing the pipeline.
sim_res <- simulate_multiple_households_comm(
n_households = 50,
viral_testing = "viral load", # or "Ct"
infectious_shape = 10,
infectious_scale = 1, # Mean infectious period = 10 days
waning_shape = 6,
waning_scale = 10, # Mean immunity = 60 days
surveillance_interval = 4, # Test every 4 days
start_date = study_start,
end_date = study_end,
surveillance_df = surveillance_data,
covariates_config = sim_config,
household_profile_list = household_profile,
role_mixing_matrix = role_mixing_weights,
seed = 123
)The output is a list with two dataframes:
| Output | Description |
|---|---|
sim_res$hh_df |
One row per person per infection episode: household ID, person ID, role, infection time, recovery times, and covariate assignments. |
sim_res$diagnostic_df |
Simulated test results at each surveillance time point: viral load or Ct value, binary test result, and episode ID. |
rates <- summarize_attack_rates(sim_res)
rates$primary_overall # Overall primary attack rate
rates$primary_by_role # Attack rate by age group
rates$reinf_overall # Overall reinfection summary
rates$reinf_by_role # Reinfections by age groupOverlays simulated infections (stacked bars by age group) with the surveillance line:
plot_epidemic_curve(sim_res, surveillance_data,
start_date_str = study_start, bin_width = 7)This step transforms the diagnostic dataframe into the structured list that Stan expects. It handles column renaming, infection window imputation, viral load gap-filling, covariate matrix construction, and prior specification.
Important: Covariates from the simulation live in
hh_df, but Stan needs them joined to
diagnostic_df:
# Extract 1-row-per-person covariate table
person_covariates <- sim_res$hh_df %>%
dplyr::select(hh_id, person_id, vacc_status) %>%
dplyr::distinct()
# Join to diagnostic data
df_for_stan <- sim_res$diagnostic_df %>%
dplyr::left_join(person_covariates, by = c("hh_id", "person_id"))Define priors for the Bayesian model. Supported
distributions are "normal", "uniform", and
"lognormal":
my_priors <- list(
beta1 = list(dist = "normal", params = c(-5, 1)),
beta2 = list(dist = "normal", params = c(-5, 1)),
alpha = list(dist = "normal", params = c(-7, 1)),
phi_role = list(dist = "normal", params = c(0, 1)),
kappa_role = list(dist = "normal", params = c(0, 1)),
vl_midpoint = list(dist = "normal", params = c(4, 1.0)), # adjust if Ct values
vl_slope = list(dist = "normal", params = c(1.0, 0.5)) # adjust if Ct values
)Supply viral load imputation parameters (used to fill gaps in observed viral data):
VL_params_list <- list(
adult = list(v_p = 4.14, t_p = 5.09, lambda_g = 2.31, lambda_d = 2.71),
infant = list(v_p = 5.84, t_p = 4.09, lambda_g = 2.82, lambda_d = 1.01),
toddler = list(v_p = 5.84, t_p = 4.09, lambda_g = 2.82, lambda_d = 1.01),
elderly = list(v_p = 2.95, t_p = 5.10, lambda_g = 3.15, lambda_d = 0.87)
)Build the Stan input:
stan_input <- prepare_stan_data(
df_clean = df_for_stan,
surveillance_df = surveillance_data,
study_start_date = as.Date(study_start),
study_end_date = as.Date(study_end),
use_vl_data = TRUE,
use_curve_logic = FALSE,
delta = 0,
imputation_params = VL_params_list,
priors = my_priors,
role_mixing_matrix = role_mixing_weights)fit_household_model() runs Hamiltonian Monte Carlo (HMC)
sampling via rstan::sampling(). You can exclude internal
bookkeeping parameters to keep the output clean:
options(mc.cores = parallel::detectCores())
fit <- fit_household_model(
stan_input,
pars = c("log_phi_by_role_raw",
"log_kappa_by_role_raw",
"log_beta1",
"log_beta2",
"log_alpha_comm",
"g_curve_est",
"V_term_calc"),
include = FALSE,
iter = 2000,
warmup = 1000,
chains = 4,
seed = 123,
init_fun = NULL,
core = 4
)
print(fit, probs = c(0.025, 0.5, 0.975))# Posterior distributions of susceptibility (phi) and infectivity (kappa) by role
plot_posterior_distributions(fit)
# Reconstruct who-infected-whom transmission chains
chains <- reconstruct_transmission_chains(fit, stan_input, min_prob_threshold = 0.001)
# Plot a specific household's infection timeline with transmission arrows
plot_household_timeline(chains, stan_input, target_hh_id = 11)| Parameter | Default | Description |
|---|---|---|
beta1 |
0.008 |
Baseline within-household transmission rate (contact-driven). |
beta2 |
0.008 |
Viral-load-dependent transmission rate. Total per-infector force =
beta1 + beta2 * f(VL). |
delta |
0 |
Household size scaling. Force scaled by
(1/max(n_h,1))^delta in larger households. |
alpha_comm_by_role |
5e-4 |
Daily community acquisition rate. |
phi_by_role |
c(adult=1, infant=3, toddler=4, elderly=1) |
Susceptibility multipliers by role. |
kappa_by_role |
c(adult=1, infant=1, toddler=1.2, elderly=1) |
Infectivity multipliers by role. |
| Parameter | Default | Description |
|---|---|---|
infectious_shape |
3 |
Gamma shape for infectious duration. |
infectious_scale |
1 |
Gamma scale for infectious duration. Mean = shape x scale. |
waning_shape |
16 |
Gamma shape for immunity waning. |
waning_scale |
10 |
Gamma scale for immunity waning. Mean = 160 days. |
max_infections |
Inf |
Max infections per person. Set to 1 to disable
reinfection. |
| Parameter | Default | Description |
|---|---|---|
viral_testing |
"viral load" |
"viral load" (log10 scale) or "Ct" (cycle
threshold). |
model_type |
"empirical" |
"empirical" (parametric curves) or "ODE"
(within-host ODE). |
vl_midpoint |
3.0 or 40 |
Reference log10 VL for infectiousness scaling or Ct at 50% infectiousness (sigmoid) |
vl_slope |
2.5 or 2 |
Power-law exponent for log10 VL scaling. or Steepness of Ct-infectiousness sigmoid. |
| Parameter | Default | Description |
|---|---|---|
surveillance_interval |
1 |
Days between routine tests. |
test_daily |
FALSE |
Switch to daily testing after first household detection. |
perfect_detection |
TRUE |
If FALSE, detection depends on viral load
thresholds. |
detect_threshold_log10 |
1e-6 |
Min log10 VL for positive test. |
detect_threshold_Ct |
40 |
Max Ct for positive test (set 35-40 for
realistic PCR). |
| Parameter | Default | Description |
|---|---|---|
use_vl_data |
TRUE |
Use actual viral load data vs. generation interval curve. |
use_curve_logic |
TRUE |
When use_vl_data = FALSE, estimate generation interval
curve vs. constant infectivity. |
delta |
0 |
Household size scaling. Force scaled by
(1/max(n_h,1))^delta. |
vl_type = 1)| Parameter | Default | Description |
|---|---|---|
vl_midpoint |
3.0 |
Reference log10 VL for infectiousness scaling. |
vl_slope |
2.5 |
Power-law exponent for VL scaling:
(VL/vl_midpoint)^vl_slope. |
vl_type = 0)| Parameter | Default | Description |
|---|---|---|
vl_midpoint |
35.0 |
Ct value at 50% infectiousness. |
vl_slope |
2.0 |
Steepness of Ct-infectiousness sigmoid. |
| Function | Description |
|---|---|
simulate_multiple_households_comm() |
Simulate multi-household outbreaks |
prepare_stan_data() |
Format data for Stan model |
fit_household_model() |
Fit the Bayesian model via HMC |
summarize_attack_rates() |
Attack rates and reinfection summaries |
reconstruct_transmission_chains() |
Posterior who-infected-whom probabilities |
fill_missing_viral_data() |
Impute missing Ct / viral load |
plot_posterior_distributions() |
Violin plots of phi and kappa |
plot_covariate_effects() |
Forest plot of intervention effects |
plot_epidemic_curve() |
Epidemic curve overlaid with surveillance |
plot_household_timeline() |
Per-household timeline with transmission arrows |
prepare_stan_data() |
Format data for Stan with conditional parameter setup |
The daily probability of infection for susceptible individual \(i\) in household \(h\) at time \(t\) is:
\[P(\text{infection}_{ih}(t)) = 1 - \exp\left(-\lambda_{ih}(t)\right)\]
where the total force of infection combines community and household sources:
\[\lambda_{ih}(t) = \phi_{r_i} \cdot \exp(\boldsymbol{\beta}_{\text{susc}} \cdot \mathbf{X}_{\text{susc},i}) \cdot \left( \alpha_{\text{comm}} \cdot S(t) + (\frac{1}{\max(1,n_h)})^{\delta} \cdot \sum_{j \in h, j \ne i} C_{ij} \cdot \kappa_{r_j} \cdot \exp(\boldsymbol{\beta}_{\text{inf}} \cdot \mathbf{X}_{\text{inf},j}) \cdot \left(\beta_1 + \beta_2 \cdot f(VL_j(t))\right) \right)\]
| Parameter | Description |
|---|---|
| \(\phi_{r_i}\) | Role-specific susceptibility multiplier |
| \(\kappa_{r_j}\) | Role-specific infectivity multiplier |
| \(\alpha_{\text{comm}}\) | Baseline Community risk |
| \(S(t)\) | Seasonal forcing term |
| \(\delta\) | Household size scaling |
| \(C_{ij}\) | Contact weight between persons \(i\) and \(j\) |
| \(f(VL)\) | Viral load scaling function (power-law or sigmoid) |
| \(\boldsymbol{\beta}_{\text{susc}}\) | Log-linear coefficients for susceptibility covariates |
| \(\boldsymbol{\beta}_{\text{inf}}\) | Log-linear coefficients for infectivity covariates |
| \(\mathbf{X}_{\text{susc},i}\) | Covariate vector for person \(i\) (susceptibility) |
| \(\mathbf{X}_{\text{inf},j}\) | Covariate vector for person \(j\) (infectivity) |
Individuals progress through S -> I -> R ->
S with Gamma-distributed duration. Inference is performed via
HMC in Stan (adapt_delta = 0.95,
max_treedepth = 15).
The simulation applies fixed efficacies during data generation, but the Stan model estimates log-linear coefficients to recover those effects:
Simulation time:
# Vaccination reduces susceptibility by 50%
if (vaccinated) susceptibility *= 0.5Stan estimation:
# Stan estimates beta_susc ≈ -0.693, because exp(-0.693) ≈ 0.5
susceptibility = phi_role * exp(beta_susc * vacc_status)Interpretation: - β = 0: No effect
(exp(0) = 1)
- β = -0.693: 50% reduction
(exp(-0.693) ≈ 0.5) - β = +0.693: 2× increase
(exp(0.693) ≈ 2.0) - Multiple covariates: Effects multiply
on natural scale, add on log scale
The model supports four scenarios for modeling
infectiousness, controlled by use_vl_data and
vl_type:
Scenario 1: Log10 Viral Load Data
(use_vl_data = TRUE, vl_type = 1)
v_component = (max(0, VL) / vl_midpoint)^vl_slopeEstimated parameters: vl_midpoint,
vl_slope
Scenario 2: Ct Value Data
(use_vl_data = TRUE, vl_type = 0)
v_component = 1 / (1 + exp((vl_midpoint-Ct) / vl_slope))Estimated parameters: vl_midpoint,
vl_slope
Scenario 3: Generation Interval Curve
(use_vl_data = FALSE, use_curve_logic = TRUE)
v_component = g_curve[days_since_infection]Where g_curve is a normalized Gamma distribution.
Estimated parameters: gen_shape,
gen_rate
Scenario 4: Constant Infectiousness
(use_vl_data = FALSE, use_curve_logic = FALSE)
v_component = 1.0 # No viral dynamicsNo additional parameters estimated.
The final infectiousness is always:
beta1 + beta2 * v_component
The Stan model jointly estimates a core set of parameters, plus
optional parameters that activate based on the data you supply to
prepare_stan_data().
The Stan model uses conditional parameter estimation - different parameters are estimated based on your data configuration:
| Parameter | Description |
|---|---|
beta1, beta2 |
Baseline and viral-dependent transmission rates |
alpha_comm |
Community importation rate |
phi_role |
Role-specific susceptibility multipliers |
kappa_role |
Role-specific infectivity multipliers |
vl_midpoint, vl_slope |
Log10 viral load scaling or Ct value scaling |
| Parameter | Description | When Estimated |
|---|---|---|
gen_shape, gen_rate |
Generation interval curve | use_vl_data = FALSE AND
use_curve_logic = TRUE |
beta_susc |
Susceptibility covariate effects | covariates_susceptibility provided |
beta_inf |
Infectivity covariate effects | covariates_infectivity provided |
All parameters support flexible priors ("normal",
"uniform", "lognormal"):
my_priors <- list(
phi_role = list(dist = "normal", params = c(0, 0.5)), # Tighter susceptibility prior
kappa_role = list(dist = "lognormal", params = c(0, 0.3)), # LogNormal infectivity prior
)# Baseline: no covariates → estimates core params + VL curve only
stan_input <- prepare_stan_data(
...,
use_vl_data = 1,
covariates_susceptibility = NULL,
covariates_infectivity = NULL
)
# With covariates → also estimates vaccine/masking effects
stan_input <- prepare_stan_data(
...,
use_vl_data = 1,
covariates_susceptibility = c("vacc_status"),
covariates_infectivity = c("vacc_status", "masked")
)When K_susc = 0 or K_inf = 0, the
corresponding coefficient vectors do not enter the model. This means the
same Stan model adapts automatically — no need to switch between
different model files.
This work was supported by a grant from the National Institutes of Health (R01AI137093). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
If you use HHBayes in your work, please cite:
Ke Li & Yiren Hou (2026). HHBayes: Bayesian Household Transmission Modeling in R. GitHub: https://github.com/keli5734/HHBayes