The Canadian Health Measures Survey (CHMS) is a repeated cross-sectional survey conducted by Statistics Canada since 2007. Each cycle collects questionnaire, physical measurement, and laboratory data from a nationally representative sample of Canadians. Combining data across cycles enables trend analyses and increases statistical power, but the raw data is not directly comparable across cycles for several reasons.
Variable names sometimes change between cycles. For example, the accelerometer variable for moderate-to-vigorous physical activity on day 1 is named amsdmva1 in cycle 1, but ammdmva1 in cycles 2–6. A researcher pooling cycles must know about every such rename, or risk silent misalignment.
chmsflow handles these renames through metadata in variable-details.csv. The variableStart column uses a mixed format to specify cycle-specific exceptions:
cycle1::amsdmva1, [ammdmva1]
This means: use amsdmva1 for cycle 1, and ammdmva1 for all other cycles. The recodeflow package reads this format and applies the correct mapping automatically. For the full list of variable naming patterns, see inst/metadata/schemas/chms/chms_database_config.yaml in the package source.
Even when variable names are stable, the coding categories may differ. CHMS response codes for missing data, valid skips, and refusals vary across variables and cycles. chmsflow’s recoding rules in variable-details.csv define explicit mappings for each response code, ensuring consistent treatment across cycles.
Cycles 1–2 store medication data in a wide format: up to 80 columns of ATC code and time-last-taken pairs per respondent. Cycles 3–6 use a long format: one row per medication per respondent with two columns (meucatc and npi_25b). chmsflow provides separate functions for each format (recode_meds_cycles1to2() and recode_meds_cycles3to6()) with identical call signatures, so the downstream workflow is the same regardless of cycle. See Recoding medications for details.
chmsflow builds on the recodeflow package, which separates recoding rules from recoding logic. The rules live in two CSV metadata files:
variables.csv – lists every harmonized variable with its name, label, section, subject, type, and unitvariable-details.csv – defines how each raw CHMS value maps to a harmonized value, row by rowThe recoding logic lives in recodeflow::rec_with_table(), which reads the metadata and applies the mappings. This separation means that adding or correcting a recoding rule is a CSV edit, not a code change.
For detailed schema documentation, see Variable schema reference.
variableStart columnThe variableStart column in variable-details.csv tells rec_with_table() where to find the source data. It supports several formats:
| Format | Meaning | Example |
|---|---|---|
[variable_name] |
Same name across all cycles | [clc_age] |
cycle1::name1, [default_name] |
Cycle-specific exception with a default | cycle1::amsdmva1, [ammdmva1] |
DerivedVar::[var1, var2, ...] |
Computed by a function from listed inputs | DerivedVar::[lab_bcre, pgdcgt, clc_sex, clc_age] |
Func::function_name |
The R function that computes the derived variable (in recTo) |
Func::calculate_gfr |
recStart and recEnd columnsThese columns define the mapping from source values to harmonized values:
recStart = "copy" passes the value unchangedrecStart and recEnd define source ranges (e.g., [1, 3]), and the row’s recTo column gives the harmonized valuerecTo = "NA::a" or recTo = "NA::b" map specific source codes to tagged NA valuesCHMS uses numeric codes for missing data (e.g., 996 for valid skip, 997--999 for don’t know / refusal / not stated). chmsflow converts these to haven::tagged_na() values that preserve the reason for missingness:
tagged_na("a") – valid skip (not applicable to this respondent)tagged_na("b") – missing (don’t know, refusal, or not stated)This distinction matters for analysis. For example, a respondent who was never asked about smoking (valid skip) should be treated differently from one who refused to answer (missing). See Missing data (tagged_na) for a full explanation.
Some harmonized variables cannot be created by simple value mapping. These are computed by R functions referenced in variable-details.csv with the Func:: prefix. Examples include:
calculate_gfr() – estimated glomerular filtration rate from creatinine, ethnicity, sex, and agederive_hypertension() – hypertension status from adjusted blood pressure, medication use, and comorbiditiescalculate_pack_years() – smoking pack-years from smoking history variablesThe DerivedVar:: prefix in variableStart lists the input variables that must be present in the data before the function can run. See Derived variables for details.
variableStart entries. These are documented in the metadata but can be surprising if you inspect the raw CHMS data directly.variable-details.csv to confirm that the recoding rules are appropriate for their specific research question. The mappings follow clinical and epidemiological conventions but may not suit every analytic purpose.tagged_na() preserves the reason for missingness in Missing data (tagged_na).