Cell Key Perturbation

R-CMD-check

This method creates a frequency table which has had cell key perturbation applied to the counts to protect against disclosure.

Cell key perturbation adds small amounts of noise to frequency tables. Noise is added to change the counts that appear in the frequency table by small amounts, for example a 14 is changed to a 15. This noise introduces uncertainty in the counts and makes it harder to identify individuals, especially when taking the ‘difference’ between two similar tables. It protects against the risk of disclosure by differencing since it cannot be determined whether a difference between two similar tables represents a real person, or is caused by the perturbation.

Cell Key Perturbation is consistent and repeatable, so the same cells are always perturbed in the same way.

It is expected that users will tabulate 1 to 4 variables for a particular geography level - for example, tabulate age by sex at local authority level. 

BigQuery

The BigQuery version allows users to perform perturbation without reading raw data into local memory. The package creates the frequency table and runs perturbation with an SQL query. Then, it converts the final perturbed table into a data.table as an output.

This will allow users to run the method on large datasets without breaking the memory limits.

Terminology

User Instructions

Installing the method

This method requires R version 3.5 or higher and uses the data.table package.

You can install the released version of cellkeyperturbation from CRAN:

install.packages("cellkeyperturbation")

In your code you can load the cell key perturbation package using:

library(cellkeyperturbation)

Using the method

You can call the main functions for cell key perturbation with the following parameters:

# for data.table
create_perturbed_table(data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold)

# for BigQuery
create_perturbed_table_bigquery(con, data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold)

Parameters specific for BigQuery version:

Parameters specific for data.table version:

Common parameters for both versions:

Worked Example with Synthetic Data (data.table)

This is an example showing how to create a perturbed table from synthetic test data provided in the package (micro and ptable_10_5). You can access and view these data tables after loading the package.

library(cellkeyperturbation)
View(micro)
View(ptable_10_5)

You can also generate different sample data or generate random record keys for testing purposes for your own test data with the following code:

data = generate_test_data(size = 1000, rkey_range = 255, seed = 123)
ptable = generate_ptable_10_5_rule(ckey_range = 255)

library(data.table)
data <- fread("input_microdata.csv")
data = generate_random_rkey(data, rkey_range = 255, seed = 123)

Example rows of a microdata table are shown below:

record_key var1 var5 var8
84 2 9 D
108 1 9 C
212 1 1 D
212 2 2 A
86 2 4 A

Example rows of a ptable are shown below:

pcv ckey pvalue
1 0 -1
1 1 -1
1 2 -1
750 255 0

Use the following code to generate the perturbed table using the sample microdata and perturbation table provided:

perturbed_table <- create_perturbed_table(
  data       = micro,
  ptable     = ptable_10_5,
  geog       = c("var1"),
  tab_vars   = c("var5","var8"),
  record_key = "record_key",
  threshold  = 10
)

Interpreting the Output

The output from the code is a data.table containing a frequency table with the counts having been affected by perturbation, as specified in the ptable.

For most ptables, the most obvious effect will be that all counts lower than the threshold of 10 will have been removed. Suppressing counts below the threshold is a condition that need to be met when exporting data from IDS (Integrated Data Service) and many other secure environments such as SRS (Secure Research Service).

The perturbation code will treat categories for missing data in the same way as it treats other categories. If you would like to exclude missing data from your outputs, you will need to remove the missing data categories either before or after applying the perturbation.

The table will be in the following format:

var1 var5 var8 pre_sdc_count ckey pcv pvalue count
1 1 A 10 173 10 0 10
1 1 B 10 88 10 0 10
1 1 C 7 180 7 -7 nan
1 1 D 14 66 14 1 15
1 2 A 11 190 11 -1 10

The table contains the variables used to summarise the data (in this example var1, var5 & var8), and five other columns:

The columns you are most likely interested in are the variables, which are the categories you’ve summarised by, plus the count column.

WARNING! - The ckey, pcv, pre_sdc_count and pvalue columns should be dropped before the contingency table is published. Otherwise, the perturbation can be unpicked and the output will be disclosive.

Appendix - Help Pages

The package includes further help pages like Introduction to Cell Key Perturbation vignette and documentation for each function. You can access these pages by selecting the cellkeyperturbation package name in the packages tab of RStudio or using:

help(package=cellkeyperturbation)