| Title: | Cell Key Perturbation |
|---|---|
| Description: | Provides functions to generate frequency tables and apply cell key perturbation to protect against statistical disclosure in tabular outputs. The implemented methods are described in "Cell Key Perturbation User Guide" <https://github.com/ONSdigital/cell-key-perturbation-R/blob/main/documentation/SML_UserDoc_CKP_R.md>. Developed at the UK Office for National Statistics. |
| Authors: | Iain Dove [aut, cph], Ahmet Aydin [aut, cre] |
| Maintainer: | Ahmet Aydin <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 3.0.0 |
| Built: | 2026-05-29 08:11:37 UTC |
| Source: | https://github.com/onsdigital/cell-key-perturbation-r |
Check perturbed table for missingness in tabulation variables
check_for_na(DT, cols)check_for_na(DT, cols)
DT |
– |
cols |
– |
Warning message if any tabulation variable contain missing values
create_perturbed_table() creates a frequency table which has had
cell key perturbation applied to the counts.
A p-table file needs to be supplied which determines which cells are
perturbed.
The data needs to contain a 'record key' variable which along with the
ptable allows the process to be repeatable and consistent.
create_perturbed_table( data, ptable, geog, tab_vars, record_key, use_existing_ons_id = TRUE, threshold = 10 )create_perturbed_table( data, ptable, geog, tab_vars, record_key, use_existing_ons_id = TRUE, threshold = 10 )
data |
A The data should contain one row per statistical unit (person, household, business or other) and one column per variable (age, sex, health status) |
ptable |
A |
geog |
A |
tab_vars |
A |
record_key |
A |
use_existing_ons_id |
A |
threshold |
An |
Returns a data.table giving a frequency table which has had
cell key perturbation applied according to the ptable supplied.
if (requireNamespace("data.table", quietly = TRUE)) { data.table::setDTthreads(1) } geog <- "var1" tab_vars <- c("var5","var8") record_key <- "record_key" perturbed_table <- create_perturbed_table(micro, ptable_10_5, geog, tab_vars, record_key) # Alternatively perturbed_table <- create_perturbed_table(data = micro, ptable = ptable_10_5, geog = c(), tab_vars = c("var1","var5","var8"), record_key = "record_key", threshold = 10)if (requireNamespace("data.table", quietly = TRUE)) { data.table::setDTthreads(1) } geog <- "var1" tab_vars <- c("var5","var8") record_key <- "record_key" perturbed_table <- create_perturbed_table(micro, ptable_10_5, geog, tab_vars, record_key) # Alternatively perturbed_table <- create_perturbed_table(data = micro, ptable = ptable_10_5, geog = c(), tab_vars = c("var1","var5","var8"), record_key = "record_key", threshold = 10)
This function runs the perturbation method fully in BigQuery (via SQL) and only downloads the result, which allows handling large datasets efficiently.
create_perturbed_table_bigquery( con, data, ptable, geog, tab_vars, record_key, use_existing_ons_id = TRUE, threshold = 10, return_query = FALSE )create_perturbed_table_bigquery( con, data, ptable, geog, tab_vars, record_key, use_existing_ons_id = TRUE, threshold = 10, return_query = FALSE )
con |
– |
data |
– |
ptable |
– |
geog |
– |
tab_vars |
– |
record_key |
– |
use_existing_ons_id |
– |
threshold |
– |
return_query |
– |
Function workflow:
Generate BigQuery SQL query to run perturbation on BigQuery
If return_query = TRUE, return the query text and exit: otherwise, execute the rest
Validate inputs using BigQuery
Run perturbation using BigQuery
Convert perturbed table to data.table and sort
The query build by this function does the following when executed:
Computes counts and cell keys for each unique combination of geographic and tabulation variables.
Includes zero-count cells by generating the full cartesian product of variable combinations.
Calculates pcv by ensuring the rows of ptable 501-750 are reused for cell values above 750.
Applies perturbation values from a perturbation table based on cell keys and pseudo cell values (pcv).
Suppresses cells below a specified threshold by setting their perturbed count to NULL.
When return_query = FALSE: a data.table containing the perturbed
frequency table, sorted by geog and tab_vars.
When return_query = TRUE: a character string containing the query.
# --- Return query text without executing it --- query <- create_perturbed_table_bigquery( con = NULL, data = "my-gcp-project.survey.microdata", ptable = "my-gcp-project.sdc.ptable", geog = c("Region"), tab_vars = c("AgeGroup", "HealthStatus", "Occupation"), record_key = "Record_Key", threshold = 10, return_query = TRUE ) cat(query)# --- Return query text without executing it --- query <- create_perturbed_table_bigquery( con = NULL, data = "my-gcp-project.survey.microdata", ptable = "my-gcp-project.sdc.ptable", geog = c("Region"), tab_vars = c("AgeGroup", "HealthStatus", "Occupation"), record_key = "Record_Key", threshold = 10, return_query = TRUE ) cat(query)
generate_ptable_10_5_rule() generates a sample p-table based on 10-5 rule,
which means a suppression threshold of 10 and rounding to the nearest 5.
generate_ptable_10_5_rule(max_pcv = 750, ckey_range = 255)generate_ptable_10_5_rule(max_pcv = 750, ckey_range = 255)
max_pcv |
Max value for pcv. Default is 750. |
ckey_range |
The max range for cell keys. Default is 255. |
A data.table assigning a pvalue to each ckey and pcv combination
if (requireNamespace("data.table", quietly = TRUE)) { data.table::setDTthreads(1) } ptable <- generate_ptable_10_5_rule()if (requireNamespace("data.table", quietly = TRUE)) { data.table::setDTthreads(1) } ptable <- generate_ptable_10_5_rule()
generate_random_key() attaches randomly generated record keys to microdata
tables for testing purposes.
generate_random_rkey(data, rkey_range = 255, seed = NULL)generate_random_rkey(data, rkey_range = 255, seed = NULL)
data |
A data.table or data.frame containing the microdata |
rkey_range |
The max range for record keys. Default is 255. |
seed |
A seed for the random number generator |
A data.table with a new integer column record_key
library(data.table) data <- data.table(id = 1:1000) data <- generate_random_rkey(data, rkey_range = 255, seed = 2005)library(data.table) data <- data.table(id = 1:1000) data <- generate_random_rkey(data, rkey_range = 255, seed = 2005)
This function creates a new record key column by taking the modulo 4096 of
the ons_id column. It converts ons_id to numeric, preserving NA for
non-numeric values, and assigns the result as an integer.
generate_record_key_from_ons_id(data, record_key_col)generate_record_key_from_ons_id(data, record_key_col)
data |
A |
record_key_col |
A character string specifying the name of the new record key column to create. |
The function checks that data is a data.table.
Non-numeric values in ons_id are converted to NA.
The record key is computed as ons_id %% 4096 and stored as integer.
A data.table with the new record key column added.
generate_test_data() creates a sample microdata containing randomly
generated microdata columns and record keys for testing purposes.
Note: You can set a seed for random value generator to obtain same output in different runs. However, the sample microdata included in the package will be different than this one, as it was generated from the corresponding python package for consistency in test output.
generate_test_data(size = 1000, rkey_range = 255, seed = NULL)generate_test_data(size = 1000, rkey_range = 255, seed = NULL)
size |
Number of rows in the sample microdata. Default is 1000. |
rkey_range |
The max range for record keys. Default is 255. |
seed |
A seed for the random number generator |
A data.table containing randomly generated microdata and record keys
if (requireNamespace("data.table", quietly = TRUE)) { data.table::setDTthreads(1) } data <- generate_test_data(size = 1000) data <- generate_test_data(size = 1000, rkey_range = 255, seed = 111)if (requireNamespace("data.table", quietly = TRUE)) { data.table::setDTthreads(1) } data <- generate_test_data(size = 1000) data <- generate_test_data(size = 1000, rkey_range = 255, seed = 111)
A data set containing randomly generated data to showcase the cell key perturbation method.
data(micro)data(micro)
A data.table containing 1000 observations of 11 variables
record_key. record key value (0-255)
var1. example variable 1 (1-5)
var2. example variable 2 (1,2)
var3. example variable 3 (1-4)
var4. example variable 4 (1-4)
var5. example variable 5 (1-10)
var6. example variable 6 (1-5)
var7. example variable 7 (1-5)
var8. example variable 8 (A-D)
var9. example variable 9 (A-H)
var10. example variable 10 (1-49)
A data set containing the rules to apply cell key perturbation with a threshold of 10, and rounding to base 5. In other words, counts less than 10 will be removed, and all others will be rounded to the nearest 5.
data(ptable_10_5)data(ptable_10_5)
A data.table containing 192000 observations of 3 variables
pcv. perturbation cell value (1-750)
ckey. cell key value (0-255)
pvalue. perturbation value to be applied
Validates BigQuery inputs for a perturbation process.
Validate input arguments
Check that at least one variable specified for geog or tab_vars
Check geog and tab_vars are either character vectors or NULL
Check specified record_key is character vector or NULL
Check threshold is an integer and non-negative
Validate microdata and ptable contain required columns
Check data contain the specified geog, tab_vars & record_key
Check ptable contains required columns
Validate the range of record keys and cell keys
Validate data has sufficient records with record keys to apply perturbation
validate_inputs_bigquery( con, data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold )validate_inputs_bigquery( con, data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold )
con |
– |
data |
– |
ptable |
– |
geog |
– |
tab_vars |
– |
record_key |
– |
use_existing_ons_id |
– |
threshold |
– |
Invisibly returns TRUE on success. Throws stop or Warning messages if any validation fails.