Package 'cellkeyperturbation' reference manual

Title:	Cell Key Perturbation
Description:	Provides functions to generate frequency tables and apply cell key perturbation to protect against statistical disclosure in tabular outputs. The implemented methods are described in "Cell Key Perturbation User Guide" <https://github.com/ONSdigital/cell-key-perturbation-R/blob/main/documentation/SML_UserDoc_CKP_R.md>. Developed at the UK Office for National Statistics.
Authors:	Iain Dove [aut, cph], Ahmet Aydin [aut, cre]
Maintainer:	Ahmet Aydin <[email protected]>
License:	MIT + file LICENSE
Version:	3.0.0
Built:	2026-05-29 08:11:37 UTC
Source:	https://github.com/onsdigital/cell-key-perturbation-r

Check perturbed table for missingness in tabulation variables

Description

Check perturbed table for missingness in tabulation variables

Usage

check_for_na(DT, cols)
check_for_na(DT, cols)

Arguments

DT

– data.table Perturbed frequency table

cols

– ⁠character vector⁠ Tabulation variables

Value

Warning message if any tabulation variable contain missing values

Create a frequency table with cell key perturbation applied

Description

create_perturbed_table() creates a frequency table which has had cell key perturbation applied to the counts. A p-table file needs to be supplied which determines which cells are perturbed. The data needs to contain a 'record key' variable which along with the ptable allows the process to be repeatable and consistent.

Usage

create_perturbed_table(
  data,
  ptable,
  geog,
  tab_vars,
  record_key,
  use_existing_ons_id = TRUE,
  threshold = 10
)
create_perturbed_table(
  data,
  ptable,
  geog,
  tab_vars,
  record_key,
  use_existing_ons_id = TRUE,
  threshold = 10
)

Arguments

data

A data.table containing the data to be tabulated and perturbed

The data should contain one row per statistical unit (person, household, business or other) and one column per variable (age, sex, health status)

ptable

A data.table containing the ptable file which determines when perturbation is applied.

geog

A ⁠character vector⁠ giving the column name in data that contains the desired geography level for the frequency table. This can be an empty vector, c(), if no geography level is required.

tab_vars

A ⁠character vector⁠ giving the column names in data of the variables to be tabulated. This can be an empty vector, c(), provided a geography level is supplied.

record_key

A character containing the column name in data giving the record keys required for perturbation. If data contains "ons_id" and use_existing_ons_id = TRUE, set (record_key = NULL), as record key will be generated from "ons_id".

use_existing_ons_id

A logical on whether to create record keys from ons_id, if ons_id exists in data. It will be irrelevant if microdata does not contain ons_id. Default is TRUE.

threshold

An integer specifying the value below which counts are suppressed, with a default value of 10.

Value

Returns a data.table giving a frequency table which has had cell key perturbation applied according to the ptable supplied.

Examples

if (requireNamespace("data.table", quietly = TRUE)) {
  data.table::setDTthreads(1)
}

geog <- "var1"
tab_vars <- c("var5","var8")
record_key <- "record_key"
perturbed_table <- create_perturbed_table(micro,
                                          ptable_10_5,
                                          geog,
                                          tab_vars,
                                          record_key)

# Alternatively
perturbed_table <- create_perturbed_table(data = micro,
                                          ptable = ptable_10_5,
                                          geog = c(),
                                          tab_vars = c("var1","var5","var8"),
                                          record_key = "record_key",
                                          threshold = 10)

if (requireNamespace("data.table", quietly = TRUE)) {
  data.table::setDTthreads(1)
}

geog <- "var1"
tab_vars <- c("var5","var8")
record_key <- "record_key"
perturbed_table <- create_perturbed_table(micro,
                                          ptable_10_5,
                                          geog,
                                          tab_vars,
                                          record_key)

# Alternatively
perturbed_table <- create_perturbed_table(data = micro,
                                          ptable = ptable_10_5,
                                          geog = c(),
                                          tab_vars = c("var1","var5","var8"),
                                          record_key = "record_key",
                                          threshold = 10)

Create a perturbed frequency table in BigQuery and return it as a data frame

Description

This function runs the perturbation method fully in BigQuery (via SQL) and only downloads the result, which allows handling large datasets efficiently.

Usage

create_perturbed_table_bigquery(
  con,
  data,
  ptable,
  geog,
  tab_vars,
  record_key,
  use_existing_ons_id = TRUE,
  threshold = 10,
  return_query = FALSE
)
create_perturbed_table_bigquery(
  con,
  data,
  ptable,
  geog,
  tab_vars,
  record_key,
  use_existing_ons_id = TRUE,
  threshold = 10,
  return_query = FALSE
)

Arguments

con

–DBIConnection. An active BigQuery connection created with DBI::dbConnect()

data

–character. BigQuery table name for microdata in full format: "<PROJECT>.<DATASET>.<TABLE>". One row per statistical unit (person, household, business, etc.), and one column per variable (e.g. age, sex, health status)

ptable

–character. BigQuery table name for the p-table in full format: "<PROJECT>.<DATASET>.<TABLE>".

geog

–⁠character vector⁠. Column name containing the desired geography level for the frequency table. e.g., c("Region") or c("LocalAuthority"). Use c() if no geography breakdown required.

tab_vars

–⁠character vector⁠. Column names to tabulate, e.g., c("Age", "Health", "Occupation").

record_key

–character. Column name with record keys required for perturbation, e.g., "Record_Key". If data contains "ons_id" and use_existing_ons_id = TRUE, set (record_key = NULL), as record key will be generated from "ons_id".

use_existing_ons_id

– logical Whether to create record keys from ons_id, if ons_id exists in data. It will be irrelevant if microdata does not contain ons_id. Default is TRUE.

threshold

–integer. Suppression threshold; perturbed counts below this value are suppressed. Default 10.

return_query

–logical. If TRUE, returns the generated SQL query without executing it. Default FALSE.

Details

Function workflow:

Generate BigQuery SQL query to run perturbation on BigQuery
If return_query = TRUE, return the query text and exit: otherwise, execute the rest
Validate inputs using BigQuery
Run perturbation using BigQuery
Convert perturbed table to data.table and sort

The query build by this function does the following when executed:

Computes counts and cell keys for each unique combination of geographic and tabulation variables.
Includes zero-count cells by generating the full cartesian product of variable combinations.
Calculates pcv by ensuring the rows of ptable 501-750 are reused for cell values above 750.
Applies perturbation values from a perturbation table based on cell keys and pseudo cell values (pcv).
Suppresses cells below a specified threshold by setting their perturbed count to NULL.

Value

When return_query = FALSE: a data.table containing the perturbed frequency table, sorted by geog and tab_vars.
When return_query = TRUE: a character string containing the query.

Examples

# --- Return query text without executing it ---
query <- create_perturbed_table_bigquery(
  con        = NULL,
  data       = "my-gcp-project.survey.microdata",
  ptable     = "my-gcp-project.sdc.ptable",
  geog       = c("Region"),
  tab_vars   = c("AgeGroup", "HealthStatus", "Occupation"),
  record_key = "Record_Key",
  threshold  = 10,
  return_query = TRUE
)
cat(query)

# --- Return query text without executing it ---
query <- create_perturbed_table_bigquery(
  con        = NULL,
  data       = "my-gcp-project.survey.microdata",
  ptable     = "my-gcp-project.sdc.ptable",
  geog       = c("Region"),
  tab_vars   = c("AgeGroup", "HealthStatus", "Occupation"),
  record_key = "Record_Key",
  threshold  = 10,
  return_query = TRUE
)
cat(query)

Generate ptable (10-5 rule)

Description

generate_ptable_10_5_rule() generates a sample p-table based on 10-5 rule, which means a suppression threshold of 10 and rounding to the nearest 5.

Usage

generate_ptable_10_5_rule(max_pcv = 750, ckey_range = 255)
generate_ptable_10_5_rule(max_pcv = 750, ckey_range = 255)

Arguments

max_pcv

Max value for pcv. Default is 750.

ckey_range

The max range for cell keys. Default is 255.

Value

A data.table assigning a pvalue to each ckey and pcv combination

Examples

if (requireNamespace("data.table", quietly = TRUE)) {
  data.table::setDTthreads(1)
}
ptable <- generate_ptable_10_5_rule()

if (requireNamespace("data.table", quietly = TRUE)) {
  data.table::setDTthreads(1)
}
ptable <- generate_ptable_10_5_rule()

Generate and attach random record keys to microdata

Description

generate_random_key() attaches randomly generated record keys to microdata tables for testing purposes.

Usage

generate_random_rkey(data, rkey_range = 255, seed = NULL)
generate_random_rkey(data, rkey_range = 255, seed = NULL)

Arguments

data

A data.table or data.frame containing the microdata

rkey_range

The max range for record keys. Default is 255.

seed

A seed for the random number generator

Value

A data.table with a new integer column record_key

Examples

library(data.table)
data <- data.table(id = 1:1000)
data <- generate_random_rkey(data, rkey_range = 255, seed = 2005)
library(data.table)
data <- data.table(id = 1:1000)
data <- generate_random_rkey(data, rkey_range = 255, seed = 2005)

Generate Record Key from ONS ID

Description

This function creates a new record key column by taking the modulo 4096 of the ons_id column. It converts ons_id to numeric, preserving NA for non-numeric values, and assigns the result as an integer.

Usage

generate_record_key_from_ons_id(data, record_key_col)
generate_record_key_from_ons_id(data, record_key_col)

Arguments

data

A data.table containing the ons_id column.

record_key_col

A character string specifying the name of the new record key column to create.

Details

The function checks that data is a data.table.
Non-numeric values in ons_id are converted to NA.
The record key is computed as ons_id %% 4096 and stored as integer.

Value

A data.table with the new record key column added.

Generate sample microdata

Description

generate_test_data() creates a sample microdata containing randomly generated microdata columns and record keys for testing purposes.

Note: You can set a seed for random value generator to obtain same output in different runs. However, the sample microdata included in the package will be different than this one, as it was generated from the corresponding python package for consistency in test output.

Usage

generate_test_data(size = 1000, rkey_range = 255, seed = NULL)
generate_test_data(size = 1000, rkey_range = 255, seed = NULL)

Arguments

size

Number of rows in the sample microdata. Default is 1000.

rkey_range

The max range for record keys. Default is 255.

seed

A seed for the random number generator

Value

A data.table containing randomly generated microdata and record keys

Examples

if (requireNamespace("data.table", quietly = TRUE)) {
  data.table::setDTthreads(1)
}
data <- generate_test_data(size = 1000)
data <- generate_test_data(size = 1000, rkey_range = 255, seed = 111)
if (requireNamespace("data.table", quietly = TRUE)) {
  data.table::setDTthreads(1)
}
data <- generate_test_data(size = 1000)
data <- generate_test_data(size = 1000, rkey_range = 255, seed = 111)

Example data (micro)

Description

A data set containing randomly generated data to showcase the cell key perturbation method.

Usage

data(micro)
data(micro)

Format

A data.table containing 1000 observations of 11 variables

Details

record_key. record key value (0-255)
var1. example variable 1 (1-5)
var2. example variable 2 (1,2)
var3. example variable 3 (1-4)
var4. example variable 4 (1-4)
var5. example variable 5 (1-10)
var6. example variable 6 (1-5)
var7. example variable 7 (1-5)
var8. example variable 8 (A-D)
var9. example variable 9 (A-H)
var10. example variable 10 (1-49)

Perturbation table

Description

A data set containing the rules to apply cell key perturbation with a threshold of 10, and rounding to base 5. In other words, counts less than 10 will be removed, and all others will be rounded to the nearest 5.

Usage

data(ptable_10_5)
data(ptable_10_5)

Format

A data.table containing 192000 observations of 3 variables

Details

pcv. perturbation cell value (1-750)
ckey. cell key value (0-255)
pvalue. perturbation value to be applied

Validate Inputs Before Perturbation using BigQuery

Description

Validates BigQuery inputs for a perturbation process.

Validate input arguments
- Check that at least one variable specified for geog or tab_vars
- Check geog and tab_vars are either character vectors or NULL
- Check specified record_key is character vector or NULL
- Check threshold is an integer and non-negative
Validate microdata and ptable contain required columns
- Check data contain the specified geog, tab_vars & record_key
- Check ptable contains required columns
Validate the range of record keys and cell keys
Validate data has sufficient records with record keys to apply perturbation

Usage

validate_inputs_bigquery(
  con,
  data,
  ptable,
  geog,
  tab_vars,
  record_key,
  use_existing_ons_id,
  threshold
)
validate_inputs_bigquery(
  con,
  data,
  ptable,
  geog,
  tab_vars,
  record_key,
  use_existing_ons_id,
  threshold
)

Arguments

con

–DBIConnection. An active BigQuery connection created with DBI::dbConnect()

data

ptable

–character. BigQuery table name for the p-table in full format: "<PROJECT>.<DATASET>.<TABLE>".

geog

–⁠character vector⁠. Column name containing the desired geography level for the frequency table. e.g., c("Region") or c("LocalAuthority"). Use c() if no geography breakdown required.

tab_vars

–⁠character vector⁠. Column names to tabulate, e.g., c("Age", "Health", "Occupation").

record_key

use_existing_ons_id

– logical Whether to create record keys from ons_id, if ons_id exists in data. It will be irrelevant if microdata does not contain ons_id. Default is TRUE.

threshold

–integer. Suppression threshold; perturbed counts below this value are suppressed. Default 10.

Value

Invisibly returns TRUE on success. Throws stop or Warning messages if any validation fails.

Package 'cellkeyperturbation'

Help Index

Check perturbed table for missingness in tabulation variables

Description

Usage

Arguments

Value

Create a frequency table with cell key perturbation applied

Description

Usage

Arguments

Value

Examples

Create a perturbed frequency table in BigQuery and return it as a data frame

Description

Usage

Arguments

Details

Value

Examples

Generate ptable (10-5 rule)

Description

Usage

Arguments

Value

Examples

Generate and attach random record keys to microdata

Description

Usage

Arguments

Value

Examples

Generate Record Key from ONS ID

Description

Usage

Arguments

Details

Value

Generate sample microdata

Description

Usage

Arguments

Value

Examples

Example data (micro)

Description

Usage

Format

Details

Perturbation table

Description

Usage

Format

Details

Validate Inputs Before Perturbation using BigQuery

Description

Usage

Arguments

Value