Skip to contents

Creates a small, synthetic dataset that mimics the structure of UK Biobank phenotype data on the RAP. Useful for developing and testing derive_*, assoc_*, and plot_* functions without requiring real UKB data access.

Usage

ops_toy(scenario = "cohort", n = 1000L, seed = 42L)

Arguments

scenario

(character) Data structure to generate:

  • "cohort": wide participant-level table with raw UKB field columns for the full derive_*assoc_*plot_* pipeline.

  • "association": analysis-ready table with covariates already as factors, BMI/TDI binned, and two pre-derived disease outcomes (dm_*, htn_*) including status, date, timing, and follow-up columns. Use this for assoc_* examples and testing without running the derive pipeline.

  • "forest": association results table matching assoc_coxph() output, for testing plot_forest(). n = number of exposures (default 8).

n

(integer) Number of participants (or exposures for "forest"). Default 1000L for "cohort", 2000L for "association", 8L for "forest".

seed

(integer or NULL) Random seed for reproducibility. Pass NULL for a different dataset on every call. Default 42L.

Value

A data.table with UKB-style column names. See Details for the columns included in each scenario.

Details

This dataset is entirely synthetic. Column names follow RAP conventions (e.g. p41270, p20002_i0_a0).

scenario = "cohort"

Includes the following column groups:

  • Demographics: eid, p31, p34, p53_i0, p21022

  • Covariates: p21001_i0, p20116_i0, p1558_i0, p21000_i0, p22189, p54_i0

  • Genetic PCs: p22009_a1p22009_a10

  • Self-report disease: p20002_i0_a0–a4, p20008_i0_a0–a4

  • Self-report cancer: p20001_i0_a0–a4, p20006_i0_a0–a4

  • HES: p41270 (JSON array), p41280_a0–a8

  • Cancer registry: p40006_i0–i2, p40011_i0–i2, p40012_i0–i2, p40005_i0–i2

  • Death registry: p40001_i0, p40002_i0_a0–a2, p40000_i0

  • First occurrence: p131742

  • GRS: grs_bmi, grs_raw, grs_finngen

  • Messy columns: messy_allna, messy_empty, messy_label

scenario = "association"

Analysis-ready table. All derive inputs (raw arrays, HES JSON, registry fields) are omitted; derive outputs are pre-computed with internally consistent relationships:

  • Demographics: eid, p31 (factor), p53_i0 (IDate), p21022

  • Covariates: p21001_i0, bmi_cat (factor, derived from p21001_i0), p20116_i0 (factor), p1558_i0 (factor), p21000_i0 (factor), p22189, tdi_cat (factor, derived from p22189 quartiles), p54_i0 (factor)

  • Genetic PCs: p22009_a1p22009_a10

  • GRS: grs_bmi (continuous exposure)

  • DM outcome: dm_status, dm_date, dm_timing, dm_followup_end, dm_followup_years (type 2 diabetes, ~12% prevalence)

  • HTN outcome: htn_status, htn_date, htn_timing, htn_followup_end, htn_followup_years (hypertension, ~28% prevalence)

Internal relationships guaranteed:

  • bmi_cat is cut from p21001_i0 (breaks 18.5 / 25 / 30)

  • tdi_cat is cut from p22189 quartiles

  • dm_date is NA iff dm_status = FALSE

  • dm_timing: 0 = no disease, 1 = prevalent, 2 = incident, NA = no date

  • dm_followup_years is NA for prevalent cases (dm_timing == 1)

Examples

# cohort: raw UKB-style columns, feed into derive pipeline
dt <- ops_toy(n = 100)
#>  ops_toy: 100 participants | 75 columns | scenario = "cohort" | seed = 42
dt <- derive_missing(dt)
#>  derive_missing: replaced 47 values across 3 columns (action = "na").

# association: analysis-ready, feed directly into assoc_* functions
dt <- ops_toy(scenario = "association", n = 500)
#>  ops_toy: 500 participants | 33 columns | scenario = "association" | seed = 42
dt <- dt[dm_timing != 1L]   # exclude prevalent cases

# forest: results table for plot_forest()
dt <- ops_toy(scenario = "forest")
#>  ops_toy: 24 rows | 11 columns | scenario = "forest" | seed = 42