Creates a small, synthetic dataset that mimics the structure of UK Biobank
phenotype data on the RAP. Useful for developing and testing derive_*,
assoc_*, and plot_* functions without requiring real UKB data access.
Arguments
- scenario
(character) Data structure to generate:
"cohort": wide participant-level table with raw UKB field columns for the fullderive_*→assoc_*→plot_*pipeline."association": analysis-ready table with covariates already as factors, BMI/TDI binned, and two pre-derived disease outcomes (dm_*,htn_*) including status, date, timing, and follow-up columns. Use this forassoc_*examples and testing without running the derive pipeline."forest": association results table matchingassoc_coxph()output, for testingplot_forest().n= number of exposures (default 8).
- n
(integer) Number of participants (or exposures for
"forest"). Default1000Lfor"cohort",2000Lfor"association",8Lfor"forest".- seed
(integer or NULL) Random seed for reproducibility. Pass
NULLfor a different dataset on every call. Default42L.
Value
A data.table with UKB-style column names. See Details for the
columns included in each scenario.
Details
This dataset is entirely synthetic. Column names follow RAP conventions
(e.g. p41270, p20002_i0_a0).
scenario = "cohort"
Includes the following column groups:
Demographics:
eid,p31,p34,p53_i0,p21022Covariates:
p21001_i0,p20116_i0,p1558_i0,p21000_i0,p22189,p54_i0Genetic PCs:
p22009_a1–p22009_a10Self-report disease:
p20002_i0_a0–a4,p20008_i0_a0–a4Self-report cancer:
p20001_i0_a0–a4,p20006_i0_a0–a4HES:
p41270(JSON array),p41280_a0–a8Cancer registry:
p40006_i0–i2,p40011_i0–i2,p40012_i0–i2,p40005_i0–i2Death registry:
p40001_i0,p40002_i0_a0–a2,p40000_i0First occurrence:
p131742GRS:
grs_bmi,grs_raw,grs_finngenMessy columns:
messy_allna,messy_empty,messy_label
scenario = "association"
Analysis-ready table. All derive inputs (raw arrays, HES JSON, registry fields) are omitted; derive outputs are pre-computed with internally consistent relationships:
Demographics:
eid,p31(factor),p53_i0(IDate),p21022Covariates:
p21001_i0,bmi_cat(factor, derived fromp21001_i0),p20116_i0(factor),p1558_i0(factor),p21000_i0(factor),p22189,tdi_cat(factor, derived fromp22189quartiles),p54_i0(factor)Genetic PCs:
p22009_a1–p22009_a10GRS:
grs_bmi(continuous exposure)DM outcome:
dm_status,dm_date,dm_timing,dm_followup_end,dm_followup_years(type 2 diabetes, ~12% prevalence)HTN outcome:
htn_status,htn_date,htn_timing,htn_followup_end,htn_followup_years(hypertension, ~28% prevalence)
Internal relationships guaranteed:
bmi_catis cut fromp21001_i0(breaks 18.5 / 25 / 30)tdi_catis cut fromp22189quartilesdm_dateisNAiffdm_status = FALSEdm_timing: 0 = no disease, 1 = prevalent, 2 = incident,NA= no datedm_followup_yearsisNAfor prevalent cases (dm_timing == 1)
Examples
# cohort: raw UKB-style columns, feed into derive pipeline
dt <- ops_toy(n = 100)
#> ✔ ops_toy: 100 participants | 75 columns | scenario = "cohort" | seed = 42
dt <- derive_missing(dt)
#> ✔ derive_missing: replaced 47 values across 3 columns (action = "na").
# association: analysis-ready, feed directly into assoc_* functions
dt <- ops_toy(scenario = "association", n = 500)
#> ✔ ops_toy: 500 participants | 33 columns | scenario = "association" | seed = 42
dt <- dt[dm_timing != 1L] # exclude prevalent cases
# forest: results table for plot_forest()
dt <- ops_toy(scenario = "forest")
#> ✔ ops_toy: 24 rows | 11 columns | scenario = "forest" | seed = 42
