Operational Utilities: Setup, Diagnostics, and Pipeline Tracking

Overview

The ops_* functions are a set of lightweight utilities that sit outside the main analysis pipeline. They help you verify your environment before starting, explore data quality, and track how your cohort changes at each processing step.

Function	Purpose
`ops_setup()`	Check dx CLI, RAP authentication, and R package dependencies
`ops_toy()`	Generate synthetic UKB-like data for development and testing
`ops_na()`	Summarise missing values (NA and `""`) across all columns
`ops_snapshot()`	Record pipeline checkpoints and track dataset changes

ops_setup() may query dx CLI and RAP authentication status as part of its health check. All other functions operate entirely locally: ops_toy() and ops_na() are read-only; ops_snapshot() and its companions track and optionally clean up columns; ops_withdraw() removes withdrawn participants in-place. None of them read from or write to RAP storage.

`ops_setup()` — Environment Health Check

Run ops_setup() once after installing ukbflow to confirm that all required components are in place before starting a real analysis.

library(ukbflow)

ops_setup()
#> ── ukbflow environment check ──────────────────────────────────────────────
#> ℹ ukbflow 0.1.0  |  R 4.4.1  |  2026-03-09
#> ── 1. dx-toolkit ──────────────────────────────────────────────────────────
#> ✔ dx: /usr/local/bin/dx  (dx-toolkit v0.375.0)
#> ── 2. RAP authentication ───────────────────────────────────────────────────
#> ✔ user: evan.zhou
#> ✔ project: project-GXk9...
#> ── 3. R packages ───────────────────────────────────────────────────────────
#> ✔ cli  3.6.3  [core]
#> ✔ data.table  1.15.4  [core]
#> ✔ survival  3.7.0  [assoc_coxph]
#> ✔ forestploter  1.1.1  [plot_forest]
#> ...
#> ───────────────────────────────────────────────────────────────────────────
#> ✔ 15 passed
#> ! 2 optional / warning

For programmatic use (e.g. inside scripts or CI), set verbose = FALSE and inspect the returned list:

result <- ops_setup(verbose = FALSE)
result$summary
#> $pass
#> [1] 15
#> $warn
#> [1] 2
#> $fail
#> [1] 0

# Gate the rest of your script on a clean environment
stopifnot(result$summary$fail == 0)

Individual checks can be disabled when only a subset is needed:

# Check R package dependencies only (skip dx and RAP auth)
ops_setup(check_dx = FALSE, check_auth = FALSE)

`ops_toy()` — Synthetic UKB Data

ops_toy() generates a realistic but entirely synthetic dataset that mimics the structure of UKB phenotype data on the RAP. Use it to develop and test derive_*, assoc_*, and plot_* functions without needing real UKB data access.

Cohort scenario

The default "cohort" scenario produces a wide participant-level table that covers all major UKB data domains:

dt <- ops_toy()
#> ✔ ops_toy: 1000 participants | 75 columns | scenario = "cohort" | seed = 42

dim(dt)
#> [1] 1000   75

names(dt)
#>  [1] "eid"          "p31"          "p34"          "p53_i0"
#>  [5] "p21022"       "p21001_i0"    "p20116_i0"    "p1558_i0"
#>  ...

Column groups included:

Group	Columns
Demographics	`eid`, `p31`, `p34`, `p53_i0`, `p21022`
Covariates	`p21001_i0`, `p20116_i0`, `p1558_i0`, `p21000_i0`, `p22189`, `p54_i0`
Genetic PCs	`p22009_a1` – `p22009_a10`
Self-report disease	`p20002_i0_a0` – `a4`, `p20008_i0_a0` – `a4`
Self-report cancer	`p20001_i0_a0` – `a4`, `p20006_i0_a0` – `a4`
HES	`p41270` (JSON array), `p41280_a0` – `a8`
Cancer registry	`p40006_i0` – `i2`, `p40011_i0` – `i2`, `p40012_i0` – `i2`, `p40005_i0` – `i2`
Death registry	`p40001_i0`, `p40002_i0_a0` – `a2`, `p40000_i0`
First occurrence	`p131742`
GRS columns	`grs_bmi`, `grs_raw`, `grs_finngen`
Messy columns	`messy_allna`, `messy_empty`, `messy_label`

The messy columns deliberately stress-test derive_missing() and ops_na() against common data quality issues (all-NA columns, empty strings, non-standard missing labels).

Feed the output directly into the derive pipeline:

dt <- ops_toy()
dt <- derive_missing(dt)
dt <- derive_covariate(dt,
  as_numeric = "p21001_i0",
  as_factor  = c("p31", "p20116_i0")
)

Forest scenario

The "forest" scenario returns a results table matching the output of assoc_coxph(), useful for developing and testing plot_forest() without running a real Cox model:

dt_forest <- ops_toy(scenario = "forest")
#> ✔ ops_toy: 24 rows | 11 columns | scenario = "forest" | seed = 42

plot_forest(
  data  = dt_forest[model == "Fully adjusted"],
  est   = dt_forest[model == "Fully adjusted", HR],
  lower = dt_forest[model == "Fully adjusted", CI_lower],
  upper = dt_forest[model == "Fully adjusted", CI_upper]
)

Reproducibility

Results are reproducible by default (seed = 42). Pass seed = NULL for a different dataset on every call:

dt1 <- ops_toy(seed = 1)
dt2 <- ops_toy(seed = 1)
identical(dt1, dt2)   # TRUE

dt_random <- ops_toy(seed = NULL)   # different every call

`ops_na()` — Missing Value Diagnostics

ops_na() scans every column for NA and empty strings (""), returning counts and percentages sorted by missingness. Counting "" as missing is intentional — UKB exports frequently use empty strings as placeholders for absent text values, so ops_na() reports effective missingness rather than a plain is.na() count. It is designed to be called before derive_missing() to understand the data quality profile of a freshly extracted UKB dataset.

dt <- ops_toy()
ops_na(dt)
#> ── ops_na ──────────────────────────────────────────────────────────────────
#> ℹ 1000 rows | 65 columns | threshold = 0%
#> ✖ messy_allna   1000 / 1000  (100.00%)
#> ✖ p41280_a4     1000 / 1000  (100.00%)
#> ✖ p20002_i0_a4   976 / 1000  ( 97.60%)
#> ✖ p131742        916 / 1000  ( 91.60%)
#> ...
#> ────────────────────────────────────────────────────────────────────────────
#> ✖ 41 columns ≥ 10% missing
#> ✔ 24 columns complete (0% missing)

Columns with ≥ 10% missing are flagged in red (✖); those between 0% and 10% in yellow (!). The summary block (totals) is always printed regardless of the threshold setting.

Controlling CLI output with `threshold`

Use threshold to silence low-missingness columns from the per-column listing when the dataset has many columns. The summary block and returned data.table are always complete.

# Only list columns with > 50% missing in the console output
ops_na(dt, threshold = 50)

# Suppress all per-column lines — summary only
ops_na(dt, threshold = 99)

Programmatic use

ops_na() returns a data.table invisibly, regardless of threshold:

result <- ops_na(dt, verbose = FALSE)
result
#>           column  n_na pct_na
#>           <char> <int>  <num>
#>  1:  messy_allna  1000  100.0
#>  2:    p41280_a4  1000  100.0
#>  ...

# Identify columns to drop before modelling
cols_to_drop <- result[pct_na > 90, column]
dt[, (cols_to_drop) := NULL]

`ops_snapshot()` — Pipeline Checkpoints

ops_snapshot() records a lightweight summary of your dataset at each processing step and stores it in the session cache. Each subsequent call automatically computes deltas (Δ) against the previous snapshot, making it easy to track how rows, columns, and missingness change through the pipeline.

Recording snapshots

dt <- ops_toy()
ops_snapshot(dt, label = "raw")
#> ── snapshot: raw ───────────────────────────────────────────────────────────
#>   rows      1,000
#>   cols         65
#>   NA cols      41
#>   size       0.61 MB
#> ────────────────────────────────────────────────────────────────────────────

dt <- derive_missing(dt)
ops_snapshot(dt, label = "after_derive_missing")
#> ── snapshot: after_derive_missing ──────────────────────────────────────────
#>   rows      1,000  (= 0)
#>   cols         65  (= 0)
#>   NA cols      43  (+2)
#>   size       0.61 MB  (= 0)
#> ────────────────────────────────────────────────────────────────────────────

dt <- dt[p31 == "Female"]
ops_snapshot(dt, label = "female_only")
#> ── snapshot: female_only ───────────────────────────────────────────────────
#>   rows        570  (-430)
#>   cols         65  (= 0)
#>   NA cols      43  (= 0)
#>   size       0.36 MB  (-0.25 MB)
#> ────────────────────────────────────────────────────────────────────────────

When label is omitted, snapshots are named snapshot_1, snapshot_2, etc. automatically. Labels should be unique within a session: if the same label is used twice, the history row is appended again but the stored column list is overwritten — which can cause ops_snapshot_cols() and ops_snapshot_diff() to behave unexpectedly.

Viewing the full history

Call ops_snapshot() with no arguments to print and return the complete history data.table:

ops_snapshot()
#> ── ops_snapshot history ────────────────────────────────────────────────────
#>    idx                label timestamp  nrow  ncol n_na_cols size_mb
#>  1:  1                  raw  14:30:01  1000    65        41    0.61
#>  2:  2 after_derive_missing  14:30:05  1000    65        43    0.61
#>  3:  3          female_only  14:30:08   570    65        43    0.36
#> ────────────────────────────────────────────────────────────────────────────

Silent recording

Set verbose = FALSE to record a snapshot without printing anything — useful inside functions or automated scripts:

ops_snapshot(dt, label = "pre_assoc", verbose = FALSE)

Resetting history

ops_snapshot(reset = TRUE)
#> ✔ Snapshot history cleared.

Session scope: the snapshot history lives in ukbflow’s session cache and is cleared when the R session ends or when ops_snapshot(reset = TRUE) is called. It is not written to disk.

Snapshot Helpers

`ops_snapshot_cols()` — column names at a checkpoint

Returns the column names recorded at a given snapshot label, minus protected columns (eid, sex, age, age_at_recruitment, and any registered via ops_set_safe_cols()). The primary use is building a drop vector after the raw columns are no longer needed.

raw_cols <- ops_snapshot_cols("raw")
# raw_cols is a character vector of droppable column names

Pass keep to protect additional columns beyond the defaults:

raw_cols <- ops_snapshot_cols("raw", keep = "p53_i0")

`ops_snapshot_diff()` — compare two checkpoints

Returns lists of columns added and removed between two snapshots — useful for auditing what derive_* functions produced.

result <- ops_snapshot_diff("raw", "after_derive_missing")
result$added    # columns added in this step
result$removed  # columns dropped in this step

`ops_snapshot_remove()` — drop raw columns after deriving

Removes the raw columns captured at a snapshot from data, keeping any derived columns added since. Built-in safe columns (eid, etc.) and columns supplied in keep are always retained.

# After deriving, drop the original raw columns
dt <- ops_snapshot_remove(dt, from = "raw")
#> ✔ ops_snapshot_remove: dropped 60 raw columns, 15 remaining.

For data.table input the operation is by reference (in-place); for data.frame input a new data.table is returned and the original is not modified.

`ops_set_safe_cols()` — register study-specific protected columns

Adds column names to the session safe list so they are never dropped by ops_snapshot_cols() or ops_snapshot_remove().

ops_set_safe_cols(c("date_baseline", "age_at_recruitment"))

# Clear registered safe cols
ops_set_safe_cols(reset = TRUE)

`ops_withdraw()` — Exclude Withdrawn Participants

UK Biobank periodically issues withdrawal files listing participants who have revoked consent. ops_withdraw() reads the headerless single-column CSV supplied by UKB and removes matching rows from your dataset. Two snapshots (before_withdraw / after_withdraw) are recorded automatically.

dt <- ops_withdraw(dt, file = "withdraw.csv")
#> ── snapshot: before_withdraw ───────────────────────────────────────────────
#>   rows      502,492
#>   ...
#> ── snapshot: after_withdraw ────────────────────────────────────────────────
#>   rows      502,489  (-3)
#>   ...
#> ℹ Withdrawal file: w854944_20260310.csv (312 IDs)
#> ✖ Excluded: 3 participants found in data
#> ✔ Remaining: 502,489 participants

Run this immediately after loading your extracted dataset, before any derive_* steps, so withdrawn participants never enter the analysis.

Typical Workflow

The four ops_* functions form a natural bookend around the core pipeline:

library(ukbflow)

# 1. Verify environment before starting
ops_setup()

# 2. Generate test data (or extract real data from RAP)
dt <- ops_toy()

# 3. Inspect data quality before processing
ops_na(dt)

# 4. Run pipeline with checkpoints
ops_snapshot(dt, label = "raw")

dt <- derive_missing(dt)
ops_snapshot(dt, label = "after_derive_missing")

dt <- derive_covariate(dt,
  as_numeric = "p21001_i0",
  as_factor  = c("p31", "p20116_i0")
)
ops_snapshot(dt, label = "after_derive_covariate")

# 5. Review full pipeline history
ops_snapshot()

Getting Help

?ops_setup, ?ops_toy, ?ops_na, ?ops_snapshot
?ops_snapshot_cols, ?ops_snapshot_diff, ?ops_snapshot_remove, ?ops_set_safe_cols
?ops_withdraw
vignette("get-started") — end-to-end pipeline overview
vignette("derive") — disease phenotype derivation
GitHub Issues

Overview

ops_setup() — Environment Health Check

ops_toy() — Synthetic UKB Data