Overview
The ops_* functions are a set of lightweight utilities
that sit outside the main analysis pipeline. They help you verify your
environment before starting, explore data quality, and track how your
cohort changes at each processing step.
| Function | Purpose |
|---|---|
ops_setup() |
Check dx CLI, RAP authentication, and R package dependencies |
ops_toy() |
Generate synthetic UKB-like data for development and testing |
ops_na() |
Summarise missing values (NA and "") across all
columns |
ops_snapshot() |
Record pipeline checkpoints and track dataset changes |
ops_setup() may query dx CLI and RAP authentication
status as part of its health check. All other functions operate entirely
locally: ops_toy() and ops_na() are read-only;
ops_snapshot() and its companions track and optionally
clean up columns; ops_withdraw() removes withdrawn
participants in-place. None of them read from or write to RAP
storage.
ops_setup() — Environment Health Check
Run ops_setup() once after installing ukbflow to confirm
that all required components are in place before starting a real
analysis.
library(ukbflow)
ops_setup()
#> ── ukbflow environment check ──────────────────────────────────────────────
#> ℹ ukbflow 0.1.0 | R 4.4.1 | 2026-03-09
#> ── 1. dx-toolkit ──────────────────────────────────────────────────────────
#> ✔ dx: /usr/local/bin/dx (dx-toolkit v0.375.0)
#> ── 2. RAP authentication ───────────────────────────────────────────────────
#> ✔ user: evan.zhou
#> ✔ project: project-GXk9...
#> ── 3. R packages ───────────────────────────────────────────────────────────
#> ✔ cli 3.6.3 [core]
#> ✔ data.table 1.15.4 [core]
#> ✔ survival 3.7.0 [assoc_coxph]
#> ✔ forestploter 1.1.1 [plot_forest]
#> ...
#> ───────────────────────────────────────────────────────────────────────────
#> ✔ 15 passed
#> ! 2 optional / warningFor programmatic use (e.g. inside scripts or CI), set
verbose = FALSE and inspect the returned list:
result <- ops_setup(verbose = FALSE)
result$summary
#> $pass
#> [1] 15
#> $warn
#> [1] 2
#> $fail
#> [1] 0
# Gate the rest of your script on a clean environment
stopifnot(result$summary$fail == 0)Individual checks can be disabled when only a subset is needed:
# Check R package dependencies only (skip dx and RAP auth)
ops_setup(check_dx = FALSE, check_auth = FALSE)
ops_toy() — Synthetic UKB Data
ops_toy() generates a realistic but entirely synthetic
dataset that mimics the structure of UKB phenotype data on the RAP. Use
it to develop and test derive_*, assoc_*, and
plot_* functions without needing real UKB data access.
Cohort scenario
The default "cohort" scenario produces a wide
participant-level table that covers all major UKB data domains:
dt <- ops_toy()
#> ✔ ops_toy: 1000 participants | 75 columns | scenario = "cohort" | seed = 42
dim(dt)
#> [1] 1000 75
names(dt)
#> [1] "eid" "p31" "p34" "p53_i0"
#> [5] "p21022" "p21001_i0" "p20116_i0" "p1558_i0"
#> ...Column groups included:
| Group | Columns |
|---|---|
| Demographics |
eid, p31, p34,
p53_i0, p21022
|
| Covariates |
p21001_i0, p20116_i0,
p1558_i0, p21000_i0, p22189,
p54_i0
|
| Genetic PCs |
p22009_a1 – p22009_a10
|
| Self-report disease |
p20002_i0_a0 – a4,
p20008_i0_a0 – a4
|
| Self-report cancer |
p20001_i0_a0 – a4,
p20006_i0_a0 – a4
|
| HES |
p41270 (JSON array), p41280_a0 –
a8
|
| Cancer registry |
p40006_i0 – i2, p40011_i0 –
i2, p40012_i0 – i2,
p40005_i0 – i2
|
| Death registry |
p40001_i0, p40002_i0_a0 – a2,
p40000_i0
|
| First occurrence | p131742 |
| GRS columns |
grs_bmi, grs_raw,
grs_finngen
|
| Messy columns |
messy_allna, messy_empty,
messy_label
|
The messy columns deliberately stress-test
derive_missing() and ops_na() against common
data quality issues (all-NA columns, empty strings, non-standard missing
labels).
Feed the output directly into the derive pipeline:
dt <- ops_toy()
dt <- derive_missing(dt)
dt <- derive_covariate(dt,
as_numeric = "p21001_i0",
as_factor = c("p31", "p20116_i0")
)Forest scenario
The "forest" scenario returns a results table matching
the output of assoc_coxph(), useful for developing and
testing plot_forest() without running a real Cox model:
dt_forest <- ops_toy(scenario = "forest")
#> ✔ ops_toy: 24 rows | 11 columns | scenario = "forest" | seed = 42
plot_forest(
data = dt_forest[model == "Fully adjusted"],
est = dt_forest[model == "Fully adjusted", HR],
lower = dt_forest[model == "Fully adjusted", CI_lower],
upper = dt_forest[model == "Fully adjusted", CI_upper]
)
ops_na() — Missing Value Diagnostics
ops_na() scans every column for NA
and empty strings (""), returning counts
and percentages sorted by missingness. Counting "" as
missing is intentional — UKB exports frequently use empty strings as
placeholders for absent text values, so ops_na() reports
effective missingness rather than a plain is.na()
count. It is designed to be called before derive_missing()
to understand the data quality profile of a freshly extracted UKB
dataset.
dt <- ops_toy()
ops_na(dt)
#> ── ops_na ──────────────────────────────────────────────────────────────────
#> ℹ 1000 rows | 65 columns | threshold = 0%
#> ✖ messy_allna 1000 / 1000 (100.00%)
#> ✖ p41280_a4 1000 / 1000 (100.00%)
#> ✖ p20002_i0_a4 976 / 1000 ( 97.60%)
#> ✖ p131742 916 / 1000 ( 91.60%)
#> ...
#> ────────────────────────────────────────────────────────────────────────────
#> ✖ 41 columns ≥ 10% missing
#> ✔ 24 columns complete (0% missing)Columns with ≥ 10% missing are flagged in red (✖); those
between 0% and 10% in yellow (!). The summary block
(totals) is always printed regardless of the threshold
setting.
Controlling CLI output with threshold
Use threshold to silence low-missingness columns from
the per-column listing when the dataset has many columns. The summary
block and returned data.table are always complete.
Programmatic use
ops_na() returns a data.table invisibly,
regardless of threshold:
result <- ops_na(dt, verbose = FALSE)
result
#> column n_na pct_na
#> <char> <int> <num>
#> 1: messy_allna 1000 100.0
#> 2: p41280_a4 1000 100.0
#> ...
# Identify columns to drop before modelling
cols_to_drop <- result[pct_na > 90, column]
dt[, (cols_to_drop) := NULL]
ops_snapshot() — Pipeline Checkpoints
ops_snapshot() records a lightweight summary of your
dataset at each processing step and stores it in the session cache. Each
subsequent call automatically computes deltas (Δ) against the previous
snapshot, making it easy to track how rows, columns, and missingness
change through the pipeline.
Recording snapshots
dt <- ops_toy()
ops_snapshot(dt, label = "raw")
#> ── snapshot: raw ───────────────────────────────────────────────────────────
#> rows 1,000
#> cols 65
#> NA cols 41
#> size 0.61 MB
#> ────────────────────────────────────────────────────────────────────────────
dt <- derive_missing(dt)
ops_snapshot(dt, label = "after_derive_missing")
#> ── snapshot: after_derive_missing ──────────────────────────────────────────
#> rows 1,000 (= 0)
#> cols 65 (= 0)
#> NA cols 43 (+2)
#> size 0.61 MB (= 0)
#> ────────────────────────────────────────────────────────────────────────────
dt <- dt[p31 == "Female"]
ops_snapshot(dt, label = "female_only")
#> ── snapshot: female_only ───────────────────────────────────────────────────
#> rows 570 (-430)
#> cols 65 (= 0)
#> NA cols 43 (= 0)
#> size 0.36 MB (-0.25 MB)
#> ────────────────────────────────────────────────────────────────────────────When label is omitted, snapshots are named
snapshot_1, snapshot_2, etc. automatically.
Labels should be unique within a session: if the same label is used
twice, the history row is appended again but the stored column list is
overwritten — which can cause ops_snapshot_cols() and
ops_snapshot_diff() to behave unexpectedly.
Viewing the full history
Call ops_snapshot() with no arguments to print and
return the complete history data.table:
ops_snapshot()
#> ── ops_snapshot history ────────────────────────────────────────────────────
#> idx label timestamp nrow ncol n_na_cols size_mb
#> 1: 1 raw 14:30:01 1000 65 41 0.61
#> 2: 2 after_derive_missing 14:30:05 1000 65 43 0.61
#> 3: 3 female_only 14:30:08 570 65 43 0.36
#> ────────────────────────────────────────────────────────────────────────────Silent recording
Set verbose = FALSE to record a snapshot without
printing anything — useful inside functions or automated scripts:
ops_snapshot(dt, label = "pre_assoc", verbose = FALSE)Resetting history
ops_snapshot(reset = TRUE)
#> ✔ Snapshot history cleared.Session scope: the snapshot history lives in ukbflow’s session cache and is cleared when the R session ends or when
ops_snapshot(reset = TRUE)is called. It is not written to disk.
Snapshot Helpers
ops_snapshot_cols() — column names at a checkpoint
Returns the column names recorded at a given snapshot label, minus
protected columns (eid, sex, age,
age_at_recruitment, and any registered via
ops_set_safe_cols()). The primary use is building a drop
vector after the raw columns are no longer needed.
raw_cols <- ops_snapshot_cols("raw")
# raw_cols is a character vector of droppable column namesPass keep to protect additional columns beyond the
defaults:
raw_cols <- ops_snapshot_cols("raw", keep = "p53_i0")
ops_snapshot_diff() — compare two checkpoints
Returns lists of columns added and removed between two snapshots —
useful for auditing what derive_* functions produced.
result <- ops_snapshot_diff("raw", "after_derive_missing")
result$added # columns added in this step
result$removed # columns dropped in this step
ops_snapshot_remove() — drop raw columns after
deriving
Removes the raw columns captured at a snapshot from
data, keeping any derived columns added since. Built-in
safe columns (eid, etc.) and columns supplied in
keep are always retained.
# After deriving, drop the original raw columns
dt <- ops_snapshot_remove(dt, from = "raw")
#> ✔ ops_snapshot_remove: dropped 60 raw columns, 15 remaining.For data.table input the operation is by reference
(in-place); for data.frame input a new
data.table is returned and the original is not
modified.
ops_set_safe_cols() — register study-specific protected
columns
Adds column names to the session safe list so they are never dropped
by ops_snapshot_cols() or
ops_snapshot_remove().
ops_set_safe_cols(c("date_baseline", "age_at_recruitment"))
# Clear registered safe cols
ops_set_safe_cols(reset = TRUE)
ops_withdraw() — Exclude Withdrawn Participants
UK Biobank periodically issues withdrawal files listing participants
who have revoked consent. ops_withdraw() reads the
headerless single-column CSV supplied by UKB and removes matching rows
from your dataset. Two snapshots (before_withdraw /
after_withdraw) are recorded automatically.
dt <- ops_withdraw(dt, file = "withdraw.csv")
#> ── snapshot: before_withdraw ───────────────────────────────────────────────
#> rows 502,492
#> ...
#> ── snapshot: after_withdraw ────────────────────────────────────────────────
#> rows 502,489 (-3)
#> ...
#> ℹ Withdrawal file: w854944_20260310.csv (312 IDs)
#> ✖ Excluded: 3 participants found in data
#> ✔ Remaining: 502,489 participantsRun this immediately after loading your extracted dataset, before any
derive_* steps, so withdrawn participants never enter the
analysis.
Typical Workflow
The four ops_* functions form a natural bookend around
the core pipeline:
library(ukbflow)
# 1. Verify environment before starting
ops_setup()
# 2. Generate test data (or extract real data from RAP)
dt <- ops_toy()
# 3. Inspect data quality before processing
ops_na(dt)
# 4. Run pipeline with checkpoints
ops_snapshot(dt, label = "raw")
dt <- derive_missing(dt)
ops_snapshot(dt, label = "after_derive_missing")
dt <- derive_covariate(dt,
as_numeric = "p21001_i0",
as_factor = c("p31", "p20116_i0")
)
ops_snapshot(dt, label = "after_derive_covariate")
# 5. Review full pipeline history
ops_snapshot()Getting Help
-
?ops_setup,?ops_toy,?ops_na,?ops_snapshot -
?ops_snapshot_cols,?ops_snapshot_diff,?ops_snapshot_remove,?ops_set_safe_cols ?ops_withdraw-
vignette("get-started")— end-to-end pipeline overview -
vignette("derive")— disease phenotype derivation - GitHub Issues
