RAP-native R workflow for UK Biobank analysis
[!NOTE] ๐ 2026-04 โ ukbflow is now available on CRAN! Install with
install.packages("ukbflow").
Overview
ukbflow provides a streamlined, RAP-native R workflow for UK Biobank analysis โ from phenotype extraction and disease derivation to association analysis and publication-quality figures. All functions are designed to run within the UK Biobank Research Analysis Platform (RAP), in compliance with the 2024+ data policy requiring individual-level data to remain in the cloud.
Installation
# From CRAN (recommended)
install.packages("ukbflow")
# Latest development version from GitHub
pak::pkg_install("evanbio/ukbflow")
# or
remotes::install_github("evanbio/ukbflow")Requirements: R โฅ 4.1 ยท dxpy (dx-toolkit, required for RAP interaction)
Key Features
Connection โ Authenticate to RAP via dx-toolkit and manage project selection (auth_login, auth_select_project)
Data Access โ Retrieve phenotype data from UKB dataset on RAP; monitor asynchronous jobs (fetch_metadata, extract_batch, job_wait)
Data Processing โ Harmonize multi-source records and derive analysis-ready cohort: decode field IDs and value codes, build ICD-10 case definitions, compute follow-up time (decode_names, decode_values, derive_icd10, derive_followup, derive_case)
Association Analysis โ Cox, logistic, and linear regression with automatic three-model adjustment framework; subgroup analysis, dose-response trend, and Fine-Gray competing risks (assoc_coxph, assoc_logistic, assoc_subgroup)
Genomic Scoring โ Distributed plink2 scoring on RAP worker nodes: BGEN โ PGEN conversion, multi-chromosome GRS computation, and standardisation (grs_bgen2pgen, grs_score, grs_standardize)
Visualization โ Publication-ready forest plots and Table 1, saved in all major formats at 300 dpi (plot_forest, plot_tableone)
Utilities โ Verify environment before analysis; generate synthetic UKB-like data for development; diagnose missing values; track cohort changes across pipeline steps; exclude withdrawn participants (ops_setup, ops_toy, ops_na, ops_snapshot, ops_withdraw)
Quick Start
library(ukbflow)
# Simulate UKB-style data locally (on RAP: replace with extract_batch() + job_wait())
data <- ops_toy(n = 5000, seed = 2026) |>
derive_missing()
# Derive lung cancer outcome (ICD-10 C34) and follow-up time
data <- data |>
derive_icd10(name = "lung", icd10 = "C34",
source = c("cancer_registry", "hes")) |>
derive_followup(name = "lung",
event_col = "lung_icd10_date",
baseline_col = "p53_i0",
censor_date = as.Date("2022-10-31"),
death_col = "p40000_i0")
# Define exposure: ever vs. never smoker
data[, smoking_ever := factor(
ifelse(p20116_i0 == "Never", "Never", "Ever"),
levels = c("Never", "Ever")
)]
# Cox regression: smoking โ lung cancer (3-model adjustment)
res <- assoc_coxph(data,
outcome_col = "lung_icd10",
time_col = "lung_followup_years",
exposure_col = "smoking_ever",
covariates = c("p21022", "p31", "p22189"))
# Forest plot
res_df <- as.data.frame(res)
plot_forest(
data = res_df,
est = res_df$HR,
lower = res_df$CI_lower,
upper = res_df$CI_upper,
ci_column = 2L
)Documentation
- Get Started โ Installation and end-to-end workflow
- Function Reference โ Complete API documentation
- Vignettes โ Module-by-module tutorials
Getting Help
- Browse the function reference for detailed documentation
- Read vignettes for step-by-step examples
- Report issues on GitHub
