Skip to contents

RAP-native R workflow for UK Biobank analysis

CRAN status R-CMD-check Codecov Lifecycle


[!NOTE] ๐ŸŽ‰ 2026-04 โ€” ukbflow is now available on CRAN! Install with install.packages("ukbflow").

Overview

ukbflow is an R-native, RAP-aware workflow system for UK Biobank controlled-data analysis. It provides a coherent workflow layer for phenotype extraction, disease derivation, association analysis, audit records, and publication-quality outputs. It is designed to support workflows on the UK Biobank Research Analysis Platform (RAP) under the 2024+ data policy requiring individual-level data to remain in the cloud; users remain responsible for ensuring that only permitted summary-level outputs are downloaded.

Installation

# From CRAN (recommended)
install.packages("ukbflow")

# Latest development version from GitHub
pak::pkg_install("evanbio/ukbflow")

# or
remotes::install_github("evanbio/ukbflow")

Requirements: R โ‰ฅ 4.1 ยท dxpy (dx-toolkit, required for RAP interaction)

pip install dxpy

GRS workflows additionally require plink2 availability in the RAP job environment.

Key Features

Connection โ€” Authenticate to RAP via dx-toolkit and manage project selection (auth_login, auth_select_project)

Data Access โ€” Retrieve phenotype data from UKB dataset on RAP; monitor asynchronous jobs (fetch_metadata, extract_batch, job_wait)

Data Processing โ€” Harmonize multi-source records and derive analysis-ready cohort: decode field IDs and value codes, build ICD-10 case definitions, compute follow-up time (decode_names, decode_values, derive_icd10, derive_followup, derive_case)

Association Analysis โ€” Cox, logistic, and linear regression with automatic three-model adjustment framework; subgroup analysis, dose-response trend, and Fine-Gray competing risks (assoc_coxph, assoc_logistic, assoc_subgroup)

Genomic Scoring โ€” Distributed plink2 scoring on RAP worker nodes: BGEN โ†’ PGEN conversion, multi-chromosome GRS computation, and standardization (grs_bgen2pgen, grs_score, grs_standardize)

Visualization โ€” Publication-ready forest plots and Table 1 outputs in common manuscript formats (plot_forest, plot_tableone)

Utilities โ€” Verify environment before analysis; search approved project fields and look up common UKB field IDs; generate synthetic UKB-like data for development; diagnose missing values; track cohort changes across pipeline steps; exclude withdrawn participants (ops_setup, ops_fields, ops_fields_common, ops_toy, ops_na, ops_snapshot, ops_withdraw)

Analysis Audit โ€” Create lightweight JSON manifests for reproducibility: field IDs, dataset snapshots, derived phenotype summaries, model result tables, RAP jobs, and session metadata (audit_start, audit_fields, audit_snapshot, audit_pheno, audit_model, audit_job, audit_write)

Supported Phenotype Sources

ukbflow currently focuses on common UK Biobank disease-phenotype sources that are routinely available in phenotype extraction workflows:

Source Code system / field type Main function(s)
Self-reported illness / cancer UKB fields 20002 / 20001 derive_selfreport()
HES inpatient diagnoses ICD-10, any-position field 41270 with dates from 41280; primary/secondary position is not currently configurable derive_hes()
First Occurrence fields UKB precomputed p131xxx dates derive_first_occurrence()
Cancer registry ICD-10, histology, behaviour, diagnosis date derive_cancer_registry()
Death registry ICD-10 primary / secondary cause of death derive_death_registry()
Multi-source ICD-10 phenotype HES, death, First Occurrence, cancer registry derive_icd10()
Final case definition Self-report plus ICD-10-derived status/date derive_case()

ICD-9, OPCS-4, Read v2, CTV3, and other GP / primary-care code systems are not part of the current public API.

Limitations

ukbflow is a workflow system, not a replacement for the underlying RAP and statistical tools. It orchestrates and records common analysis steps around dx-toolkit / DNAnexus jobs, R modelling functions, plotting packages, and PLINK2-based GRS workflows. It does not provide a general DAG scheduler, estimate RAP costs, replace the DNAnexus interface, or determine study design, covariate choice, phenotype validity, or causal interpretation. Current public phenotype helpers focus on the UKB sources listed above and do not cover GP / primary-care code systems.

Quick Start

library(ukbflow)

# Simulate UKB-style data locally (on RAP: replace with extract_batch() + job_wait())
data <- ops_toy(n = 5000, seed = 2026) |>
  derive_missing()

# Derive lung cancer outcome (ICD-10 C34) and follow-up time
data <- data |>
  derive_icd10(name = "lung", icd10 = "C34",
               source = c("cancer_registry", "hes")) |>
  derive_followup(name         = "lung",
                  event_col    = "lung_icd10_date",
                  baseline_col = "p53_i0",
                  censor_date  = as.Date("2022-10-31"),
                  death_col    = "p40000_i0")

# Define exposure: ever vs. never smoker
data[, smoking_ever := factor(
  ifelse(p20116_i0 == "Never", "Never", "Ever"),
  levels = c("Never", "Ever")
)]

# Cox regression: smoking โ†’ lung cancer (3-model adjustment)
res <- assoc_coxph(data,
  outcome_col  = "lung_icd10",
  time_col     = "lung_followup_years",
  exposure_col = "smoking_ever",
  covariates   = c("p21022", "p31", "p22189"))

# Forest plot
res_df <- as.data.frame(res)
plot_forest(
  data      = res_df,
  est       = res_df$HR,
  lower     = res_df$CI_lower,
  upper     = res_df$CI_upper,
  ci_column = 2L
)

Documentation

Getting Help

License

MIT License ยฉ 2026 Yibin Zhou