Overview
The audit_* functions create a lightweight analysis
manifest. They are not a workflow engine: the goal is to add small audit
records at natural points in an ordinary ukbflow analysis, using objects
that already exist in the script.
A typical audit captures:
- the analysis name, ukbflow version, session information, and optional RAP context;
- the UKB field IDs requested for extraction;
- dataset snapshots at key stages, including row count, column count, missingness count, object size, and complete column names;
- derived phenotype summaries from standard
derive_*column names; - association result tables returned by
assoc_*; - DNAnexus job IDs and lightweight job metadata when available;
- a JSON manifest that can be saved with the analysis outputs.
The examples below use synthetic data from ops_toy() and
can be developed without RAP access. In a real RAP project, the same
audit calls sit next to extract_batch(),
job_result(), derive_*(), and
assoc_*() calls.
Start an Audit
Start one audit object near the beginning of the analysis.
library(ukbflow)
aud <- audit_start("smoking_lung_cancer")
audaudit_start() records the analysis name, start time,
ukbflow version, R session information, and current DNAnexus
user/project when available. If the dx CLI or RAP context is
unavailable, those fields are recorded as NA without
failing.
Record Field IDs
Field IDs are usually already stored in a vector before extraction. Reuse that object directly in the audit.
fields <- c(
31, 53, 21022, 21001, 20116, 1558, 22189, 54,
22009, 20001, 20006, 40006, 40011, 40012, 40005, 40000
)
aud <- audit_fields(aud, fields, label = "analysis_fields")
# In a RAP workflow this same vector can be used for extraction:
# job_id <- extract_batch(field_id = fields, file = "lung_analysis_pheno")
# aud <- audit_job(aud, job_id, "phenotype_extraction")The manifest stores the declared field IDs, an optional dataset name, a label, the number of fields, and a timestamp.
audit_job() records the DNAnexus job ID and any
lightweight metadata available from
dx describe job-XXXX --json, such as job state and output
file ID. It does not estimate RAP cost; use the DNAnexus / RAP billing
interface for cost review.
Snapshot Data States
Use snapshots at points where the dataset changes meaningfully: raw data, after phenotype derivation, after exclusions, and immediately before modelling.
data <- ops_toy(scenario = "cohort", n = 1000, seed = 2026)
aud <- audit_snapshot(aud, data, "raw")
data <- derive_missing(data)
aud <- audit_snapshot(aud, data, "after_missing")Each audit snapshot stores the full column names. Retrieve them by label when you need to inspect or compare the data structure recorded in the manifest.
raw_cols <- audit_cols(aud, "raw")
head(raw_cols)Record Phenotype Summaries
After running derive_* functions,
audit_pheno() can summarise phenotype columns that follow
ukbflow’s standard naming convention. It only needs the audit object,
the data, and the phenotype prefix.
data <- derive_selfreport(
data,
name = "lung_cancer",
regex = "lung cancer",
field = "cancer"
)
data <- derive_icd10(
data,
name = "lung",
icd10 = "^C3[34]",
match = "regex",
source = "cancer_registry",
behaviour = 3L
)
data <- derive_case(
data,
name = "lung",
selfreport_col = "lung_cancer_selfreport",
selfreport_date_col = "lung_cancer_selfreport_date"
)
data <- derive_timing(data, name = "lung", baseline_col = "p53_i0")
data <- derive_followup(
data,
name = "lung",
event_col = "lung_date",
baseline_col = "p53_i0",
censor_date = as.Date("2022-10-31"),
death_col = "p40000_i0",
lost_col = FALSE
)
aud <- audit_pheno(aud, data, "lung")
aud <- audit_snapshot(aud, data, "after_phenotype")audit_pheno() records whichever components exist:
self-report, ICD-10, per-source ICD-10 columns, combined status/date,
timing, and follow-up. Missing components are marked as not present
rather than treated as errors.
Record Cohort Assembly
Audit snapshots work well for cohort exclusions because they record row count, column count, missingness count, and column names at each stage.
aud <- audit_snapshot(aud, data, "before_exclusions")
data <- data[lung_timing != 1L | is.na(lung_timing)]
aud <- audit_snapshot(aud, data, "after_excluding_prevalent")
data[, smoking_ever := factor(
ifelse(p20116_i0 == "Never", "Never", "Ever"),
levels = c("Never", "Ever")
)]
data <- data[
!is.na(smoking_ever) &
!is.na(p31) &
!is.na(p21022) &
!is.na(p1558_i0) &
!is.na(p54_i0)
]
aud <- audit_snapshot(aud, data, "analysis_ready")For UKB withdrawal files, run ops_withdraw() early in
the pipeline and then record an audit snapshot.
ops_withdraw() itself records before/after snapshots in the
session-level ops_snapshot() history.
withdraw_file <- tempfile(fileext = ".csv")
writeLines(as.character(data$eid[1:3]), withdraw_file)
data <- ops_withdraw(data, file = withdraw_file)
aud <- audit_snapshot(aud, data, "after_withdraw")Record Model Results
Association result tables are usually small and already contain the
most useful model summary. audit_model() stores the result
table directly. If the covariate vector already exists in your script,
pass it along.
covars <- c(
"p21022",
"p31",
"p1558_i0",
"p54_i0"
)
res <- assoc_coxph(
data = data,
outcome_col = "lung_status",
time_col = "lung_followup_years",
exposure_col = "smoking_ever",
covariates = covars
)
aud <- audit_model(
aud,
result = res,
label = "smoking_lung_cox",
covariates = covars
)The model record stores the full result table, inferred method, exposures, model labels, optional covariates, and a timestamp.
Review and Write the Manifest
Use summary() for a short directory-style overview.
summary(aud)Write the manifest as JSON alongside the analysis outputs.
audit_write(aud, "ukbflow-audit.json", overwrite = TRUE)The resulting JSON contains the audit metadata, extraction field records, snapshots, phenotype summaries, model result records, and session information.
Suggested Audit Points
For most analyses, these are enough:
-
audit_start()after loading ukbflow. -
audit_fields()next to the field vector used for extraction. -
audit_snapshot()after loading raw data. -
audit_snapshot()andaudit_pheno()after phenotype derivation. -
audit_snapshot()after each major cohort exclusion. -
audit_snapshot()immediately before modelling. -
audit_model()after each main association result. -
audit_job()next to long-running RAP jobs when a job ID is available. -
audit_write()at the end of the script.
Keep the audit close to the real workflow. Do not duplicate logic just for the manifest; record objects that already exist in the analysis.
