Survival Analysis Setup for UKB Outcomes

Overview

After disease case definitions have been derived (see vignette("derive")), three additional functions prepare the data for time-to-event analysis:

Function	Output columns	Purpose
`derive_timing()`	`{name}_timing`	Classify prevalent vs. incident disease
`derive_age()`	`age_at_{name}`	Age at event (years)
`derive_followup()`	`{name}_followup_end`, `{name}_followup_years`	Follow-up end date and duration

Prerequisite: {name}_status and {name}_date must already be present — produced by vignette("derive"). The examples below assume the full disease derivation pipeline has been run on an ops_toy() dataset, so the baseline date column is p53_i0 and age at recruitment is p21022.

Step 1: Classify Timing — Prevalent vs. Incident

derive_timing() compares the disease date to the UKB baseline assessment date and assigns each participant to one of four categories:

Value	Meaning
`0`	No disease (`status` is `FALSE`)
`1`	Prevalent — disease date on or before baseline
`2`	Incident — disease date strictly after baseline
`NA`	Case with no recorded date; timing cannot be determined

library(ukbflow)

# Build on the derive pipeline from vignette("derive")
df <- ops_toy(n = 500)
df <- derive_missing(df)
df <- derive_covariate(df, as_factor = c("p31", "p20116_i0"))
df <- derive_selfreport(df, name = "dm", regex = "type 2 diabetes")
df <- derive_icd10(df, name = "dm", icd10 = "E11", source = c("hes", "death"))
df <- derive_case(df, name = "dm")

# Uses {name}_status and {name}_date by default
df <- derive_timing(df, name = "dm", baseline_col = "p53_i0")

Supply explicit column names when the defaults do not apply:

df <- derive_timing(df,
  name         = "dm",
  status_col   = "dm_status",
  date_col     = "dm_date",
  baseline_col = "p53_i0"
)

Call once per variable needed — for example, once for the combined case and once per individual source (HES, self-report, etc.).

Step 2: Age at Event

derive_age() computes age at the time of the event for cases, and returns NA for non-cases and cases without a date.

$\text{age\_at\_event} = \text{age\_at\_recruitment} + \frac{\text{event\_date} - \text{baseline\_date}}{365.25}$

The divisor 365.25 accounts for leap years, ensuring sub-monthly precision in age calculation across the full UKB follow-up window.

# Auto-detects {name}_date and {name}_status; produces age_at_{name} column.
df <- derive_age(df,
  name         = "dm",
  baseline_col = "p53_i0",
  age_col      = "p21022"
)

Supply explicit column mappings when names do not follow the default {name}_date / {name}_status pattern:

df <- derive_age(df,
  name         = "dm",
  baseline_col = "p53_i0",
  age_col      = "p21022",
  date_cols    = c(dm = "dm_date"),
  status_cols  = c(dm = "dm_status")
)

Step 3: Follow-Up Time

derive_followup() computes the follow-up end date as the earliest of:

The outcome event date (if the participant is a case)
Date of death (field 40000; competing event)
Date lost to follow-up (field 191)
The administrative censoring date

Follow-up time in years is then derived from the baseline date.

df <- derive_followup(df,
  name         = "dm",
  event_col    = "dm_date",
  baseline_col = "p53_i0",
  censor_date  = as.Date("2022-10-31"),   # set to your study's cut-off date
  death_col    = "p40000_i0",
  lost_col     = FALSE                    # not available in ops_toy
)

Output columns:

Column	Type	Description
`dm_followup_end`	IDate	Earliest competing date
`dm_followup_years`	numeric	Years from baseline to end

Prevalent cases receive `NA` follow-up time

Participants whose event date falls before or on the baseline date (prevalent cases, {name}_timing == 1) will have followup_years set to NA rather than a zero or negative value, which has no meaning in time-to-event analysis. Use derive_timing() to identify and exclude prevalent cases before fitting a Cox model (see the full pipeline example below).

Auto-detection of death and lost-to-follow-up columns

When death_col and lost_col are NULL (default), derive_followup() looks them up automatically from the field cache (UKB fields 40000 and 191). Pass FALSE to explicitly disable a competing event:

df <- derive_followup(df,
  name         = "dm",
  event_col    = "dm_date",
  baseline_col = "p53_i0",
  censor_date  = as.Date("2022-10-31"),
  death_col    = FALSE,
  lost_col     = FALSE
)

Full Survival-Ready Pipeline

After completing all three steps, the data contains everything needed to fit a Cox proportional hazards model:

library(survival)

# Incident analysis: exclude prevalent cases and those with undetermined timing
df_incident <- df[dm_timing != 1L]

fit <- coxph(
  Surv(dm_followup_years, dm_status) ~
    p20116_i0 + p21022 + p31 + p1558_i0,
  data = df_incident
)
summary(fit)

Column roles in the model:

Column	Role
`dm_status`	Event indicator (logical)
`dm_followup_years`	Time variable
`dm_timing`	Filter: exclude prevalent (`== 1`)
`age_at_dm`	Age at diagnosis (descriptive / secondary analysis)
`p20116_i0`	Exposure of interest (smoking status)

Getting Help

?derive_timing, ?derive_age, ?derive_followup
vignette("derive") — disease phenotype derivation
vignette("decode") — decoding column names and values
GitHub Issues