Skip to contents

Purpose

This scope note describes the phenotype sources currently supported by ukbflow, the default reconciliation rules used by the disease-derivation helpers, and the code systems that are outside the current public API.

ukbflow provides workflow helpers for common UK Biobank phenotype extraction and derivation tasks. It does not replace study-specific phenotype validation, clinical case-definition decisions, or the UK Biobank Showcase.

Supported Sources

ukbflow currently focuses on common disease-phenotype sources that are routinely available in UK Biobank phenotype extraction workflows.

Source UKB field(s) / structure Code system or field type Main function(s) Current behavior
Self-reported illness 20002 with corresponding report dates UKB self-report coding derive_selfreport() Matches user-supplied labels / regex and returns status plus earliest report date
Self-reported cancer 20001 with corresponding report dates UKB self-report cancer coding derive_selfreport(field = "cancer") Matches user-supplied labels / regex and returns status plus earliest report date
HES inpatient diagnoses 41270 with dates from 41280 ICD-10 any-position diagnosis field derive_hes() Treats any matching ICD-10 code in 41270 as a case; primary / secondary diagnosis position is not currently configurable
First Occurrence p131xxx date fields UKB precomputed first occurrence fields derive_first_occurrence() Uses the supplied First Occurrence date field as an event source
Cancer registry 40006, 40011, 40012, 40005 ICD-10, histology, behaviour, diagnosis date derive_cancer_registry() Supports ICD-10 matching with optional histology and behaviour filters
Death registry 40001, 40002, 40000 ICD-10 primary / secondary cause of death derive_death_registry() Searches primary and secondary death-cause fields and returns status plus death date
Multi-source ICD-10 phenotype HES, First Occurrence, cancer registry, death registry ICD-10-derived sources derive_icd10() Combines selected source-specific helpers into one ICD-10-derived status and earliest date
Final case definition Self-report plus ICD-10-derived status/date Source reconciliation derive_case() Uses an any-source OR rule by default: self-report or ICD-10-derived status can define a case

Default Reconciliation

derive_case() applies an any-source reconciliation rule by default. The final status is TRUE if either the ICD-10-derived status or the self-report status is TRUE. The final date is the earliest available date across included sources.

This default is a workflow convention, not a medical-record confirmation rule. For stricter case definitions, users should construct source-specific phenotypes explicitly, for example by using an ICD-10-derived name without matching self-report columns, or by controlling the sources passed to derive_icd10(source = ...).

Not Currently Supported

The following code systems and source types are not part of the current public API:

Source or code system Current status
ICD-9 Not currently supported
OPCS-4 procedure codes Not currently supported
Read v2 primary-care codes Not currently supported
CTV3 primary-care codes Not currently supported
General GP / primary-care phenotype parsing Not currently supported
HES primary / secondary diagnosis-position selection Not currently exposed as a public argument

These exclusions are intentional scope boundaries for the current release. Users can still derive custom variables outside ukbflow and then use derive_timing(), derive_followup(), and the assoc_* functions once analysis-ready status and date columns have been constructed.

Design Principle

The phenotype helpers are intentionally explicit and source-aware. ukbflow prioritizes tested helpers for common UKB sources over broad, under-specified parsing of every possible clinical coding system.

For complex phenotypes, the recommended workflow is:

  1. Identify approved fields in the active RAP project.
  2. Extract the required fields with extract_batch() or related helpers.
  3. Use source-specific derive_* helpers where supported.
  4. Build custom status/date columns for sources outside the current public API.
  5. Use derive_timing() and derive_followup() to construct analysis-ready survival variables.
  6. Pass explicit covariates and model choices to the assoc_* functions.