Overview
Raw UKB phenotype data contains encoded column names and values that need to be converted before analysis.
| Source | Column names | Column values |
|---|---|---|
extract_pheno() |
participant.p31 |
Raw integer codes — needs decode_values()
|
extract_batch() |
p31, p53_i0
|
Usually already decoded — decode_values() typically not
needed |
Both outputs need decode_names() to convert field ID
column names to human-readable snake_case.
Call order matters: when using
extract_pheno()output, always rundecode_values()beforedecode_names(), because value decoding relies on the numeric field ID still being present in the column name.
Recommended Workflow
library(ukbflow)
df <- extract_pheno(c(31, 54, 20116, 21022))
df <- decode_values(df) # 0/1 → "Female"/"Male", etc.
df <- decode_names(df) # participant.p31 → sexStep 1: Decode Values
decode_values() converts raw integer codes to
human-readable labels for categorical fields that have UKB encoding
mappings. Continuous, date, text, and already-decoded fields are left
unchanged.
df <- decode_values(df)
#> ✔ Decoded 3 categorical columns; 2 non-categorical columns unchanged.It requires two metadata files from the UKB Showcase. Download them once with:
fetch_metadata(dest_dir = "data/metadata")Then point decode_values() to the same directory
(default matches fetch_metadata()):
df <- decode_values(df, metadata_dir = "data/metadata")Step 2: Decode Names
decode_names() renames columns from field ID format to
snake_case labels using the approved UKB field dictionary available to
your project.
df <- decode_names(df)
#> ✔ Renamed 5 columns.Name conversion examples
| Raw name | Decoded name |
|---|---|
participant.eid |
eid |
participant.p31 |
sex |
participant.p21022 |
age_at_recruitment |
participant.p53_i0 |
date_of_attending_assessment_centre_i0 |
p31 |
sex |
p53_i0 |
date_of_attending_assessment_centre_i0 |
Both extract_pheno() format
(participant.p31) and extract_batch() format
(p31) are handled automatically.
Long names
Some UKB field titles are verbose. Names exceeding
max_nchar characters are flagged with a warning (default:
60). Lower the threshold to catch more aggressively:
df <- decode_names(df, max_nchar = 30)
#> ! 1 column name longer than 30 characters - consider renaming manually:
#> • date_of_attending_assessment_centre_i0Rename manually to something concise:
Getting Help
-
?decode_values,?decode_names -
vignette("extract")— extracting phenotype data - GitHub Issues
