Skip to contents

Overview

Raw UKB phenotype data contains encoded column names and values that need to be converted before analysis.

Source Column names Column values
extract_pheno() participant.p31 Raw integer codes — needs decode_values()
extract_batch() p31, p53_i0 Usually already decoded — decode_values() typically not needed

Both outputs need decode_names() to convert field ID column names to human-readable snake_case.

Call order matters: when using extract_pheno() output, always run decode_values() before decode_names(), because value decoding relies on the numeric field ID still being present in the column name.


library(ukbflow)

df <- extract_pheno(c(31, 54, 20116, 21022))
df <- decode_values(df)   # 0/1 → "Female"/"Male", etc.
df <- decode_names(df)    # participant.p31 → sex

Step 1: Decode Values

decode_values() converts raw integer codes to human-readable labels for categorical fields that have UKB encoding mappings. Continuous, date, text, and already-decoded fields are left unchanged.

df <- decode_values(df)
#> ✔ Decoded 3 categorical columns; 2 non-categorical columns unchanged.

It requires two metadata files from the UKB Showcase. Download them once with:

fetch_metadata(dest_dir = "data/metadata")

Then point decode_values() to the same directory (default matches fetch_metadata()):

df <- decode_values(df, metadata_dir = "data/metadata")

What gets decoded

Column Raw value Decoded value
p31 0 / 1 "Female" / "Male"
p54 11012 "Leeds"
p20116_i0 0 / 1 / 2 "Never" / "Previous" / "Current"

Codes absent from the encoding table (including UKB missing codes -1, -3, -7) are returned as NA.


Step 2: Decode Names

decode_names() renames columns from field ID format to snake_case labels using the approved UKB field dictionary available to your project.

df <- decode_names(df)
#> ✔ Renamed 5 columns.

Name conversion examples

Raw name Decoded name
participant.eid eid
participant.p31 sex
participant.p21022 age_at_recruitment
participant.p53_i0 date_of_attending_assessment_centre_i0
p31 sex
p53_i0 date_of_attending_assessment_centre_i0

Both extract_pheno() format (participant.p31) and extract_batch() format (p31) are handled automatically.

Long names

Some UKB field titles are verbose. Names exceeding max_nchar characters are flagged with a warning (default: 60). Lower the threshold to catch more aggressively:

df <- decode_names(df, max_nchar = 30)
#> ! 1 column name longer than 30 characters - consider renaming manually:
#> • date_of_attending_assessment_centre_i0

Rename manually to something concise:

names(df)[names(df) == "date_of_attending_assessment_centre_i0"] <- "date_baseline"

Getting Help