Extracting Phenotype Data

Overview

UKB phenotype data is stored in a proprietary .dataset format on the RAP and cannot be read directly. The extract_* functions provide R interfaces for discovering approved fields and extracting phenotype data via the DNAnexus dx extract_dataset and table-exporter tools.

Two workflows are available:

Function	Mode	Scale	Output
`extract_batch()`	Async job	Large / production (typically 50+ fields)	job ID → CSV on RAP cloud
`extract_pheno()`	Synchronous	Small (quick checks)	data.table in memory

extract_batch() is the recommended approach for any serious analysis. extract_pheno() is provided for quick interactive inspection inside the RAP environment only.

Prerequisites

Ensure you are authenticated and have selected your project:

library(ukbflow)

auth_login()
auth_select_project("project-XXXXXXXXXXXX")

Step 1: Browse Available Fields

Before extracting, use extract_ls() to explore what fields are approved for your project:

# List all approved fields (cached after first call)
extract_ls()

# Search by keyword
extract_ls(pattern = "cancer")
extract_ls(pattern = "p31|p53|p21022")

# Force refresh after switching projects or datasets
extract_ls(refresh = TRUE)

The result is a data.frame with two columns:

Column	Example
`field_name`	`participant.p53_i0`
`title`	`Date of attending assessment centre \\| Instance 0`

Fields reflect your project’s approved data only — not all UKB fields are present.

Step 2: Extract Data

Recommended: `extract_batch()`

For large-scale or production extractions, submit an asynchronous table-exporter job on the RAP cloud:

# Submit extraction job
job_id <- extract_batch(c(31, 53, 21022, 22189))

# Custom output name
job_id <- extract_batch(
  field_id = c(31, 53, 21022, 22189),
  file     = "ukb_demographics"
)

# High priority (faster queue, higher cost)
job_id <- extract_batch(
  field_id = c(31, 53, 21022, 22189),
  priority = "high"
)

The job runs asynchronously on the RAP cloud. The output CSV is saved to your RAP project and can be monitored with the job_ series:

job_status(job_id)        # check progress
job_path(job_id)          # get cloud file path once complete
job_result(job_id)        # read result as data.table (inside RAP only)

Instance type

extract_batch() automatically selects an appropriate instance based on the number of columns:

Columns	Instance
≤ 20	`mem1_ssd1_v2_x4`
≤ 100	`mem1_ssd1_v2_x8`
≤ 500	`mem1_ssd1_v2_x16`
> 500	`mem1_ssd1_v2_x36`

You can override this with the instance_type argument if needed.

Quick inspection: `extract_pheno()`

For small-scale interactive checks inside the RAP RStudio environment:

df <- extract_pheno(c(31, 53, 21022))

extract_pheno() is restricted to the RAP environment and returns data in memory only. For any analysis intended to be saved or reproduced, use extract_batch().

Note: extract_pheno() returns raw coded values (e.g. 1/0 for Sex, numeric codes for diseases). Use the decode_* series to convert codes to human-readable labels.

A Note on Column Names

Column naming differs between the two extraction methods:

extract_batch() — no prefix:

Column	Meaning
`eid`	Participant ID
`p31`	Field 31 (Sex)
`p53_i0`	Field 53, Instance 0
`p20002_i0_a0`	Field 20002, Instance 0, Array 0

extract_pheno() — participant. prefix:

Column	Meaning
`participant.eid`	Participant ID
`participant.p31`	Field 31 (Sex)
`participant.p53_i0`	Field 53, Instance 0
`participant.p20002_i0_a0`	Field 20002, Instance 0, Array 0

Getting Help

?extract_ls, ?extract_pheno, ?extract_batch
vignette("auth") — authentication setup
GitHub Issues