Skip to contents

Overview

UKB phenotype data is stored in a proprietary .dataset format on the RAP and cannot be read directly. The extract_* functions provide R interfaces for discovering approved fields and extracting phenotype data via the DNAnexus dx extract_dataset and table-exporter tools.

Two workflows are available:

Function Mode Scale Output
extract_batch() Async job Large / production (typically 50+ fields) job ID → CSV on RAP cloud
extract_pheno() Synchronous Small (quick checks) data.table in memory

extract_batch() is the recommended approach for any serious analysis. extract_pheno() is provided for quick interactive inspection inside the RAP environment only.


Prerequisites

Ensure you are authenticated and have selected your project:

library(ukbflow)

auth_login()
auth_select_project("project-XXXXXXXXXXXX")

Step 1: Browse Available Fields

Before extracting, use extract_ls() to explore what fields are approved for your project:

# List all approved fields (cached after first call)
extract_ls()

# Search by keyword
extract_ls(pattern = "cancer")
extract_ls(pattern = "p31|p53|p21022")

# Force refresh after switching projects or datasets
extract_ls(refresh = TRUE)

The result is a data.frame with two columns:

Column Example
field_name participant.p53_i0
title Date of attending assessment centre \| Instance 0

Fields reflect your project’s approved data only — not all UKB fields are present.


Step 2: Extract Data

For large-scale or production extractions, submit an asynchronous table-exporter job on the RAP cloud:

# Submit extraction job
job_id <- extract_batch(c(31, 53, 21022, 22189))

# Custom output name
job_id <- extract_batch(
  field_id = c(31, 53, 21022, 22189),
  file     = "ukb_demographics"
)

# High priority (faster queue, higher cost)
job_id <- extract_batch(
  field_id = c(31, 53, 21022, 22189),
  priority = "high"
)

The job runs asynchronously on the RAP cloud. The output CSV is saved to your RAP project and can be monitored with the job_ series:

job_status(job_id)        # check progress
job_path(job_id)          # get cloud file path once complete
job_result(job_id)        # read result as data.table (inside RAP only)

Instance type

extract_batch() automatically selects an appropriate instance based on the number of columns:

Columns Instance
≤ 20 mem1_ssd1_v2_x4
≤ 100 mem1_ssd1_v2_x8
≤ 500 mem1_ssd1_v2_x16
> 500 mem1_ssd1_v2_x36

You can override this with the instance_type argument if needed.


Quick inspection: extract_pheno()

For small-scale interactive checks inside the RAP RStudio environment:

df <- extract_pheno(c(31, 53, 21022))

extract_pheno() is restricted to the RAP environment and returns data in memory only. For any analysis intended to be saved or reproduced, use extract_batch().

Note: extract_pheno() returns raw coded values (e.g. 1/0 for Sex, numeric codes for diseases). Use the decode_* series to convert codes to human-readable labels.


A Note on Column Names

Column naming differs between the two extraction methods:

extract_batch() — no prefix:

Column Meaning
eid Participant ID
p31 Field 31 (Sex)
p53_i0 Field 53, Instance 0
p20002_i0_a0 Field 20002, Instance 0, Array 0

extract_pheno()participant. prefix:

Column Meaning
participant.eid Participant ID
participant.p31 Field 31 (Sex)
participant.p53_i0 Field 53, Instance 0
participant.p20002_i0_a0 Field 20002, Instance 0, Array 0

Getting Help