Genetic Risk Score (GRS) Analysis on UKB RAP

Overview

The grs_* functions provide an end-to-end pipeline for computing and validating polygenic / genetic risk scores within the UK Biobank Research Analysis Platform (RAP). Because individual-level genotype data cannot be downloaded locally, all computationally intensive steps are executed as cloud jobs via the DNAnexus Swiss Army Knife app.

Function	Where it runs	Purpose
`grs_check()`	Local or RAP	Validate and export a SNP weights file
`grs_bgen2pgen()`	Submits RAP jobs	Convert UKB imputed BGEN files to PGEN
`grs_score()`	Submits RAP jobs	Score genotypes across chromosomes with plink2
`grs_standardize()`	Local or RAP	Z-score standardise GRS columns
`grs_zscore()`	Local or RAP	Alias for `grs_standardize()`
`grs_validate()`	Local or RAP	Assess predictive performance (OR/HR, AUC, C-index)

Typical pipeline:

grs_check()  ->  grs_bgen2pgen()  ->  grs_score()  ->  grs_standardize()  ->  grs_validate()

Prerequisite: you must be authenticated to UKB RAP and have a project selected. See vignette("auth").

Step 1: Validate the Weights File – `grs_check()`

Before uploading to RAP, validate your SNP weights file. The function reads the file, checks required columns, flags problems, and writes a plink2-compatible space-delimited output.

Required columns:

Column	Description
`snp`	SNP identifier (expected `rs` + digits format)
`effect_allele`	Effect allele: one of A / T / C / G
`beta`	Effect size (log-OR or beta); must be numeric

library(ukbflow)

# Local: weights file on your machine
w <- grs_check("weights.csv", dest = "weights_clean.txt")
#> Read 'weights.csv': 312 rows, 5 columns.
#> ✔ No NA values.
#> ✔ No duplicate SNPs.
#> ✔ All SNP IDs match rs[0-9]+ format.
#> ✔ All effect alleles are A/T/C/G.
#> Beta summary:
#>   Range     : -0.412 to 0.589
#>   Mean |beta|: 0.183
#>   Positive  : 187 (59.9%)
#>   Negative  : 125 (40.1%)
#>   Zero      : 0
#> ✔ Weights file passed checks: 312 SNPs ready for UKB RAP.
#> ✔ Saved: 'weights_clean.txt'

# On RAP (RStudio) -- use /mnt/project/ paths directly
w <- grs_check(
  file = "/mnt/project/weights/weights.csv",
  dest = "/mnt/project/weights/weights_clean.txt"
)

The validated weights are also returned invisibly for inspection.

Step 2: Convert BGEN to PGEN – `grs_bgen2pgen()`

UKB imputed genotype data is stored in BGEN format on RAP. grs_bgen2pgen() submits one Swiss Army Knife job per chromosome to convert BGEN files to the plink2-native PGEN format with a MAF filter applied via plink2. This staged workflow separates one-time genotype conversion from repeated GRS scoring, so the converted PGEN files can be reused across multiple score files.

The trade-off is additional RAP project storage: PGEN outputs should be kept only when they will be reused, and removed according to the project’s storage and cost policy when no longer needed.

This function is called from your local machine or RAP RStudio – the heavy lifting runs entirely in the cloud.

# Test on the smallest chromosome first
ids <- grs_bgen2pgen(chr = 22, dest = "/pgen/", priority = "high")
#> Uploading 'grs_bgen2pgen_std.R' to RAP ...
#> ✔ Uploaded: '/grs_bgen2pgen_std.R'
#> Submitting 1 job(s) -- mem2_ssd1_v2_x4 / priority: high
#> ✔ chr22 -> 'job-Gxxxxxxxxxxxxxxxxxxxxxxxx'
#> ✔ 1/1 job(s) submitted. Monitor with job_ls().

# Full run: small chromosomes on standard, large on upgraded instance
ids_small <- grs_bgen2pgen(chr = 17:22, dest = "/pgen/")
ids_large <- grs_bgen2pgen(chr = 1:16,  dest = "/pgen/", instance = "large")

# Monitor progress (job_wait() takes one job ID at a time)
job_ls()
for (id in c(ids_small, ids_large)) job_wait(id)

Instance types:

Preset	DNAnexus instance	Cores	RAM	SSD	Recommended for
`"standard"`	`mem2_ssd1_v2_x4`	4	12 GB	200 GB	chr 15–22
`"large"`	`mem2_ssd2_v2_x8`	8	28 GB	640 GB	chr 1–16

Key arguments:

Argument	Default	Description
`chr`	`1:22`	Chromosomes to process
`dest`	—	RAP output path for PGEN files (required)
`maf`	`0.01`	MAF filter passed to plink2 `--maf`
`instance`	`"standard"`	Instance type preset
`priority`	`"low"`	Job priority (`"low"` or `"high"`)

Storage warning: chromosomes 1–16 may exceed the 200 GB SSD on "standard" instances. Use instance = "large" for these.

Step 3: Calculate GRS Scores – `grs_score()`

Once PGEN files are ready, grs_score() uploads your weights file(s) and submits one plink2 scoring job per GRS. Each job scores all 22 chromosomes and saves a single CSV to RAP.

ids <- grs_score(
  file     = c(
    grs_a = "weights/grs_a_weights.txt",
    grs_b = "weights/grs_b_weights.txt"
  ),
  pgen_dir = "/mnt/project/pgen",
  dest     = "/grs/",
  priority = "high"
)
#> ── Uploading 2 weight file(s) to RAP ────────────────────────────────────────
#> Uploading 'grs_a_weights.txt' ...
#> ✔ Uploaded: '/grs_a_weights.txt'
#> Uploading 'grs_b_weights.txt' ...
#> ✔ Uploaded: '/grs_b_weights.txt'
#> Submitting 2 job(s) -- mem2_ssd1_v2_x4 / priority: high
#> ✔ grs_a -> 'job-Gxxxxxxxxxxxxxxxxxxxxxxxx'
#> ✔ grs_b -> 'job-Gyyyyyyyyyyyyyyyyyyyyyyyy'
#> ✔ 2/2 job(s) submitted. Monitor with job_ls().

job_wait(ids)

When running from RAP RStudio with weights already at the project root, the upload step is skipped automatically:

# On RAP: weights already at /mnt/project/grs_a_weights.txt
ids <- grs_score(
  file     = c(grs_a = "/mnt/project/grs_a_weights.txt"),
  pgen_dir = "/mnt/project/pgen",
  dest     = "/grs/"
)
#> ℹ grs_a_weights.txt already at RAP root, skipping upload.

Important: the maf argument must match the value used in grs_bgen2pgen() so that the correct PGEN files are located.

Output per job: /grs/<GRS_name>_scores.csv with columns IID and the GRS score named GRS_<name>.

Step 4: Standardise GRS Columns – `grs_standardize()`

After downloading the score CSVs from RAP (via fetch_file()) and merging them into your analysis dataset, Z-score standardise each GRS column so that effect estimates are interpretable as per-SD associations.

# Auto-detect all columns containing "grs" (case-insensitive)
cohort <- grs_standardize(cohort)
#> Auto-detected 2 GRS column(s): 'GRS_a', 'GRS_b'
#> ✔ GRS_a -> GRS_a_z  [mean=0.0000, sd=1.0000]
#> ✔ GRS_b -> GRS_b_z  [mean=0.0000, sd=1.0000]

# Or specify columns explicitly
cohort <- grs_standardize(cohort, grs_cols = c("GRS_a", "GRS_b"))

grs_zscore() is an alias and produces an identical result.

The original columns are preserved; _z columns are inserted immediately after their source column.

Step 5: Validate Predictive Performance – `grs_validate()`

grs_validate() runs four complementary validation analyses for each GRS:

Analysis	What it measures
Per SD	OR (logistic) or HR (Cox) per 1-SD increase in GRS
High vs Low	OR / HR comparing top 20% vs bottom 20%
Trend test	P-trend across Q1–Q4 quartiles
Discrimination	AUC (logistic) or C-index (Cox) with 95% CI

Logistic (cross-sectional)

res <- grs_validate(
  data        = cohort,
  grs_cols    = c("GRS_a_z", "GRS_b_z"),
  outcome_col = "outcome"
)
#> ── Creating GRS groups ───────────────────────────────────────────────────────
#> ── Effect per SD (OR) ───────────────────────────────────────────────────────
#> ── High vs Low ──────────────────────────────────────────────────────────────
#> ── Trend test ───────────────────────────────────────────────────────────────
#> ── AUC ──────────────────────────────────────────────────────────────────────
#> ✔ Validation complete.

res$per_sd
res$discrimination

Cox (survival)

res <- grs_validate(
  data        = cohort,
  grs_cols    = c("GRS_a_z", "GRS_b_z"),
  outcome_col = "outcome",
  time_col    = "followup_years",
  covariates  = c("age", "sex", paste0("pc", 1:10))
)

res$per_sd          # HR per SD x model
res$high_vs_low     # HR: top 20% vs bottom 20%
res$trend           # p-trend across Q1–Q4
res$discrimination  # C-index with 95% CI

Return value: a named list with four data.table elements.

Element	Columns (logistic)	Columns (Cox)
`per_sd`	`exposure`, `term`, `model`, `OR`, `CI_lower`, `CI_upper`, `p_value`	same with `HR`
`high_vs_low`	same as `per_sd`	same with `HR`
`trend`	`exposure`, `model`, `p_trend`	same
`discrimination`	`GRS`, `AUC`, `CI_lower`, `CI_upper`	`GRS`, `C_index`, `CI_lower`, `CI_upper`

AUC calculation requires the pROC package. Install with install.packages("pROC") if needed.

Complete Pipeline Example

library(ukbflow)

# 1. Validate weights (local)
grs_check("weights.csv", dest = "weights_clean.txt")

# 2. Convert BGEN -> PGEN on RAP (submit jobs)
ids_std  <- grs_bgen2pgen(chr = 17:22, dest = "/pgen/", maf = 0.01)
ids_lrg  <- grs_bgen2pgen(chr = 1:16,  dest = "/pgen/", maf = 0.01, instance = "large")
for (id in c(ids_std, ids_lrg)) job_wait(id)

# 3. Score GRS on RAP (submit jobs)
score_ids <- grs_score(
  file     = c(grs_a = "weights_clean.txt"),
  pgen_dir = "/mnt/project/pgen",
  maf      = 0.01,   # must match grs_bgen2pgen()
  dest     = "/grs/"
)
job_wait(score_ids)

# 4. Download score CSV from RAP
fetch_file("/grs/GRS_a_scores.csv", dest = "GRS_a_scores.csv")

# 5. Merge into analysis dataset and standardise
# cohort: your analysis data.table with IID and phenotype columns
scores <- data.table::fread("GRS_a_scores.csv")   # columns: IID, GRS_a
cohort <- scores[cohort, on = "IID"]               # right-join: keep all cohort rows
cohort <- grs_standardize(cohort, grs_cols = "GRS_a")

# 6. Validate
res <- grs_validate(
  data        = cohort,
  grs_cols    = "GRS_a_z",
  outcome_col = "outcome",
  time_col    = "followup_years",
  covariates  = c("age", "sex", paste0("pc", 1:10))
)

res$per_sd
res$discrimination

Getting Help

?grs_check, ?grs_bgen2pgen, ?grs_score
?grs_standardize, ?grs_validate
vignette("auth") – RAP authentication and project selection
vignette("job") – monitoring and managing RAP jobs
vignette("assoc") – association analysis functions used inside grs_validate()
GitHub Issues

Overview

Step 1: Validate the Weights File – grs_check()

Step 2: Convert BGEN to PGEN – grs_bgen2pgen()

Step 3: Calculate GRS Scores – grs_score()

Step 4: Standardise GRS Columns – grs_standardize()

Step 5: Validate Predictive Performance – grs_validate()