Overview
The grs_* functions provide an end-to-end pipeline for
computing and validating polygenic / genetic risk scores within the UK
Biobank Research Analysis Platform (RAP). Because individual-level
genotype data cannot be downloaded locally, all computationally
intensive steps are executed as cloud jobs via the DNAnexus Swiss Army
Knife app.
| Function | Where it runs | Purpose |
|---|---|---|
grs_check() |
Local or RAP | Validate and export a SNP weights file |
grs_bgen2pgen() |
Submits RAP jobs | Convert UKB imputed BGEN files to PGEN |
grs_score() |
Submits RAP jobs | Score genotypes across chromosomes with plink2 |
grs_standardize() |
Local or RAP | Z-score standardise GRS columns |
grs_zscore() |
Local or RAP | Alias for grs_standardize()
|
grs_validate() |
Local or RAP | Assess predictive performance (OR/HR, AUC, C-index) |
Typical pipeline:
grs_check() -> grs_bgen2pgen() -> grs_score() -> grs_standardize() -> grs_validate()
Prerequisite: you must be authenticated to UKB RAP and have a project selected. See
vignette("auth").
Step 1: Validate the Weights File – grs_check()
Before uploading to RAP, validate your SNP weights file. The function reads the file, checks required columns, flags problems, and writes a plink2-compatible space-delimited output.
Required columns:
| Column | Description |
|---|---|
snp |
SNP identifier (expected rs + digits format) |
effect_allele |
Effect allele: one of A / T / C / G |
beta |
Effect size (log-OR or beta); must be numeric |
library(ukbflow)
# Local: weights file on your machine
w <- grs_check("weights.csv", dest = "weights_clean.txt")
#> Read 'weights.csv': 312 rows, 5 columns.
#> ✔ No NA values.
#> ✔ No duplicate SNPs.
#> ✔ All SNP IDs match rs[0-9]+ format.
#> ✔ All effect alleles are A/T/C/G.
#> Beta summary:
#> Range : -0.412 to 0.589
#> Mean |beta|: 0.183
#> Positive : 187 (59.9%)
#> Negative : 125 (40.1%)
#> Zero : 0
#> ✔ Weights file passed checks: 312 SNPs ready for UKB RAP.
#> ✔ Saved: 'weights_clean.txt'
# On RAP (RStudio) -- use /mnt/project/ paths directly
w <- grs_check(
file = "/mnt/project/weights/weights.csv",
dest = "/mnt/project/weights/weights_clean.txt"
)The validated weights are also returned invisibly for inspection.
Step 2: Convert BGEN to PGEN – grs_bgen2pgen()
UKB imputed genotype data is stored in BGEN format on RAP.
grs_bgen2pgen() submits one Swiss Army Knife job per
chromosome to convert BGEN files to the plink2-native PGEN format with a
MAF filter applied via plink2. This staged workflow separates one-time
genotype conversion from repeated GRS scoring, so the converted PGEN
files can be reused across multiple score files.
The trade-off is additional RAP project storage: PGEN outputs should be kept only when they will be reused, and removed according to the project’s storage and cost policy when no longer needed.
This function is called from your local machine or RAP RStudio – the heavy lifting runs entirely in the cloud.
# Test on the smallest chromosome first
ids <- grs_bgen2pgen(chr = 22, dest = "/pgen/", priority = "high")
#> Uploading 'grs_bgen2pgen_std.R' to RAP ...
#> ✔ Uploaded: '/grs_bgen2pgen_std.R'
#> Submitting 1 job(s) -- mem2_ssd1_v2_x4 / priority: high
#> ✔ chr22 -> 'job-Gxxxxxxxxxxxxxxxxxxxxxxxx'
#> ✔ 1/1 job(s) submitted. Monitor with job_ls().
# Full run: small chromosomes on standard, large on upgraded instance
ids_small <- grs_bgen2pgen(chr = 17:22, dest = "/pgen/")
ids_large <- grs_bgen2pgen(chr = 1:16, dest = "/pgen/", instance = "large")
# Monitor progress (job_wait() takes one job ID at a time)
job_ls()
for (id in c(ids_small, ids_large)) job_wait(id)Instance types:
| Preset | DNAnexus instance | Cores | RAM | SSD | Recommended for |
|---|---|---|---|---|---|
"standard" |
mem2_ssd1_v2_x4 |
4 | 12 GB | 200 GB | chr 15–22 |
"large" |
mem2_ssd2_v2_x8 |
8 | 28 GB | 640 GB | chr 1–16 |
Key arguments:
| Argument | Default | Description |
|---|---|---|
chr |
1:22 |
Chromosomes to process |
dest |
— | RAP output path for PGEN files (required) |
maf |
0.01 |
MAF filter passed to plink2 --maf
|
instance |
"standard" |
Instance type preset |
priority |
"low" |
Job priority ("low" or "high") |
Storage warning: chromosomes 1–16 may exceed the 200 GB SSD on
"standard"instances. Useinstance = "large"for these.
Step 3: Calculate GRS Scores – grs_score()
Once PGEN files are ready, grs_score() uploads your
weights file(s) and submits one plink2 scoring job per GRS. Each job
scores all 22 chromosomes and saves a single CSV to RAP.
ids <- grs_score(
file = c(
grs_a = "weights/grs_a_weights.txt",
grs_b = "weights/grs_b_weights.txt"
),
pgen_dir = "/mnt/project/pgen",
dest = "/grs/",
priority = "high"
)
#> ── Uploading 2 weight file(s) to RAP ────────────────────────────────────────
#> Uploading 'grs_a_weights.txt' ...
#> ✔ Uploaded: '/grs_a_weights.txt'
#> Uploading 'grs_b_weights.txt' ...
#> ✔ Uploaded: '/grs_b_weights.txt'
#> Submitting 2 job(s) -- mem2_ssd1_v2_x4 / priority: high
#> ✔ grs_a -> 'job-Gxxxxxxxxxxxxxxxxxxxxxxxx'
#> ✔ grs_b -> 'job-Gyyyyyyyyyyyyyyyyyyyyyyyy'
#> ✔ 2/2 job(s) submitted. Monitor with job_ls().
job_wait(ids)When running from RAP RStudio with weights already at the project root, the upload step is skipped automatically:
# On RAP: weights already at /mnt/project/grs_a_weights.txt
ids <- grs_score(
file = c(grs_a = "/mnt/project/grs_a_weights.txt"),
pgen_dir = "/mnt/project/pgen",
dest = "/grs/"
)
#> ℹ grs_a_weights.txt already at RAP root, skipping upload.Important: the
mafargument must match the value used ingrs_bgen2pgen()so that the correct PGEN files are located.
Output per job: /grs/<GRS_name>_scores.csv with
columns IID and the GRS score named
GRS_<name>.
Step 4: Standardise GRS Columns –
grs_standardize()
After downloading the score CSVs from RAP (via
fetch_file()) and merging them into your analysis dataset,
Z-score standardise each GRS column so that effect estimates are
interpretable as per-SD associations.
# Auto-detect all columns containing "grs" (case-insensitive)
cohort <- grs_standardize(cohort)
#> Auto-detected 2 GRS column(s): 'GRS_a', 'GRS_b'
#> ✔ GRS_a -> GRS_a_z [mean=0.0000, sd=1.0000]
#> ✔ GRS_b -> GRS_b_z [mean=0.0000, sd=1.0000]
# Or specify columns explicitly
cohort <- grs_standardize(cohort, grs_cols = c("GRS_a", "GRS_b"))grs_zscore() is an alias and produces an identical
result.
The original columns are preserved; _z columns are
inserted immediately after their source column.
Step 5: Validate Predictive Performance –
grs_validate()
grs_validate() runs four complementary validation
analyses for each GRS:
| Analysis | What it measures |
|---|---|
| Per SD | OR (logistic) or HR (Cox) per 1-SD increase in GRS |
| High vs Low | OR / HR comparing top 20% vs bottom 20% |
| Trend test | P-trend across Q1–Q4 quartiles |
| Discrimination | AUC (logistic) or C-index (Cox) with 95% CI |
Logistic (cross-sectional)
res <- grs_validate(
data = cohort,
grs_cols = c("GRS_a_z", "GRS_b_z"),
outcome_col = "outcome"
)
#> ── Creating GRS groups ───────────────────────────────────────────────────────
#> ── Effect per SD (OR) ───────────────────────────────────────────────────────
#> ── High vs Low ──────────────────────────────────────────────────────────────
#> ── Trend test ───────────────────────────────────────────────────────────────
#> ── AUC ──────────────────────────────────────────────────────────────────────
#> ✔ Validation complete.
res$per_sd
res$discriminationCox (survival)
res <- grs_validate(
data = cohort,
grs_cols = c("GRS_a_z", "GRS_b_z"),
outcome_col = "outcome",
time_col = "followup_years",
covariates = c("age", "sex", paste0("pc", 1:10))
)
res$per_sd # HR per SD x model
res$high_vs_low # HR: top 20% vs bottom 20%
res$trend # p-trend across Q1–Q4
res$discrimination # C-index with 95% CIReturn value: a named list with four
data.table elements.
| Element | Columns (logistic) | Columns (Cox) |
|---|---|---|
per_sd |
exposure, term, model,
OR, CI_lower, CI_upper,
p_value
|
same with HR
|
high_vs_low |
same as per_sd
|
same with HR
|
trend |
exposure, model, p_trend
|
same |
discrimination |
GRS, AUC, CI_lower,
CI_upper
|
GRS, C_index, CI_lower,
CI_upper
|
AUC calculation requires the
pROCpackage. Install withinstall.packages("pROC")if needed.
Complete Pipeline Example
library(ukbflow)
# 1. Validate weights (local)
grs_check("weights.csv", dest = "weights_clean.txt")
# 2. Convert BGEN -> PGEN on RAP (submit jobs)
ids_std <- grs_bgen2pgen(chr = 17:22, dest = "/pgen/", maf = 0.01)
ids_lrg <- grs_bgen2pgen(chr = 1:16, dest = "/pgen/", maf = 0.01, instance = "large")
for (id in c(ids_std, ids_lrg)) job_wait(id)
# 3. Score GRS on RAP (submit jobs)
score_ids <- grs_score(
file = c(grs_a = "weights_clean.txt"),
pgen_dir = "/mnt/project/pgen",
maf = 0.01, # must match grs_bgen2pgen()
dest = "/grs/"
)
job_wait(score_ids)
# 4. Download score CSV from RAP
fetch_file("/grs/GRS_a_scores.csv", dest = "GRS_a_scores.csv")
# 5. Merge into analysis dataset and standardise
# cohort: your analysis data.table with IID and phenotype columns
scores <- data.table::fread("GRS_a_scores.csv") # columns: IID, GRS_a
cohort <- scores[cohort, on = "IID"] # right-join: keep all cohort rows
cohort <- grs_standardize(cohort, grs_cols = "GRS_a")
# 6. Validate
res <- grs_validate(
data = cohort,
grs_cols = "GRS_a_z",
outcome_col = "outcome",
time_col = "followup_years",
covariates = c("age", "sex", paste0("pc", 1:10))
)
res$per_sd
res$discriminationGetting Help
-
?grs_check,?grs_bgen2pgen,?grs_score -
?grs_standardize,?grs_validate -
vignette("auth")– RAP authentication and project selection -
vignette("job")– monitoring and managing RAP jobs -
vignette("assoc")– association analysis functions used insidegrs_validate() - GitHub Issues
