Download

Overview

The download module covers four related tasks:

Task Function
Download one file safely download_url()
Download multiple files download_batch()
Download gene annotation download_gene_ref()
Download GEO series data download_geo()
library(evanverse)

The module is intentionally small. It does not try to replace a full data management workflow; it gives evanverse users predictable helpers for common download jobs that appear repeatedly in analysis projects.

Single-File Downloads

download_url() downloads one URL to one destination path.

dest <- file.path(tempdir(), "README.md")

download_url(
  url  = "https://raw.githubusercontent.com/tidyverse/ggplot2/main/README.md",
  dest = dest
)

By default, an existing destination file is treated as complete and the download is skipped.

download_url(
  url       = "https://raw.githubusercontent.com/tidyverse/ggplot2/main/README.md",
  dest      = dest,
  overwrite = FALSE
)
ℹ File already exists, skipping: '/tmp/RtmpGXC9gI/README.md'

Set overwrite = TRUE to download again. With the default resume = TRUE, an existing non-empty destination is used as a resume target.

download_url(
  url       = "https://raw.githubusercontent.com/tidyverse/ggplot2/main/README.md",
  dest      = dest,
  overwrite = TRUE,
  resume    = TRUE
)

When resume = FALSE, the function downloads to a temporary file first and moves it into place only after success. This avoids leaving a new partial file at dest after a failed download.

download_url(
  url       = "https://raw.githubusercontent.com/tidyverse/ggplot2/main/README.md",
  dest      = dest,
  overwrite = TRUE,
  resume    = FALSE
)

Use shorter timeouts and fewer retries for interactive work; use longer values for unattended scripts.

download_url(
  url     = "https://example.com/data.csv",
  dest    = file.path(tempdir(), "data.csv"),
  timeout = 60,
  retries = 1
)
Error:
! Failed to download <https://example.com/data.csv>: HTTP response code
  said error [example.com]: The requested URL returned error: 404

Batch Downloads

download_batch() downloads multiple URLs into one directory. File names are derived from the URL path.

urls <- c(
  "https://httpbin.org/robots.txt",
  "https://httpbin.org/encoding/utf8"
)

paths <- download_batch(
  urls     = urls,
  dest_dir = file.path(tempdir(), "downloads")
)

Existing files are skipped when overwrite = FALSE.

download_batch(
  urls      = urls,
  dest_dir  = file.path(tempdir(), "downloads"),
  overwrite = FALSE
)

The function rejects duplicate destination file names. This matters when two different URLs end with the same file name.

download_batch(
  urls = c(
    "https://host-a.example/data.csv",
    "https://host-b.example/data.csv"
  ),
  dest_dir = tempdir()
)
Error:
! `destination filenames derived from urls` must not contain duplicate
  value: "/tmp/RtmpGXC9gI/data.csv".

Blank URLs are rejected early so errors point to the input rather than to curl or file-name derivation.

download_batch(
  urls     = c("https://example.com/data.csv", ""),
  dest_dir = tempdir()
)
Error:
! `urls` must not contain NA or empty string values.

Gene Annotation

download_gene_ref() retrieves a standard Ensembl gene annotation table for human or mouse using biomaRt.

human_ref <- download_gene_ref("human")

head(human_ref)

The result contains stable columns used elsewhere in evanverse workflows:

Column Meaning
ensembl_id Ensembl gene ID
symbol Gene symbol
entrez_id Entrez gene ID
gene_type Ensembl biotype
chromosome, start, end, strand Genomic location
description Ensembl description
species Requested species
ensembl_version Current Ensembl version when available
download_date Local download date

Use dest to save the table as an RDS file. The .rds extension is appended when it is omitted.

download_gene_ref(
  species = "mouse",
  dest    = file.path(tempdir(), "mouse_gene_ref")
)

biomaRt is required for this function. If it is not installed, the function fails with an installation hint instead of a namespace error.

GEO Series

download_geo() downloads a GEO series with three pieces:

Output Contents
gse_object GEO expression object from GEOquery::getGEO()
supplemental_files Downloaded supplemental files, when available
platform_info Platform ID and downloaded platform annotation files
geo <- download_geo(
  gse_id   = "GSE121212",
  dest_dir = file.path(tempdir(), "GSE121212")
)

names(geo)

GEO accessions must use the GSE prefix followed by digits.

download_geo("GSE12345", dest_dir = tempdir())

GEOquery is required for this function. Supplemental and platform downloads can fail independently of the main series matrix, so downstream code should check the returned file vectors before reading them.

Design Notes

  • download_url() uses temporary files for fresh downloads so failed calls do not create a destination file that later calls would treat as complete.
  • Resume mode writes directly to the destination because that is the intended curl resume behavior.
  • download_batch() retries only failed files after each attempt.
  • download_batch() currently does not remove failed destination files because curl::multi_download() may overwrite a user file that existed before the call. Cleanup needs to distinguish files created by the current call from files owned by the user.
  • External data sources are network-dependent. Tests that hit httpbin, Ensembl, or GEO are skipped on CRAN and CI, but should still be run locally when validating this module.