Overview
The base module provides eleven utility functions covering four areas:
| Area | Functions |
|---|---|
| Data frame utilities |
df2list(), df2vect(),
recode_column(), view()
|
| File system utilities |
file_ls(), file_info(),
file_tree()
|
| Gene ID conversion |
gene2entrez(), gene2ensembl()
|
| GMT file parsing |
gmt2df(), gmt2list()
|
1 Data Frame Utilities
df2list() — Split a data frame into a named list
Groups one column’s values by another column and returns a named list. Useful for building marker lists, gene set inputs, or any grouping operation that downstream functions expect as a list.
df <- data.frame(
cell_type = c("T_cell", "T_cell", "B_cell", "B_cell", "B_cell"),
marker = c("CD3D", "CD3E", "CD79A", "MS4A1", "CD19"),
stringsAsFactors = FALSE
)
df2list(df, group_col = "cell_type", value_col = "marker")
#> $T_cell
#> [1] "CD3D" "CD3E"
#>
#> $B_cell
#> [1] "CD79A" "MS4A1" "CD19"
df2vect() — Extract a named vector from a data
frame
Extracts two columns and returns a named vector, using one column as names and the other as values. The original value type is preserved.
df <- data.frame(
gene = c("TP53", "BRCA1", "MYC"),
score = c(0.91, 0.74, 0.55),
stringsAsFactors = FALSE
)
df2vect(df, name_col = "gene", value_col = "score")
#> TP53 BRCA1 MYC
#> 0.91 0.74 0.55The name column must not contain NA, empty strings, or
duplicates — all three are caught at input and raise an informative
error.
bad <- data.frame(id = c("a", "a"), val = 1:2)
df2vect(bad, "id", "val")
#> Error in `df2vect()`:
#> ! `name_col` contains duplicate values.
recode_column() — Map column values via a named
vector
Replaces values in a column using a named vector (dict).
Unmatched values receive default (NA by default). Set
name to write to a new column instead of overwriting the
source.
df <- data.frame(
gene = c("TP53", "BRCA1", "EGFR", "XYZ"),
stringsAsFactors = FALSE
)
dict <- c("TP53" = "Tumour suppressor", "EGFR" = "Oncogene")
# Overwrite in place
recode_column(df, column = "gene", dict = dict)
#> gene
#> 1 Tumour suppressor
#> 2 <NA>
#> 3 Oncogene
#> 4 <NA>
# Write to a new column, keep original; use a custom fallback
recode_column(df, column = "gene", dict = dict,
name = "role", default = "Unknown")
#> gene role
#> 1 TP53 Tumour suppressor
#> 2 BRCA1 Unknown
#> 3 EGFR Oncogene
#> 4 XYZ Unknown
view() — Interactive table viewer
Returns an interactive reactable widget with search,
filtering, sorting, and pagination. In RStudio the widget renders in the
Viewer pane; in other environments it renders in the default HTML
output.
view(iris, n = 10)view() requires the reactable package. If
it is not installed, the function raises a clear error rather than
falling back silently.
2 File System Utilities
file_ls() — List files with metadata
Returns a data frame of file metadata for all files in a directory.
Columns: file, size_MB,
modified_time, path.
# All files in the current directory
file_ls(".")
#> file size_MB modified_time path
#> 1 DESCRIPTION 0.002 2026-03-20 14:22:01 F:/project/evanverse/DESCRIPTION
#> 2 NAMESPACE 0.002 2026-03-20 14:22:01 F:/project/evanverse/NAMESPACE
#> ...
# R source files only, searched recursively
file_ls("R", recursive = TRUE, pattern = "\\.R$")
file_info() — Metadata for specific files
Returns the same four-column data frame as file_ls() but
for an explicit vector of file paths rather than a directory scan.
file_info(c("DESCRIPTION", "NAMESPACE"))
#> file size_MB modified_time path
#> 1 DESCRIPTION 0.002 2026-03-20 14:22:01 F:/project/evanverse/DESCRIPTION
#> 2 NAMESPACE 0.002 2026-03-20 14:22:01 F:/project/evanverse/NAMESPACEDuplicate paths in the input are silently deduplicated. Missing files raise an error listing all unresolved paths.
file_tree() — Print a directory tree
Prints the directory structure in tree format. Returns the lines invisibly so output can be captured if needed.
file_tree(".", max_depth = 2)
#> F:/project/evanverse
#> +-- DESCRIPTION
#> +-- NAMESPACE
#> +-- R
#> | +-- base.R
#> | +-- plot.R
#> | +-- utils.R
#> +-- tests
#> +-- testthat3 Gene ID Conversion
Both gene2entrez() and gene2ensembl()
accept a character vector of gene symbols and return a three-column data
frame: the original input (symbol), the case-normalised
form used for matching (symbol_std), and the converted
ID.
Reference table
Matching is performed against a ref data frame with
columns symbol, entrez_id, and
ensembl_id. Two sources are available:
| Source | When to use |
|---|---|
toy_gene_ref() |
Examples, tests, offline work — 20 genes, no network |
download_gene_ref() |
Production analysis — full genome via biomaRt |
# Fast, offline reference for development
ref <- toy_gene_ref(species = "human")
# Full reference for analysis (requires network + Bioconductor)
# ref <- download_gene_ref(species = "human")Case normalisation
| Species | Rule applied to both input and reference |
|---|---|
"human" |
toupper() — "tp53" and "TP53"
both match TP53
|
"mouse" |
tolower() — "TRP53" and
"Trp53" both match Trp53
|
Unmatched symbols are returned with NA in the ID column
rather than dropped.
gene2entrez()
ref <- toy_gene_ref(species = "human")
gene2entrez(c("tp53", "BRCA1", "GHOST"), ref = ref, species = "human")
#> symbol symbol_std entrez_id
#> 1 tp53 TP53 7157
#> 2 BRCA1 BRCA1 672
#> 3 GHOST GHOST <NA>
gene2ensembl()
ref_mouse <- toy_gene_ref(species = "mouse")
gene2ensembl(c("Trp53", "TRP53", "FakeGene"), ref = ref_mouse, species = "mouse")
#> symbol symbol_std ensembl_id
#> 1 Trp53 trp53 ENSMUSG00000059552
#> 2 TRP53 trp53 ENSMUSG00000059552
#> 3 FakeGene fakegene <NA>4 GMT File Parsing
GMT (Gene Matrix Transposed) is the standard format for gene set
collections such as MSigDB. Each line encodes one gene set:
term, description, and a tab-separated list of
gene symbols.
toy_gmt() writes a minimal GMT file to a temp path for
offline use:
tmp <- toy_gmt(n = 3)
readLines(tmp)
#> [1] "HALLMARK_P53_PATHWAY\tGenes regulated by p53\tTP53\tBRCA1\tMYC\t..."
#> [2] "HALLMARK_MTORC1_SIGNALING\tGenes upregulated by mTORC1\tPTEN\t..."
#> [3] "HALLMARK_HYPOXIA\tGenes upregulated under hypoxia\tMTOR\tHIF1A\t..."
gmt2df() — Long-format data frame
Returns one row per gene, making the output directly compatible with
dplyr and data.table workflows.
gmt2list() — Named list of gene vectors
Returns a named list where each element is a character vector of gene
symbols. This is the format expected by most gene set enrichment tools
(e.g., fgsea, clusterProfiler).
gs <- gmt2list(tmp)
names(gs)
#> [1] "HALLMARK_P53_PATHWAY" "HALLMARK_MTORC1_SIGNALING"
#> [3] "HALLMARK_HYPOXIA"
gs[["HALLMARK_P53_PATHWAY"]]
#> [1] "TP53" "BRCA1" "MYC" "EGFR" "PTEN" "CDK2" "MDM2"
#> [8] "RB1" "CDKN2A" "AKT1"Lines with fewer than 3 tab-separated fields are skipped with a warning and removed from the result. If every line is malformed, both functions return
NULLrather than raising an error — this is the current behaviour. Always check for aNULLreturn when parsing files from untrusted sources.
5 A Combined Workflow
Gene ID conversion and GMT parsing compose naturally. The example below reads a GMT file, converts all gene symbols to Entrez IDs, and produces a named list of ID vectors ready for enrichment analysis.
library(evanverse)
# 1. Parse GMT into long format
tmp <- toy_gmt(n = 5)
df <- gmt2df(tmp)
# 2. Convert symbols to Entrez IDs
ref <- toy_gene_ref(species = "human")
id_map <- gene2entrez(df$gene, ref = ref, species = "human")
# 3. Attach IDs and drop unmatched
df$entrez_id <- id_map$entrez_id
df <- df[!is.na(df$entrez_id), ]
# 4. Rebuild named list with Entrez IDs
gs_entrez <- df2list(df, group_col = "term", value_col = "entrez_id")
gs_entrez[["HALLMARK_P53_PATHWAY"]]
#> [1] "7157" "672" "4609" "1956" "5728" "1031" "4193" "5925" "1029" "207"