Base Utilities

Overview

The base module provides thirteen utility functions covering six areas:

Area	Functions
Data frame utilities	`df2list()`, `df2vect()`, `recode_column()`, `view()`
File system utilities	`file_ls()`, `file_info()`, `file_tree()`
Gene ID conversion	`gene2entrez()`, `gene2ensembl()`
GMT file parsing	`gmt2df()`, `gmt2list()`
Math utilities	`perm()`, `comb()`

library(evanverse)

1 Data Frame Utilities

`df2list()` — Split a data frame into a named list

Groups one column’s values by another column and returns a named list. Useful for building marker lists, gene set inputs, or any grouping operation that downstream functions expect as a list.

df <- data.frame(
  cell_type = c("T_cell", "T_cell", "B_cell", "B_cell", "B_cell"),
  marker    = c("CD3D", "CD3E", "CD79A", "MS4A1", "CD19"),
  stringsAsFactors = FALSE
)

df2list(df, group_col = "cell_type", value_col = "marker")
#> $T_cell
#> [1] "CD3D" "CD3E"
#>
#> $B_cell
#> [1] "CD79A" "MS4A1" "CD19"

`df2vect()` — Extract a named vector from a data frame

Extracts two columns and returns a named vector, using one column as names and the other as values. The original value type is preserved.

df <- data.frame(
  gene  = c("TP53", "BRCA1", "MYC"),
  score = c(0.91, 0.74, 0.55),
  stringsAsFactors = FALSE
)

df2vect(df, name_col = "gene", value_col = "score")
#>  TP53 BRCA1   MYC
#>  0.91  0.74  0.55

The name column must not contain NA, empty strings, or duplicates — all three are caught at input and raise an informative error.

bad <- data.frame(id = c("a", "a"), val = 1:2)
df2vect(bad, "id", "val")
#> Error in `df2vect()`:
#> ! `name_col` contains duplicate values.

`recode_column()` — Map column values via a named vector

Replaces values in a column using a named vector (dict). Unmatched values receive the scalar default (NA by default). Set name to write to a new column instead of overwriting the source. Explicit NA values in dict are kept as matched values rather than replaced by default.

df <- data.frame(
  gene = c("TP53", "BRCA1", "EGFR", "XYZ"),
  stringsAsFactors = FALSE
)

dict <- c("TP53" = "Tumour suppressor", "EGFR" = "Oncogene")

# Overwrite in place
recode_column(df, column = "gene", dict = dict)
#>                gene
#> 1 Tumour suppressor
#> 2              <NA>
#> 3          Oncogene
#> 4              <NA>

# Write to a new column, keep original; use a custom fallback
recode_column(df, column = "gene", dict = dict,
              name = "role", default = "Unknown")
#>    gene              role
#> 1  TP53 Tumour suppressor
#> 2 BRCA1           Unknown
#> 3  EGFR          Oncogene
#> 4   XYZ           Unknown

`view()` — Interactive table viewer

Returns an interactive reactable widget with search, filtering, sorting, and pagination. In RStudio the widget renders in the Viewer pane; in other environments it renders in the default HTML output.

view(iris, n = 10)

view() requires the reactable package. If it is not installed, the function raises a clear error rather than falling back silently.

2 File System Utilities

`file_ls()` — List files with metadata

Returns a data frame of file metadata for all files in a directory. Columns: file, size_MB, modified_time, path.

# All files in the current directory
file_ls(".")
#>              file size_MB       modified_time                          path
#> 1  DESCRIPTION   0.002  2026-03-20 14:22:01  F:/project/evanverse/DESCRIPTION
#> 2    NAMESPACE   0.002  2026-03-20 14:22:01  F:/project/evanverse/NAMESPACE
#> ...

# R source files only, searched recursively
file_ls("R", recursive = TRUE, pattern = "\\.R$")

`file_info()` — Metadata for specific files

Returns the same four-column data frame as file_ls() but for an explicit vector of file paths rather than a directory scan.

file_info(c("DESCRIPTION", "NAMESPACE"))
#>          file size_MB       modified_time                          path
#> 1 DESCRIPTION   0.002  2026-03-20 14:22:01  F:/project/evanverse/DESCRIPTION
#> 2   NAMESPACE   0.002  2026-03-20 14:22:01  F:/project/evanverse/NAMESPACE

Duplicate paths in the input are silently deduplicated. Missing files raise an error listing all unresolved paths.

`file_tree()` — Print a directory tree

Prints the directory structure in tree format. Returns the lines invisibly so output can be captured if needed.

file_tree(".", max_depth = 2)
#> F:/project/evanverse
#> +-- DESCRIPTION
#> +-- NAMESPACE
#> +-- R
#> |   +-- base.R
#> |   +-- plot.R
#> |   +-- utils.R
#> +-- tests
#>     +-- testthat

3 Gene ID Conversion

Both gene2entrez() and gene2ensembl() accept a character vector of gene symbols and return a three-column data frame: the original input (symbol), the case-normalised form used for matching (symbol_std), and the converted ID.

Reference table

Matching is performed against a ref data frame with columns symbol, entrez_id, and ensembl_id. Two sources are available:

Source	When to use
`toy_gene_ref()`	Examples, tests, offline work — 20 genes, no network
`download_gene_ref()`	Production analysis — full genome via biomaRt

# Fast, offline reference for development
ref <- toy_gene_ref(species = "human")

# Full reference for analysis (requires network + Bioconductor)
# ref <- download_gene_ref(species = "human")

Case normalisation

Species	Rule applied to both input and reference
`"human"`	`toupper()` — `"tp53"` and `"TP53"` both match `TP53`
`"mouse"`	`tolower()` — `"TRP53"` and `"Trp53"` both match `Trp53`

Unmatched symbols are returned with NA in the ID column rather than dropped. If the reference table contains duplicated symbols after case normalisation, the first match is used and a warning is emitted.

`gene2entrez()`

ref <- toy_gene_ref(species = "human")

gene2entrez(c("tp53", "BRCA1", "GHOST"), ref = ref, species = "human")
#>   symbol symbol_std entrez_id
#> 1   tp53       TP53      7157
#> 2  BRCA1      BRCA1       672
#> 3  GHOST      GHOST      <NA>

`gene2ensembl()`

ref_mouse <- toy_gene_ref(species = "mouse")

gene2ensembl(c("Trp53", "TRP53", "FakeGene"), ref = ref_mouse, species = "mouse")
#>     symbol symbol_std          ensembl_id
#> 1    Trp53      trp53  ENSMUSG00000059552
#> 2    TRP53      trp53  ENSMUSG00000059552
#> 3 FakeGene   fakegene                <NA>

4 GMT File Parsing

GMT (Gene Matrix Transposed) is the standard format for gene set collections such as MSigDB. Each line encodes one gene set: term, description, and a tab-separated list of gene symbols.

toy_gmt() writes a minimal GMT file to a temp path for offline use:

tmp <- toy_gmt(n = 3)
readLines(tmp)
#> [1] "HALLMARK_P53_PATHWAY\tGenes regulated by p53\tTP53\tBRCA1\tMYC\t..."
#> [2] "HALLMARK_MTORC1_SIGNALING\tGenes upregulated by mTORC1\tPTEN\t..."
#> [3] "HALLMARK_HYPOXIA\tGenes upregulated under hypoxia\tMTOR\tHIF1A\t..."

`gmt2df()` — Long-format data frame

Returns one row per gene, making the output directly compatible with dplyr and data.table workflows.

df <- gmt2df(tmp)
head(df, 4)
#>                      term               description  gene
#> 1   HALLMARK_P53_PATHWAY  Genes regulated by p53   TP53
#> 2   HALLMARK_P53_PATHWAY  Genes regulated by p53  BRCA1
#> 3   HALLMARK_P53_PATHWAY  Genes regulated by p53    MYC
#> 4   HALLMARK_P53_PATHWAY  Genes regulated by p53   EGFR

`gmt2list()` — Named list of gene vectors

Returns a named list where each element is a character vector of gene symbols. This is the format expected by most gene set enrichment tools (e.g., fgsea, clusterProfiler).

gs <- gmt2list(tmp)
names(gs)
#> [1] "HALLMARK_P53_PATHWAY"      "HALLMARK_MTORC1_SIGNALING"
#> [3] "HALLMARK_HYPOXIA"

gs[["HALLMARK_P53_PATHWAY"]]
#>  [1] "TP53"   "BRCA1"  "MYC"    "EGFR"   "PTEN"   "CDK2"   "MDM2"
#>  [8] "RB1"    "CDKN2A" "AKT1"

Lines with fewer than 3 tab-separated fields are skipped with a warning and removed from the result. If every line is malformed, both functions raise an error because no valid gene set can be returned.

5 Math Utilities

`perm()` — Ordered arrangements

Calculates the number of ordered arrangements of k items from n distinct items:

perm(8, 4)
#> [1] 1680

perm(5, 6)
#> [1] 0

perm(n, 0) returns 1. If k > n, the result is 0.

`comb()` — Unordered combinations

Calculates the number of ways to choose k items from n distinct items:

comb(8, 4)
#> [1] 70

comb(10, 3)
#> [1] 120

comb(n, 0) and comb(n, n) return 1. If k > n, the result is 0. Very large inputs warn before returning an infinite result.

6 A Combined Workflow

Gene ID conversion and GMT parsing compose naturally. The example below reads a GMT file, converts all gene symbols to Entrez IDs, and produces a named list of ID vectors ready for enrichment analysis.

library(evanverse)

# 1. Parse GMT into long format
tmp <- toy_gmt(n = 5)
df  <- gmt2df(tmp)

# 2. Convert symbols to Entrez IDs
ref    <- toy_gene_ref(species = "human")
id_map <- gene2entrez(df$gene, ref = ref, species = "human")

# 3. Attach IDs and drop unmatched
df$entrez_id <- id_map$entrez_id
df <- df[!is.na(df$entrez_id), ]

# 4. Rebuild named list with Entrez IDs
gs_entrez <- df2list(df, group_col = "term", value_col = "entrez_id")
gs_entrez[["HALLMARK_P53_PATHWAY"]]
#> [1] "7157" "672"  "4609" "1956" "5728" "1031" "4193" "5925" "1029"  "207"

Getting Help

?df2list, ?df2vect, ?recode_column, ?view
?file_ls, ?file_info, ?file_tree
?gene2entrez, ?gene2ensembl
?gmt2df, ?gmt2list
?perm, ?comb
?toy_gene_ref, ?toy_gmt, ?download_gene_ref
GitHub Issues

Overview

1 Data Frame Utilities

df2list() — Split a data frame into a named list

df2vect() — Extract a named vector from a data frame

recode_column() — Map column values via a named vector

view() — Interactive table viewer