Toy
The toy module provides small deterministic fixtures for examples, demos, offline workflows, and tests. It should make package examples runnable without network access while still exercising realistic package contracts such as gene ID conversion and GMT parsing.
This module is intentionally narrow. It is not a source of real biological reference data, and it should not drift into a general simulation framework. Its value is that the outputs are small, stable, and composable with the rest of evanverse.
Scope
R/toy.R currently exports two functions:
| Group | Functions | Role |
|---|---|---|
| Gene reference fixtures | toy_gene_ref() |
Return a compact human or mouse reference table for offline gene ID conversion |
| GMT fixtures | toy_gmt() |
Write a temporary GMT file with built-in gene sets for parser and workflow examples |
The public wrappers live in R/toy.R. Internal data construction and temporary GMT writing live in R/utils_toy.R.
Design Contract
Fixtures Should Be Deterministic
toy_gene_ref() and toy_gmt() should return stable content across calls. Temporary file paths from toy_gmt() may differ, but the written GMT content should be predictable.
This matters because these helpers are used in examples and tests. They should not depend on network access, package caches, remote annotation services, or random seeds.
Toy Gene References Should Be Clean
toy_gene_ref() returns a data frame with this column contract:
| Column | Meaning |
|---|---|
symbol |
Gene symbol |
ensembl_id |
Ensembl gene ID |
entrez_id |
Entrez gene ID |
gene_type |
Simple gene type label |
species |
"human" or "mouse" |
ensembl_version |
Reference version label |
download_date |
Stable fixture date |
Within each species, non-missing symbols and Ensembl IDs should be unique. The default toy reference should not trigger duplicate-symbol warnings in gene2entrez() or gene2ensembl().
GMT Genes Should Map Through The Human Reference
The human toy_gene_ref() includes the symbols used by toy_gmt():
TP53, BRCA1, MYC, EGFR, PTEN, CDK2, MDM2, RB1, CDKN2A, AKT1, MTOR, PIK3CA, KRAS, BRAF, NRAS, VEGFA, HIF1A, STAT3, JAK2, and BCL2.
That makes this offline workflow valid:
- Write a GMT file with
toy_gmt(). - Parse it with
gmt2df(). - Convert
genevalues withgene2entrez(..., ref = toy_gene_ref("human")). - Rebuild grouped Entrez-ID sets with
df2list().
The combined workflow should not produce all-NA IDs.
Input Counts Are Positive And Capped
Both exported functions validate n with .assert_count(), so n must be a single positive integer.
The available built-in fixture sizes are finite:
| Function | Default | Maximum returned |
|---|---|---|
toy_gene_ref() |
20 rows | 100 rows |
toy_gmt() |
5 gene sets | 5 gene sets |
Values above the available maximum are silently capped. This is currently part of the tested behavior.
Temporary GMT Output Should Be Parser-Compatible
Each toy_gmt() line is GMT formatted:
term, description, then one or more tab-separated genes.
The output should be directly compatible with both gmt2df() and gmt2list(). Term names should be non-empty and distinct.
Review Notes
The latest review focused on a real mismatch between implementation, docs, and tests:
toy_gmt()used common cancer/pathway symbols such asTP53andBRCA1.- The old human
toy_gene_ref()did not contain those symbols, so documented gene ID conversion examples returnedNA. - The old human reference contained duplicated symbols and duplicated Ensembl IDs, causing duplicate-reference warnings in ordinary examples.
- The toy vignette hand-written output was stale.
- Tests checked that
toy_gmt()parsed, but did not check semantic compatibility betweentoy_gmt()andtoy_gene_ref().
The fixes changed toy_gene_ref() to a clean deterministic reference whose human symbols align with toy_gmt(), removed the stale internal reference blocks, updated docs, and added contract tests for cross-function compatibility.
Tests
The focused toy test suite lives in tests/testthat/test-toy.R.
Latest focused run:
devtools::test(filter = "toy")
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 62 ]
Base-module tests were also rerun because toy fixtures are used by base gene ID conversion and GMT parser examples:
devtools::test(filter = "base")
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 125 ]
The important tests are contract tests:
toy_gmt()returns an existing.gmtpath and writes the expected number of lines;- GMT lines have non-empty distinct terms and parser-compatible fields;
toy_gmt()output is compatible withgmt2df()andgmt2list();toy_gmt()genes are mappable with humantoy_gene_ref();toy_gene_ref()returns the documented columns in order;- human and mouse references include key example genes and expected IDs;
- symbols and Ensembl IDs are unique within each species;
- invalid
nvalues error for both exported functions.
Open Questions
- Whether silent capping should remain the long-term behavior or become a warning when
nexceeds available fixture rows/sets. - Whether fixture data should include a small controlled duplicate-reference variant for testing warning paths, instead of relying on ad hoc test data.
- Whether
toy_gmt()should expose the built-in gene-set data as a data frame in addition to writing a temporary file. - Whether the synthetic filler IDs should remain simple stable placeholders or be replaced with a larger curated set of real IDs.