Single-Cell Data Formats
Purpose
Single-cell data files are not all the same thing. Some files are raw or filtered count matrices; others are saved analysis objects with metadata, embeddings, clusters, and annotations.
This page keeps a small map of common formats and how to load them.
Format Overview
Add formats one by one. Do not mix different storage formats in the same row.
| Format | Files | Typical Loader | Notes |
|---|---|---|---|
| 10x Matrix | matrix.mtx.gz, barcodes.tsv.gz, features.tsv.gz |
Seurat::Read10X() |
Three-file sparse matrix format from Cell Ranger |
| 10x Non-Canonical Matrix | non-standard .mtx/.tsv names |
Matrix::readMM() |
10x-like files with non-standard names |
| SingleCellExperiment RDS | .rds |
readRDS() |
Saved Bioconductor single-cell object |
| 10x H5 | .h5 |
Seurat::Read10X_h5() |
Compact HDF5 version of 10x output |
| AnnData | .h5ad |
zellkonverter::readH5AD() |
Python AnnData object |
| Loom | .loom |
BiocIO::import(), scanpy.read_loom() |
Older HDF5-based exchange format |
| Seurat RDS | .rds |
readRDS() |
Saved R object |
| Trajectory Object RDS | .rds |
readRDS() |
Saved pseudotime or trajectory analysis object |
10x Matrix
10x Matrix usually means the three sparse matrix files exported by Cell Ranger:
matrix.mtx.gzbarcodes.tsv.gzfeatures.tsv.gzorgenes.tsv.gz
The usual modern file is features.tsv.gz; older outputs may use genes.tsv.gz.
The most common R entry point is Seurat. Seurat provides Read10X() for loading this folder directly.
data <- Seurat::Read10X(
data.dir = "filtered_feature_bc_matrix/", # folder with 10x matrix files
gene.column = 2, # use gene symbols from features.tsv.gz
cell.column = 1, # use cell barcodes from barcodes.tsv.gz
unique.features = TRUE, # make duplicated feature names unique
strip.suffix = FALSE # keep barcode suffixes such as -1
)Then create a Seurat object:
seurat_obj <- Seurat::CreateSeuratObject(
counts = data, # count matrix from Read10X()
min.cells = 3, # keep genes detected in at least 3 cells
min.features = 200, # keep cells with at least 200 detected genes
project = "sample", # project or sample name
assay = "RNA" # assay name for gene expression
)Parameter notes:
data.dir: folder containingmatrix.mtx.gz,barcodes.tsv.gz, andfeatures.tsv.gz.gene.column:2usually means gene symbols;1usually means gene IDs.cell.column:1means the barcode column.unique.features: whether feature names should be made unique. Usually keepTRUE.strip.suffix: whether to remove trailing barcode suffixes such as-1. Usually keepFALSEunless suffixes interfere with merging or matching metadata.
So the loading step gives a sparse count matrix or a list of matrices. CreateSeuratObject() is the next step that turns counts into a Seurat object.
Check after loading:
- cells are columns and genes/features are rows
- gene names or feature IDs look correct
- sample identity is added to metadata after object creation
- values are raw counts, not normalized expression
10x Non-Canonical Matrix
Some downloaded datasets are 10x-like matrices, but the file names are not standard.
Example:
GSM000000_sample_matrix.mtx.gz
GSM000000_sample_barcodes.tsv.gz
GSM000000_sample_features.tsv.gz
When possible, the cleanest solution is to copy or rename files into a standard 10x folder:
sample/
matrix.mtx.gz
barcodes.tsv.gz
features.tsv.gz
Then use Seurat::Read10X().
If renaming is not practical, read the files manually:
mat <- Matrix::readMM("GSM000000_sample_matrix.mtx.gz")
barcodes <- readr::read_tsv("GSM000000_sample_barcodes.tsv.gz", col_names = FALSE)
features <- readr::read_tsv("GSM000000_sample_features.tsv.gz", col_names = FALSE)
rownames(mat) <- make.unique(features$X2) # gene symbols
colnames(mat) <- barcodes$X1 # cell barcodesBefore merging samples, add a sample prefix to cell barcodes:
colnames(mat) <- paste("sample", colnames(mat), sep = "_")Then create a Seurat object:
seurat_obj <- Seurat::CreateSeuratObject(
counts = mat,
min.cells = 3,
min.features = 200,
project = "sample",
assay = "RNA"
)What to remember:
- Prefer standard 10x names when possible.
- Manual
readMM()is useful for messy downloaded files. - Always check whether rows are genes and columns are cells.
- Add sample prefixes before merging multiple samples.
SingleCellExperiment RDS
SingleCellExperiment, often shortened as SCE, is the core single-cell data object in the Bioconductor ecosystem.
SCE is not a file format by itself. It is an R object type. It is often saved as an .rds file, but .rds can store many kinds of R objects, so always check what was loaded.
Load in R:
sce <- readRDS("sample_sce.rds")Check after loading:
class(sce) # should include "SingleCellExperiment"
assayNames(sce) # available assays, such as counts or logcounts
colnames(colData(sce)) # cell metadata columnsWhat to remember:
- SCE belongs to the Bioconductor ecosystem.
assaysstore expression matrices.colDatastores cell metadata.rowDatastores gene or feature metadata..rdsis only the storage wrapper; the object inside needs to be checked.
Convert SCE to Seurat
For learning, the clearest route is to extract the count matrix and metadata manually, then create a Seurat object.
counts <- SummarizedExperiment::assay(sce, "counts") # raw count matrix
metadata <- as.data.frame(SummarizedExperiment::colData(sce)) # cell metadata
all(colnames(counts) == rownames(metadata)) # check cell order before creating object
seurat_obj <- Seurat::CreateSeuratObject(
counts = counts,
meta.data = metadata,
min.cells = 3,
min.features = 200,
project = "sample",
assay = "RNA"
)If the Seurat object already exists, metadata can also be added later:
seurat_obj <- Seurat::AddMetaData(
object = seurat_obj,
metadata = metadata
)Shortcut route:
seurat_obj <- Seurat::as.Seurat(
x = sce,
counts = "counts",
data = "logcounts"
)Before using the shortcut, check assayNames(sce). Some SCE objects do not have both counts and logcounts.
10x H5
10x H5 is the compact HDF5 version of Cell Ranger matrix output. It is not the same storage format as the three-file 10x Matrix folder.
The key point: Read10X_h5() is for 10x-formatted H5 files, not for every .h5 file.
Common file:
filtered_feature_bc_matrix.h5
Load in R with Seurat:
data <- Seurat::Read10X_h5(
filename = "filtered_feature_bc_matrix.h5", # 10x H5 file from Cell Ranger
use.names = TRUE, # use gene symbols instead of gene IDs
unique.features = TRUE # make duplicated feature names unique
)Then create a Seurat object:
seurat_obj <- Seurat::CreateSeuratObject(
counts = data,
min.cells = 3,
min.features = 200,
project = "sample",
assay = "RNA"
)Check after loading:
- whether the file contains one modality or multiple feature types
- whether gene expression is named as expected
- whether the resulting object is a matrix or a list of matrices
- whether this is really a 10x H5 file, not AnnData
.h5ador a generic.h5
AnnData h5ad
AnnData is the main Python object format used by Scanpy and many public atlas datasets. The file extension is .h5ad.
Do not confuse .h5ad with 10x H5. Both are HDF5-based files, but their internal structures are different. Seurat::Read10X_h5() is for 10x H5, not AnnData.
A .h5ad file may contain more than one kind of information:
- expression matrix in
X - raw counts or normalized values in
layers - cell metadata in
obs - gene metadata in
var - embeddings such as PCA or UMAP in
obsm - clustering, colors, and other analysis results in
uns
Load in R:
sce <- zellkonverter::readH5AD(
file = "sample.h5ad", # AnnData file
reader = "python" # use Python/anndata reader
)If the object is large:
sce <- zellkonverter::readH5AD(
file = "sample.h5ad",
reader = "python",
use_hdf5 = TRUE # keep assay data HDF5-backed when possible
)Check after loading:
colData(sce)for cell metadatarowData(sce)for gene metadataassayNames(sce)for counts or normalized matrices- reduced dimensions, if present
What to remember:
readH5AD()returns aSingleCellExperimentobject in R.reader = "python"uses Pythonanndatathroughzellkonverterandbasilisk.use_hdf5 = TRUEis about HDF5-backed storage, not about choosing a pure R reader.- Always check which assay represents raw counts before creating a Seurat object or doing downstream analysis.
Loom
Loom is an older HDF5-based single-cell exchange format. The file extension is .loom.
It appears in some older single-cell workflows and RNA velocity pipelines. For new analysis, .h5ad, Seurat RDS, or SingleCellExperiment RDS are usually easier to work with.
Use Loom mainly when a dataset is only distributed as .loom.
Load in R with Bioconductor:
sce <- BiocIO::import(
con = "sample.loom",
format = "LoomExperiment"
)If the first assay is the count matrix but is not named counts, rename it before downstream use:
if (!"counts" %in% SummarizedExperiment::assayNames(sce)) {
SummarizedExperiment::assayNames(sce)[1] <- "counts"
}In Python:
import scanpy as sc
adata = sc.read_loom("sample.loom")Check after loading:
- whether the file stores counts or processed values
- whether cell and gene metadata were imported
- whether velocity-specific layers are present
What to remember:
- Loom is a container format, not a complete analysis workflow.
- It is less common in current single-cell practice than
.h5ad. - If possible, convert Loom to AnnData or Seurat/SCE before continuing analysis.
Seurat RDS
Seurat RDS means a Seurat object saved as an .rds file.
Like SCE RDS, .rds is only the R storage wrapper. The important part is the object inside. A Seurat RDS may already contain counts, normalized data, metadata, reductions, clusters, annotations, and previous analysis results.
Load in R:
seurat_obj <- readRDS("sample_seurat.rds")Check after loading:
class(seurat_obj) # should include "Seurat"
Seurat::Assays(seurat_obj) # available assays
head(seurat_obj@meta.data) # cell metadataCommon checks:
- Which assay is active?
- Is raw count data still present?
- What metadata columns describe sample, condition, batch, patient, or cell type?
- Are reductions such as PCA or UMAP already stored?
- Were clusters or annotations created by the data provider?
Trajectory Object RDS
Trajectory or pseudotime results may also be saved as .rds files.
This is not a raw data format. It usually stores an analysis object created after preprocessing, dimensionality reduction, clustering, or trajectory inference.
Common examples:
- Monocle
CellDataSetorcell_data_set - Slingshot results stored inside a
SingleCellExperiment - pseudotime columns stored in Seurat or SCE metadata
Load in R:
traj_obj <- readRDS("trajectory_object.rds")Check after loading:
class(traj_obj) # object type, such as CellDataSet or cell_data_setIf it is a Monocle object, inspect its cell metadata and pseudotime-related fields:
colnames(SummarizedExperiment::colData(traj_obj))What to remember:
- Pseudotime objects are usually downstream analysis results.
- They may not contain raw counts suitable for starting a fresh analysis.
- Always check the object class before deciding how to use it.
- If pseudotime is stored as metadata, the object may still be Seurat or SCE.
Convert Monocle to Seurat
Some GEO datasets provide only a Monocle object. To reproduce analysis in Seurat, extract the expression matrix and cell metadata, then create a new Seurat object.
First check the object type:
cds <- readRDS("monocle_object.rds")
class(cds)Monocle 3 objects are usually cell_data_set objects:
SummarizedExperiment::assayNames(cds)
counts <- SummarizedExperiment::assay(cds, "counts")
metadata <- as.data.frame(SummarizedExperiment::colData(cds))
all(colnames(counts) == rownames(metadata))
seurat_obj <- Seurat::CreateSeuratObject(
counts = counts,
meta.data = metadata,
min.cells = 3,
min.features = 200,
project = "sample",
assay = "RNA"
)Monocle 2 objects are usually CellDataSet objects:
expr_mat <- Biobase::exprs(cds)
metadata <- Biobase::pData(cds)
features <- Biobase::fData(cds)
all(colnames(expr_mat) == rownames(metadata))
seurat_obj <- Seurat::CreateSeuratObject(
counts = expr_mat,
meta.data = metadata,
min.cells = 3,
min.features = 200,
project = "sample",
assay = "RNA"
)Important check:
range(expr_mat)
expr_mat[1:5, 1:5]If the matrix contains many decimal values, it may already be normalized expression rather than raw counts. In that case, the converted Seurat object can still be useful for inspecting metadata or visualization, but it should not be treated as a fresh raw-count object without caution.
Note
The first question is always: is this file a count matrix or a saved analysis object?
That determines whether the next step is object creation, metadata inspection, QC, or checking what processing has already been done.