Single-Cell Sample Organization
Purpose
The first practical question is not only “what file format is this”, but also “how many biological samples are inside this file or folder”.
The key distinction:
File is not sample.
Folder is not sample.
Object is not necessarily sample.
A single file can contain multiple samples, and a single sample can be stored as multiple files. The analysis structure should follow biological samples, not file count.
Four Situations
| Situation | Example | Main Concern |
|---|---|---|
| Single sample, single file | one 10x H5, one .h5ad, one Seurat RDS |
load and add/check metadata |
| Single sample, multiple files | one 10x Matrix folder with matrix.mtx.gz, barcodes.tsv.gz, features.tsv.gz |
treat the folder as one sample |
| Multiple samples, single file or folder | one .h5ad or Seurat RDS containing many samples |
metadata must identify samples |
| Multiple samples, multiple files or folders | one 10x folder, H5, h5ad, or RDS per sample | read separately, add metadata, then decide list vs merged object |
Single Sample, Single File
This is simple. Load the file, create or inspect the object, then make sure sample metadata exists.
Examples:
- one
filtered_feature_bc_matrix.h5 - one
sample.h5ad - one
sample_seurat.rds
Main checks:
- Is this really one biological sample?
- Does the object already contain
sample_id,condition, orbatch? - Are the cell names unique enough if this object will later be merged?
Single Sample, Multiple Files
This is also simple when it is a 10x Matrix folder.
sample/
matrix.mtx.gz
barcodes.tsv.gz
features.tsv.gz
There are multiple files, but they form one count matrix from one sample.
Main checks:
- Do the three files belong to the same sample?
- Are the names standard enough for
Seurat::Read10X()? - Should this sample get a prefix before future merging?
Multiple Samples, Single File Or Folder
This happens when a provider gives one processed object containing all samples.
Examples:
- one
.h5adcontaining all donors - one Seurat RDS containing all conditions
- one SCE RDS containing many samples
This is still manageable if metadata is good. The important question is whether cell-level metadata already contains sample information.
Main checks:
- Which column defines sample?
- Which column defines condition?
- Which column defines patient or donor?
- Is there a batch or sequencing-run column?
- Was the object already normalized, integrated, clustered, or annotated?
If no sample column exists, downstream analysis becomes risky because sample-level comparisons, pseudobulk, and composition analysis depend on sample identity.
Multiple Samples, Multiple Files Or Folders
This is the case that needs the most care.
Common pattern:
sample_A/filtered_feature_bc_matrix/
sample_B/filtered_feature_bc_matrix/
sample_C/filtered_feature_bc_matrix/
or:
sample_A.h5ad
sample_B.h5ad
sample_C.h5ad
Each file or folder usually corresponds to one biological sample. The practical question is how to store objects after loading.
Option 1: List Of Objects
Read each sample separately and store the result in a named list.
Conceptually:
objects <- list(
sample_A = seurat_A,
sample_B = seurat_B,
sample_C = seurat_C
)This is useful before merging because each sample can be inspected independently.
Good for:
- per-sample QC
- checking cell counts per sample
- adding sample-specific metadata
- avoiding barcode collisions before merge
- keeping raw per-sample objects available
For Seurat, each object should already have metadata such as:
seurat_obj$sample_id <- "sample_A"
seurat_obj$condition <- "control"
seurat_obj$batch <- "batch_1"Option 2: One Merged Object
After metadata and cell names are clean, multiple sample objects can be merged into one object.
Conceptually:
merged_obj <- merge(
x = objects[[1]],
y = objects[-1],
add.cell.ids = names(objects)
)The merged object is useful for analysis steps that operate across all cells:
- shared QC visualization
- normalization workflow
- dimensionality reduction
- clustering
- annotation
- integration setup
But the merged object must still preserve sample identity in metadata. Without sample_id, the merged object loses the biological replicate structure.
Join Layers After Merge
In Seurat v5, merging objects may keep sample-specific layers inside the same assay.
After merge, check layers:
Seurat::Layers(merged_obj[["RNA"]])You may see layers such as:
counts.sample_A
counts.sample_B
counts.sample_C
If the next analysis step expects one joined layer, use JoinLayers():
merged_obj <- Seurat::JoinLayers(merged_obj)Then check again:
Seurat::Layers(merged_obj[["RNA"]])What to remember:
- This is mainly a Seurat v5 layer issue.
- Do not call it blindly before checking layers.
- Use it when downstream functions expect a unified layer or give layer-related errors.
List Or Merged Object
The practical answer is usually both.
Keep:
- a list of per-sample objects for sample-level checking
- a merged object for joint analysis
Do not think of list vs merged object as an either/or choice. The list is the safer staging structure; the merged object is the working analysis structure.
Note
For multiple samples, the most important metadata columns are usually sample_id, condition, patient_id, and batch.
The goal is not just to load cells. The goal is to preserve the sample structure needed for QC, merging, integration, pseudobulk, and composition analysis.