PCA
Purpose
PCA is the first major dimensionality reduction step in the standard Seurat workflow.
It summarizes high-dimensional gene expression variation into principal components. These PCs are then used for neighbor graph construction, clustering, UMAP, and other downstream steps.
Typical position:
NormalizeData() -> FindVariableFeatures() -> ScaleData() -> RunPCA()
Before PCA
PCA usually expects scaled data.
Check that variable features exist:
length(Seurat::VariableFeatures(seu))
head(Seurat::VariableFeatures(seu), 10)Check that scaling has been run:
scaled_data <- Seurat::GetAssayData(
object = seu,
assay = "RNA",
slot = "scale.data"
)
dim(scaled_data)Run PCA
Standard PCA:
seu <- Seurat::RunPCA(
object = seu, # Seurat object after ScaleData()
features = Seurat::VariableFeatures(seu), # use variable features
npcs = 50, # number of PCs to compute
verbose = TRUE # show progress messages
)What to remember:
- PCA usually uses variable features.
- PCA is stored as a reduction inside the Seurat object.
- The number of computed PCs should be larger than the number you expect to use later.
Check PCA
Check reductions:
Seurat::Reductions(seu)Extract PCA embeddings:
pca_embeddings <- Seurat::Embeddings(
object = seu,
reduction = "pca"
)
dim(pca_embeddings)
pca_embeddings[1:5, 1:5]View top feature loadings for selected PCs:
print(
x = seu[["pca"]],
dims = 1:5,
nfeatures = 5
)Visualize PCA
Basic PCA plot:
Seurat::DimPlot(
object = seu,
reduction = "pca"
)Color PCA by metadata:
Seurat::DimPlot(
object = seu,
reduction = "pca",
group.by = "sample_id"
)Show PC loadings:
Seurat::VizDimLoadings(
object = seu,
dims = 1:2,
reduction = "pca"
)Heatmap for top PC genes:
Seurat::DimHeatmap(
object = seu,
dims = 1:6,
cells = 500,
balanced = TRUE
)Choose PCs
Use an elbow plot to inspect how much variation is captured by each PC:
p_elbow <- Seurat::ElbowPlot(
object = seu,
reduction = "pca",
ndims = 50
) +
ggplot2::labs(title = "Elbow Plot of PCA")
p_elbowThe elbow plot helps choose how many PCs to use for downstream steps such as FindNeighbors(), FindClusters(), and RunUMAP().
Note
PCA is not the final visualization. It is mainly an intermediate representation for graph construction, clustering, and UMAP.
The chosen number of PCs affects downstream results, so it should be checked rather than copied blindly.