Single-cell Transcriptomics Tutorial

Single-cell RNA-seq data analysis: from cells to cell states.

A practical tutorial for single-cell and single-nucleus RNA-seq projects. It covers experimental design, metadata, FASTQ-to-count processing, quality control, doublet detection, ambient RNA, normalization, clustering, cell-type annotation, batch integration, marker discovery, differential expression, trajectory analysis, visualization and reproducible reporting.

1. Overview: what is single-cell RNA-seq data analysis?

Single-cell RNA-seq data analysis converts sequencing reads from individual cells or nuclei into a cell-by-gene expression matrix. The analysis then identifies high-quality cells, clusters transcriptionally similar cells, annotates cell types, compares conditions and interprets cell-state changes.

Generate matrixAssign reads to cell barcodes, UMIs and genes.
Resolve cellsFilter low-quality cells, doublets and technical artefacts.
Interpret biologyCluster cells, annotate cell types and compare conditions.
Core principle: single-cell RNA-seq analysis combines sequencing QC, statistical modeling and biological interpretation. A beautiful UMAP is not enough; conclusions must be supported by QC, metadata, replicates and marker evidence.

2. Single-cell RNA-seq assay types

AssayTypical objectiveAnalysis focus
Droplet 3′ scRNA-seqProfile many cells at gene level.Cell calling, UMI counting, clustering, annotation and differential state analysis.
Droplet 5′ scRNA-seqGene expression plus immune receptor profiling.Cell annotation, TCR/BCR analysis and clonotype integration.
Single-nucleus RNA-seqProfile frozen or difficult tissues.Intronic reads, nuclear RNA signal, cell-type annotation and ambient RNA handling.
Plate-based scRNA-seqHigher sensitivity per cell, fewer cells.Per-cell QC, full-length transcript information and batch design.
CITE-seqRNA plus antibody-derived tags.Joint RNA/protein analysis and cell-type annotation.
Multiome RNA+ATACRNA expression plus chromatin accessibility in the same cells.Joint embedding, regulatory analysis and cell-state interpretation.

3. Experimental design

Single-cell projects are often limited by biological replication and batch structure. Good design is essential for reliable comparison between conditions.

Questions to answer early

  • What is the biological question: cell-type discovery, condition comparison, rare cell detection, developmental trajectory or therapy response?
  • How many biological replicates or donors are available per condition?
  • Are samples multiplexed or processed separately?
  • What tissue dissociation, nuclei isolation or preservation method was used?
  • What chemistry and library type were used: 3′, 5′, full-length, single-nucleus, CITE-seq or multiome?
  • What cell types are expected, and are marker genes known?
  • Are batch variables balanced across conditions?
  • Will differential expression be tested at cell level, pseudobulk sample level or both?
Many single-cell datasets contain thousands of cells but only a few biological replicates. For condition-level inference, donor/sample replication matters more than the total cell count alone.

4. Metadata for single-cell studies

Metadata connects cell barcodes to samples, donors, conditions, batches and technical factors. It is central for integration, differential expression and interpretation.

Example sample metadata
sample_id	donor_id	condition	batch	chemistry	tissue	fastq_path
S1	D001	control	A	10x_3prime_v3	PBMC	data/fastq/S1/
S2	D002	control	A	10x_3prime_v3	PBMC	data/fastq/S2/
S3	D003	treated	B	10x_3prime_v3	PBMC	data/fastq/S3/
S4	D004	treated	B	10x_3prime_v3	PBMC	data/fastq/S4/

Recommended metadata fields

  • Sample ID, donor ID, condition, replicate and batch.
  • Library chemistry, reference version, sequencing run and lane.
  • Tissue, time point, treatment, genotype, disease status or other covariates.
  • Expected cells, loaded cells, recovered cells and viability if available.
  • Multiplexing information, hashtag oligos or genotype demultiplexing metadata when relevant.

5. Input files and reference resources

InputTypical formatUse
Raw sequencing filesBCL or FASTQ.GZStarting point for count-matrix generation.
Cell barcode whitelistTool-specific listDefines valid cell barcodes for a chemistry.
Reference genome/transcriptomeFASTA, GTF, tool-specific indexMaps reads to genes and transcripts.
Feature-barcode filesCSV/TSVDefines antibody tags, hashtags, CRISPR guides or custom features.
Sample metadataTSV/CSVDefines conditions, donors, batches and sample-level variables.
Cell-level metadataTSV/CSV or object metadataStores QC metrics, clusters, annotations and analysis results.

6. FASTQ-to-count matrix generation

Most single-cell workflows first generate a count matrix with cells as columns and genes as rows. Droplet-based workflows use cell barcodes and UMIs to assign reads to cells and molecules.

Tool familyCommon useOutputs
Cell Ranger10x Genomics-style processing.Filtered and raw feature-barcode matrices, BAM, web summary and metrics.
STARsoloFlexible open-source single-cell preprocessing.Gene count matrices and cell-calling outputs.
kallisto/bustoolsFast lightweight quantification.Bus files, count matrices and barcode/UMI summaries.
alevin-fryFast single-cell quantification and UMI resolution.Count matrices and quantification summaries.
Conceptual Cell Ranger count command
cellranger count \
  --id=S1 \
  --transcriptome=reference/refdata-gex-GRCh38 \
  --fastqs=data/fastq/S1 \
  --sample=S1 \
  --localcores=16 \
  --localmem=64
The reference must match the organism, genome assembly and gene annotation used in downstream interpretation. Single-nucleus data often benefit from references or settings that count intronic reads.

7. Raw sequencing and pipeline QC

Before analyzing cells, inspect sequencing and matrix-generation metrics at the sample level.

Read depthTotal reads and reads per cell determine sensitivity.
Valid barcodesLow valid barcode rate can indicate chemistry or sample-index problems.
Mapping rateLow mapping can indicate wrong reference, poor sample quality or contamination.
Fraction reads in cellsLow values can indicate ambient RNA, low cell recovery or empty droplets.
Sequencing saturationIndicates whether additional sequencing would recover many new UMIs.
Median genes per cellReflects cell quality, complexity and tissue type.

8. Cell-level quality control

Cell-level QC filters out low-quality barcodes, damaged cells, potential empty droplets and extreme outliers. Thresholds should be dataset-specific.

MetricMeaningInterpretation
Total UMIsNumber of captured molecules per cell.Very low values suggest empty or poor-quality cells; very high values may indicate doublets.
Detected genesGenes with nonzero counts per cell.Low values suggest poor quality; very high values can indicate doublets.
Mitochondrial fractionFraction of counts from mitochondrial genes.High values often indicate stressed or damaged cells, but expectations vary by tissue.
Ribosomal fractionFraction of ribosomal gene counts.Can reflect biology, sample quality or technical variation.
Hemoglobin, immunoglobulin or stress genesContext-specific high-abundance gene families.May indicate contamination, tissue composition or stress responses.
Example Scanpy QC metrics
import scanpy as sc

adata = sc.read_10x_mtx("results/cellranger/S1/outs/filtered_feature_bc_matrix")
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

# Example dataset-specific filters
adata = adata[adata.obs["n_genes_by_counts"] > 200, :]
adata = adata[adata.obs["pct_counts_mt"] < 20, :]
adata = adata[adata.obs["total_counts"] < 50000, :]

9. Empty droplets and cell calling

Droplet-based single-cell data include many barcodes from empty droplets containing ambient RNA. Cell-calling algorithms distinguish real cells from background barcodes.

Knee plotVisualizes barcode rank versus UMI count to separate cells from background.
EmptyDrops-style methodsStatistically evaluate whether barcodes differ from ambient background.
Expected cellsPlatform settings can influence initial cell-calling thresholds.
Manual reviewMarker genes and QC plots should support final decisions.

10. Doublet detection

Doublets occur when two or more cells share the same barcode or droplet. They can appear as artificial hybrid clusters and distort marker-gene analysis.

Doublet signalPossible interpretationAction
High UMI and gene countBarcode contains multiple cells.Flag with doublet detection tools and inspect clusters.
Mixed lineage markersPotential doublet or real transitional state.Review markers, tissue biology and replicate consistency.
Isolated small clusterCluster may be doublet-enriched.Check doublet scores and marker combinations.
Multiplexing conflictHashtag or genotype multiplet.Remove or label according to demultiplexing results.
Doublet rates depend on cell loading, platform and tissue. Remove likely technical doublets, but avoid removing real activated or transitional cell states without evidence.

11. Ambient RNA correction

Ambient RNA comes from lysed or damaged cells and can contaminate droplets. It is especially relevant in fragile tissues, nuclei preparations and samples with high cell death.

Signs of ambient RNALow-level expression of abundant genes from unrelated cell types.
Correction toolsSoupX, CellBender-style workflows and related methods can estimate background.
Marker reviewCheck whether correction improves marker specificity without erasing true signal.
Document choicesAmbient correction can affect downstream clustering and differential expression.

12. Normalization, variable genes and scaling

Normalization adjusts for sequencing depth and technical variability before dimensionality reduction and clustering. The best method depends on workflow and dataset structure.

StepPurposeNotes
Library-size normalizationScale cells to comparable total counts.Common for exploratory analysis.
Log transformationStabilize expression range.Often used before PCA and visualization.
Variable-gene selectionIdentify genes informative for cell-state differences.Exclude technical genes when appropriate.
Regression/scalingRemove selected technical effects.Use cautiously; overcorrection can remove biology.
SCTransform-style modelingModel technical variation with regularized methods.Common in Seurat workflows.
Example Scanpy preprocessing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=3000)
adata = adata[:, adata.var["highly_variable"]]
sc.pp.scale(adata, max_value=10)

13. Dimensionality reduction

Dimensionality reduction summarizes high-dimensional gene expression patterns for clustering, visualization and interpretation.

PCAPrimary linear reduction used for neighbor graphs and clustering.
UMAPPopular nonlinear visualization for cell relationships.
t-SNEAlternative nonlinear visualization, useful in some contexts.
Diffusion mapsUseful for gradual trajectories and developmental continua.
UMAP distances and cluster separations are visual summaries, not direct measures of biological distance. Always interpret with marker genes, QC and sample metadata.

14. Cell clustering

Clustering groups cells with similar expression profiles. Resolution parameters influence how many clusters are detected.

Example Scanpy PCA, neighbors, clustering and UMAP
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.leiden(adata, resolution=0.6)
sc.tl.umap(adata)

Clustering review checklist

  • Do clusters contain cells from multiple biological replicates?
  • Are clusters driven by batch, donor, mitochondrial fraction or cell cycle?
  • Do clusters have coherent marker genes?
  • Are rare clusters real or likely doublets, debris or contamination?
  • Does a higher or lower resolution change biological conclusions?

15. Cell-type annotation

Cell-type annotation assigns biological identities to clusters or individual cells using marker genes, reference datasets and expert review.

Annotation approachStrengthsCautions
Manual marker-based annotationTransparent and biologically interpretable.Requires domain knowledge and good marker selection.
Reference mappingFast and useful for known tissues.Can fail when references differ by disease, species, platform or tissue state.
Automated classifiersScalable for large datasets.Predictions require review and confidence assessment.
Multi-modal annotationRNA plus protein, ATAC or immune receptors improves confidence.Requires correct modality integration.
Good annotation often uses several evidence sources: canonical markers, differential markers, sample distribution, reference labels and expected tissue biology.

16. Batch correction and sample integration

Integration reduces technical differences between samples, donors or batches. It is useful when comparing shared cell types across samples, but it can remove real biological differences if overused.

SituationRecommended thinkingNotes
Multiple donors per conditionIntegrate for shared cell-type annotation, then use sample-aware statistics.Do not treat cells as independent replicates for condition-level inference.
Strong technical batchConsider integration tools and inspect pre/post integration plots.Check whether known biology is preserved.
Different tissues or statesBe cautious with aggressive integration.Overcorrection may force distinct biology together.
Different platformsUse dedicated mapping or integration approaches.Feature overlap and chemistry differences matter.

17. Marker-gene discovery

Marker genes help describe clusters, annotate cell types and identify cell-state changes. Marker testing should consider expression frequency, effect size and sample representation.

Example marker discovery in Scanpy
sc.tl.rank_genes_groups(
    adata,
    groupby="leiden",
    method="wilcoxon"
)

markers = sc.get.rank_genes_groups_df(adata, group=None)
markers.to_csv("results/markers/cluster_markers.csv", index=False)
Cluster marker tests can be inflated when many cells come from the same donor. For condition comparisons, pseudobulk or mixed models are often more appropriate.

18. Differential expression and pseudobulk analysis

Differential expression in single-cell studies can be tested within cell types or states. For comparisons between conditions, pseudobulk analysis aggregates counts per sample and cell type, preserving biological replication.

Subset cell typeAnalyze comparable cells, such as T cells or macrophages.
Aggregate by sampleCreate pseudobulk counts per donor/sample and cell type.
Model conditionUse DESeq2, edgeR, limma-voom or other count models with covariates.
MethodBest useCaution
Cell-level testingMarker discovery and exploratory cluster differences.Can overstate significance if sample structure is ignored.
PseudobulkCondition-level differential expression with replicates.Requires enough cells per sample and cell type.
Mixed modelsComplex designs with donor or batch effects.Computationally heavier and requires statistical care.

19. Cell composition and differential abundance

Single-cell data can reveal changes in cell-type proportions between groups. This requires careful sample-level analysis because cells from the same sample are not independent.

Cell counts per typeSummarize cell-type proportions by sample.
Compositional effectsRelative proportions can change even when absolute abundance is unknown.
Sample-aware testingUse donor-level or sample-level statistics when comparing groups.
ValidationFlow cytometry, imaging or deconvolution can support abundance claims.

20. Trajectory and pseudotime analysis

Trajectory methods infer continuous transitions such as differentiation, activation or cell-cycle progression. They are useful when biology is gradual rather than discrete.

QuestionAnalysis conceptNotes
How do cells transition between states?Pseudotime ordering.Requires a plausible starting point and biological validation.
Are there branching lineages?Trajectory graph or lineage inference.Branch interpretation depends on sampling and cell-state continuity.
Which genes change over a trajectory?Pseudotime-associated expression.Can reveal programs of differentiation or activation.
What is RNA velocity?Spliced/unspliced RNA dynamics.Requires suitable data and careful model assumptions.

21. Cell-cell communication analysis

Cell-cell communication tools infer potential ligand-receptor interactions between annotated cell types or states. These analyses generate hypotheses about signaling, not direct proof of physical interactions.

Ligand expressionPotential signal produced by one cell type.
Receptor expressionPotential responsiveness in another cell type.
Pathway summariesGroup interactions by signaling families or pathways.
ValidationSpatial, protein or perturbation data improves confidence.

22. CITE-seq, immune profiling and multiome extensions

Many single-cell projects include additional modalities. Analysis should preserve modality-specific QC and integrate modalities only after appropriate preprocessing.

ModalityWhat it addsAnalysis focus
CITE-seq / ADTSurface protein measurements.Protein QC, background correction and RNA-protein annotation.
TCR/BCR sequencingImmune receptor clonotypes.Clonal expansion, lineage, antigen-response hypotheses and pairing with cell states.
scATAC-seqChromatin accessibility.Peak calling, gene activity, motif analysis and regulatory interpretation.
Multiome RNA+ATACExpression and accessibility from the same cells.Joint embeddings, peak-gene links and regulatory programs.

23. Visualization

Visualizations should show both biological results and technical quality. Avoid drawing conclusions from UMAP shape alone.

PlotPurposeWhat to inspect
UMAP/t-SNE by clusterVisualize cell groups.Cluster structure, isolated outliers and annotation consistency.
UMAP by sample or batchDetect batch effects.Whether clusters are sample-specific or shared across replicates.
QC violin plotsReview UMI, gene and mitochondrial metrics.Thresholds and outlier populations.
Dot plotsShow marker expression across clusters.Marker specificity and annotation support.
HeatmapsSummarize marker or pathway expression.Cluster and sample-level patterns.
Composition plotsShow cell-type proportions by sample.Group trends and sample variability.

24. Example single-cell RNA-seq workflow

The following simplified workflow illustrates a common Scanpy route after a 10x-style count matrix has been generated. Real projects should adapt thresholds and models to tissue, chemistry and study design.

Minimal Scanpy analysis example
import scanpy as sc

# Read 10x matrix
adata = sc.read_10x_mtx(
    "results/cellranger/S1/outs/filtered_feature_bc_matrix",
    var_names="gene_symbols",
    cache=True
)

# QC metrics
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

# Dataset-specific filtering
adata = adata[adata.obs["n_genes_by_counts"] > 200, :]
adata = adata[adata.obs["pct_counts_mt"] < 20, :]

# Preserve raw counts before normalization
adata.layers["counts"] = adata.X.copy()

# Normalize and transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Variable genes, PCA, neighbors and clustering
sc.pp.highly_variable_genes(adata, n_top_genes=3000)
adata = adata[:, adata.var["highly_variable"]]
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.leiden(adata, resolution=0.6)
sc.tl.umap(adata)

# Cluster markers
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon")

# Save results
adata.write("results/objects/S1_scanpy_processed.h5ad")

25. Deliverables and reporting

  • FASTQ-to-count processing summaries and sample-level QC reports.
  • Raw and filtered count matrices.
  • Cell-level metadata with QC metrics, clusters, cell-type labels and sample information.
  • Doublet and ambient RNA assessment reports.
  • UMAP/t-SNE embeddings and PCA or neighbor graph outputs.
  • Cluster marker tables and cell-type annotation evidence.
  • Differential-expression and pseudobulk tables when comparing conditions.
  • Cell-composition summaries by sample and group.
  • Trajectory, cell-cell communication or multiome outputs if included.
  • Reproducible analysis object files such as h5ad, Seurat RDS or loom, plus methods, software versions and limitations.

26. Single-cell RNA-seq analysis cheat sheet

StepCommon toolsMain outputs
FASTQ to matrixCell Ranger, STARsolo, kallisto/bustools, alevin-fryCell-by-gene count matrices and processing metrics.
QCSeurat, Scanpy, scater, scranUMI, gene, mitochondrial and sample-level QC plots.
DoubletsScrublet, DoubletFinder, scDblFinder, Solo-style methodsDoublet scores and flagged cells.
Ambient RNASoupX, CellBender-style workflowsCorrected or flagged expression matrices.
NormalizationSeurat, Scanpy, scran, SCTransform-style methodsNormalized expression and variable genes.
ClusteringSeurat, Scanpy, igraph/leidenalgClusters, embeddings and marker genes.
AnnotationMarker genes, SingleR, Azimuth-style mapping, CellTypist-style classifiersCell-type labels and confidence summaries.
IntegrationSeurat integration, Harmony, Scanorama, scVI-style modelsIntegrated embeddings and batch-corrected representations.
Differential expressionDESeq2/edgeR pseudobulk, MAST-style methods, mixed modelsCell-type-specific differential-expression tables.
ReportingR Markdown, Quarto, notebooks, workflow reports, AI-assisted summariesQC, annotation, plots, methods, interpretation and limitations.

Frequently asked questions

What is single-cell RNA-seq data analysis?

Single-cell RNA-seq data analysis is the computational processing of sequencing reads from individual cells or nuclei to generate cell-by-gene count matrices, perform quality control, identify cell types, study cell states, compare conditions and interpret biological mechanisms.

What is the difference between single-cell RNA-seq and bulk RNA-seq?

Bulk RNA-seq measures average expression across a mixture of cells, while single-cell RNA-seq measures expression for thousands to millions of individual cells or nuclei, enabling cell-type discovery and cell-state analysis.

What are UMIs and why are they important?

Unique molecular identifiers are short barcodes attached to RNA molecules before amplification. They help count original molecules rather than PCR-amplified reads and are central to many droplet-based single-cell workflows.

What are the main QC metrics in single-cell RNA-seq?

Important metrics include total UMIs per cell, number of detected genes, mitochondrial gene fraction, ribosomal fraction, doublet score, ambient RNA contamination, cell complexity, batch effects and expected marker-gene expression.

What is doublet detection?

Doublet detection identifies barcodes that likely contain two or more cells. Doublets can create artificial hybrid cell states and distort clustering or cell-type annotation.

What is ambient RNA?

Ambient RNA is RNA released from lysed or damaged cells that can contaminate droplets or wells. It can create low-level expression of genes from other cell types and should be evaluated, especially in fragile tissues.

Should single-cell data be integrated across batches?

Integration is often useful when combining samples, donors, runs or technologies, but it should be done carefully. Overcorrection can remove real biology, while undercorrection can leave technical batch effects.

Can AI help with single-cell RNA-seq?

AI can help summarize QC, annotate clusters, integrate literature, prioritize marker genes, draft reports and interpret cell-state changes, while the analysis should remain reproducible, versioned and reviewed by domain experts.