1. Overview: what is single-cell RNA-seq data analysis?
Single-cell RNA-seq data analysis converts sequencing reads from individual cells or nuclei into a cell-by-gene expression matrix. The analysis then identifies high-quality cells, clusters transcriptionally similar cells, annotates cell types, compares conditions and interprets cell-state changes.
Generate matrixAssign reads to cell barcodes, UMIs and genes.
Resolve cellsFilter low-quality cells, doublets and technical artefacts.
Interpret biologyCluster cells, annotate cell types and compare conditions.
Core principle: single-cell RNA-seq analysis combines sequencing QC, statistical modeling and biological interpretation. A beautiful UMAP is not enough; conclusions must be supported by QC, metadata, replicates and marker evidence.
2. Single-cell RNA-seq assay types
Assay
Typical objective
Analysis focus
Droplet 3′ scRNA-seq
Profile many cells at gene level.
Cell calling, UMI counting, clustering, annotation and differential state analysis.
Droplet 5′ scRNA-seq
Gene expression plus immune receptor profiling.
Cell annotation, TCR/BCR analysis and clonotype integration.
Single-nucleus RNA-seq
Profile frozen or difficult tissues.
Intronic reads, nuclear RNA signal, cell-type annotation and ambient RNA handling.
Plate-based scRNA-seq
Higher sensitivity per cell, fewer cells.
Per-cell QC, full-length transcript information and batch design.
CITE-seq
RNA plus antibody-derived tags.
Joint RNA/protein analysis and cell-type annotation.
Multiome RNA+ATAC
RNA expression plus chromatin accessibility in the same cells.
Joint embedding, regulatory analysis and cell-state interpretation.
3. Experimental design
Single-cell projects are often limited by biological replication and batch structure. Good design is essential for reliable comparison between conditions.
Questions to answer early
What is the biological question: cell-type discovery, condition comparison, rare cell detection, developmental trajectory or therapy response?
How many biological replicates or donors are available per condition?
Are samples multiplexed or processed separately?
What tissue dissociation, nuclei isolation or preservation method was used?
What chemistry and library type were used: 3′, 5′, full-length, single-nucleus, CITE-seq or multiome?
What cell types are expected, and are marker genes known?
Are batch variables balanced across conditions?
Will differential expression be tested at cell level, pseudobulk sample level or both?
Many single-cell datasets contain thousands of cells but only a few biological replicates. For condition-level inference, donor/sample replication matters more than the total cell count alone.
4. Metadata for single-cell studies
Metadata connects cell barcodes to samples, donors, conditions, batches and technical factors. It is central for integration, differential expression and interpretation.
Example sample metadata
sample_id donor_id condition batch chemistry tissue fastq_path
S1 D001 control A 10x_3prime_v3 PBMC data/fastq/S1/
S2 D002 control A 10x_3prime_v3 PBMC data/fastq/S2/
S3 D003 treated B 10x_3prime_v3 PBMC data/fastq/S3/
S4 D004 treated B 10x_3prime_v3 PBMC data/fastq/S4/
Recommended metadata fields
Sample ID, donor ID, condition, replicate and batch.
Library chemistry, reference version, sequencing run and lane.
Tissue, time point, treatment, genotype, disease status or other covariates.
Expected cells, loaded cells, recovered cells and viability if available.
Multiplexing information, hashtag oligos or genotype demultiplexing metadata when relevant.
5. Input files and reference resources
Input
Typical format
Use
Raw sequencing files
BCL or FASTQ.GZ
Starting point for count-matrix generation.
Cell barcode whitelist
Tool-specific list
Defines valid cell barcodes for a chemistry.
Reference genome/transcriptome
FASTA, GTF, tool-specific index
Maps reads to genes and transcripts.
Feature-barcode files
CSV/TSV
Defines antibody tags, hashtags, CRISPR guides or custom features.
Sample metadata
TSV/CSV
Defines conditions, donors, batches and sample-level variables.
Cell-level metadata
TSV/CSV or object metadata
Stores QC metrics, clusters, annotations and analysis results.
6. FASTQ-to-count matrix generation
Most single-cell workflows first generate a count matrix with cells as columns and genes as rows. Droplet-based workflows use cell barcodes and UMIs to assign reads to cells and molecules.
Tool family
Common use
Outputs
Cell Ranger
10x Genomics-style processing.
Filtered and raw feature-barcode matrices, BAM, web summary and metrics.
STARsolo
Flexible open-source single-cell preprocessing.
Gene count matrices and cell-calling outputs.
kallisto/bustools
Fast lightweight quantification.
Bus files, count matrices and barcode/UMI summaries.
alevin-fry
Fast single-cell quantification and UMI resolution.
The reference must match the organism, genome assembly and gene annotation used in downstream interpretation. Single-nucleus data often benefit from references or settings that count intronic reads.
7. Raw sequencing and pipeline QC
Before analyzing cells, inspect sequencing and matrix-generation metrics at the sample level.
Read depthTotal reads and reads per cell determine sensitivity.
Valid barcodesLow valid barcode rate can indicate chemistry or sample-index problems.
Mapping rateLow mapping can indicate wrong reference, poor sample quality or contamination.
Fraction reads in cellsLow values can indicate ambient RNA, low cell recovery or empty droplets.
Sequencing saturationIndicates whether additional sequencing would recover many new UMIs.
Median genes per cellReflects cell quality, complexity and tissue type.
8. Cell-level quality control
Cell-level QC filters out low-quality barcodes, damaged cells, potential empty droplets and extreme outliers. Thresholds should be dataset-specific.
Metric
Meaning
Interpretation
Total UMIs
Number of captured molecules per cell.
Very low values suggest empty or poor-quality cells; very high values may indicate doublets.
Detected genes
Genes with nonzero counts per cell.
Low values suggest poor quality; very high values can indicate doublets.
Mitochondrial fraction
Fraction of counts from mitochondrial genes.
High values often indicate stressed or damaged cells, but expectations vary by tissue.
Ribosomal fraction
Fraction of ribosomal gene counts.
Can reflect biology, sample quality or technical variation.
Hemoglobin, immunoglobulin or stress genes
Context-specific high-abundance gene families.
May indicate contamination, tissue composition or stress responses.
Droplet-based single-cell data include many barcodes from empty droplets containing ambient RNA. Cell-calling algorithms distinguish real cells from background barcodes.
Knee plotVisualizes barcode rank versus UMI count to separate cells from background.
EmptyDrops-style methodsStatistically evaluate whether barcodes differ from ambient background.
Expected cellsPlatform settings can influence initial cell-calling thresholds.
Manual reviewMarker genes and QC plots should support final decisions.
10. Doublet detection
Doublets occur when two or more cells share the same barcode or droplet. They can appear as artificial hybrid clusters and distort marker-gene analysis.
Doublet signal
Possible interpretation
Action
High UMI and gene count
Barcode contains multiple cells.
Flag with doublet detection tools and inspect clusters.
Mixed lineage markers
Potential doublet or real transitional state.
Review markers, tissue biology and replicate consistency.
Isolated small cluster
Cluster may be doublet-enriched.
Check doublet scores and marker combinations.
Multiplexing conflict
Hashtag or genotype multiplet.
Remove or label according to demultiplexing results.
Doublet rates depend on cell loading, platform and tissue. Remove likely technical doublets, but avoid removing real activated or transitional cell states without evidence.
11. Ambient RNA correction
Ambient RNA comes from lysed or damaged cells and can contaminate droplets. It is especially relevant in fragile tissues, nuclei preparations and samples with high cell death.
Signs of ambient RNALow-level expression of abundant genes from unrelated cell types.
Correction toolsSoupX, CellBender-style workflows and related methods can estimate background.
Document choicesAmbient correction can affect downstream clustering and differential expression.
12. Normalization, variable genes and scaling
Normalization adjusts for sequencing depth and technical variability before dimensionality reduction and clustering. The best method depends on workflow and dataset structure.
Step
Purpose
Notes
Library-size normalization
Scale cells to comparable total counts.
Common for exploratory analysis.
Log transformation
Stabilize expression range.
Often used before PCA and visualization.
Variable-gene selection
Identify genes informative for cell-state differences.
Exclude technical genes when appropriate.
Regression/scaling
Remove selected technical effects.
Use cautiously; overcorrection can remove biology.
SCTransform-style modeling
Model technical variation with regularized methods.
Dimensionality reduction summarizes high-dimensional gene expression patterns for clustering, visualization and interpretation.
PCAPrimary linear reduction used for neighbor graphs and clustering.
UMAPPopular nonlinear visualization for cell relationships.
t-SNEAlternative nonlinear visualization, useful in some contexts.
Diffusion mapsUseful for gradual trajectories and developmental continua.
UMAP distances and cluster separations are visual summaries, not direct measures of biological distance. Always interpret with marker genes, QC and sample metadata.
14. Cell clustering
Clustering groups cells with similar expression profiles. Resolution parameters influence how many clusters are detected.
Example Scanpy PCA, neighbors, clustering and UMAP
Do clusters contain cells from multiple biological replicates?
Are clusters driven by batch, donor, mitochondrial fraction or cell cycle?
Do clusters have coherent marker genes?
Are rare clusters real or likely doublets, debris or contamination?
Does a higher or lower resolution change biological conclusions?
15. Cell-type annotation
Cell-type annotation assigns biological identities to clusters or individual cells using marker genes, reference datasets and expert review.
Annotation approach
Strengths
Cautions
Manual marker-based annotation
Transparent and biologically interpretable.
Requires domain knowledge and good marker selection.
Reference mapping
Fast and useful for known tissues.
Can fail when references differ by disease, species, platform or tissue state.
Automated classifiers
Scalable for large datasets.
Predictions require review and confidence assessment.
Multi-modal annotation
RNA plus protein, ATAC or immune receptors improves confidence.
Requires correct modality integration.
Good annotation often uses several evidence sources: canonical markers, differential markers, sample distribution, reference labels and expected tissue biology.
16. Batch correction and sample integration
Integration reduces technical differences between samples, donors or batches. It is useful when comparing shared cell types across samples, but it can remove real biological differences if overused.
Situation
Recommended thinking
Notes
Multiple donors per condition
Integrate for shared cell-type annotation, then use sample-aware statistics.
Do not treat cells as independent replicates for condition-level inference.
Strong technical batch
Consider integration tools and inspect pre/post integration plots.
Check whether known biology is preserved.
Different tissues or states
Be cautious with aggressive integration.
Overcorrection may force distinct biology together.
Different platforms
Use dedicated mapping or integration approaches.
Feature overlap and chemistry differences matter.
17. Marker-gene discovery
Marker genes help describe clusters, annotate cell types and identify cell-state changes. Marker testing should consider expression frequency, effect size and sample representation.
Cluster marker tests can be inflated when many cells come from the same donor. For condition comparisons, pseudobulk or mixed models are often more appropriate.
18. Differential expression and pseudobulk analysis
Differential expression in single-cell studies can be tested within cell types or states. For comparisons between conditions, pseudobulk analysis aggregates counts per sample and cell type, preserving biological replication.
Subset cell typeAnalyze comparable cells, such as T cells or macrophages.
Aggregate by sampleCreate pseudobulk counts per donor/sample and cell type.
Model conditionUse DESeq2, edgeR, limma-voom or other count models with covariates.
Method
Best use
Caution
Cell-level testing
Marker discovery and exploratory cluster differences.
Can overstate significance if sample structure is ignored.
Pseudobulk
Condition-level differential expression with replicates.
Requires enough cells per sample and cell type.
Mixed models
Complex designs with donor or batch effects.
Computationally heavier and requires statistical care.
19. Cell composition and differential abundance
Single-cell data can reveal changes in cell-type proportions between groups. This requires careful sample-level analysis because cells from the same sample are not independent.
Cell counts per typeSummarize cell-type proportions by sample.
Compositional effectsRelative proportions can change even when absolute abundance is unknown.
Sample-aware testingUse donor-level or sample-level statistics when comparing groups.
ValidationFlow cytometry, imaging or deconvolution can support abundance claims.
20. Trajectory and pseudotime analysis
Trajectory methods infer continuous transitions such as differentiation, activation or cell-cycle progression. They are useful when biology is gradual rather than discrete.
Question
Analysis concept
Notes
How do cells transition between states?
Pseudotime ordering.
Requires a plausible starting point and biological validation.
Are there branching lineages?
Trajectory graph or lineage inference.
Branch interpretation depends on sampling and cell-state continuity.
Which genes change over a trajectory?
Pseudotime-associated expression.
Can reveal programs of differentiation or activation.
What is RNA velocity?
Spliced/unspliced RNA dynamics.
Requires suitable data and careful model assumptions.
21. Cell-cell communication analysis
Cell-cell communication tools infer potential ligand-receptor interactions between annotated cell types or states. These analyses generate hypotheses about signaling, not direct proof of physical interactions.
Ligand expressionPotential signal produced by one cell type.
Receptor expressionPotential responsiveness in another cell type.
Pathway summariesGroup interactions by signaling families or pathways.
ValidationSpatial, protein or perturbation data improves confidence.
22. CITE-seq, immune profiling and multiome extensions
Many single-cell projects include additional modalities. Analysis should preserve modality-specific QC and integrate modalities only after appropriate preprocessing.
Modality
What it adds
Analysis focus
CITE-seq / ADT
Surface protein measurements.
Protein QC, background correction and RNA-protein annotation.
TCR/BCR sequencing
Immune receptor clonotypes.
Clonal expansion, lineage, antigen-response hypotheses and pairing with cell states.
scATAC-seq
Chromatin accessibility.
Peak calling, gene activity, motif analysis and regulatory interpretation.
Multiome RNA+ATAC
Expression and accessibility from the same cells.
Joint embeddings, peak-gene links and regulatory programs.
23. Visualization
Visualizations should show both biological results and technical quality. Avoid drawing conclusions from UMAP shape alone.
Plot
Purpose
What to inspect
UMAP/t-SNE by cluster
Visualize cell groups.
Cluster structure, isolated outliers and annotation consistency.
UMAP by sample or batch
Detect batch effects.
Whether clusters are sample-specific or shared across replicates.
QC violin plots
Review UMI, gene and mitochondrial metrics.
Thresholds and outlier populations.
Dot plots
Show marker expression across clusters.
Marker specificity and annotation support.
Heatmaps
Summarize marker or pathway expression.
Cluster and sample-level patterns.
Composition plots
Show cell-type proportions by sample.
Group trends and sample variability.
24. Example single-cell RNA-seq workflow
The following simplified workflow illustrates a common Scanpy route after a 10x-style count matrix has been generated. Real projects should adapt thresholds and models to tissue, chemistry and study design.
Single-cell RNA-seq data analysis is the computational processing of sequencing reads from individual cells or nuclei to generate cell-by-gene count matrices, perform quality control, identify cell types, study cell states, compare conditions and interpret biological mechanisms.
What is the difference between single-cell RNA-seq and bulk RNA-seq?
Bulk RNA-seq measures average expression across a mixture of cells, while single-cell RNA-seq measures expression for thousands to millions of individual cells or nuclei, enabling cell-type discovery and cell-state analysis.
What are UMIs and why are they important?
Unique molecular identifiers are short barcodes attached to RNA molecules before amplification. They help count original molecules rather than PCR-amplified reads and are central to many droplet-based single-cell workflows.
What are the main QC metrics in single-cell RNA-seq?
Important metrics include total UMIs per cell, number of detected genes, mitochondrial gene fraction, ribosomal fraction, doublet score, ambient RNA contamination, cell complexity, batch effects and expected marker-gene expression.
What is doublet detection?
Doublet detection identifies barcodes that likely contain two or more cells. Doublets can create artificial hybrid cell states and distort clustering or cell-type annotation.
What is ambient RNA?
Ambient RNA is RNA released from lysed or damaged cells that can contaminate droplets or wells. It can create low-level expression of genes from other cell types and should be evaluated, especially in fragile tissues.
Should single-cell data be integrated across batches?
Integration is often useful when combining samples, donors, runs or technologies, but it should be done carefully. Overcorrection can remove real biology, while undercorrection can leave technical batch effects.
Can AI help with single-cell RNA-seq?
AI can help summarize QC, annotate clusters, integrate literature, prioritize marker genes, draft reports and interpret cell-state changes, while the analysis should remain reproducible, versioned and reviewed by domain experts.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.