RNA-seq data analysis transforms sequencing reads into quantitative and interpretable information about the transcriptome. Depending on the experiment, it may quantify gene expression, identify differential expression, study transcript isoforms, detect alternative splicing, evaluate allele-specific expression, discover gene fusions or support pathway-level interpretation.
PreprocessingValidate metadata, run FASTQ QC, trim when needed and confirm library type.
QuantificationMap or pseudoalign reads, assign reads to genes or estimate transcript abundance.
InterpretationNormalize, model comparisons, visualize results and interpret biological pathways.
Core principle: RNA-seq analysis is not only a sequence-processing task. It is a statistical experiment where replication, metadata, batch effects and model design determine whether conclusions are reliable.
2. RNA-seq assay types
Different RNA-seq protocols produce different signals and require different analysis choices.
Assay type
Typical objective
Analysis focus
Bulk mRNA-seq
Measure polyadenylated transcript expression.
Gene expression, transcript abundance, differential expression and pathway analysis.
Total RNA-seq
Capture coding and noncoding RNA, often after rRNA depletion.
mRNA, lncRNA, intronic signal, pre-mRNA and broader transcriptome analysis.
Small RNA-seq
Profile miRNAs, piRNAs or other small RNAs.
Adapter trimming, short-read mapping, small-RNA references and length distributions.
Stranded RNA-seq
Preserve transcript orientation.
Correct strandedness settings, antisense transcription and overlapping genes.
Long-read RNA-seq
Resolve full-length transcripts and isoforms.
Isoform discovery, transcript structure, fusion transcripts and splice complexity.
Single-cell RNA-seq
Measure expression at cell level.
Cell calling, UMI counting, doublets, clustering and cell-type annotation.
3. Experimental design before analysis
Good RNA-seq analysis begins before sequencing. Differential expression and downstream interpretation depend strongly on biological replication, balanced batches and well-defined contrasts.
Questions to answer early
What is the biological question and primary comparison?
How many biological replicates are available per group?
Are there known batch variables such as sequencing run, library-preparation date, donor, sex, tissue, treatment time or plate?
Is the library stranded or unstranded?
Is the protocol poly(A), total RNA, ribo-depleted, small RNA, 3′ counting, full-length or single-cell?
Should the analysis focus on genes, transcripts, isoforms, splice junctions, fusions or pathway signatures?
Which reference genome and annotation version will be used?
Are there expected confounders that must be included in the statistical model?
RNA-seq without biological replication can be useful for exploration, but it has limited power for statistically reliable differential-expression testing.
4. Input files and reference resources
RNA-seq workflows require raw reads, metadata, a reference genome or transcriptome and a compatible gene annotation.
Input
Typical format
Use
Raw reads
FASTQ or FASTQ.GZ
Primary sequencing reads before QC, alignment or quantification.
Sample sheet
TSV/CSV
Connects FASTQ files to samples, groups, batches and covariates.
Reference genome
FASTA
Used for splice-aware alignment and coordinate-based analyses.
Gene annotation
GTF or GFF3
Defines genes, transcripts and exons for counting and interpretation.
Transcriptome
FASTA
Used by transcript quantification tools such as Salmon or Kallisto.
Optional references
rRNA, ERCC spike-ins, custom transcript sets
Used for contamination checks, spike-in normalization or custom analyses.
5. Sample metadata for RNA-seq
Metadata defines the statistical design. Poor metadata can make otherwise high-quality sequencing data difficult to interpret.
Example RNA-seq sample sheet
sample_id group batch condition sex fastq_1 fastq_2
S1 control A untreated F S1_R1.fastq.gz S1_R2.fastq.gz
S2 control A untreated M S2_R1.fastq.gz S2_R2.fastq.gz
S3 treated B drug F S3_R1.fastq.gz S3_R2.fastq.gz
S4 treated B drug M S4_R1.fastq.gz S4_R2.fastq.gz
Recommended metadata fields
sample_id: stable identifier used in all files.
group or condition: biological group for comparisons.
batch: sequencing run, preparation batch or processing batch.
replicate or donor identifier.
Covariates such as sex, age, time point, treatment dose, tissue or cell type when relevant.
FASTQ file names and read layout.
Library protocol, strandedness, read length and reference/annotation version.
6. FASTQ quality control
FASTQ QC checks raw read quality before alignment or quantification. RNA-seq-specific interpretation should consider library type, insert size, transcript abundance and possible rRNA contamination.
Per-base quality scores and quality decay toward read ends.
Adapter contamination, especially in short inserts and small RNA-seq.
Read counts per sample and unexpected sample outliers.
GC content distribution and overrepresented sequences.
Duplication patterns, interpreted in the context of expression levels and library complexity.
R1/R2 consistency for paired-end libraries.
7. Adapter trimming and filtering
Trimming removes adapters, primers or low-quality read ends. For RNA-seq, incorrect trimming can affect mappability and quantification, so preprocessing should be guided by QC.
Trim when...Adapters, primers or poor-quality tails are visible and likely to affect alignment or quantification.
Be careful with...Small RNA-seq, 3′ RNA-seq, UMI-containing reads and protocols with special barcode structures.
8. Reference genome, transcriptome and annotation preparation
Reference choice affects mapping, counting and interpretation. The genome FASTA, transcriptome FASTA and GTF/GFF annotation should come from compatible releases.
Genome FASTAUsed by splice-aware aligners such as STAR and HISAT2.
GTF/GFF annotationDefines genes and transcripts for counting and splice-junction support.
Transcriptome FASTAUsed by Salmon, Kallisto and transcript-level workflows.
Version trackingRecord source, release number, assembly, download date and checksums.
For differential expression, transcript-level estimates are often summarized to gene level using tximport-style workflows before DESeq2 or edgeR analysis.
12. Gene-level counting with featureCounts or HTSeq
Gene-level counting assigns aligned reads to annotated genomic features. Correct strandedness and annotation choice are critical.
Use for paired-end libraries when counting fragments.
-s 0
Unstranded counting.
Use only if the library is unstranded.
-s 1
Stranded counting.
Forward-stranded convention for featureCounts.
-s 2
Reverse-stranded counting.
Many common stranded RNA-seq kits are reverse-stranded, but confirm experimentally.
Wrong strandedness can dramatically reduce assigned reads and distort expression estimates. Confirm strandedness with library documentation and QC tools.
13. Count matrices and expression tables
Most statistical RNA-seq workflows start from a matrix where rows are genes or transcripts and columns are samples.
Expression value
Use
Caution
Raw counts
Differential-expression testing with DESeq2, edgeR and related methods.
Not directly comparable between samples without modeling or normalization.
Normalized counts
Visualization, clustering and exploratory analysis.
Normalization method should match the analysis objective.
TPM
Descriptive expression abundance and within-sample transcript comparison.
Not usually the preferred input for count-based differential-expression testing.
FPKM/RPKM
Older abundance summaries.
Less preferred in many modern workflows compared with TPM or count-based models.
Keep raw counts, normalized counts and annotation tables separate. Do not overwrite raw matrices with transformed values.
14. Sample-level RNA-seq QC
After quantification, check whether samples behave as expected. Sample-level QC often detects swaps, outliers, batch effects, failed libraries and metadata errors.
Library sizeTotal assigned reads or fragments per sample.
Detected genesNumber of genes with nonzero or sufficient counts.
PCA or MDSVisualize sample relationships, groups, batches and outliers.
Correlation heatmapCheck replicate similarity and sample-level consistency.
Biotype compositionEvaluate protein-coding, rRNA, mitochondrial, intronic or intergenic signal.
Marker genesConfirm expected cell type, tissue, sex-specific or treatment-responsive markers.
15. Normalization and transformation
RNA-seq count data require normalization because samples differ in library size, RNA composition and other technical factors.
Method concept
Common use
Notes
Size-factor normalization
DESeq2-style differential expression.
Accounts for sequencing depth and RNA composition under model assumptions.
TMM normalization
edgeR-style workflows.
Useful for count-based analysis with compositional correction.
VST/rlog/logCPM
PCA, clustering and heatmaps.
Transform counts for visualization and distance-based methods.
TPM
Abundance summaries and some transcript-level comparisons.
Useful descriptively, but not a substitute for count-based statistical testing.
16. Differential-expression analysis
Differential-expression analysis estimates whether gene expression differs between conditions while accounting for biological variability and relevant covariates.
Prepare designDefine groups, batches, covariates and contrasts.
Model countsUse DESeq2, edgeR, limma-voom or another suitable method.
The design formula should match the experiment. Including the wrong covariates, omitting major batches or overfitting small studies can change conclusions.
17. Batch effects and confounders
Batch effects are technical or experimental differences that are unrelated to the main biological comparison. They can dominate RNA-seq data if not considered.
Batch source
How it appears
Mitigation
Library preparation batch
Samples cluster by preparation date or kit lot.
Balance groups across batches and include batch in the model.
Sequencing run or lane
Run-specific differences in read depth or composition.
Randomize samples, use lane merging carefully and model run when needed.
Donor or patient
Individual differences dominate expression.
Use paired or blocking designs when appropriate.
RNA quality
5′/3′ bias, lower detected genes and degradation signatures.
Include quality metrics, filter failed samples or redesign if needed.
Cell composition
Expression changes reflect altered cell mixtures.
Interpret with marker genes, deconvolution or sorted-cell designs.
18. Isoforms and alternative splicing
Standard gene-level differential expression may miss transcript switching, exon usage and alternative splicing. Isoform and splicing analyses require additional methods and careful interpretation.
Transcript abundanceEstimate transcript-level expression with Salmon, Kallisto or similar tools.
Differential transcript usageTest whether isoform proportions change between conditions.
Exon usageEvaluate differential exon inclusion or exclusion.
Splice junctionsUse junction-spanning reads from splice-aware aligners.
Short-read RNA-seq can infer many splicing patterns, but long-read RNA sequencing can provide stronger evidence for full-length isoforms.
19. Expressed variants and fusion transcripts
RNA-seq can support additional analyses beyond expression, especially in cancer and disease studies.
Analysis
What it detects
Caution
Fusion transcripts
Chimeric transcripts or gene fusions.
Requires fusion-specific tools and careful artefact filtering.
Expressed variants
Variants present in RNA reads.
RNA editing, allele-specific expression and mapping artefacts complicate interpretation.
Allele-specific expression
Imbalanced expression of alleles.
Requires genotype-aware analysis and bias control.
Viral or microbial reads
Non-host expressed sequences.
Requires contamination-aware interpretation and proper references.
20. Functional and pathway interpretation
Differential-expression tables become more informative when interpreted at pathway, gene-set and biological-process levels.
Overrepresentation analysisTests whether selected significant genes are enriched for pathways or GO terms.
Gene set enrichmentUses ranked gene lists and can detect coordinated moderate changes.
Pathway databasesReactome, Gene Ontology, KEGG-style resources and curated gene sets are common options.
Biological reviewInterpret gene sets together with study design, cell composition and known biology.
Pathway results are hypotheses and summaries, not proof of mechanism. They depend on gene universe, annotation version, ranking metric and database content.
21. RNA-seq visualization
Visualizations help assess data quality, communicate results and detect unexpected patterns.
Plot
Purpose
What to inspect
PCA / MDS
Sample relationships and batch effects.
Group separation, outliers and batch-driven clustering.
Sample correlation heatmap
Replicate similarity.
Unexpectedly weak replicate correlation or sample swaps.
MA plot
Differential-expression pattern versus abundance.
Global shifts, low-count noise and strong outliers.
Volcano plot
Significance and effect size summary.
Genes with strong fold change and adjusted significance.
Heatmap
Expression patterns for selected genes.
Clustering, group-specific signatures and batch patterns.
Genome browser tracks
Read coverage and splice junction visualization.
Gene-level evidence, isoform changes and mapping artefacts.
22. Note on single-cell RNA-seq
Single-cell RNA-seq shares some RNA-seq concepts but requires specialized processing. Cell barcodes, UMIs, empty droplets, doublets, ambient RNA, cell-cycle effects, normalization, clustering and cell-type annotation are central.
Cell-level QCUMIs per cell, genes per cell, mitochondrial fraction and doublet scores.
Matrix generationCell-by-gene matrices are produced from barcode and UMI-aware workflows.
ClusteringCells are grouped by expression profiles and annotated with marker genes.
Different statisticsSingle-cell differential testing should consider donor structure and pseudobulk designs when appropriate.
23. Example RNA-seq analysis workflow
The following simplified workflow illustrates a common bulk RNA-seq route. Real projects should adapt parameters to the library protocol, reference and statistical design.
RNA-seq reports should include both computational outputs and interpretation. The exact deliverables depend on whether the project focuses on expression, isoforms, splicing, fusions or pathway analysis.
Recommended deliverables
FASTQ QC reports and final MultiQC report.
Alignment or quantification logs.
Sorted BAM files and indexes if genome alignment was performed.
PCA, sample correlation and outlier assessment plots.
Differential-expression tables with log2 fold changes and adjusted p-values.
Pathway or gene-set enrichment results when requested.
Methods section with software versions, references and model design.
Interpretation summary with limitations and suggested follow-up analyses.
25. Reproducibility and workflow automation
RNA-seq projects often need to be rerun as metadata changes, samples are added or references are updated. Reproducible workflows reduce manual errors and make results auditable.
Version everythingRecord FASTQ checksums, reference releases, annotation versions and software versions.
Use workflow managersNextflow, Snakemake or equivalent systems help scale and document analysis.
Save environmentsUse conda/mamba YAML files, containers or locked workflow profiles.
Preserve statistical designKeep sample metadata, contrast definitions and model formulas with the report.
26. RNA-seq data analysis cheat sheet
Step
Common tools
Main outputs
FASTQ QC
FastQC, fastp, MultiQC
Raw-read QC and project-level summary.
Trimming
fastp, Cutadapt, Trim Galore
Trimmed FASTQ files and trimming reports.
Genome alignment
STAR, HISAT2
Splice-aware BAM files, logs and junction information.
Transcript quantification
Salmon, Kallisto
Transcript abundance estimates and quantification summaries.
Gene counting
featureCounts, HTSeq-count, STAR GeneCounts
Gene-level count matrix.
Sample QC
MultiQC, R/Python, RSeQC, RNA-SeQC
PCA, correlations, assignment rate, gene body coverage and strandedness checks.
Differential expression
DESeq2, edgeR, limma-voom
Log2 fold changes, adjusted p-values and ranked gene tables.
RNA-seq data analysis is the computational processing of RNA sequencing reads to measure gene or transcript expression, detect differential expression, study splicing, identify expressed variants or fusions, and interpret biological pathways or signatures.
What is the difference between RNA-seq alignment and transcript quantification?
Alignment maps reads to a genome or transcriptome and can produce BAM files for visualization, splice-junction analysis and counting. Transcript quantification estimates transcript or gene abundance, often with tools such as Salmon or Kallisto, and may not require full genomic BAM files.
Which tools are commonly used for RNA-seq analysis?
Common tools include FastQC and MultiQC for QC, fastp or Cutadapt for trimming, STAR or HISAT2 for splice-aware alignment, featureCounts or HTSeq for counting, Salmon or Kallisto for transcript quantification, and DESeq2, edgeR or limma-voom for differential expression.
How many reads are needed for RNA-seq?
Required depth depends on organism, library type, transcriptome complexity and analysis goal. Many bulk mRNA-seq studies use tens of millions of reads per sample, while low-input, total RNA, allele-specific, splicing or low-abundance transcript analyses may require more.
Is trimming required for RNA-seq?
Trimming is required when adapter sequences, primers or poor-quality tails are detected and may affect mapping or quantification. It should be based on QC results rather than applied blindly.
What is strandedness in RNA-seq?
Strandedness describes whether read orientation preserves information about the original RNA strand. Correct strandedness settings are essential for accurate gene counting, especially for overlapping genes and antisense transcription.
What is the difference between raw counts, TPM and FPKM?
Raw counts are integer read counts used by many differential-expression methods. TPM and FPKM/RPKM are normalized abundance measures useful for within-sample or descriptive expression summaries, but raw counts are usually preferred for statistical differential-expression testing with DESeq2 or edgeR-style models.
Can AI help with RNA-seq analysis?
AI can help summarize QC reports, detect unusual samples, explain differential-expression results, prioritize genes and draft pathway interpretation. However, the analysis should remain reproducible, versioned and based on explicit statistical models and expert review.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.