RNA-seq Analysis Tutorial

RNA-seq data analysis: from FASTQ files to biological interpretation.

A practical tutorial for bulk RNA-seq and related transcriptomics projects. It covers experimental design, metadata, FASTQ quality control, trimming, reference preparation, splice-aware alignment, transcript quantification, gene counting, normalization, exploratory analysis, differential expression, pathway interpretation and reproducible reporting.

1. Overview: what is RNA-seq data analysis?

RNA-seq data analysis transforms sequencing reads into quantitative and interpretable information about the transcriptome. Depending on the experiment, it may quantify gene expression, identify differential expression, study transcript isoforms, detect alternative splicing, evaluate allele-specific expression, discover gene fusions or support pathway-level interpretation.

Preprocessing Validate metadata, run FASTQ QC, trim when needed and confirm library type.
Quantification Map or pseudoalign reads, assign reads to genes or estimate transcript abundance.
Interpretation Normalize, model comparisons, visualize results and interpret biological pathways.
Core principle: RNA-seq analysis is not only a sequence-processing task. It is a statistical experiment where replication, metadata, batch effects and model design determine whether conclusions are reliable.

2. RNA-seq assay types

Different RNA-seq protocols produce different signals and require different analysis choices.

Assay type Typical objective Analysis focus
Bulk mRNA-seq Measure polyadenylated transcript expression. Gene expression, transcript abundance, differential expression and pathway analysis.
Total RNA-seq Capture coding and noncoding RNA, often after rRNA depletion. mRNA, lncRNA, intronic signal, pre-mRNA and broader transcriptome analysis.
Small RNA-seq Profile miRNAs, piRNAs or other small RNAs. Adapter trimming, short-read mapping, small-RNA references and length distributions.
Stranded RNA-seq Preserve transcript orientation. Correct strandedness settings, antisense transcription and overlapping genes.
Long-read RNA-seq Resolve full-length transcripts and isoforms. Isoform discovery, transcript structure, fusion transcripts and splice complexity.
Single-cell RNA-seq Measure expression at cell level. Cell calling, UMI counting, doublets, clustering and cell-type annotation.

3. Experimental design before analysis

Good RNA-seq analysis begins before sequencing. Differential expression and downstream interpretation depend strongly on biological replication, balanced batches and well-defined contrasts.

Questions to answer early

  • What is the biological question and primary comparison?
  • How many biological replicates are available per group?
  • Are there known batch variables such as sequencing run, library-preparation date, donor, sex, tissue, treatment time or plate?
  • Is the library stranded or unstranded?
  • Is the protocol poly(A), total RNA, ribo-depleted, small RNA, 3′ counting, full-length or single-cell?
  • Should the analysis focus on genes, transcripts, isoforms, splice junctions, fusions or pathway signatures?
  • Which reference genome and annotation version will be used?
  • Are there expected confounders that must be included in the statistical model?
RNA-seq without biological replication can be useful for exploration, but it has limited power for statistically reliable differential-expression testing.

4. Input files and reference resources

RNA-seq workflows require raw reads, metadata, a reference genome or transcriptome and a compatible gene annotation.

Input Typical format Use
Raw reads FASTQ or FASTQ.GZ Primary sequencing reads before QC, alignment or quantification.
Sample sheet TSV/CSV Connects FASTQ files to samples, groups, batches and covariates.
Reference genome FASTA Used for splice-aware alignment and coordinate-based analyses.
Gene annotation GTF or GFF3 Defines genes, transcripts and exons for counting and interpretation.
Transcriptome FASTA Used by transcript quantification tools such as Salmon or Kallisto.
Optional references rRNA, ERCC spike-ins, custom transcript sets Used for contamination checks, spike-in normalization or custom analyses.

5. Sample metadata for RNA-seq

Metadata defines the statistical design. Poor metadata can make otherwise high-quality sequencing data difficult to interpret.

Example RNA-seq sample sheet
sample_id	group	batch	condition	sex	fastq_1	fastq_2
S1	control	A	untreated	F	S1_R1.fastq.gz	S1_R2.fastq.gz
S2	control	A	untreated	M	S2_R1.fastq.gz	S2_R2.fastq.gz
S3	treated	B	drug	F	S3_R1.fastq.gz	S3_R2.fastq.gz
S4	treated	B	drug	M	S4_R1.fastq.gz	S4_R2.fastq.gz

Recommended metadata fields

  • sample_id: stable identifier used in all files.
  • group or condition: biological group for comparisons.
  • batch: sequencing run, preparation batch or processing batch.
  • replicate or donor identifier.
  • Covariates such as sex, age, time point, treatment dose, tissue or cell type when relevant.
  • FASTQ file names and read layout.
  • Library protocol, strandedness, read length and reference/annotation version.

6. FASTQ quality control

FASTQ QC checks raw read quality before alignment or quantification. RNA-seq-specific interpretation should consider library type, insert size, transcript abundance and possible rRNA contamination.

FASTQ QC with FastQC and MultiQC
mkdir -p results/qc/fastqc results/qc/multiqc

fastqc data/fastq/*.fastq.gz \
  --outdir results/qc/fastqc \
  --threads 8

multiqc results/qc \
  --outdir results/qc/multiqc

Key checks

  • Per-base quality scores and quality decay toward read ends.
  • Adapter contamination, especially in short inserts and small RNA-seq.
  • Read counts per sample and unexpected sample outliers.
  • GC content distribution and overrepresented sequences.
  • Duplication patterns, interpreted in the context of expression levels and library complexity.
  • R1/R2 consistency for paired-end libraries.

7. Adapter trimming and filtering

Trimming removes adapters, primers or low-quality read ends. For RNA-seq, incorrect trimming can affect mappability and quantification, so preprocessing should be guided by QC.

Trim when... Adapters, primers or poor-quality tails are visible and likely to affect alignment or quantification.
Be careful with... Small RNA-seq, 3′ RNA-seq, UMI-containing reads and protocols with special barcode structures.
Paired-end trimming with fastp
mkdir -p results/trimmed results/qc/fastp

fastp \
  --in1 data/fastq/S1_R1.fastq.gz \
  --in2 data/fastq/S1_R2.fastq.gz \
  --out1 results/trimmed/S1_R1.trimmed.fastq.gz \
  --out2 results/trimmed/S1_R2.trimmed.fastq.gz \
  --html results/qc/fastp/S1.fastp.html \
  --json results/qc/fastp/S1.fastp.json \
  --thread 8

8. Reference genome, transcriptome and annotation preparation

Reference choice affects mapping, counting and interpretation. The genome FASTA, transcriptome FASTA and GTF/GFF annotation should come from compatible releases.

Genome FASTA Used by splice-aware aligners such as STAR and HISAT2.
GTF/GFF annotation Defines genes and transcripts for counting and splice-junction support.
Transcriptome FASTA Used by Salmon, Kallisto and transcript-level workflows.
Version tracking Record source, release number, assembly, download date and checksums.
STAR index generation
mkdir -p reference/star_index

STAR \
  --runThreadN 16 \
  --runMode genomeGenerate \
  --genomeDir reference/star_index \
  --genomeFastaFiles reference/genome.fa \
  --sjdbGTFfile reference/annotation.gtf \
  --sjdbOverhang 149
Salmon transcriptome index
mkdir -p reference/salmon_index

salmon index \
  -t reference/transcripts.fa \
  -i reference/salmon_index \
  -p 16
Do not mix annotation releases accidentally. Count matrices generated with different GTF releases may not be directly comparable.

9. Alignment vs transcript quantification

RNA-seq workflows can follow a genome-alignment route, a transcript-quantification route or both.

Strategy Common tools Best use
Splice-aware genome alignment STAR, HISAT2 Genome-browser visualization, splice junctions, fusion discovery, variant-aware analysis and gene counting.
Gene counting from BAM featureCounts, HTSeq-count Gene-level count matrices for differential expression.
Transcript quantification Salmon, Kallisto Fast transcript-level abundance estimation and gene-level summarization with tximport-style workflows.
Hybrid workflow STAR plus Salmon, or alignment plus quantification Projects needing both BAM-level QC and transcript-level estimates.

10. Splice-aware alignment with STAR and HISAT2

Splice-aware aligners can map reads that cross exon-exon junctions. This is essential for many RNA-seq analyses that require genomic coordinates.

STAR paired-end alignment with gene counts
mkdir -p results/star/S1

STAR \
  --runThreadN 16 \
  --genomeDir reference/star_index \
  --readFilesIn data/fastq/S1_R1.fastq.gz data/fastq/S1_R2.fastq.gz \
  --readFilesCommand zcat \
  --outFileNamePrefix results/star/S1/ \
  --outSAMtype BAM SortedByCoordinate \
  --quantMode GeneCounts

samtools index results/star/S1/Aligned.sortedByCoord.out.bam
HISAT2 paired-end alignment
mkdir -p results/hisat2 results/logs

hisat2 \
  -x reference/hisat2/genome \
  -1 data/fastq/S1_R1.fastq.gz \
  -2 data/fastq/S1_R2.fastq.gz \
  -p 16 \
  2> results/logs/S1.hisat2.log | \
  samtools sort -@ 8 -o results/hisat2/S1.sorted.bam

samtools index results/hisat2/S1.sorted.bam

Alignment QC for RNA-seq

  • Uniquely mapped reads and multi-mapping reads.
  • Reads assigned to exons, introns, intergenic regions and splice junctions.
  • Strandedness and gene-assignment rate.
  • rRNA fraction and mitochondrial fraction.
  • Gene body coverage and 5′/3′ bias.

11. Transcript quantification with Salmon and Kallisto

Transcript quantification tools estimate transcript abundance quickly and are widely used for gene-level and transcript-level expression analysis.

Salmon paired-end quantification
mkdir -p results/salmon/S1

salmon quant \
  -i reference/salmon_index \
  -l A \
  -1 data/fastq/S1_R1.fastq.gz \
  -2 data/fastq/S1_R2.fastq.gz \
  -p 16 \
  --validateMappings \
  -o results/salmon/S1
Kallisto paired-end quantification
mkdir -p results/kallisto/S1

kallisto quant \
  -i reference/kallisto/transcripts.idx \
  -o results/kallisto/S1 \
  -t 16 \
  data/fastq/S1_R1.fastq.gz \
  data/fastq/S1_R2.fastq.gz
For differential expression, transcript-level estimates are often summarized to gene level using tximport-style workflows before DESeq2 or edgeR analysis.

12. Gene-level counting with featureCounts or HTSeq

Gene-level counting assigns aligned reads to annotated genomic features. Correct strandedness and annotation choice are critical.

featureCounts for paired-end RNA-seq
mkdir -p results/counts

featureCounts \
  -T 8 \
  -p \
  -s 2 \
  -a reference/annotation.gtf \
  -o results/counts/gene_counts.txt \
  results/star/*/Aligned.sortedByCoord.out.bam
Option Meaning Important note
-p Paired-end read counting. Use for paired-end libraries when counting fragments.
-s 0 Unstranded counting. Use only if the library is unstranded.
-s 1 Stranded counting. Forward-stranded convention for featureCounts.
-s 2 Reverse-stranded counting. Many common stranded RNA-seq kits are reverse-stranded, but confirm experimentally.
Wrong strandedness can dramatically reduce assigned reads and distort expression estimates. Confirm strandedness with library documentation and QC tools.

13. Count matrices and expression tables

Most statistical RNA-seq workflows start from a matrix where rows are genes or transcripts and columns are samples.

Expression value Use Caution
Raw counts Differential-expression testing with DESeq2, edgeR and related methods. Not directly comparable between samples without modeling or normalization.
Normalized counts Visualization, clustering and exploratory analysis. Normalization method should match the analysis objective.
TPM Descriptive expression abundance and within-sample transcript comparison. Not usually the preferred input for count-based differential-expression testing.
FPKM/RPKM Older abundance summaries. Less preferred in many modern workflows compared with TPM or count-based models.
Keep raw counts, normalized counts and annotation tables separate. Do not overwrite raw matrices with transformed values.

14. Sample-level RNA-seq QC

After quantification, check whether samples behave as expected. Sample-level QC often detects swaps, outliers, batch effects, failed libraries and metadata errors.

Library size Total assigned reads or fragments per sample.
Detected genes Number of genes with nonzero or sufficient counts.
PCA or MDS Visualize sample relationships, groups, batches and outliers.
Correlation heatmap Check replicate similarity and sample-level consistency.
Biotype composition Evaluate protein-coding, rRNA, mitochondrial, intronic or intergenic signal.
Marker genes Confirm expected cell type, tissue, sex-specific or treatment-responsive markers.

15. Normalization and transformation

RNA-seq count data require normalization because samples differ in library size, RNA composition and other technical factors.

Method concept Common use Notes
Size-factor normalization DESeq2-style differential expression. Accounts for sequencing depth and RNA composition under model assumptions.
TMM normalization edgeR-style workflows. Useful for count-based analysis with compositional correction.
VST/rlog/logCPM PCA, clustering and heatmaps. Transform counts for visualization and distance-based methods.
TPM Abundance summaries and some transcript-level comparisons. Useful descriptively, but not a substitute for count-based statistical testing.

16. Differential-expression analysis

Differential-expression analysis estimates whether gene expression differs between conditions while accounting for biological variability and relevant covariates.

Prepare design Define groups, batches, covariates and contrasts.
Model counts Use DESeq2, edgeR, limma-voom or another suitable method.
Review results Inspect log2 fold change, adjusted p-value, expression level and QC context.
Minimal DESeq2-style R example
library(DESeq2)

counts <- read.delim("results/counts/gene_counts_matrix.tsv", row.names = 1, check.names = FALSE)
metadata <- read.delim("data/samples.tsv", row.names = 1)

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData = metadata,
  design = ~ batch + group
)

dds <- dds[rowSums(counts(dds)) >= 10, ]
dds <- DESeq(dds)

res <- results(dds, contrast = c("group", "treated", "control"))
res <- res[order(res$padj), ]

write.csv(as.data.frame(res), "results/differential_expression/treated_vs_control_deseq2.csv")
The design formula should match the experiment. Including the wrong covariates, omitting major batches or overfitting small studies can change conclusions.

17. Batch effects and confounders

Batch effects are technical or experimental differences that are unrelated to the main biological comparison. They can dominate RNA-seq data if not considered.

Batch source How it appears Mitigation
Library preparation batch Samples cluster by preparation date or kit lot. Balance groups across batches and include batch in the model.
Sequencing run or lane Run-specific differences in read depth or composition. Randomize samples, use lane merging carefully and model run when needed.
Donor or patient Individual differences dominate expression. Use paired or blocking designs when appropriate.
RNA quality 5′/3′ bias, lower detected genes and degradation signatures. Include quality metrics, filter failed samples or redesign if needed.
Cell composition Expression changes reflect altered cell mixtures. Interpret with marker genes, deconvolution or sorted-cell designs.

18. Isoforms and alternative splicing

Standard gene-level differential expression may miss transcript switching, exon usage and alternative splicing. Isoform and splicing analyses require additional methods and careful interpretation.

Transcript abundance Estimate transcript-level expression with Salmon, Kallisto or similar tools.
Differential transcript usage Test whether isoform proportions change between conditions.
Exon usage Evaluate differential exon inclusion or exclusion.
Splice junctions Use junction-spanning reads from splice-aware aligners.
Short-read RNA-seq can infer many splicing patterns, but long-read RNA sequencing can provide stronger evidence for full-length isoforms.

19. Expressed variants and fusion transcripts

RNA-seq can support additional analyses beyond expression, especially in cancer and disease studies.

Analysis What it detects Caution
Fusion transcripts Chimeric transcripts or gene fusions. Requires fusion-specific tools and careful artefact filtering.
Expressed variants Variants present in RNA reads. RNA editing, allele-specific expression and mapping artefacts complicate interpretation.
Allele-specific expression Imbalanced expression of alleles. Requires genotype-aware analysis and bias control.
Viral or microbial reads Non-host expressed sequences. Requires contamination-aware interpretation and proper references.

20. Functional and pathway interpretation

Differential-expression tables become more informative when interpreted at pathway, gene-set and biological-process levels.

Overrepresentation analysis Tests whether selected significant genes are enriched for pathways or GO terms.
Gene set enrichment Uses ranked gene lists and can detect coordinated moderate changes.
Pathway databases Reactome, Gene Ontology, KEGG-style resources and curated gene sets are common options.
Biological review Interpret gene sets together with study design, cell composition and known biology.
Pathway results are hypotheses and summaries, not proof of mechanism. They depend on gene universe, annotation version, ranking metric and database content.

21. RNA-seq visualization

Visualizations help assess data quality, communicate results and detect unexpected patterns.

Plot Purpose What to inspect
PCA / MDS Sample relationships and batch effects. Group separation, outliers and batch-driven clustering.
Sample correlation heatmap Replicate similarity. Unexpectedly weak replicate correlation or sample swaps.
MA plot Differential-expression pattern versus abundance. Global shifts, low-count noise and strong outliers.
Volcano plot Significance and effect size summary. Genes with strong fold change and adjusted significance.
Heatmap Expression patterns for selected genes. Clustering, group-specific signatures and batch patterns.
Genome browser tracks Read coverage and splice junction visualization. Gene-level evidence, isoform changes and mapping artefacts.

22. Note on single-cell RNA-seq

Single-cell RNA-seq shares some RNA-seq concepts but requires specialized processing. Cell barcodes, UMIs, empty droplets, doublets, ambient RNA, cell-cycle effects, normalization, clustering and cell-type annotation are central.

Cell-level QC UMIs per cell, genes per cell, mitochondrial fraction and doublet scores.
Matrix generation Cell-by-gene matrices are produced from barcode and UMI-aware workflows.
Clustering Cells are grouped by expression profiles and annotated with marker genes.
Different statistics Single-cell differential testing should consider donor structure and pseudobulk designs when appropriate.

23. Example RNA-seq analysis workflow

The following simplified workflow illustrates a common bulk RNA-seq route. Real projects should adapt parameters to the library protocol, reference and statistical design.

Project folders
mkdir -p reference data/fastq data/metadata \
  results/qc/{fastqc,multiqc,fastp,alignment} \
  results/star results/salmon results/counts \
  results/differential_expression results/figures results/logs
FASTQ QC
fastqc data/fastq/*.fastq.gz \
  --outdir results/qc/fastqc \
  --threads 8

multiqc results/qc \
  --outdir results/qc/multiqc
STAR alignment for one sample
sample="S1"

mkdir -p "results/star/${sample}"

STAR \
  --runThreadN 16 \
  --genomeDir reference/star_index \
  --readFilesIn "data/fastq/${sample}_R1.fastq.gz" "data/fastq/${sample}_R2.fastq.gz" \
  --readFilesCommand zcat \
  --outFileNamePrefix "results/star/${sample}/" \
  --outSAMtype BAM SortedByCoordinate \
  --quantMode GeneCounts

samtools index "results/star/${sample}/Aligned.sortedByCoord.out.bam"
featureCounts gene counting
featureCounts \
  -T 8 \
  -p \
  -s 2 \
  -a reference/annotation.gtf \
  -o results/counts/gene_counts.txt \
  results/star/*/Aligned.sortedByCoord.out.bam
Alternative Salmon quantification
sample="S1"

salmon quant \
  -i reference/salmon_index \
  -l A \
  -1 "data/fastq/${sample}_R1.fastq.gz" \
  -2 "data/fastq/${sample}_R2.fastq.gz" \
  -p 16 \
  --validateMappings \
  -o "results/salmon/${sample}"
Aggregate QC
multiqc results \
  --outdir results/qc/multiqc_final

24. RNA-seq deliverables and reports

RNA-seq reports should include both computational outputs and interpretation. The exact deliverables depend on whether the project focuses on expression, isoforms, splicing, fusions or pathway analysis.

Recommended deliverables

  • FASTQ QC reports and final MultiQC report.
  • Alignment or quantification logs.
  • Sorted BAM files and indexes if genome alignment was performed.
  • Gene-level and/or transcript-level count matrices.
  • Normalized expression tables for visualization.
  • PCA, sample correlation and outlier assessment plots.
  • Differential-expression tables with log2 fold changes and adjusted p-values.
  • Pathway or gene-set enrichment results when requested.
  • Methods section with software versions, references and model design.
  • Interpretation summary with limitations and suggested follow-up analyses.

25. Reproducibility and workflow automation

RNA-seq projects often need to be rerun as metadata changes, samples are added or references are updated. Reproducible workflows reduce manual errors and make results auditable.

Version everything Record FASTQ checksums, reference releases, annotation versions and software versions.
Use workflow managers Nextflow, Snakemake or equivalent systems help scale and document analysis.
Save environments Use conda/mamba YAML files, containers or locked workflow profiles.
Preserve statistical design Keep sample metadata, contrast definitions and model formulas with the report.

26. RNA-seq data analysis cheat sheet

Step Common tools Main outputs
FASTQ QC FastQC, fastp, MultiQC Raw-read QC and project-level summary.
Trimming fastp, Cutadapt, Trim Galore Trimmed FASTQ files and trimming reports.
Genome alignment STAR, HISAT2 Splice-aware BAM files, logs and junction information.
Transcript quantification Salmon, Kallisto Transcript abundance estimates and quantification summaries.
Gene counting featureCounts, HTSeq-count, STAR GeneCounts Gene-level count matrix.
Sample QC MultiQC, R/Python, RSeQC, RNA-SeQC PCA, correlations, assignment rate, gene body coverage and strandedness checks.
Differential expression DESeq2, edgeR, limma-voom Log2 fold changes, adjusted p-values and ranked gene tables.
Functional interpretation clusterProfiler, ReactomePA, fgsea, GSEA-style workflows Pathways, GO terms and gene-set enrichment results.
Visualization R, Python, IGV, MultiQC PCA, heatmaps, volcano plots, MA plots and genome-browser tracks.
Reporting R Markdown, Quarto, workflow reports, AI-assisted summaries Methods, QC, results, plots, interpretation and limitations.

Frequently asked questions

What is RNA-seq data analysis?

RNA-seq data analysis is the computational processing of RNA sequencing reads to measure gene or transcript expression, detect differential expression, study splicing, identify expressed variants or fusions, and interpret biological pathways or signatures.

What is the difference between RNA-seq alignment and transcript quantification?

Alignment maps reads to a genome or transcriptome and can produce BAM files for visualization, splice-junction analysis and counting. Transcript quantification estimates transcript or gene abundance, often with tools such as Salmon or Kallisto, and may not require full genomic BAM files.

Which tools are commonly used for RNA-seq analysis?

Common tools include FastQC and MultiQC for QC, fastp or Cutadapt for trimming, STAR or HISAT2 for splice-aware alignment, featureCounts or HTSeq for counting, Salmon or Kallisto for transcript quantification, and DESeq2, edgeR or limma-voom for differential expression.

How many reads are needed for RNA-seq?

Required depth depends on organism, library type, transcriptome complexity and analysis goal. Many bulk mRNA-seq studies use tens of millions of reads per sample, while low-input, total RNA, allele-specific, splicing or low-abundance transcript analyses may require more.

Is trimming required for RNA-seq?

Trimming is required when adapter sequences, primers or poor-quality tails are detected and may affect mapping or quantification. It should be based on QC results rather than applied blindly.

What is strandedness in RNA-seq?

Strandedness describes whether read orientation preserves information about the original RNA strand. Correct strandedness settings are essential for accurate gene counting, especially for overlapping genes and antisense transcription.

What is the difference between raw counts, TPM and FPKM?

Raw counts are integer read counts used by many differential-expression methods. TPM and FPKM/RPKM are normalized abundance measures useful for within-sample or descriptive expression summaries, but raw counts are usually preferred for statistical differential-expression testing with DESeq2 or edgeR-style models.

Can AI help with RNA-seq analysis?

AI can help summarize QC reports, detect unusual samples, explain differential-expression results, prioritize genes and draft pathway interpretation. However, the analysis should remain reproducible, versioned and based on explicit statistical models and expert review.