Short read alignment: from FASTQ files to analysis-ready BAM files.
A practical tutorial for aligning short-read sequencing data to reference genomes or transcriptomes. It covers reference preparation, aligner selection, indexing, paired-end alignment, read groups, SAM/BAM/CRAM processing, alignment QC, troubleshooting and reproducible workflow design.
Short-read alignment maps sequencing reads to a reference genome, transcriptome or custom reference. Reads are usually stored in FASTQ files before alignment and in SAM, BAM or CRAM files after alignment. The alignment step creates genomic coordinates that make downstream analyses possible.
AlignmentAligner maps reads to reference sequences and records mapping position, quality and flags.
OutputSorted and indexed BAM/CRAM files plus alignment QC reports and logs.
Core principle: alignment is not only a technical conversion step. The chosen reference, parameters, read groups and post-processing choices directly affect downstream biological and clinical interpretation.
2. Key alignment concepts
Understanding a few core terms makes alignment reports and BAM files easier to interpret.
Concept
Meaning
Why it matters
Mapping quality
A score estimating confidence that a read is placed correctly.
Low mapping quality often indicates multi-mapping or repetitive regions.
CIGAR string
Compact representation of matches, insertions, deletions, clipping and skipped regions.
Essential for variant calling, splicing, indel interpretation and read visualization.
Primary alignment
The main alignment selected for a read.
Most downstream tools operate primarily on primary alignments.
Secondary alignment
Alternative alignment for a multi-mapping read.
Important in repeats, paralogs, pseudogenes and multi-copy sequences.
Supplementary alignment
Part of a split alignment.
Important for structural variants, chimeric reads and some fusion analyses.
Proper pair
Paired-end reads aligned in the expected orientation and distance.
Low proper pairing can indicate library, reference or alignment problems.
3. Input files and project metadata
Alignment starts with clean input definitions. The most common problems are wrong sample names, mixed genome assemblies, incorrect read layout and missing metadata.
Minimum input information
FASTQ file paths for R1 and, if paired-end, R2 reads.
Sample identifiers that match the sample sheet and downstream reports.
Organism and reference genome assembly, for example GRCh38 or mm39.
sample_id group fastq_1 fastq_2
S1 control S1_R1.fastq.gz S1_R2.fastq.gz
S2 control S2_R1.fastq.gz S2_R2.fastq.gz
S3 treated S3_R1.fastq.gz S3_R2.fastq.gz
S4 treated S4_R1.fastq.gz S4_R2.fastq.gz
4. Reference genome preparation and indexing
Aligners require an index built from the reference sequence. The index must match the exact FASTA file used for downstream interpretation. If a BAM file is created with one reference but analysed with another, coordinate and variant interpretation errors can occur.
Use one assembly consistentlyDo not mix GRCh37/hg19 and GRCh38/hg38 files unless coordinate conversion is explicitly planned.
Keep reference metadataSave source URL, download date, assembly name, annotation version and checksum.
Index once, reuse carefullyReference indexes can be large. Store them in a versioned reference folder.
Match annotation to referenceRNA-seq and gene counting require annotation files compatible with the same assembly.
Reference indexing examples
# FASTA index used by many tools
samtools faidx reference/genome.fa
# BWA index
bwa index reference/genome.fa
# Bowtie2 index
bowtie2-build reference/genome.fa reference/bowtie2/genome
# HISAT2 index
hisat2-build reference/genome.fa reference/hisat2/genome
# STAR genome index, example for RNA-seq
STAR \
--runThreadN 16 \
--runMode genomeGenerate \
--genomeDir reference/star_index \
--genomeFastaFiles reference/genome.fa \
--sjdbGTFfile reference/annotation.gtf \
--sjdbOverhang 149
5. Choosing an aligner
Aligner choice depends on the assay and downstream endpoint. A tool that is excellent for DNA-seq may be inappropriate for RNA-seq because RNA-seq reads can span exon-exon junctions.
Assay
Common tools
Typical output
Short-read DNA-seq
BWA-MEM, BWA-MEM2, Bowtie2, minimap2
Genome-aligned BAM/CRAM for variant calling, coverage and CNV analysis.
RNA-seq
STAR, HISAT2, Salmon, Kallisto
Splice-aware BAM, transcript quantification or gene-count matrix.
ChIP-seq / ATAC-seq
Bowtie2, BWA, BWA-MEM2
Genome-aligned BAM for peak calling and signal tracks.
Small RNA-seq
Bowtie, Bowtie2, STAR small-RNA-aware workflows
Short alignments to genome or small-RNA references.
Metagenomics
Bowtie2, minimap2, BWA or taxonomic classifiers
Host-filtered reads, microbial alignments or taxonomic profiles.
6. DNA-seq alignment with BWA or BWA-MEM2
BWA-MEM and BWA-MEM2 are commonly used for short-read DNA-seq alignment, especially for whole-genome, exome and targeted-panel workflows. BWA-MEM2 is designed as a faster implementation that produces highly similar alignments for many practical workflows.
Bowtie2 alignment summaries are printed to stderr, so redirecting 2> to a log file preserves important QC information.
8. RNA-seq alignment with STAR and HISAT2
RNA-seq reads can span exon-exon junctions. Splice-aware aligners such as STAR and HISAT2 are designed to map these reads correctly. For expression-only projects, transcript quantifiers such as Salmon or Kallisto may be more efficient.
RNA-seq counting requires correct strandedness settings. Incorrect strandedness can strongly reduce assigned reads and distort differential-expression results.
9. Paired-end alignment and insert size
In paired-end sequencing, each DNA fragment is read from both ends. The aligner uses expected orientation and distance between reads to improve mapping and detect unusual fragments.
Properly paired readsReads that align in an expected orientation and distance. Low values can indicate library or reference problems.
Insert sizeEstimated fragment length between read pairs. Unexpected distributions may indicate library-preparation issues.
SingletonsOnly one read in a pair maps confidently. High singleton rates can indicate quality, contamination or reference mismatch.
Discordant pairsPairs with unexpected orientation or distance. These can be artefacts or signal structural variation.
10. Read groups and sample metadata
Read groups store sample, library, platform and run information inside BAM files. They are often required for variant-calling workflows and are useful for tracking batches and lanes.
Field
Meaning
Example
ID
Read group identifier, often lane or run specific.
Avoid storing huge uncompressed SAM files unless needed temporarily. Pipe aligner output directly into sorting or conversion whenever possible.
12. Filtering alignments
Filtering removes reads that are unmapped, low quality, secondary, supplementary, duplicated or otherwise unsuitable for a specific analysis. Filtering choices must match the biological question.
Common SAMtools filtering examples
# Keep mapped reads only
samtools view -b -F 4 sample.sorted.bam > sample.mapped.bam
# Keep reads with mapping quality at least 30
samtools view -b -q 30 sample.sorted.bam > sample.mapq30.bam
# Exclude unmapped, secondary and supplementary alignments
samtools view -b -F 2308 sample.sorted.bam > sample.primary_mapped.bam
# Index filtered BAM
samtools index sample.primary_mapped.bam
Filtering can change coverage, peak calls, variant calls and expression estimates. Always document filters and avoid applying generic filters without understanding downstream effects.
13. Duplicate marking and removal
Duplicate marking identifies reads that appear to originate from the same original DNA fragment. Many DNA-seq workflows mark duplicates before variant calling, but duplicate handling differs by assay.
Assay
Typical duplicate handling
Notes
WGS / WES
Mark duplicates; variant callers may ignore or model them.
High duplicate rate can reduce effective coverage.
Targeted panels
Often mark duplicates, but UMI-aware workflows may be required.
Deep panels need assay-specific duplicate strategy.
RNA-seq
Usually not removed for gene-expression counting.
High duplication can reflect high expression rather than PCR artefacts.
ChIP-seq / ATAC-seq
Often marked or filtered depending on pipeline and depth.
Interpret duplication with library complexity and enrichment metrics.
UMI workflows
Use UMI-aware consensus or deduplication.
Coordinate-only duplicate marking can be inappropriate.
Reference genome FASTA source, index information and checksum.
Sample sheet and read-group definitions.
Software versions and exact commands or workflow configuration.
MultiQC summary report, where available.
Notes about failed or flagged samples.
For regulated or clinical-genomics-style projects, alignment is part of the validated analysis chain. Tool versions, parameters and reference data should be locked and auditable.
19. Short-read alignment cheat sheet
Task
Common command or tool
Purpose
FASTQ QC
fastqc, multiqc
Inspect raw reads before alignment.
Adapter trimming
fastp, cutadapt
Remove adapters or low-quality bases when needed.
DNA-seq alignment
bwa mem, bwa-mem2 mem
Map short DNA reads to a genome.
ChIP/ATAC alignment
bowtie2, bwa
Map reads for peak calling or signal tracks.
RNA-seq alignment
STAR, hisat2
Splice-aware mapping.
Sort BAM
samtools sort
Prepare BAM for indexing and downstream tools.
Index BAM
samtools index
Enable fast random access.
Alignment QC
samtools flagstat/stats/idxstats, MultiQC
Summarize alignment quality.
Coverage
mosdepth, bedtools coverage
Measure depth and breadth of coverage.
Useful alignment tools and documentation
These resources are useful starting points for short-read alignment workflows and file processing.
Short-read alignment is the process of mapping sequencing reads, usually tens to a few hundred bases long, to a reference genome, transcriptome or other reference sequence. The output is commonly a SAM, BAM or CRAM alignment file.
Which aligner should I use for DNA-seq?
For many short-read DNA-seq projects, BWA-MEM, BWA-MEM2, Bowtie2 or minimap2 short-read modes are common choices. The best choice depends on read length, organism, reference, downstream variant calling and institutional validation requirements.
Which aligner should I use for RNA-seq?
RNA-seq reads usually require splice-aware alignment or transcript-level quantification. STAR and HISAT2 are common splice-aware aligners, while Salmon and Kallisto are common transcript quantification tools when full genomic BAM files are not required.
Do I always need to trim reads before alignment?
No. Trimming should be based on FASTQ quality-control results and the assay. Adapter contamination, low-quality tails or primer sequences often justify trimming, but unnecessary aggressive trimming can reduce mappability.
What is the difference between SAM, BAM and CRAM?
SAM is a text alignment format, BAM is its compressed binary equivalent, and CRAM is a reference-based compressed alignment format. BAM is widely used for analysis, while CRAM can reduce storage requirements if the reference is managed carefully.
What is a good mapping rate?
There is no universal mapping-rate threshold. Expected mapping rate depends on organism, library type, read quality, contamination, reference quality and assay. The value should be interpreted together with duplication, coverage, insert size and sample metadata.
Should PCR duplicates be removed after alignment?
It depends on the assay. Duplicate marking is standard in many DNA-seq workflows, but removal can be inappropriate for RNA-seq, amplicon sequencing, single-cell data or very high-depth targeted assays unless the method specifically requires it.
Can AI help with short-read alignment workflows?
AI can help summarize alignment metrics, detect unusual samples, triage QC reports, generate documentation and assist with workflow configuration. The alignment itself should remain reproducible, versioned and based on explicit parameters.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.