NGS Alignment Tutorial

Short read alignment: from FASTQ files to analysis-ready BAM files.

A practical tutorial for aligning short-read sequencing data to reference genomes or transcriptomes. It covers reference preparation, aligner selection, indexing, paired-end alignment, read groups, SAM/BAM/CRAM processing, alignment QC, troubleshooting and reproducible workflow design.

1. Overview: what is short-read alignment?

Short-read alignment maps sequencing reads to a reference genome, transcriptome or custom reference. Reads are usually stored in FASTQ files before alignment and in SAM, BAM or CRAM files after alignment. The alignment step creates genomic coordinates that make downstream analyses possible.

Input FASTQ files, sample sheet, read layout, reference genome and optional annotation.
Alignment Aligner maps reads to reference sequences and records mapping position, quality and flags.
Output Sorted and indexed BAM/CRAM files plus alignment QC reports and logs.
Core principle: alignment is not only a technical conversion step. The chosen reference, parameters, read groups and post-processing choices directly affect downstream biological and clinical interpretation.

2. Key alignment concepts

Understanding a few core terms makes alignment reports and BAM files easier to interpret.

Concept Meaning Why it matters
Mapping quality A score estimating confidence that a read is placed correctly. Low mapping quality often indicates multi-mapping or repetitive regions.
CIGAR string Compact representation of matches, insertions, deletions, clipping and skipped regions. Essential for variant calling, splicing, indel interpretation and read visualization.
Primary alignment The main alignment selected for a read. Most downstream tools operate primarily on primary alignments.
Secondary alignment Alternative alignment for a multi-mapping read. Important in repeats, paralogs, pseudogenes and multi-copy sequences.
Supplementary alignment Part of a split alignment. Important for structural variants, chimeric reads and some fusion analyses.
Proper pair Paired-end reads aligned in the expected orientation and distance. Low proper pairing can indicate library, reference or alignment problems.

3. Input files and project metadata

Alignment starts with clean input definitions. The most common problems are wrong sample names, mixed genome assemblies, incorrect read layout and missing metadata.

Minimum input information

  • FASTQ file paths for R1 and, if paired-end, R2 reads.
  • Sample identifiers that match the sample sheet and downstream reports.
  • Organism and reference genome assembly, for example GRCh38 or mm39.
  • Library type: DNA-seq, RNA-seq, ChIP-seq, ATAC-seq, WES, panel, metagenomics or other.
  • Read length, paired-end or single-end status and expected insert size if known.
  • Whether reads were already trimmed or filtered.
  • Required downstream analysis: variants, counts, peaks, coverage, fusion detection or visualization.
Example sample sheet
sample_id	group	fastq_1	fastq_2
S1	control	S1_R1.fastq.gz	S1_R2.fastq.gz
S2	control	S2_R1.fastq.gz	S2_R2.fastq.gz
S3	treated	S3_R1.fastq.gz	S3_R2.fastq.gz
S4	treated	S4_R1.fastq.gz	S4_R2.fastq.gz

4. Reference genome preparation and indexing

Aligners require an index built from the reference sequence. The index must match the exact FASTA file used for downstream interpretation. If a BAM file is created with one reference but analysed with another, coordinate and variant interpretation errors can occur.

Use one assembly consistently Do not mix GRCh37/hg19 and GRCh38/hg38 files unless coordinate conversion is explicitly planned.
Keep reference metadata Save source URL, download date, assembly name, annotation version and checksum.
Index once, reuse carefully Reference indexes can be large. Store them in a versioned reference folder.
Match annotation to reference RNA-seq and gene counting require annotation files compatible with the same assembly.
Reference indexing examples
# FASTA index used by many tools
samtools faidx reference/genome.fa

# BWA index
bwa index reference/genome.fa

# Bowtie2 index
bowtie2-build reference/genome.fa reference/bowtie2/genome

# HISAT2 index
hisat2-build reference/genome.fa reference/hisat2/genome

# STAR genome index, example for RNA-seq
STAR \
  --runThreadN 16 \
  --runMode genomeGenerate \
  --genomeDir reference/star_index \
  --genomeFastaFiles reference/genome.fa \
  --sjdbGTFfile reference/annotation.gtf \
  --sjdbOverhang 149

5. Choosing an aligner

Aligner choice depends on the assay and downstream endpoint. A tool that is excellent for DNA-seq may be inappropriate for RNA-seq because RNA-seq reads can span exon-exon junctions.

Assay Common tools Typical output
Short-read DNA-seq BWA-MEM, BWA-MEM2, Bowtie2, minimap2 Genome-aligned BAM/CRAM for variant calling, coverage and CNV analysis.
RNA-seq STAR, HISAT2, Salmon, Kallisto Splice-aware BAM, transcript quantification or gene-count matrix.
ChIP-seq / ATAC-seq Bowtie2, BWA, BWA-MEM2 Genome-aligned BAM for peak calling and signal tracks.
Small RNA-seq Bowtie, Bowtie2, STAR small-RNA-aware workflows Short alignments to genome or small-RNA references.
Metagenomics Bowtie2, minimap2, BWA or taxonomic classifiers Host-filtered reads, microbial alignments or taxonomic profiles.

6. DNA-seq alignment with BWA or BWA-MEM2

BWA-MEM and BWA-MEM2 are commonly used for short-read DNA-seq alignment, especially for whole-genome, exome and targeted-panel workflows. BWA-MEM2 is designed as a faster implementation that produces highly similar alignments for many practical workflows.

Paired-end DNA-seq alignment with BWA-MEM
mkdir -p results/alignments

bwa mem -t 16 \
  reference/genome.fa \
  data/fastq/sample_R1.fastq.gz \
  data/fastq/sample_R2.fastq.gz | \
  samtools sort -@ 8 -o results/alignments/sample.sorted.bam

samtools index results/alignments/sample.sorted.bam

Common BWA considerations

  • Use read groups when downstream variant calling requires sample/library/run metadata.
  • Pipe directly to samtools sort to avoid huge intermediate SAM files.
  • Check mapping rate, duplicates and insert size after alignment.
  • For clinical or regulated workflows, use a validated aligner version and fixed parameters.

7. Bowtie2 for flexible short-read alignment

Bowtie2 is widely used for ChIP-seq, ATAC-seq, microbial sequencing, metagenomic host depletion and many custom short-read mapping tasks.

Paired-end alignment with Bowtie2
mkdir -p results/alignments results/logs

bowtie2 \
  -x reference/bowtie2/genome \
  -1 data/fastq/sample_R1.fastq.gz \
  -2 data/fastq/sample_R2.fastq.gz \
  -p 16 \
  2> results/logs/sample.bowtie2.log | \
  samtools sort -@ 8 -o results/alignments/sample.sorted.bam

samtools index results/alignments/sample.sorted.bam
Bowtie2 alignment summaries are printed to stderr, so redirecting 2> to a log file preserves important QC information.

8. RNA-seq alignment with STAR and HISAT2

RNA-seq reads can span exon-exon junctions. Splice-aware aligners such as STAR and HISAT2 are designed to map these reads correctly. For expression-only projects, transcript quantifiers such as Salmon or Kallisto may be more efficient.

STAR paired-end RNA-seq alignment
mkdir -p results/star/sample

STAR \
  --runThreadN 16 \
  --genomeDir reference/star_index \
  --readFilesIn data/fastq/sample_R1.fastq.gz data/fastq/sample_R2.fastq.gz \
  --readFilesCommand zcat \
  --outFileNamePrefix results/star/sample/ \
  --outSAMtype BAM SortedByCoordinate \
  --quantMode GeneCounts

samtools index results/star/sample/Aligned.sortedByCoord.out.bam
HISAT2 paired-end RNA-seq alignment
mkdir -p results/hisat2 results/logs

hisat2 \
  -x reference/hisat2/genome \
  -1 data/fastq/sample_R1.fastq.gz \
  -2 data/fastq/sample_R2.fastq.gz \
  -p 16 \
  2> results/logs/sample.hisat2.log | \
  samtools sort -@ 8 -o results/hisat2/sample.sorted.bam

samtools index results/hisat2/sample.sorted.bam
RNA-seq counting requires correct strandedness settings. Incorrect strandedness can strongly reduce assigned reads and distort differential-expression results.

9. Paired-end alignment and insert size

In paired-end sequencing, each DNA fragment is read from both ends. The aligner uses expected orientation and distance between reads to improve mapping and detect unusual fragments.

Properly paired reads Reads that align in an expected orientation and distance. Low values can indicate library or reference problems.
Insert size Estimated fragment length between read pairs. Unexpected distributions may indicate library-preparation issues.
Singletons Only one read in a pair maps confidently. High singleton rates can indicate quality, contamination or reference mismatch.
Discordant pairs Pairs with unexpected orientation or distance. These can be artefacts or signal structural variation.

10. Read groups and sample metadata

Read groups store sample, library, platform and run information inside BAM files. They are often required for variant-calling workflows and are useful for tracking batches and lanes.

Field Meaning Example
ID Read group identifier, often lane or run specific. run1_lane1
SM Sample name. S1
LB Library identifier. S1_lib1
PL Sequencing platform. ILLUMINA
PU Platform unit, often flow cell and lane. flowcell.lane
BWA alignment with read group
bwa mem -t 16 \
  -R '@RG\tID:S1_lane1\tSM:S1\tLB:S1_lib1\tPL:ILLUMINA\tPU:flowcell1.lane1' \
  reference/genome.fa \
  data/fastq/S1_R1.fastq.gz \
  data/fastq/S1_R2.fastq.gz | \
  samtools sort -@ 8 -o results/alignments/S1.sorted.bam

11. SAM, BAM and CRAM post-processing

Most aligners produce SAM or BAM output. Post-processing usually includes sorting, indexing and sometimes conversion to CRAM.

Step Command example Purpose
Sort BAM samtools sort -o sample.sorted.bam sample.bam Order alignments by genomic coordinate.
Index BAM samtools index sample.sorted.bam Enable fast regional access and genome-browser viewing.
View header samtools view -H sample.sorted.bam Inspect reference contigs, read groups and program records.
Convert to CRAM samtools view -C -T reference.fa -o sample.cram sample.sorted.bam Reduce storage using reference-based compression.
Avoid storing huge uncompressed SAM files unless needed temporarily. Pipe aligner output directly into sorting or conversion whenever possible.

12. Filtering alignments

Filtering removes reads that are unmapped, low quality, secondary, supplementary, duplicated or otherwise unsuitable for a specific analysis. Filtering choices must match the biological question.

Common SAMtools filtering examples
# Keep mapped reads only
samtools view -b -F 4 sample.sorted.bam > sample.mapped.bam

# Keep reads with mapping quality at least 30
samtools view -b -q 30 sample.sorted.bam > sample.mapq30.bam

# Exclude unmapped, secondary and supplementary alignments
samtools view -b -F 2308 sample.sorted.bam > sample.primary_mapped.bam

# Index filtered BAM
samtools index sample.primary_mapped.bam
Filtering can change coverage, peak calls, variant calls and expression estimates. Always document filters and avoid applying generic filters without understanding downstream effects.

13. Duplicate marking and removal

Duplicate marking identifies reads that appear to originate from the same original DNA fragment. Many DNA-seq workflows mark duplicates before variant calling, but duplicate handling differs by assay.

Assay Typical duplicate handling Notes
WGS / WES Mark duplicates; variant callers may ignore or model them. High duplicate rate can reduce effective coverage.
Targeted panels Often mark duplicates, but UMI-aware workflows may be required. Deep panels need assay-specific duplicate strategy.
RNA-seq Usually not removed for gene-expression counting. High duplication can reflect high expression rather than PCR artefacts.
ChIP-seq / ATAC-seq Often marked or filtered depending on pipeline and depth. Interpret duplication with library complexity and enrichment metrics.
UMI workflows Use UMI-aware consensus or deduplication. Coordinate-only duplicate marking can be inappropriate.
Mark duplicates with samtools markdup
samtools fixmate -m sample.name_sorted.bam sample.fixmate.bam
samtools sort -o sample.position_sorted.bam sample.fixmate.bam
samtools markdup sample.position_sorted.bam sample.markdup.bam
samtools index sample.markdup.bam

14. Alignment quality control

Alignment QC evaluates whether mapped reads behave as expected. These metrics often reveal contamination, wrong reference, poor library quality, low complexity or sample outliers.

Mapping rate Fraction of reads aligned to the reference.
Proper pairing Fraction of paired-end reads aligned in expected orientation and distance.
Duplicate rate Fraction of reads marked as duplicates.
Insert-size distribution Fragment length profile for paired-end libraries.
Chromosome distribution Read distribution across reference contigs, mitochondria, decoys and sex chromosomes.
Coverage Depth and breadth across genome or target regions.
Alignment QC commands
mkdir -p results/qc/alignment

samtools flagstat sample.sorted.bam \
  > results/qc/alignment/sample.flagstat.txt

samtools stats sample.sorted.bam \
  > results/qc/alignment/sample.samtools.stats.txt

samtools idxstats sample.sorted.bam \
  > results/qc/alignment/sample.idxstats.txt

multiqc results/qc/alignment \
  --outdir results/qc/multiqc_alignment

15. Assay-specific alignment notes

Assay Alignment considerations Downstream impact
WGS Use full reference, read groups, duplicate marking and coverage QC. Variant calling, CNV and SV detection depend on uniform alignment and coverage.
WES / panels Use target BED files for coverage metrics; watch off-target reads. Sensitivity depends on depth and uniformity across targets.
RNA-seq Use splice-aware aligners or transcript quantification; check strandedness. Incorrect strandedness or annotation mismatch reduces assigned reads.
ChIP-seq Handle duplicates and multi-mapping carefully; remove blacklisted regions if appropriate. Peak calling and FRiP metrics depend on clean alignment.
ATAC-seq Check mitochondrial reads, Tn5 shift conventions and fragment sizes. Open-chromatin peaks and TSS enrichment depend on correct processing.
Small RNA-seq Adapter trimming is critical; reads are short and can map to multiple loci. Reference choice and multi-mapping rules strongly influence counts.

16. Troubleshooting common alignment problems

Problem Possible cause What to check
Low mapping rate Wrong organism, wrong reference, contamination, poor read quality or adapters. FastQC, taxonomic classification, reference assembly and trimming report.
Low proper pairing Wrong read pairing, unexpected insert size, contamination or structural variation. FASTQ pairing, insert-size metrics and library information.
High mitochondrial reads Sample degradation, ATAC-seq nuclei preparation issue or expected tissue biology. Assay type, library prep and chromosome distribution.
High duplicate rate Low input, PCR over-amplification, over-sequencing or targeted assay design. Library complexity, UMI status and assay expectations.
Chromosome names mismatch Mixing references with chr1 vs 1 naming conventions. Reference FASTA, GTF/BED, VCF and BAM headers.
RNA-seq has low assigned reads Wrong strandedness, wrong annotation, wrong organism or poor RNA quality. Strandedness check, annotation source and gene body coverage.

17. Example short-read alignment workflow

This example shows a compact DNA-seq style workflow. It should be adapted to the project, reference, assay and validation requirements.

Project folders
mkdir -p reference data/fastq results/alignments results/qc/alignment results/logs
Build reference indexes
samtools faidx reference/genome.fa
bwa index reference/genome.fa
Align, sort and index one sample
sample="S1"

bwa mem -t 16 \
  -R "@RG\tID:${sample}\tSM:${sample}\tLB:${sample}_lib1\tPL:ILLUMINA" \
  reference/genome.fa \
  "data/fastq/${sample}_R1.fastq.gz" \
  "data/fastq/${sample}_R2.fastq.gz" \
  2> "results/logs/${sample}.bwa.log" | \
  samtools sort -@ 8 -o "results/alignments/${sample}.sorted.bam"

samtools index "results/alignments/${sample}.sorted.bam"
Collect alignment QC
samtools flagstat "results/alignments/${sample}.sorted.bam" \
  > "results/qc/alignment/${sample}.flagstat.txt"

samtools stats "results/alignments/${sample}.sorted.bam" \
  > "results/qc/alignment/${sample}.samtools.stats.txt"

samtools idxstats "results/alignments/${sample}.sorted.bam" \
  > "results/qc/alignment/${sample}.idxstats.txt"
Loop over samples from a sample sheet
tail -n +2 data/samples.tsv | while IFS=$'\t' read -r sample_id group fastq_1 fastq_2; do
  echo "Aligning ${sample_id}"

  bwa mem -t 16 \
    -R "@RG\tID:${sample_id}\tSM:${sample_id}\tLB:${sample_id}_lib1\tPL:ILLUMINA" \
    reference/genome.fa \
    "data/fastq/${fastq_1}" \
    "data/fastq/${fastq_2}" \
    2> "results/logs/${sample_id}.bwa.log" | \
    samtools sort -@ 8 -o "results/alignments/${sample_id}.sorted.bam"

  samtools index "results/alignments/${sample_id}.sorted.bam"
done

18. Alignment deliverables and reporting

A complete alignment deliverable should include analysis-ready alignment files and enough documentation for another analyst to reproduce the results.

Recommended deliverables

  • Sorted and indexed BAM or CRAM files.
  • Alignment logs from the aligner.
  • Alignment QC metrics: mapping rate, proper pairing, duplicates, insert-size and chromosome distribution.
  • Reference genome FASTA source, index information and checksum.
  • Sample sheet and read-group definitions.
  • Software versions and exact commands or workflow configuration.
  • MultiQC summary report, where available.
  • Notes about failed or flagged samples.
For regulated or clinical-genomics-style projects, alignment is part of the validated analysis chain. Tool versions, parameters and reference data should be locked and auditable.

19. Short-read alignment cheat sheet

Task Common command or tool Purpose
FASTQ QC fastqc, multiqc Inspect raw reads before alignment.
Adapter trimming fastp, cutadapt Remove adapters or low-quality bases when needed.
DNA-seq alignment bwa mem, bwa-mem2 mem Map short DNA reads to a genome.
ChIP/ATAC alignment bowtie2, bwa Map reads for peak calling or signal tracks.
RNA-seq alignment STAR, hisat2 Splice-aware mapping.
Sort BAM samtools sort Prepare BAM for indexing and downstream tools.
Index BAM samtools index Enable fast random access.
Alignment QC samtools flagstat/stats/idxstats, MultiQC Summarize alignment quality.
Coverage mosdepth, bedtools coverage Measure depth and breadth of coverage.

Frequently asked questions

What is short-read alignment?

Short-read alignment is the process of mapping sequencing reads, usually tens to a few hundred bases long, to a reference genome, transcriptome or other reference sequence. The output is commonly a SAM, BAM or CRAM alignment file.

Which aligner should I use for DNA-seq?

For many short-read DNA-seq projects, BWA-MEM, BWA-MEM2, Bowtie2 or minimap2 short-read modes are common choices. The best choice depends on read length, organism, reference, downstream variant calling and institutional validation requirements.

Which aligner should I use for RNA-seq?

RNA-seq reads usually require splice-aware alignment or transcript-level quantification. STAR and HISAT2 are common splice-aware aligners, while Salmon and Kallisto are common transcript quantification tools when full genomic BAM files are not required.

Do I always need to trim reads before alignment?

No. Trimming should be based on FASTQ quality-control results and the assay. Adapter contamination, low-quality tails or primer sequences often justify trimming, but unnecessary aggressive trimming can reduce mappability.

What is the difference between SAM, BAM and CRAM?

SAM is a text alignment format, BAM is its compressed binary equivalent, and CRAM is a reference-based compressed alignment format. BAM is widely used for analysis, while CRAM can reduce storage requirements if the reference is managed carefully.

What is a good mapping rate?

There is no universal mapping-rate threshold. Expected mapping rate depends on organism, library type, read quality, contamination, reference quality and assay. The value should be interpreted together with duplication, coverage, insert size and sample metadata.

Should PCR duplicates be removed after alignment?

It depends on the assay. Duplicate marking is standard in many DNA-seq workflows, but removal can be inappropriate for RNA-seq, amplicon sequencing, single-cell data or very high-depth targeted assays unless the method specifically requires it.

Can AI help with short-read alignment workflows?

AI can help summarize alignment metrics, detect unusual samples, triage QC reports, generate documentation and assist with workflow configuration. The alignment itself should remain reproducible, versioned and based on explicit parameters.