Bioinformatics Scripts

Read alignment with Bowtie, Bowtie2, BWA, HISAT2 and STAR.

This tutorial summarizes common Linux commands for building reference indexes and aligning NGS reads with short-read and RNA-seq aligners. It covers Bowtie and Bowtie2 for small RNA and general short reads, BWA for DNA-seq, HISAT2 for splice-aware RNA-seq, and STAR for high-performance transcriptome and genome alignment.

Overview

Read alignment maps sequencing reads to a reference genome, transcriptome, custom FASTA file, or RNA-class reference. The output is usually a SAM or BAM file that can be used for read counting, variant calling, peak calling, transcript quantification, or downstream quality control.

The original SciBerg page provides command examples for Bowtie, Bowtie2, BWA, HISAT2, and STAR. This updated page preserves the core commands and adds modern installation options, workflow context, validation steps, and reusable examples.

Input FASTQ files and a matching reference FASTA or prebuilt reference index.
Processing Build an index, run the aligner with suitable parameters, and produce SAM or BAM output.
Output SAM/BAM alignment files, mapping summaries, log files, and downstream analysis-ready alignments.

Choosing an aligner

Different aligners are optimized for different read lengths, error models, and biological assays. The table below gives a practical starting point.

Bowtie Useful for short reads and small RNA fragments where very strict mismatch control is needed.
Bowtie2 General-purpose short-read alignment against genome, transcriptome, or custom FASTA references.
BWA-MEM Common choice for DNA-seq, WGS, WES, and longer short reads mapped to a genome.
HISAT2 Splice-aware RNA-seq aligner suitable for genome-based transcriptome analysis.
STAR Fast splice-aware RNA-seq aligner for genome and transcriptome outputs, requiring substantial memory for large genomes.
Custom references For RNA-class, contaminant, viral, or small custom databases, Bowtie/Bowtie2 are often convenient.

Install the aligners

The original page uses manual binary and source installations. For modern reproducible analysis, Conda or Mamba environments are often easier to manage.

Recommended: install with Mamba or Conda

mamba create -n ngs-align -c conda-forge -c bioconda \
  bowtie bowtie2 bwa hisat2 star samtools

mamba activate ngs-align

bowtie --version
bowtie2 --version
bwa 2>&1 | head
hisat2 --version
STAR --version
samtools --version

Manual installation examples from the original workflow

# Bowtie and Bowtie2 binary downloads, version examples from the original page
wget https://sourceforge.net/projects/bowtie-bio/files/bowtie/1.2.3/bowtie-1.2.3-linux-x86_64.zip/download
wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.3.5.1/bowtie2-2.3.5.1-linux-x86_64.zip/download

# BWA source example
wget https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.17.tar.bz2/download
tar xvjf bwa-0.7.17.tar.bz2
cd bwa-0.7.17
make

Version note

The original SciBerg page lists historical versions such as Bowtie 1.2.3, Bowtie2 2.3.5.1, BWA 0.7.17, HISAT2 2.1.0, and STAR 2.6.1a. For new projects, use current stable releases and record exact versions in the methods section.

Bowtie and Bowtie2 alignment

Bowtie and Bowtie2 are convenient for mapping reads to genomes, transcriptomes, or custom FASTA references. Bowtie is especially useful when strict mismatch control is needed for short RNA fragments.

Build Bowtie and Bowtie2 indexes

# Bowtie index
bowtie-build reference.fa reference

# Bowtie2 index
bowtie2-build reference.fa reference

Bowtie mapping with mismatch control

# 0 mismatches: preferred for small RNA fragments of approximately 15–25 nt
bowtie -p 8 -v 0 reference INPUT.fastq -S INPUT_over_reference.sam

# 1 mismatch: often acceptable for RNA fragments of approximately 25–100 nt
bowtie -p 8 -v 1 reference INPUT.fastq -S INPUT_over_reference.sam

# 2 mismatches: often acceptable for longer fragments, depending on project goals
bowtie -p 8 -v 2 reference INPUT.fastq -S INPUT_over_reference.sam

Bowtie2 mapping

# Single-end reads
bowtie2 \
  -q \
  -p 8 \
  -x reference \
  -U INPUT.fastq \
  -S INPUT_over_reference.sam

# Paired-end reads
bowtie2 \
  -q \
  -p 8 \
  -x reference \
  -1 INPUT_R1.fastq.gz \
  -2 INPUT_R2.fastq.gz \
  -S INPUT_over_reference.sam

Index prefix detail

In commands such as bowtie2 -x reference, the value after -x is the index prefix, not necessarily the FASTA filename. Keep index names simple and consistent.

BWA alignment

BWA-MEM is commonly used for DNA sequencing data such as WGS, WES, targeted DNA panels, and other genomic reads.

Build a BWA index

bwa index reference.fa

Map single-end reads with BWA-MEM

bwa mem \
  -t 8 \
  reference.fa \
  INPUT.fastq \
  > INPUT_over_reference.sam

Map paired-end reads with BWA-MEM

bwa mem \
  -t 8 \
  reference.fa \
  INPUT_R1.fastq.gz \
  INPUT_R2.fastq.gz \
  > INPUT_over_reference.sam

Convert SAM to sorted BAM

samtools view -@ 8 -bS INPUT_over_reference.sam | \
samtools sort -@ 8 -o INPUT_over_reference.sorted.bam

samtools index INPUT_over_reference.sorted.bam

HISAT2 alignment

HISAT2 is a splice-aware aligner for RNA-seq data. It can use genome, SNP-aware, and transcriptome-aware index configurations. The original SciBerg page notes that building human GRCh38 transcriptome-aware indexes can require very large memory and suggests using prebuilt indexes.

Download prebuilt HISAT2 indexes

# Historical URLs from the original SciBerg workflow
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz
tar -xzf grch38.tar.gz

wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38_snp.tar.gz
tar -xzf grch38_snp.tar.gz

wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38_tran.tar.gz
tar -xzf grch38_tran.tar.gz

Map RNA-seq reads with HISAT2

hisat2 \
  -q \
  -p 8 \
  --dta \
  -x folder_with_Genome_tran_indexes/Homo_sapiens.GRCh38.dna.84.fa \
  -U INPUT.fastq \
  -S INPUT.sam

Paired-end RNA-seq example

hisat2 \
  -q \
  -p 8 \
  --dta \
  -x grch38_tran/genome_tran \
  -1 INPUT_R1.fastq.gz \
  -2 INPUT_R2.fastq.gz \
  -S INPUT.sam

Transcriptome references

For direct alignment to transcriptome FASTA references rather than genome indexes, Bowtie or Bowtie2 can be more appropriate. For splice-aware genome alignment, use HISAT2 or STAR.

STAR alignment

STAR is a high-performance aligner widely used for RNA-seq. It supports genome alignment, sorted BAM output, splice-junction-aware alignment, and transcriptome output for downstream quantification.

Generate STAR genome indexes without annotation

STAR \
  --runThreadN 8 \
  --runMode genomeGenerate \
  --genomeDir folder_with_STAR_indexes_no_gtf/ \
  --genomeFastaFiles folder_with_reference_genome/genome.fa

Generate STAR genome indexes with annotation GTF

STAR \
  --runThreadN 8 \
  --runMode genomeGenerate \
  --sjdbGTFfile folder_with_gtf_file/genome_annotation.gtf \
  --genomeDir folder_with_STAR_indexes_with_gtf/ \
  --genomeFastaFiles folder_with_reference_genome/genome.fa

Generate STAR indexes for transcriptome FASTA

STAR \
  --runThreadN 8 \
  --limitGenomeGenerateRAM=60000000000 \
  --runMode genomeGenerate \
  --genomeDir folder_with_STAR_indexes_for_transcriptome/ \
  --genomeFastaFiles folder_with_reference_transcriptome/transcriptome.fa

Map reads with STAR and produce SAM output

STAR \
  --runThreadN 8 \
  --genomeDir folder_with_STAR_indexes_no_gtf/ \
  --sjdbGTFfile folder_with_gtf_file/genome_annotation.gtf \
  --readFilesIn INPUT.fastq \
  --outFileNamePrefix INPUT

Map reads and produce BAM output

# Unsorted BAM
STAR \
  --runThreadN 8 \
  --genomeDir folder_with_STAR_indexes_no_gtf/ \
  --sjdbGTFfile folder_with_gtf_file/genome_annotation.gtf \
  --readFilesIn INPUT.fastq \
  --outSAMtype BAM Unsorted \
  --outFileNamePrefix INPUT

# Coordinate-sorted BAM
STAR \
  --runThreadN 8 \
  --genomeDir folder_with_STAR_indexes_no_gtf/ \
  --sjdbGTFfile folder_with_gtf_file/genome_annotation.gtf \
  --readFilesIn INPUT.fastq \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix INPUT

STAR transcriptome output

# Genome alignment plus transcriptome BAM output
STAR \
  --runThreadN 8 \
  --genomeDir folder_with_STAR_indexes_no_gtf/ \
  --sjdbGTFfile folder_with_gtf_file/genome_annotation.gtf \
  --readFilesIn INPUT.fastq \
  --quantMode TranscriptomeSAM \
  --outFileNamePrefix INPUT

# Direct alignment to a transcriptome reference index
STAR \
  --runThreadN 8 \
  --genomeDir folder_with_STAR_indexes_for_transcriptome/ \
  --readFilesIn INPUT.fastq \
  --outFileNamePrefix INPUT

Compressed FASTQ files with STAR

For .fastq.gz input files, add --readFilesCommand zcat to STAR commands so reads are decompressed on the fly.

Reusable alignment workflow skeleton

The script below shows a simple Bowtie2 alignment workflow that builds an index if needed, maps single-end reads, converts SAM to sorted BAM, and records mapping statistics.

#!/usr/bin/env bash
set -euo pipefail

THREADS=8
REFERENCE="reference.fa"
INDEX_PREFIX="reference_index/reference"
FASTQ="INPUT.fastq.gz"
OUT_PREFIX="INPUT_over_reference"

mkdir -p "$(dirname "$INDEX_PREFIX")" alignments logs

# Build Bowtie2 index if missing
if [[ ! -f "${INDEX_PREFIX}.1.bt2" && ! -f "${INDEX_PREFIX}.1.bt2l" ]]; then
  bowtie2-build "$REFERENCE" "$INDEX_PREFIX"
fi

# Align reads
bowtie2 \
  -q \
  -p "$THREADS" \
  -x "$INDEX_PREFIX" \
  -U "$FASTQ" \
  -S "alignments/${OUT_PREFIX}.sam" \
  2> "logs/${OUT_PREFIX}.bowtie2.log"

# Convert SAM to sorted BAM
samtools view -@ "$THREADS" -bS "alignments/${OUT_PREFIX}.sam" | \
samtools sort -@ "$THREADS" -o "alignments/${OUT_PREFIX}.sorted.bam"

samtools index "alignments/${OUT_PREFIX}.sorted.bam"

# Optional: remove large intermediate SAM after validation
# rm "alignments/${OUT_PREFIX}.sam"

samtools flagstat "alignments/${OUT_PREFIX}.sorted.bam" \
  > "logs/${OUT_PREFIX}.flagstat.txt"

echo "Done. Alignment files are in alignments/ and logs are in logs/"

Next steps after read alignment

After alignment, inspect mapping statistics and continue with the assay-specific downstream analysis.

  1. Review aligner logs, mapping rate, multimapping rate, and uniquely mapped read counts.
  2. Convert SAM to BAM, sort and index BAM files if required.
  3. Run quality checks such as samtools flagstat, samtools idxstats, or project-specific QC tools.
  4. Proceed to read counting, variant calling, peak calling, transcript quantification, or visualization.
  5. Document aligner name, version, index source, reference release, command line, and parameters.