Bioinformatics Scripts

Manipulating SAM and BAM files with samtools.

SAM and BAM files are standard alignment outputs generated by tools such as Bowtie, Bowtie2, BWA, HISAT2 and STAR. This tutorial shows how to install samtools, convert SAM to BAM, sort alignments by coordinate, create indexes, inspect headers, and generate basic mapping summaries.

Overview

samtools is one of the most widely used command-line toolkits for working with SAM, BAM, and CRAM alignment files. After read alignment, typical operations include converting SAM to BAM, sorting BAM files, indexing sorted BAM files, extracting subsets of reads, and generating mapping statistics.

The original SciBerg page provides the core installation and conversion commands. This updated page preserves those commands and adds practical examples for modern NGS workflows.

Input SAM or BAM files produced by short-read or RNA-seq aligners.
Processing Convert, sort, index, filter, and summarize alignment files.
Output Sorted BAM files, BAM indexes, mapping statistics, and analysis-ready alignments.

Install samtools

The original SciBerg workflow installs samtools from source. For reproducible projects, a Conda or Mamba environment is usually the easiest option.

Recommended: install with Mamba or Conda

mamba create -n samtools-env -c conda-forge -c bioconda samtools
mamba activate samtools-env

samtools --version

Source-install example from the original SciBerg workflow

# Get the source archive
wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2

# Unpack, compile, and install
tar -vxjf samtools-1.9.tar.bz2
cd samtools-1.9
make
sudo make install

Version note

The original command uses samtools 1.9 as an example. For new projects, use a current stable release and record the exact samtools version in your workflow documentation.

SAM versus BAM files

SAM is a human-readable text format. BAM is the compressed binary representation of SAM and is more efficient for storage, indexing, and downstream analysis.

SAM Text-based alignment format. Easy to inspect but large and slow for big datasets.
BAM Compressed binary alignment format. Preferred for most downstream workflows.
Coordinate-sorted BAM Alignments sorted by genomic position. Required for indexing and many genome-browser workflows.
BAM index .bai or related index file that allows fast random access to genomic regions.

Keep intermediate files only when needed

SAM files can be very large. Once a sorted BAM file and logs have been validated, intermediate SAM files are often removed to save disk space.

Convert SAM to BAM

The core conversion command from the original SciBerg page converts a SAM file into a BAM file using multiple threads.

samtools view -b -@ [insert number of threads] INPUT.sam -o INPUT.bam

Concrete example

samtools view \
  -b \
  -@ 8 \
  INPUT.sam \
  -o INPUT.bam

Inspect the BAM header

samtools view -H INPUT.bam | head

View the first few alignments

samtools view INPUT.bam | head

Sort BAM files by coordinate

Coordinate sorting prepares BAM files for indexing, genome-browser visualization, variant calling, and many downstream tools.

Sort an existing BAM file

samtools sort -@ [insert number of threads] INPUT.bam -o INPUT_sorted.bam

Concrete example

samtools sort \
  -@ 8 \
  INPUT.bam \
  -o INPUT_sorted.bam

Convert SAM directly to coordinate-sorted BAM

The original SciBerg page also provides a one-step conversion and sorting command.

samtools sort -@ [insert number of threads] -o INPUT_sorted.bam INPUT.sam

Concrete one-step example

samtools sort \
  -@ 8 \
  -o INPUT_sorted.bam \
  INPUT.sam

Index and inspect sorted BAM files

After sorting, create a BAM index so genome browsers and region-specific tools can access alignments efficiently.

Create an index

samtools index INPUT_sorted.bam

Generate mapping statistics

# Overall alignment statistics
samtools flagstat INPUT_sorted.bam > INPUT_sorted.flagstat.txt

# Per-reference mapped and unmapped read counts
samtools idxstats INPUT_sorted.bam > INPUT_sorted.idxstats.txt

# Summary statistics
samtools stats INPUT_sorted.bam > INPUT_sorted.stats.txt

Inspect a specific region

# Example genomic region
samtools view INPUT_sorted.bam chr1:1000000-1010000 | head

Chromosome naming compatibility

Region queries require chromosome names that match the BAM header exactly, for example chr1 versus 1.

Useful filtering examples

samtools view can extract or exclude reads based on alignment flags, mapping quality, genomic region, or header presence.

Extract only mapped reads

# Exclude unmapped reads with flag 4
samtools view -b -F 4 INPUT_sorted.bam -o INPUT_mapped.bam

Extract only unmapped reads

# Keep reads with flag 4
samtools view -b -f 4 INPUT_sorted.bam -o INPUT_unmapped.bam

Filter by mapping quality

# Keep alignments with mapping quality at least 30
samtools view -b -q 30 INPUT_sorted.bam -o INPUT_mapq30.bam

Extract reads from a genomic region

samtools view -b INPUT_sorted.bam chr1:1000000-2000000 -o INPUT_chr1_region.bam

Preserve headers in SAM output

samtools view -h INPUT_sorted.bam chr1:1000000-2000000 > INPUT_chr1_region.sam

Reusable SAM-to-sorted-BAM workflow

The script below converts one SAM file to a coordinate-sorted BAM file, creates an index, writes QC summaries, and optionally removes the intermediate BAM file.

#!/usr/bin/env bash
set -euo pipefail

THREADS=8
INPUT_SAM="INPUT.sam"
PREFIX="${INPUT_SAM%.sam}"
OUTDIR="bam"
LOGDIR="logs"

mkdir -p "$OUTDIR" "$LOGDIR"

# Convert SAM to BAM
samtools view \
  -b \
  -@ "$THREADS" \
  "$INPUT_SAM" \
  -o "$OUTDIR/${PREFIX}.bam"

# Sort BAM by genomic coordinate
samtools sort \
  -@ "$THREADS" \
  "$OUTDIR/${PREFIX}.bam" \
  -o "$OUTDIR/${PREFIX}.sorted.bam"

# Index sorted BAM
samtools index "$OUTDIR/${PREFIX}.sorted.bam"

# Alignment summaries
samtools flagstat "$OUTDIR/${PREFIX}.sorted.bam" \
  > "$LOGDIR/${PREFIX}.sorted.flagstat.txt"

samtools idxstats "$OUTDIR/${PREFIX}.sorted.bam" \
  > "$LOGDIR/${PREFIX}.sorted.idxstats.txt"

samtools stats "$OUTDIR/${PREFIX}.sorted.bam" \
  > "$LOGDIR/${PREFIX}.sorted.stats.txt"

# Optional: remove intermediate unsorted BAM after validation
# rm "$OUTDIR/${PREFIX}.bam"

echo "Done."
echo "Sorted BAM: $OUTDIR/${PREFIX}.sorted.bam"
echo "Index:      $OUTDIR/${PREFIX}.sorted.bam.bai"
echo "Logs:       $LOGDIR/"

More compact one-step version

samtools sort \
  -@ 8 \
  -o INPUT.sorted.bam \
  INPUT.sam

samtools index INPUT.sorted.bam
samtools flagstat INPUT.sorted.bam > INPUT.sorted.flagstat.txt

Next steps after SAM/BAM processing

Once BAM files are sorted and indexed, they can be used for assay-specific downstream analysis.

  1. Review flagstat, idxstats, and aligner log files.
  2. Check whether mapping rates, multimapping levels, and reference-specific counts are consistent with the experiment.
  3. Use sorted BAM files for read counting, variant calling, peak calling, genome-browser visualization, or downstream QC.
  4. Keep raw alignment commands, reference names, tool versions, and processing scripts with the project documentation.
  5. Archive intermediate SAM files only if required, because they can consume substantial disk space.