Bioinformatics Scripts

Primary analysis of FASTQ files with FastQC.

This tutorial introduces the first quality-control step for raw sequencing reads. FastQC provides a quick, visual overview of high-throughput sequencing data before trimming, alignment, quantification, variant calling, or downstream statistical analysis.

Overview

FASTQ files are the standard starting point for many NGS workflows. Before any downstream analysis, raw reads should be inspected for quality, sequence composition, GC distribution, adapter contamination, duplicate reads, and overrepresented sequences.

FastQC is a widely used quality-control tool for high-throughput sequencing data. It can import FASTQ, BAM, and SAM files, provides summary graphs and tables, and exports an HTML report for permanent documentation.

Input FASTQ, FASTQ.GZ, BAM, or SAM files from sequencing pipelines.
Analysis Modular checks for quality, content, GC bias, duplication, and adapters.
Output HTML reports, summary files, and optional extracted report directories.

Install FastQC

FastQC requires a suitable Java Runtime Environment. For reproducible bioinformatics environments, package managers such as Conda or Mamba are usually the simplest installation route.

Option 1: Install with Mamba or Conda

# Create an environment for NGS quality-control tools
mamba create -n ngs-qc -c conda-forge -c bioconda fastqc

# Activate the environment
mamba activate ngs-qc

# Check the installation
fastqc --version

Option 2: Manual download

If you prefer a manual installation, download FastQC from Babraham Bioinformatics and make the wrapper script executable.

# Example version-specific installation
# Check the FastQC download page and update the version number if needed

wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.0.zip
unzip fastqc_v0.12.0.zip

cd FastQC
chmod 755 fastqc

# Optional: add FastQC to your PATH
sudo ln -s /path/to/FastQC/fastqc /usr/local/bin/fastqc

Tip

When working on shared servers or HPC systems, avoid using sudo unless permitted by your administrator. Instead, add the FastQC folder to your personal PATH or use a Conda/Mamba environment.

Run FastQC

FastQC can be launched interactively, but command-line execution is usually preferred for reproducible NGS projects and batch processing.

Launch the graphical interface

fastqc

Analyze one or more FASTQ files

# Analyze individual files
fastqc sample_R1.fastq.gz sample_R2.fastq.gz

# Create an output directory and process all FASTQ files
mkdir -p fastqc_reports
fastqc -o fastqc_reports *.fastq.gz

Use multiple CPU threads

# Run FastQC with 8 threads
fastqc -t 8 -o fastqc_reports *.fastq.gz

Extract report folders automatically

# Produce HTML reports and extracted data folders
fastqc --extract -t 8 -o fastqc_reports *.fastq.gz

Interpret FastQC reports

FastQC reports should not be interpreted mechanically. A warning or failure flag does not always mean a sample is unusable; the interpretation depends on library type, read length, sequencing platform, organism, and downstream analysis goal.

Per-base sequence quality Shows whether quality decreases across read positions, especially toward read ends.
Per-sequence quality scores Summarizes whether many reads have globally low quality.
Per-base sequence content Highlights positional nucleotide biases that may reflect library preparation or technical effects.
Per-sequence GC content Can reveal contamination, unusual library composition, or organism-specific GC profiles.
Sequence duplication levels Helps estimate redundancy, PCR duplication, and library complexity.
Adapter content Indicates whether trimming may be needed before alignment or quantification.
Overrepresented sequences Lists abundant sequences that may represent adapters, contaminants, rRNA, or biological signal.
Sequence length distribution Shows whether reads have fixed or variable lengths after sequencing or preprocessing.

Important interpretation note

Small RNA-seq, ATAC-seq, amplicon sequencing, metagenomics, and bisulfite sequencing can produce FastQC patterns that look unusual compared with standard RNA-seq or WGS data. Always interpret QC metrics in the context of the experimental design.

Batch analysis example

In most projects, FastQC is applied to all FASTQ files in a folder and results are collected into a dedicated QC directory.

#!/usr/bin/env bash
set -euo pipefail

INPUT_DIR="fastq"
OUTPUT_DIR="fastqc_reports"
THREADS=8

mkdir -p "$OUTPUT_DIR"

fastqc \
  -t "$THREADS" \
  -o "$OUTPUT_DIR" \
  "$INPUT_DIR"/*.fastq.gz

After running FastQC, inspect the HTML files in fastqc_reports. For larger projects, you may also combine multiple QC reports into one summary dashboard using workflow-specific reporting tools.

Next steps after primary FASTQ QC

The next step depends on the sequencing assay and QC findings. Some datasets may require trimming or filtering before alignment, while others may proceed directly into alignment, quantification, variant calling, or taxonomic classification.

  1. Review per-base quality, adapter content, overrepresented sequences, GC distribution, and duplication.
  2. Decide whether adapter trimming or quality filtering is required.
  3. Choose the appropriate downstream pipeline for RNA-seq, DNA-seq, single-cell, small RNA-seq, ChIP-seq, ATAC-seq, bisulfite-seq, or metagenomics.
  4. Document all commands, versions, references, and parameters for reproducibility.