NGS Quality-Control Tutorial

Quality control of NGS data: from raw reads to reliable results.

A practical tutorial for evaluating next-generation sequencing data quality across FASTQ files, trimming, alignment, coverage, RNA-seq, DNA-seq, targeted panels, ChIP-seq, ATAC-seq, single-cell and long-read workflows. The goal is to identify technical problems early and document which data are reliable for interpretation.

1. Overview: what is NGS quality control?

NGS quality control is the process of evaluating whether sequencing data are technically suitable for downstream analysis and biological interpretation. QC is not limited to checking raw reads. It also includes metadata validation, run-level metrics, preprocessing results, alignment performance, coverage, sample identity, contamination, batch effects and assay-specific metrics.

Detect Find low-quality reads, adapters, contamination, low complexity, poor mapping and outlier samples.
Decide Choose whether to trim, filter, resequence, exclude, merge or keep samples with documented limitations.
Document Record QC metrics, thresholds, failed samples and the reasoning behind each decision.
Core principle: QC should answer whether the data are fit for the intended analysis, not whether every metric looks perfect.

2. Main stages of NGS QC

Different QC stages detect different problems. A complete project usually needs multiple checks.

Stage What is checked Common outputs
Metadata QC Sample names, groups, batches, read layout, strandedness, organism and reference version. Validated sample sheet and project design notes.
Run-level QC Instrument performance, yield, Q30, cluster density, loading, barcode balance and demultiplexing. Run report, demultiplexing report, index statistics.
Raw FASTQ QC Read quality, adapters, GC distribution, duplication, overrepresented sequences and read length. FastQC and MultiQC reports.
Preprocessing QC Effect of trimming, filtering, read removal and retained bases. fastp, Cutadapt or trimming reports.
Alignment QC Mapping rate, duplicates, insert size, chromosome distribution, coverage and contamination. SAMtools, Picard, Qualimap, mosdepth and MultiQC summaries.
Assay-specific QC RNA-seq bias, DNA-seq variant metrics, ChIP/ATAC enrichment, single-cell cell quality, long-read read length. Assay-specific tables, figures and pass/fail notes.

3. Sequencing run-level QC

Run-level QC evaluates whether the sequencing instrument and demultiplexing performed as expected. These metrics often come from instrument software or the sequencing provider.

Total yield Total bases or reads generated by the run. Compare with expected output for the flow cell, chemistry and read length.
Q30 or quality distribution Fraction of bases above a quality-score threshold. Interpretation depends on platform and read position.
Index balance Uneven index representation can indicate pooling problems or sample-index issues.
Undetermined reads High undetermined reads may indicate incorrect sample sheet, index mismatch, barcode quality or index hopping.
If run-level metrics fail, downstream FASTQ QC may reveal symptoms, but the root cause may be library loading, reagent quality, sample sheet errors or instrument issues.

4. Raw FASTQ QC metrics

FASTQ QC is usually the first analysis step after receiving data. It checks read-level properties before alignment or quantification.

Metric What it means Typical interpretation
Per-base quality Quality score along read positions. Quality often decreases toward read ends; severe decline may require trimming or resequencing.
Adapter content Presence of adapter sequences in reads. Common in short inserts, small RNA-seq and over-sequenced libraries; trimming is usually needed.
GC content Distribution of GC percentage across reads. Strong deviation from expected distribution can suggest contamination or library bias.
Duplication Repeated identical sequences. May indicate low complexity, PCR duplication or expected biology depending on assay.
Overrepresented sequences Sequences occurring more often than expected. May represent adapters, primers, ribosomal reads, highly abundant transcripts or contamination.
Sequence length Read-length distribution. Uniform in many short-read datasets; variable after trimming or in long-read data.

5. FastQC and MultiQC

FastQC provides per-sample reports for FASTQ files. MultiQC aggregates many tool outputs into a project-level report, making it easier to compare samples and detect outliers.

Run FastQC and MultiQC
mkdir -p results/qc/fastqc results/qc/multiqc

fastqc data/fastq/*.fastq.gz \
  --outdir results/qc/fastqc \
  --threads 8

multiqc results/qc \
  --outdir results/qc/multiqc

How to read MultiQC results

  • Look first for sample outliers, not only pass/fail icons.
  • Compare read counts across samples and libraries.
  • Check whether R1 and R2 behave differently in paired-end data.
  • Review adapter content before deciding trimming settings.
  • Check whether unusual GC content is present in many samples or only one sample.

6. Trimming and filtering QC

Trimming removes adapters, primers, low-quality ends or reads below a minimum length. Filtering removes reads that fail defined criteria. QC should be repeated after trimming to confirm that preprocessing improved the data.

Trim when... Adapter sequences are present, read ends are poor-quality, primer sequences must be removed or insert size is shorter than read length.
Avoid over-trimming when... Quality decline is mild, the aligner handles soft clipping well or trimming would create very short reads.
Example paired-end trimming with fastp
mkdir -p results/trimmed results/qc/fastp

fastp \
  --in1 data/fastq/sample_R1.fastq.gz \
  --in2 data/fastq/sample_R2.fastq.gz \
  --out1 results/trimmed/sample_R1.trimmed.fastq.gz \
  --out2 results/trimmed/sample_R2.trimmed.fastq.gz \
  --html results/qc/fastp/sample.fastp.html \
  --json results/qc/fastp/sample.fastp.json \
  --thread 8

7. Contamination and unexpected sequence content

Contamination can come from other species, adapters, primers, rRNA, microbial reads, index misassignment, PhiX, sample swaps or laboratory handling. The relevant checks depend on the organism and assay.

Signal Possible cause What to check
Unexpected GC profile Microbial contamination, mixed species, biased library. Kraken-style classification, alignment to expected reference, overrepresented sequences.
High rRNA fraction Incomplete depletion or poor mRNA enrichment. RNA-seq mapping categories, SortMeRNA-style checks, gene biotype counts.
High undetermined indexes Incorrect index definitions or poor barcode quality. Demultiplexing report and sample sheet.
Reads mapping to wrong organism Contamination, sample mix-up or wrong reference. Taxonomic classification and sample metadata.
Contamination interpretation should consider biology. For example, metagenomic samples are expected to contain many organisms, while human WGS is not.

8. Alignment-level QC

Alignment QC evaluates how well reads map to the reference and whether mapped reads have expected properties.

Basic BAM QC with SAMtools
mkdir -p results/qc/alignment

samtools flagstat results/alignments/sample.sorted.bam \
  > results/qc/alignment/sample.flagstat.txt

samtools stats results/alignments/sample.sorted.bam \
  > results/qc/alignment/sample.samtools.stats.txt

samtools idxstats results/alignments/sample.sorted.bam \
  > results/qc/alignment/sample.idxstats.txt
Metric Why it matters Interpretation
Mapping rate Shows how many reads align to the reference. Low mapping may indicate contamination, wrong reference, poor quality or library issues.
Properly paired reads Important for paired-end libraries. Low values may indicate insert-size problems, contamination or alignment issues.
Duplicate rate Reflects repeated fragments or PCR duplication. Interpret in context of assay, input amount and sequencing depth.
Insert size Reflects fragment length distribution. Unexpected distribution can indicate library-preparation issues.
Chromosome distribution Shows where reads map. Unexpected high mitochondrial, ribosomal or off-target mapping may require investigation.

9. Coverage QC

Coverage QC measures how deeply and uniformly genomic regions are sequenced. It is central for WGS, WES, targeted panels and many DNA-seq workflows.

Mean coverage Average depth across the genome or target regions.
Coverage uniformity How evenly reads cover target regions. Poor uniformity can reduce sensitivity.
Breadth of coverage Fraction of bases covered at or above thresholds such as 10×, 20×, 30× or 100×.
GC bias Coverage loss in high-GC or low-GC regions can affect variant detection and copy-number analysis.
Coverage with mosdepth
mkdir -p results/qc/coverage

mosdepth \
  --threads 8 \
  results/qc/coverage/sample \
  results/alignments/sample.sorted.bam

10. RNA-seq quality control

RNA-seq QC includes both sequencing quality and transcriptome-specific properties. Good RNA-seq QC checks whether libraries reflect the intended RNA population and experimental design.

Metric What it detects Possible action
Mapping rate Read alignment to genome or transcriptome. Check reference, contamination, library type and read quality.
Assigned reads Reads assigned to genes or transcripts. Check annotation version, strandedness and feature-counting settings.
rRNA fraction Residual ribosomal RNA. Assess depletion/enrichment success and usable depth.
Strand specificity Whether library strandedness matches analysis settings. Correct counting parameters or clarify library preparation.
Gene body coverage 5′/3′ bias and degradation effects. Investigate RNA quality, protocol and sample handling.
PCA and clustering Sample outliers, batch effects and group separation. Review metadata, batches and possible sample swaps.
RNA-seq QC should evaluate whether biological replicates cluster together and whether known batches or covariates explain unwanted variation.

11. DNA-seq and variant-calling QC

DNA-seq QC checks whether the data are suitable for variant detection and whether variant calls show expected statistical properties.

Pre-variant QC Read quality, mapping rate, duplicates, insert size, coverage, contamination and sample identity.
Variant-call QC Number of variants, Ti/Tv ratio, heterozygosity, depth distribution, allele balance and filter status.
Somatic QC Tumour purity, normal contamination, panel of normals, artefacts, strand bias and matched-normal quality.
CNV/SV QC Coverage uniformity, insert size, GC bias, batch effects and read-depth noise.
Basic VCF QC commands
mkdir -p results/qc/variants

bcftools stats variants/sample.vcf.gz \
  > results/qc/variants/sample.bcftools.stats.txt

bcftools query -l variants/sample.vcf.gz \
  > results/qc/variants/sample_names.txt

bcftools view -H variants/sample.vcf.gz | wc -l \
  > results/qc/variants/sample.variant_count.txt

12. Targeted-panel and exome QC

Targeted sequencing requires additional metrics describing how well reads are concentrated in the intended target regions.

Metric Meaning Why it matters
On-target rate Fraction of reads overlapping target regions. Low values waste sequencing and reduce effective depth.
Target coverage Depth across panel or exome regions. Determines variant-detection sensitivity.
Uniformity Evenness of coverage across targets. Poor uniformity causes under-covered regions.
Fold enrichment Enrichment over random genomic sequencing. Evaluates capture or amplification performance.
Covered bases at threshold Fraction of target bases at ≥ defined depth. Often critical for reporting regions with insufficient coverage.

13. ChIP-seq and ATAC-seq QC

ChIP-seq and ATAC-seq QC focus on enrichment, library complexity and signal quality rather than only read quality.

Assay Key QC metrics Interpretation
ChIP-seq Mapping rate, duplicate rate, library complexity, peak count, FRiP, input/control quality, replicate concordance. Good data show enrichment at expected genomic regions and reproducible peaks between replicates.
ATAC-seq Mapping rate, mitochondrial fraction, insert-size periodicity, TSS enrichment, FRiP, duplicate rate, peak count. Good ATAC-seq often shows nucleosome patterning and strong TSS enrichment.
ChIP-seq QC depends strongly on antibody quality, target biology and control design. ATAC-seq QC depends strongly on nuclei preparation and mitochondrial contamination.

14. Single-cell NGS QC

Single-cell QC evaluates both sequencing quality and cell-level properties. It is usually performed after demultiplexing and count-matrix generation.

Cell-level metrics UMIs per cell, genes per cell, mitochondrial fraction, ribosomal fraction and cell barcode quality.
Sample-level metrics Total cells recovered, reads per cell, sequencing saturation and cell-calling consistency.
Technical artefacts Empty droplets, ambient RNA, doublets, multiplets and barcode swapping.
Biological checks Expected cell types, marker genes, batch effects and sample-specific outliers.

15. Long-read sequencing QC

Long-read QC differs from short-read QC because read length, molecule quality and platform-specific processing are central.

Metric PacBio-style HiFi relevance Nanopore relevance
Total yield Total HiFi bases and reads generated. Total bases generated over time and per flow cell.
Read length distribution HiFi read N50 and insert-size distribution. Read N50, ultra-long read fraction and length distribution.
Read quality HiFi read accuracy and pass filters. Basecalled read quality and model version effects.
Platform metadata Movie time, polymerase read quality and yield per cell. Pore/channel activity, read speed, pore occupancy and basecalling model.
Downstream QC Alignment, coverage, phasing, assembly or SV metrics. Alignment, methylation tags, assembly, SV calling and contamination checks.

16. Sample identity, swaps and batch effects

Technically good reads can still belong to the wrong sample, wrong condition or wrong batch. Sample-level QC is essential for larger projects.

Sample swaps Detect using known genotypes, sex markers, SNP concordance, expression signatures or metadata consistency.
Contamination Estimate using genotype-aware tools, allele balance, taxonomic classification or unexpected mapping patterns.
Batch effects Detect using PCA, clustering, sample metadata, sequencing run, library-preparation batch and processing date.
Outliers Investigate samples with unusual read counts, mapping, coverage, expression profiles or QC summaries.

17. QC thresholds and decision making

QC thresholds should be defined in the context of the assay, organism, platform, sample type and analysis endpoint. Universal thresholds are risky.

Decision Questions to ask Possible outcome
Keep sample Are metrics acceptable for the intended analysis? Proceed and document QC status.
Trim or filter Are adapters, low-quality tails or technical sequences affecting analysis? Preprocess reads and rerun QC.
Flag with limitation Is the sample usable but weaker than others? Keep but document limitations and interpret carefully.
Exclude sample Does it fail core QC or behave as an outlier? Remove from analysis with documented reason.
Resequence or repeat Is failure caused by insufficient depth, library prep or sequencing problem? Request additional data or repeat library preparation.

18. Example QC command-line workflow

The following commands illustrate a simple QC workflow. Adapt file names, reference genomes, parameters and tools to the project.

Raw read QC
mkdir -p results/qc/fastqc_raw results/qc/multiqc_raw

fastqc data/fastq/*.fastq.gz \
  --outdir results/qc/fastqc_raw \
  --threads 8

multiqc results/qc/fastqc_raw \
  --outdir results/qc/multiqc_raw
Post-trimming QC
mkdir -p results/qc/fastqc_trimmed results/qc/multiqc_trimmed

fastqc results/trimmed/*.fastq.gz \
  --outdir results/qc/fastqc_trimmed \
  --threads 8

multiqc results/qc \
  --outdir results/qc/multiqc_all
Alignment QC summary
mkdir -p results/qc/bam

for bam in results/alignments/*.sorted.bam; do
  sample="$(basename "$bam" .sorted.bam)"
  samtools flagstat "$bam" > "results/qc/bam/${sample}.flagstat.txt"
  samtools stats "$bam" > "results/qc/bam/${sample}.samtools.stats.txt"
  samtools idxstats "$bam" > "results/qc/bam/${sample}.idxstats.txt"
done

multiqc results/qc/bam \
  --outdir results/qc/multiqc_bam
Simple FASTQ read count check
for fq in data/fastq/*.fastq.gz; do
  reads=$(( $(zcat "$fq" | wc -l) / 4 ))
  echo -e "$(basename "$fq")\t${reads}"
done > results/qc/fastq_read_counts.tsv

19. QC reporting and deliverables

QC results should be summarized in a clear report that allows collaborators, reviewers or future analysts to understand which data were used and why.

Recommended QC report contents

  • Project name, date, analyst and software versions.
  • Input files, sample sheet and reference genome details.
  • Run-level and demultiplexing metrics if available.
  • Raw FASTQ QC summary and sample outliers.
  • Trimming/filtering parameters and retained read counts.
  • Alignment, coverage, duplication and contamination metrics.
  • Assay-specific QC results and interpretation.
  • Samples excluded or flagged, with reasons.
  • Final usable data summary and limitations.
A good QC report is not only a collection of plots. It should explain what the plots mean and what decisions were made.

20. NGS QC cheat sheet

Question Metric or tool What to look for
Are raw reads technically good? FastQC, fastp, MultiQC Quality, adapters, GC, duplication, overrepresented sequences.
Did trimming help? fastp/Cutadapt reports, FastQC after trimming Adapter reduction, retained bases, read length, quality improvement.
Do reads map as expected? SAMtools flagstat/stats/idxstats Mapping rate, proper pairing, chromosome distribution.
Is coverage sufficient? mosdepth, Picard, GATK-style metrics Mean depth, breadth, uniformity and low-coverage regions.
Is RNA-seq library valid? RSeQC, RNA-SeQC, featureCounts, MultiQC Strandedness, assigned reads, rRNA, gene body coverage and PCA outliers.
Are variants reliable? bcftools stats, GATK/Picard metrics Ti/Tv, depth, genotype quality, allele balance and filters.
Are samples correct? Genotype concordance, sex check, contamination tools, PCA Sample swaps, contamination, outliers and batch effects.
Is long-read data suitable? Platform reports, NanoPlot/pycoQC-style summaries, alignment QC Yield, read length, N50, read quality, coverage and platform-specific metrics.

Frequently asked questions

When should NGS quality control be performed?

Quality control should be performed at several stages: after sequencing run completion, on raw FASTQ files, after trimming or filtering, after alignment or quantification, and after final statistical analysis. Each stage detects different problems.

Is FastQC enough for NGS quality control?

FastQC is a useful first check for raw FASTQ files, but it is not enough for a complete project. A robust QC workflow also includes MultiQC summaries, trimming reports, alignment metrics, duplication, coverage, insert-size, contamination, sample identity, batch effects and assay-specific metrics.

Should all reads with low-quality bases be removed?

Not automatically. Trimming and filtering should be guided by the data type and QC reports. Aggressive trimming can reduce usable read length, mappability and downstream sensitivity.

What is a good mapping rate?

There is no universal value. Expected mapping rate depends on organism, library type, contamination level, reference quality and assay. Human WGS and RNA-seq often have high mapping rates, while metagenomics, degraded samples and non-model organisms may behave differently.

What does high duplication mean?

High duplication can indicate PCR over-amplification, low library complexity or over-sequencing. However, it can also be expected in targeted panels, amplicon sequencing, small RNA-seq, low-input samples or highly expressed RNA-seq libraries.

What QC metrics are most important for RNA-seq?

Important RNA-seq QC metrics include read quality, adapter content, mapping rate, rRNA fraction, strand specificity, gene body coverage, 5′/3′ bias, duplication, assigned reads, library complexity, PCA clustering and sample outliers.

What QC metrics are most important for DNA-seq?

Important DNA-seq QC metrics include read quality, mapping rate, duplicate rate, insert size, coverage depth and uniformity, target enrichment metrics, contamination, sex checks, Ti/Tv ratio, heterozygosity and variant-call quality.

Can AI help with NGS QC?

AI can help summarize QC reports, detect unusual patterns, triage failed samples and generate human-readable reports. However, QC decisions should remain traceable, based on explicit metrics and reviewed by experienced analysts.