Quality control of NGS data: from raw reads to reliable results.
A practical tutorial for evaluating next-generation sequencing data quality across FASTQ files, trimming, alignment, coverage, RNA-seq, DNA-seq, targeted panels, ChIP-seq, ATAC-seq, single-cell and long-read workflows. The goal is to identify technical problems early and document which data are reliable for interpretation.
NGS quality control is the process of evaluating whether sequencing data are technically suitable for downstream analysis and biological interpretation. QC is not limited to checking raw reads. It also includes metadata validation, run-level metrics, preprocessing results, alignment performance, coverage, sample identity, contamination, batch effects and assay-specific metrics.
Assay-specific tables, figures and pass/fail notes.
3. Sequencing run-level QC
Run-level QC evaluates whether the sequencing instrument and demultiplexing performed as expected. These metrics often come from instrument software or the sequencing provider.
Total yieldTotal bases or reads generated by the run. Compare with expected output for the flow cell, chemistry and read length.
Q30 or quality distributionFraction of bases above a quality-score threshold. Interpretation depends on platform and read position.
Index balanceUneven index representation can indicate pooling problems or sample-index issues.
Undetermined readsHigh undetermined reads may indicate incorrect sample sheet, index mismatch, barcode quality or index hopping.
If run-level metrics fail, downstream FASTQ QC may reveal symptoms, but the root cause may be library loading, reagent quality, sample sheet errors or instrument issues.
4. Raw FASTQ QC metrics
FASTQ QC is usually the first analysis step after receiving data. It checks read-level properties before alignment or quantification.
Metric
What it means
Typical interpretation
Per-base quality
Quality score along read positions.
Quality often decreases toward read ends; severe decline may require trimming or resequencing.
Adapter content
Presence of adapter sequences in reads.
Common in short inserts, small RNA-seq and over-sequenced libraries; trimming is usually needed.
GC content
Distribution of GC percentage across reads.
Strong deviation from expected distribution can suggest contamination or library bias.
Duplication
Repeated identical sequences.
May indicate low complexity, PCR duplication or expected biology depending on assay.
Overrepresented sequences
Sequences occurring more often than expected.
May represent adapters, primers, ribosomal reads, highly abundant transcripts or contamination.
Sequence length
Read-length distribution.
Uniform in many short-read datasets; variable after trimming or in long-read data.
5. FastQC and MultiQC
FastQC provides per-sample reports for FASTQ files. MultiQC aggregates many tool outputs into a project-level report, making it easier to compare samples and detect outliers.
Look first for sample outliers, not only pass/fail icons.
Compare read counts across samples and libraries.
Check whether R1 and R2 behave differently in paired-end data.
Review adapter content before deciding trimming settings.
Check whether unusual GC content is present in many samples or only one sample.
6. Trimming and filtering QC
Trimming removes adapters, primers, low-quality ends or reads below a minimum length. Filtering removes reads that fail defined criteria. QC should be repeated after trimming to confirm that preprocessing improved the data.
Trim when...Adapter sequences are present, read ends are poor-quality, primer sequences must be removed or insert size is shorter than read length.
Avoid over-trimming when...Quality decline is mild, the aligner handles soft clipping well or trimming would create very short reads.
Contamination can come from other species, adapters, primers, rRNA, microbial reads, index misassignment, PhiX, sample swaps or laboratory handling. The relevant checks depend on the organism and assay.
RNA-seq QC includes both sequencing quality and transcriptome-specific properties. Good RNA-seq QC checks whether libraries reflect the intended RNA population and experimental design.
Metric
What it detects
Possible action
Mapping rate
Read alignment to genome or transcriptome.
Check reference, contamination, library type and read quality.
Assigned reads
Reads assigned to genes or transcripts.
Check annotation version, strandedness and feature-counting settings.
rRNA fraction
Residual ribosomal RNA.
Assess depletion/enrichment success and usable depth.
Good ATAC-seq often shows nucleosome patterning and strong TSS enrichment.
ChIP-seq QC depends strongly on antibody quality, target biology and control design. ATAC-seq QC depends strongly on nuclei preparation and mitochondrial contamination.
14. Single-cell NGS QC
Single-cell QC evaluates both sequencing quality and cell-level properties. It is usually performed after demultiplexing and count-matrix generation.
Cell-level metricsUMIs per cell, genes per cell, mitochondrial fraction, ribosomal fraction and cell barcode quality.
Sample-level metricsTotal cells recovered, reads per cell, sequencing saturation and cell-calling consistency.
Technical artefactsEmpty droplets, ambient RNA, doublets, multiplets and barcode swapping.
Quality control should be performed at several stages: after sequencing run completion, on raw FASTQ files, after trimming or filtering, after alignment or quantification, and after final statistical analysis. Each stage detects different problems.
Is FastQC enough for NGS quality control?
FastQC is a useful first check for raw FASTQ files, but it is not enough for a complete project. A robust QC workflow also includes MultiQC summaries, trimming reports, alignment metrics, duplication, coverage, insert-size, contamination, sample identity, batch effects and assay-specific metrics.
Should all reads with low-quality bases be removed?
Not automatically. Trimming and filtering should be guided by the data type and QC reports. Aggressive trimming can reduce usable read length, mappability and downstream sensitivity.
What is a good mapping rate?
There is no universal value. Expected mapping rate depends on organism, library type, contamination level, reference quality and assay. Human WGS and RNA-seq often have high mapping rates, while metagenomics, degraded samples and non-model organisms may behave differently.
What does high duplication mean?
High duplication can indicate PCR over-amplification, low library complexity or over-sequencing. However, it can also be expected in targeted panels, amplicon sequencing, small RNA-seq, low-input samples or highly expressed RNA-seq libraries.
What QC metrics are most important for RNA-seq?
Important RNA-seq QC metrics include read quality, adapter content, mapping rate, rRNA fraction, strand specificity, gene body coverage, 5′/3′ bias, duplication, assigned reads, library complexity, PCA clustering and sample outliers.
What QC metrics are most important for DNA-seq?
Important DNA-seq QC metrics include read quality, mapping rate, duplicate rate, insert size, coverage depth and uniformity, target enrichment metrics, contamination, sex checks, Ti/Tv ratio, heterozygosity and variant-call quality.
Can AI help with NGS QC?
AI can help summarize QC reports, detect unusual patterns, triage failed samples and generate human-readable reports. However, QC decisions should remain traceable, based on explicit metrics and reviewed by experienced analysts.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.