NGS FAQ

When PCR duplicates should be removed.

PCR duplicates can bias some sequencing analyses, but removing them blindly can also remove real biological signal. The correct decision depends on the assay type, molecule complexity, read depth, library preparation and whether unique molecular identifiers were used.

Quick answer

PCR duplicates should be removed, marked, retained or collapsed depending on the experimental design. There is no universal rule that applies to all NGS data.

Practical rule: for DNA-seq variant calling, duplicates are usually marked and excluded from variant evidence. For standard bulk RNA-seq without UMIs, coordinate-based duplicate removal is usually avoided. For UMI-based assays, reads should be collapsed using UMI-aware logic. For ChIP-seq, ATAC-seq and similar enrichment assays, duplicate handling depends on library complexity, enrichment and downstream analysis.

What are PCR duplicates?

PCR duplicates are multiple sequencing reads that appear to come from the same original library molecule after amplification. In coordinate-based duplicate marking, paired reads are often considered duplicates when they have the same or very similar mapping coordinates and orientation.

Duplicate reads can also arise from sequencing or optical artefacts. These are not identical to PCR duplicates, but many duplicate-marking tools can report both types if the read names and parameters support optical-duplicate detection.

PCR duplicate Multiple reads produced from the same original molecule during PCR amplification. They can overrepresent a molecule and bias variant allele fractions, peak heights or coverage estimates.
Natural duplicate Independent molecules that happen to produce reads with the same coordinates. These are common in high-depth, low-complexity, targeted, RNA or enriched libraries and should not always be removed.
Optical duplicate Multiple read records caused by a sequencing instrument detecting one cluster as more than one cluster. These are a technical artefact.
UMI duplicate Reads sharing the same molecule barcode and mapping position. These can be collapsed to represent the original molecule more accurately.

When to remove, mark or retain duplicates

Assay type Recommended duplicate strategy Reason
WGS / WES / targeted DNA-seq without UMIs Mark and usually exclude
Mark duplicates after alignment and sorting. Exclude duplicate-flagged reads from variant calling unless the protocol or caller specifies otherwise.
Repeated evidence from the same DNA fragment can inflate confidence, distort allele fractions and bias depth-based interpretation.
DNA-seq with UMIs Collapse by UMI
Use UMI-aware grouping or consensus generation rather than coordinate-only duplicate removal.
UMIs distinguish original molecules before amplification and allow more accurate correction of PCR amplification bias.
Bulk RNA-seq without UMIs Usually retain
Do not remove coordinate duplicates by default. Use duplication metrics as QC, not as an automatic filtering rule.
Highly expressed transcripts naturally generate many reads with similar coordinates. Removing them can distort expression estimates.
RNA-seq with UMIs Collapse by UMI
Deduplicate at the molecule level using gene or transcript assignment, UMI and positional information depending on protocol.
UMI-based RNA-seq aims to count original molecules rather than PCR-amplified reads.
Single-cell RNA-seq UMI collapse
Use the technology-specific pipeline or UMI-aware tools.
Single-cell expression matrices are usually based on cell barcodes and UMIs, not raw read counts alone.
ChIP-seq / CUT&RUN / CUT&Tag Context-dependent
Often remove or cap duplicates before peak calling, but inspect library complexity and enrichment.
Duplicates can represent PCR artefacts, but strong enrichment can also create real biological pileups at binding sites.
ATAC-seq Usually remove or mark for peak calling
Remove duplicates for many peak-calling workflows, while also evaluating library complexity and Tn5-specific QC.
Duplicate reads can inflate accessibility peaks, but high-quality libraries and deep sequencing need careful interpretation.
WGBS / bisulfite-seq Usually mark/remove after alignment
Use bisulfite-aware duplicate handling. UMIs are preferred for very low-input or high-sensitivity assays.
PCR duplicates can bias methylation estimates if the same converted molecule is counted repeatedly.
Amplicon sequencing Do not coordinate-deduplicate blindly
Use protocol-specific logic, UMIs or consensus methods if molecule counting is needed.
All true molecules may share similar coordinates because they come from the same amplified target region.
Metagenomics Evaluate carefully
Deduplication depends on library type, input amount, complexity, host depletion and analysis goal.
Duplicate-like reads can reflect PCR artefacts, real abundant organisms or low-complexity regions.

Marking duplicates is usually safer than deleting them

In many production workflows, duplicates are first marked in the BAM, CRAM or SAM file. Marked reads remain in the file, but they carry a duplicate flag. Downstream tools can then decide whether to ignore or include them.

Why mark?
  • Preserves the original alignment evidence.
  • Allows duplicate metrics to be reviewed later.
  • Supports reanalysis with different assumptions.
  • Prevents irreversible loss of potentially useful data.
When remove?
  • When a downstream tool requires a deduplicated file.
  • When producing a clean analysis-specific peak-calling file.
  • When storage reduction is necessary and raw/marked files are retained elsewhere.
  • When a validated protocol explicitly specifies removal.

UMIs: the best way to identify true molecules

Unique Molecular Identifiers are short random or semi-random sequences added to molecules before amplification. After sequencing, reads sharing the same UMI and compatible mapping information can be interpreted as originating from the same starting molecule.

UMI-aware processing is especially important for low-input libraries, single-cell sequencing, cell-free DNA, targeted panels, error correction, rare-variant detection and molecular counting.

With UMIs, the question is not simply “do these reads have the same coordinates?” but “do these reads represent the same original molecule?”

Assay-specific guidance

DNA-seq variant calling

Duplicate reads can make a variant appear better supported than it truly is. For most short-read WGS, WES and targeted DNA-seq workflows without UMIs, duplicate marking is a standard step before variant calling. The final BAM or CRAM is often retained with duplicate flags instead of physically deleting reads.

RNA-seq expression analysis

Standard bulk RNA-seq measures transcript abundance. If a gene is highly expressed, many independent RNA molecules may produce reads at the same coordinates. Coordinate-based duplicate removal can therefore undercount highly expressed transcripts and distort differential expression. In UMI-based RNA-seq, molecule-aware deduplication is appropriate.

ChIP-seq, CUT&RUN, CUT&Tag and ATAC-seq

Enrichment assays produce pileups at biologically meaningful sites. Duplicate reads may reflect PCR artefacts, but they may also reflect genuine enrichment or limited local fragmentation possibilities. Duplicate removal is common before peak calling, but the decision should be paired with library-complexity metrics, read depth, control samples and protocol expectations.

Amplicon and targeted low-complexity libraries

In amplicon sequencing, many real reads start and end at the same coordinates because the assay deliberately amplifies defined regions. Coordinate-only deduplication can be misleading. UMIs or consensus approaches are usually needed if original-molecule counting or error suppression is required.

Recommended duplicate-handling workflow

1. Start with design Identify assay type, input amount, read depth, UMIs and intended downstream analysis.
2. QC first Review FastQC/MultiQC, mapping rate, insert size, coverage, library complexity and duplication metrics.
3. Mark or collapse Use coordinate-based marking for many DNA-seq assays or UMI-aware grouping for UMI libraries.
4. Document decision Report whether duplicates were marked, removed, retained, capped or collapsed by UMI.

Example: short-read DNA-seq without UMIs

  1. Run FASTQ quality control.
  2. Align reads to the reference genome.
  3. Sort alignments.
  4. Mark duplicates and collect duplication metrics.
  5. Run downstream variant calling with duplicate-flagged reads excluded according to the validated workflow.

Example: UMI-based targeted sequencing

  1. Extract or annotate UMI information from reads.
  2. Align reads to the target reference.
  3. Group reads by UMI and compatible coordinates.
  4. Collapse reads or generate consensus sequences.
  5. Call variants or quantify molecules from consensus or collapsed evidence.

Common mistakes to avoid

Removing duplicates from FASTQ by identical sequence This can remove real biological reads and should not be used as a general replacement for alignment- or UMI-aware duplicate handling.
Applying DNA-seq rules to RNA-seq Bulk RNA-seq expression estimates can be distorted if coordinate duplicates are removed without considering transcript abundance and library design.
Ignoring UMIs If a library contains UMIs, duplicate handling should use UMI-aware methods. Coordinate-only deduplication wastes molecule-level information.
Confusing high duplication with failure High duplication can indicate low complexity or over-amplification, but it can also occur in enriched, targeted or deeply sequenced libraries.

How SciBerg reports duplicate handling

For NGS projects, SciBerg can provide transparent duplicate-handling documentation as part of the final report.

  • Software and version used for duplicate marking or UMI collapsing.
  • Exact parameters and command lines.
  • Duplicate-rate metrics and library-complexity summaries.
  • Whether duplicate reads were retained, marked, removed, capped or collapsed.
  • Rationale for assay-specific duplicate decisions.
  • Impact on downstream outputs such as coverage, counts, peak calls or variant calls.

Frequently asked questions

Should PCR duplicates always be removed?

No. PCR duplicates should not be removed automatically in every NGS project. The correct action depends on assay type, library complexity, read depth, whether UMIs were used, and the downstream analysis. In many DNA-seq workflows duplicates are marked and excluded from variant calling, while in many non-UMI bulk RNA-seq workflows coordinate-based duplicate removal is usually avoided.

Is it better to mark duplicates or physically remove them?

For most projects, marking duplicates is safer than deleting them. Marking keeps the evidence available for QC, troubleshooting and reanalysis, while downstream tools can ignore duplicate-flagged reads where appropriate.

Should duplicates be removed from RNA-seq?

Usually not for standard bulk RNA-seq without UMIs. Highly expressed transcripts can naturally produce reads with the same mapping coordinates, and coordinate-based deduplication can distort expression estimates. UMI-based RNA-seq is different: reads are collapsed by UMI and molecule identity.

Should duplicates be removed from ChIP-seq or ATAC-seq?

Often duplicates are removed or limited before peak calling, but the decision depends on library complexity, enrichment, read depth and assay design. Strong biological enrichment can create real pileups, so duplicate metrics should be interpreted together with other QC measures.

What are UMIs and how do they change duplicate removal?

Unique Molecular Identifiers are short molecular barcodes added before amplification. They allow reads derived from the same original molecule to be grouped and collapsed more accurately than coordinate-only deduplication.

Can I remove duplicates directly from FASTQ files?

Usually no. Standard PCR duplicate detection requires alignment coordinates or UMI-aware molecule grouping. Removing reads only because they have identical sequences in FASTQ can remove real biological signal, especially in RNA-seq, amplicon sequencing or low-complexity libraries.