Quick answer
PCR duplicates should be removed, marked, retained or collapsed depending on the experimental design. There is no universal rule that applies to all NGS data.
PCR duplicates can bias some sequencing analyses, but removing them blindly can also remove real biological signal. The correct decision depends on the assay type, molecule complexity, read depth, library preparation and whether unique molecular identifiers were used.
PCR duplicates should be removed, marked, retained or collapsed depending on the experimental design. There is no universal rule that applies to all NGS data.
PCR duplicates are multiple sequencing reads that appear to come from the same original library molecule after amplification. In coordinate-based duplicate marking, paired reads are often considered duplicates when they have the same or very similar mapping coordinates and orientation.
Duplicate reads can also arise from sequencing or optical artefacts. These are not identical to PCR duplicates, but many duplicate-marking tools can report both types if the read names and parameters support optical-duplicate detection.
| Assay type | Recommended duplicate strategy | Reason |
|---|---|---|
| WGS / WES / targeted DNA-seq without UMIs | Mark and usually exclude Mark duplicates after alignment and sorting. Exclude duplicate-flagged reads from variant calling unless the protocol or caller specifies otherwise. |
Repeated evidence from the same DNA fragment can inflate confidence, distort allele fractions and bias depth-based interpretation. |
| DNA-seq with UMIs | Collapse by UMI Use UMI-aware grouping or consensus generation rather than coordinate-only duplicate removal. |
UMIs distinguish original molecules before amplification and allow more accurate correction of PCR amplification bias. |
| Bulk RNA-seq without UMIs | Usually retain Do not remove coordinate duplicates by default. Use duplication metrics as QC, not as an automatic filtering rule. |
Highly expressed transcripts naturally generate many reads with similar coordinates. Removing them can distort expression estimates. |
| RNA-seq with UMIs | Collapse by UMI Deduplicate at the molecule level using gene or transcript assignment, UMI and positional information depending on protocol. |
UMI-based RNA-seq aims to count original molecules rather than PCR-amplified reads. |
| Single-cell RNA-seq | UMI collapse Use the technology-specific pipeline or UMI-aware tools. |
Single-cell expression matrices are usually based on cell barcodes and UMIs, not raw read counts alone. |
| ChIP-seq / CUT&RUN / CUT&Tag | Context-dependent Often remove or cap duplicates before peak calling, but inspect library complexity and enrichment. |
Duplicates can represent PCR artefacts, but strong enrichment can also create real biological pileups at binding sites. |
| ATAC-seq | Usually remove or mark for peak calling Remove duplicates for many peak-calling workflows, while also evaluating library complexity and Tn5-specific QC. |
Duplicate reads can inflate accessibility peaks, but high-quality libraries and deep sequencing need careful interpretation. |
| WGBS / bisulfite-seq | Usually mark/remove after alignment Use bisulfite-aware duplicate handling. UMIs are preferred for very low-input or high-sensitivity assays. |
PCR duplicates can bias methylation estimates if the same converted molecule is counted repeatedly. |
| Amplicon sequencing | Do not coordinate-deduplicate blindly Use protocol-specific logic, UMIs or consensus methods if molecule counting is needed. |
All true molecules may share similar coordinates because they come from the same amplified target region. |
| Metagenomics | Evaluate carefully Deduplication depends on library type, input amount, complexity, host depletion and analysis goal. |
Duplicate-like reads can reflect PCR artefacts, real abundant organisms or low-complexity regions. |
In many production workflows, duplicates are first marked in the BAM, CRAM or SAM file. Marked reads remain in the file, but they carry a duplicate flag. Downstream tools can then decide whether to ignore or include them.
Unique Molecular Identifiers are short random or semi-random sequences added to molecules before amplification. After sequencing, reads sharing the same UMI and compatible mapping information can be interpreted as originating from the same starting molecule.
UMI-aware processing is especially important for low-input libraries, single-cell sequencing, cell-free DNA, targeted panels, error correction, rare-variant detection and molecular counting.
Duplicate reads can make a variant appear better supported than it truly is. For most short-read WGS, WES and targeted DNA-seq workflows without UMIs, duplicate marking is a standard step before variant calling. The final BAM or CRAM is often retained with duplicate flags instead of physically deleting reads.
Standard bulk RNA-seq measures transcript abundance. If a gene is highly expressed, many independent RNA molecules may produce reads at the same coordinates. Coordinate-based duplicate removal can therefore undercount highly expressed transcripts and distort differential expression. In UMI-based RNA-seq, molecule-aware deduplication is appropriate.
Enrichment assays produce pileups at biologically meaningful sites. Duplicate reads may reflect PCR artefacts, but they may also reflect genuine enrichment or limited local fragmentation possibilities. Duplicate removal is common before peak calling, but the decision should be paired with library-complexity metrics, read depth, control samples and protocol expectations.
In amplicon sequencing, many real reads start and end at the same coordinates because the assay deliberately amplifies defined regions. Coordinate-only deduplication can be misleading. UMIs or consensus approaches are usually needed if original-molecule counting or error suppression is required.
For NGS projects, SciBerg can provide transparent duplicate-handling documentation as part of the final report.
No. PCR duplicates should not be removed automatically in every NGS project. The correct action depends on assay type, library complexity, read depth, whether UMIs were used, and the downstream analysis. In many DNA-seq workflows duplicates are marked and excluded from variant calling, while in many non-UMI bulk RNA-seq workflows coordinate-based duplicate removal is usually avoided.
For most projects, marking duplicates is safer than deleting them. Marking keeps the evidence available for QC, troubleshooting and reanalysis, while downstream tools can ignore duplicate-flagged reads where appropriate.
Usually not for standard bulk RNA-seq without UMIs. Highly expressed transcripts can naturally produce reads with the same mapping coordinates, and coordinate-based deduplication can distort expression estimates. UMI-based RNA-seq is different: reads are collapsed by UMI and molecule identity.
Often duplicates are removed or limited before peak calling, but the decision depends on library complexity, enrichment, read depth and assay design. Strong biological enrichment can create real pileups, so duplicate metrics should be interpreted together with other QC measures.
Unique Molecular Identifiers are short molecular barcodes added before amplification. They allow reads derived from the same original molecule to be grouped and collapsed more accurately than coordinate-only deduplication.
Usually no. Standard PCR duplicate detection requires alignment coordinates or UMI-aware molecule grouping. Removing reads only because they have identical sequences in FASTQ can remove real biological signal, especially in RNA-seq, amplicon sequencing or low-complexity libraries.
The following resources are useful for understanding duplicate marking, optical duplicates and UMI-aware deduplication.