DNA-seq data analysis transforms sequencing reads into interpretable genomic information. Depending on the assay, this may include identifying single-nucleotide variants, insertions and deletions, copy-number changes, structural variants, microsatellite instability, mutational signatures, tumour purity, contamination or coverage gaps.
PreprocessingValidate metadata, run FASTQ QC, trim if needed and align reads to a reference.
Variant discoveryCall germline, somatic, copy-number or structural variants using assay-appropriate tools.
InterpretationFilter, annotate, prioritize, review coverage and report findings with limitations.
Core principle: a DNA-seq result is only as reliable as the combination of sample metadata, sequencing quality, reference choice, coverage, variant-calling strategy and interpretation framework.
2. DNA-seq assay types
DNA-seq workflows share common steps, but the analysis strategy differs by assay type.
Matched-normal comparison, tumour purity, contamination, artefact filtering and cancer annotation.
Cell-free DNA / liquid biopsy
Low-frequency variant detection.
UMIs, error suppression, depth, fragment size, background noise and highly controlled reporting.
Low-pass WGS
Genome-wide copy-number or ancestry-style analysis at low coverage.
Read-depth normalization, large CNVs, ploidy and broad genomic patterns.
3. Project design before analysis
DNA-seq analysis should be planned before sequencing. The required coverage, read length, reference, variant types and reporting scope depend on the biological question.
Questions to answer early
Is the project germline, somatic, microbial, environmental, animal, plant or custom reference analysis?
Is the assay WGS, WES, panel, amplicon, cfDNA, low-pass WGS or hybrid capture?
Which variant types are in scope: SNVs, indels, CNVs, SVs, repeat expansions, mitochondrial variants or fusions?
Which reference genome and annotation versions will be used?
What minimum coverage or variant allele fraction is required?
Are UMIs present, and should UMI consensus reads be generated?
What databases and interpretation rules are acceptable for reporting?
Is the workflow for research, industrial screening or clinical-style interpretation?
Clinical or diagnostic reporting requires validated workflows, appropriate consent, data protection, expert review and regulatory compliance. A research workflow should not be presented as diagnostic unless it is validated for that purpose.
4. Input data and reference files
A reproducible DNA-seq project starts with well-documented input files.
Input
Typical format
Use
Raw reads
FASTQ or FASTQ.GZ
Primary sequencing reads before alignment.
Reference genome
FASTA
Coordinate system for alignment and variant calling.
Reference indexes
Aligner-specific files, FASTA index, sequence dictionary.
Required by aligners and variant callers.
Target regions
BED or interval list
Coverage and variant filtering for exomes or panels.
Known sites
VCF/VCF.GZ
Used by some workflows for calibration, annotation or filtering.
Sample metadata
TSV/CSV/YAML/JSON
Links file names to samples, groups, batches and analysis parameters.
5. Sample metadata for DNA-seq
Metadata errors can be more damaging than software errors. Confirm sample identity, tumour-normal pairing and assay labels before running the workflow.
DNA-seq reads are commonly aligned to a reference genome using a genomic aligner such as BWA-MEM, BWA-MEM2, Bowtie2 or minimap2. The result is usually a sorted and indexed BAM or CRAM file.
Read groups are important for many variant-calling workflows because they preserve sample, library, run and platform information inside the alignment file.
9. BAM/CRAM processing
After alignment, BAM or CRAM files are typically sorted, indexed and checked. For storage-efficient archiving, CRAM may be useful, but reference management becomes important.
Step
Command example
Purpose
Sort BAM
samtools sort
Order alignments by genomic coordinate.
Index BAM
samtools index
Enable regional access and visualization.
Inspect header
samtools view -H
Check reference contigs, read groups and software records.
Convert to CRAM
samtools view -C -T reference.fa
Compress alignments using reference-based compression.
Merge lanes
samtools merge
Combine lane-level BAMs for the same sample when appropriate.
10. Duplicate marking and UMI-aware processing
Duplicate reads can arise from PCR amplification, optical duplicates or deep sequencing of the same original molecules. In many DNA-seq workflows, duplicates are marked rather than physically removed.
Context
Recommended thinking
Notes
WGS
Mark duplicates and evaluate duplicate rate.
High duplication reduces effective coverage.
WES
Mark duplicates and review capture complexity.
Duplicate rates are often higher than WGS.
Targeted panels
Use assay-specific duplicate strategy.
Very high depth can make simple duplicate interpretation misleading.
UMI panels
Use UMI-aware consensus or deduplication.
Coordinate-only duplicate marking may be inappropriate.
cfDNA
Use UMI and error-suppression workflows when available.
Coverage determines whether variants can be detected with sufficient confidence. For exomes and panels, target coverage and uniformity are central deliverables.
Mean depthAverage sequencing depth across genome or targets.
Breadth of coverageFraction of bases covered at thresholds such as ≥10×, ≥20×, ≥30×, ≥100× or assay-specific thresholds.
UniformityHow evenly reads cover targets. Poor uniformity creates weak regions even when mean depth is high.
On-target rateFraction of reads overlapping intended target regions in WES or panel sequencing.
12. Contamination, relatedness and sample identity
DNA-seq analysis should confirm that samples are what they are expected to be. This is especially important for tumour-normal pairs, family studies and large cohorts.
Check
Why it matters
Possible method
Sex check
Detects metadata inconsistencies or sample swaps.
Chromosome X/Y coverage, heterozygosity or genotype-based checks.
Contamination estimate
Mixed DNA can create false variants and distort allele fractions.
Genotype-aware contamination tools or allele-balance checks.
Tumour-normal concordance
Confirms matched samples belong to the same individual.
SNP concordance and relatedness checks.
Cross-sample swaps
Sample mix-ups can invalidate interpretation.
Genotype fingerprints or known SNP panels.
Batch effects
Can affect coverage, CNV and variant calling.
QC clustering by run, lane, capture batch or library-prep batch.
13. Germline variant calling
Germline variant calling identifies inherited variants in an individual or cohort. Common outputs include VCF or gVCF files containing SNVs and indels.
Prepare BAMAlign, sort, index, mark duplicates and collect QC.
Call variantsUse a germline caller on single samples, families or cohorts.
Filter and annotateApply caller filters, frequency data, consequence annotation and phenotype context.
The simple example above is for tutorial purposes. Production germline workflows usually require carefully validated parameters, appropriate callers, quality calibration or filtering, reference resources and project-specific QC.
14. Somatic variant calling
Somatic variant calling identifies acquired variants, often in tumour DNA. A matched normal sample is preferred because it helps distinguish somatic variants from inherited variants and technical artefacts.
Design
Advantages
Challenges
Tumour-normal
Best for distinguishing somatic and germline variants.
Requires matched normal DNA and careful sample concordance checks.
Tumour-only
Can be used when normal DNA is unavailable.
More difficult germline filtering and higher interpretation uncertainty.
Panel-based somatic
High depth and focused clinical/research targets.
Requires assay-specific artefact handling and coverage review.
cfDNA / liquid biopsy
Non-invasive and useful for low-frequency variants.
Requires high depth, UMI/error suppression and strict noise modelling.
Somatic QC considerations
Tumour purity and ploidy can strongly affect variant allele fractions.
Matched-normal contamination or tumour-in-normal can reduce sensitivity.
Oxidative artefacts, FFPE damage and sequencing context biases require filtering.
Panel of normals and germline population databases can help remove recurrent artefacts.
Clinical-style interpretation should include evidence levels and review by qualified experts.
15. Copy-number and structural-variant analysis
DNA-seq can be used to detect copy-number variants and structural variants, but performance depends strongly on assay type, coverage, insert size and sample quality.
Variant type
Common evidence
Assay notes
Copy-number variants
Read depth, allele balance and segmentation.
Works well in WGS; exome/panel CNV needs normalization and careful validation.
Deletions
Read depth loss, split reads and discordant pairs.
Resolution depends on read length, coverage and repetitive sequence context.
Duplications
Read depth gain and discordant pairs.
Tandem duplications can be difficult in repetitive regions.
Inversions
Discordant read orientation and split reads.
Short reads can struggle near repeats or segmental duplications.
Translocations
Discordant pairs, split reads and breakpoint evidence.
Requires careful artefact filtering and visual review.
Long-read sequencing can be more informative for complex structural variants, phasing and repetitive regions, but short-read DNA-seq remains useful for many CNV and SV applications.
16. Variant annotation
Annotation adds context to raw variant calls. It helps identify affected genes, predicted consequences, transcript effects, population frequency, known clinical significance and cancer relevance.
Gene consequenceMissense, nonsense, frameshift, splice, synonymous, intronic or regulatory consequences.
Population frequencygnomAD or other population resources help distinguish rare variants from common polymorphisms.
Clinical databasesClinVar and similar resources provide submitted interpretations and evidence context.
Cancer resourcesCOSMIC, CIViC, OncoKB-style resources and cBioPortal-style cohort data can support cancer interpretation.
Hard thresholds can remove real variants in difficult regions.
Frequency filters
Population allele frequency from gnomAD or cohort controls.
Frequency interpretation depends on ancestry, disease model and dataset coverage.
Consequence filters
Protein-altering, splice, loss-of-function or regulatory categories.
Noncoding and synonymous variants can still matter in some contexts.
Inheritance filters
Dominant, recessive, de novo, compound heterozygous or family-based models.
Requires accurate pedigree and sample identity.
Cancer filters
Variant allele fraction, tumour-normal status, panel of normals, hotspot lists.
Tumour purity and clonality must be considered.
18. Biological and clinical interpretation
Interpretation connects variant data to the scientific or clinical question. It should include review of evidence, assay limitations, coverage gaps and potential artefacts.
Research interpretationFocuses on mechanisms, candidate genes, pathways, cohort patterns and testable hypotheses.
Industrial interpretationFocuses on reproducible deliverables, scalable pipelines, QC dashboards and decision-support outputs.
DNA-seq deliverables should include both data files and interpretive summaries. The exact deliverable set depends on project type.
Recommended deliverables
FASTQ QC and MultiQC reports.
Sorted and indexed BAM or CRAM files.
Alignment, duplication, coverage and target-performance metrics.
VCF/BCF files for raw and filtered variants.
Annotated variant tables with database versions.
Coverage-gap reports for target regions, if applicable.
CNV/SV outputs when in scope.
Somatic tumour-normal summary, if applicable.
Final report with methods, QC interpretation, limitations and prioritized findings.
Workflow logs, commands, environment files and software versions.
21. Reproducibility and workflow automation
Reproducible DNA-seq analysis requires more than a list of tools. It requires consistent references, fixed parameters, versioned software and documented decisions.
Version referencesRecord genome assembly, FASTA checksum, annotation version and target BED version.
Use environmentsUse mamba/conda, containers or workflow-managed software environments.
Automate workflowsUse Nextflow or Snakemake when projects include many samples or repeated analyses.
DNA-seq data analysis is the computational processing of DNA sequencing reads to evaluate sequence variation, coverage, copy number, structural variation, sample identity, contamination, target performance and biologically or clinically relevant genomic findings.
What is the difference between WGS, WES and targeted panel analysis?
Whole-genome sequencing evaluates the entire genome, whole-exome sequencing enriches protein-coding regions, and targeted panels focus on selected genes or regions. The analysis principles overlap, but coverage, sensitivity, variant filtering and reporting differ.
Which files are needed for DNA-seq analysis?
Typical inputs include FASTQ files, a sample sheet, reference genome FASTA, known sites resources when required, target BED files for exomes or panels, and metadata describing sample type, organism, library preparation and sequencing design.
Which aligner is commonly used for short-read DNA-seq?
BWA-MEM, BWA-MEM2, Bowtie2 and minimap2 are common options for short-read DNA alignment. Many germline and somatic workflows use BWA-MEM or BWA-MEM2 with sorted, indexed BAM or CRAM outputs.
Should duplicates be removed in DNA-seq analysis?
Duplicates are commonly marked in WGS, WES and many targeted DNA-seq workflows, but whether they should be removed or ignored depends on the assay. UMI-based assays should use UMI-aware consensus or deduplication instead of simple coordinate-based duplicate removal.
What is the difference between germline and somatic variant calling?
Germline variant calling identifies inherited variants present in the individual, often using one sample or a family/cohort design. Somatic variant calling identifies acquired variants, often by comparing tumour DNA with matched normal DNA or using tumour-only methods with additional filtering.
What is variant annotation?
Variant annotation adds biological and clinical context to variants, such as gene name, transcript consequence, amino acid change, population frequency, known clinical significance, cancer relevance and predicted functional effect.
Can AI help with DNA-seq analysis?
AI can assist with QC summarization, variant prioritization, literature triage, report drafting and interpretation support, but the analysis must remain reproducible, auditable and based on validated tools, explicit thresholds and expert review.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.