DNA-seq Analysis Tutorial

DNA-seq data analysis: from FASTQ files to annotated genomic variants.

A practical tutorial for whole-genome, whole-exome and targeted DNA sequencing analysis. It covers project design, FASTQ quality control, short-read alignment, BAM/CRAM processing, coverage metrics, germline and somatic variant calling, CNV/SV analysis, annotation, interpretation and reproducible reporting.

1. Overview: what is DNA-seq data analysis?

DNA-seq data analysis transforms sequencing reads into interpretable genomic information. Depending on the assay, this may include identifying single-nucleotide variants, insertions and deletions, copy-number changes, structural variants, microsatellite instability, mutational signatures, tumour purity, contamination or coverage gaps.

Preprocessing Validate metadata, run FASTQ QC, trim if needed and align reads to a reference.
Variant discovery Call germline, somatic, copy-number or structural variants using assay-appropriate tools.
Interpretation Filter, annotate, prioritize, review coverage and report findings with limitations.
Core principle: a DNA-seq result is only as reliable as the combination of sample metadata, sequencing quality, reference choice, coverage, variant-calling strategy and interpretation framework.

2. DNA-seq assay types

DNA-seq workflows share common steps, but the analysis strategy differs by assay type.

Assay Typical objective Analysis focus
Whole-genome sequencing Genome-wide variant discovery. SNVs, indels, CNVs, SVs, coverage, repeats, noncoding variants and genome-wide QC.
Whole-exome sequencing Protein-coding variant discovery. Target coverage, exonic variants, splice-site variants, off-target reads and capture performance.
Targeted panels High-depth analysis of selected genes or regions. Coverage thresholds, low-frequency variants, UMI handling, panel-specific reporting and hotspot review.
Tumour-normal DNA-seq Somatic variant discovery. Matched-normal comparison, tumour purity, contamination, artefact filtering and cancer annotation.
Cell-free DNA / liquid biopsy Low-frequency variant detection. UMIs, error suppression, depth, fragment size, background noise and highly controlled reporting.
Low-pass WGS Genome-wide copy-number or ancestry-style analysis at low coverage. Read-depth normalization, large CNVs, ploidy and broad genomic patterns.

3. Project design before analysis

DNA-seq analysis should be planned before sequencing. The required coverage, read length, reference, variant types and reporting scope depend on the biological question.

Questions to answer early

  • Is the project germline, somatic, microbial, environmental, animal, plant or custom reference analysis?
  • Is the assay WGS, WES, panel, amplicon, cfDNA, low-pass WGS or hybrid capture?
  • Which variant types are in scope: SNVs, indels, CNVs, SVs, repeat expansions, mitochondrial variants or fusions?
  • Which reference genome and annotation versions will be used?
  • What minimum coverage or variant allele fraction is required?
  • Are UMIs present, and should UMI consensus reads be generated?
  • What databases and interpretation rules are acceptable for reporting?
  • Is the workflow for research, industrial screening or clinical-style interpretation?
Clinical or diagnostic reporting requires validated workflows, appropriate consent, data protection, expert review and regulatory compliance. A research workflow should not be presented as diagnostic unless it is validated for that purpose.

4. Input data and reference files

A reproducible DNA-seq project starts with well-documented input files.

Input Typical format Use
Raw reads FASTQ or FASTQ.GZ Primary sequencing reads before alignment.
Reference genome FASTA Coordinate system for alignment and variant calling.
Reference indexes Aligner-specific files, FASTA index, sequence dictionary. Required by aligners and variant callers.
Target regions BED or interval list Coverage and variant filtering for exomes or panels.
Known sites VCF/VCF.GZ Used by some workflows for calibration, annotation or filtering.
Sample metadata TSV/CSV/YAML/JSON Links file names to samples, groups, batches and analysis parameters.

5. Sample metadata for DNA-seq

Metadata errors can be more damaging than software errors. Confirm sample identity, tumour-normal pairing and assay labels before running the workflow.

Example DNA-seq sample sheet
sample_id	patient_id	sample_type	assay	fastq_1	fastq_2	target_bed
N1	P001	normal	WES	N1_R1.fastq.gz	N1_R2.fastq.gz	exome_targets.bed
T1	P001	tumour	WES	T1_R1.fastq.gz	T1_R2.fastq.gz	exome_targets.bed
N2	P002	normal	panel	N2_R1.fastq.gz	N2_R2.fastq.gz	panel_targets.bed
T2	P002	tumour	panel	T2_R1.fastq.gz	T2_R2.fastq.gz	panel_targets.bed

Recommended metadata fields

  • sample_id, patient_id and sample_type.
  • FASTQ paths and read layout.
  • Assay type, library kit, UMI status and target region file.
  • Sequencing run, lane, batch and index/barcode information.
  • Reference genome, target BED and annotation versions.
  • For tumour samples: tumour purity estimate, matched normal ID and tissue/source notes if available.

6. FASTQ quality control

Raw read QC identifies read-quality issues, adapters, GC bias, overrepresented sequences, duplication and sample outliers before alignment.

FASTQ QC with FastQC and MultiQC
mkdir -p results/qc/fastqc results/qc/multiqc

fastqc data/fastq/*.fastq.gz \
  --outdir results/qc/fastqc \
  --threads 8

multiqc results/qc \
  --outdir results/qc/multiqc

Key raw-read checks

  • Per-base quality and read-length distribution.
  • Adapter or primer contamination.
  • Unexpected GC distribution.
  • Read counts per sample and R1/R2 consistency.
  • Overrepresented sequences.
  • Sequence duplication patterns in the context of assay type.

7. Adapter trimming and read filtering

Trimming is not automatically required for every DNA-seq project. It should be based on QC evidence and library design.

Trim when needed Adapter sequences, primers or low-quality ends can reduce mapping quality and variant sensitivity.
Do not over-trim Aggressive trimming can shorten reads and reduce mappability, especially in repetitive regions.
Example paired-end trimming with fastp
mkdir -p results/trimmed results/qc/fastp

fastp \
  --in1 data/fastq/S1_R1.fastq.gz \
  --in2 data/fastq/S1_R2.fastq.gz \
  --out1 results/trimmed/S1_R1.trimmed.fastq.gz \
  --out2 results/trimmed/S1_R2.trimmed.fastq.gz \
  --html results/qc/fastp/S1.fastp.html \
  --json results/qc/fastp/S1.fastp.json \
  --thread 8

8. Short-read alignment

DNA-seq reads are commonly aligned to a reference genome using a genomic aligner such as BWA-MEM, BWA-MEM2, Bowtie2 or minimap2. The result is usually a sorted and indexed BAM or CRAM file.

BWA-MEM alignment with read group
mkdir -p results/alignments results/logs

sample="S1"

bwa mem -t 16 \
  -R "@RG\tID:${sample}\tSM:${sample}\tLB:${sample}_lib1\tPL:ILLUMINA" \
  reference/genome.fa \
  "data/fastq/${sample}_R1.fastq.gz" \
  "data/fastq/${sample}_R2.fastq.gz" \
  2> "results/logs/${sample}.bwa.log" | \
  samtools sort -@ 8 -o "results/alignments/${sample}.sorted.bam"

samtools index "results/alignments/${sample}.sorted.bam"
Read groups are important for many variant-calling workflows because they preserve sample, library, run and platform information inside the alignment file.

9. BAM/CRAM processing

After alignment, BAM or CRAM files are typically sorted, indexed and checked. For storage-efficient archiving, CRAM may be useful, but reference management becomes important.

Step Command example Purpose
Sort BAM samtools sort Order alignments by genomic coordinate.
Index BAM samtools index Enable regional access and visualization.
Inspect header samtools view -H Check reference contigs, read groups and software records.
Convert to CRAM samtools view -C -T reference.fa Compress alignments using reference-based compression.
Merge lanes samtools merge Combine lane-level BAMs for the same sample when appropriate.

10. Duplicate marking and UMI-aware processing

Duplicate reads can arise from PCR amplification, optical duplicates or deep sequencing of the same original molecules. In many DNA-seq workflows, duplicates are marked rather than physically removed.

Context Recommended thinking Notes
WGS Mark duplicates and evaluate duplicate rate. High duplication reduces effective coverage.
WES Mark duplicates and review capture complexity. Duplicate rates are often higher than WGS.
Targeted panels Use assay-specific duplicate strategy. Very high depth can make simple duplicate interpretation misleading.
UMI panels Use UMI-aware consensus or deduplication. Coordinate-only duplicate marking may be inappropriate.
cfDNA Use UMI and error-suppression workflows when available. Low-frequency detection requires careful noise control.
Duplicate marking with samtools markdup
samtools sort -n -o S1.name_sorted.bam S1.sorted.bam
samtools fixmate -m S1.name_sorted.bam S1.fixmate.bam
samtools sort -o S1.position_sorted.bam S1.fixmate.bam
samtools markdup S1.position_sorted.bam S1.markdup.bam
samtools index S1.markdup.bam

11. Coverage and target-region QC

Coverage determines whether variants can be detected with sufficient confidence. For exomes and panels, target coverage and uniformity are central deliverables.

Mean depth Average sequencing depth across genome or targets.
Breadth of coverage Fraction of bases covered at thresholds such as ≥10×, ≥20×, ≥30×, ≥100× or assay-specific thresholds.
Uniformity How evenly reads cover targets. Poor uniformity creates weak regions even when mean depth is high.
On-target rate Fraction of reads overlapping intended target regions in WES or panel sequencing.
Coverage analysis with mosdepth
mkdir -p results/qc/coverage

# Genome-wide coverage
mosdepth --threads 8 results/qc/coverage/S1 results/alignments/S1.markdup.bam

# Targeted coverage with BED file
mosdepth --threads 8 \
  --by targets/panel_targets.bed \
  results/qc/coverage/S1.panel \
  results/alignments/S1.markdup.bam

12. Contamination, relatedness and sample identity

DNA-seq analysis should confirm that samples are what they are expected to be. This is especially important for tumour-normal pairs, family studies and large cohorts.

Check Why it matters Possible method
Sex check Detects metadata inconsistencies or sample swaps. Chromosome X/Y coverage, heterozygosity or genotype-based checks.
Contamination estimate Mixed DNA can create false variants and distort allele fractions. Genotype-aware contamination tools or allele-balance checks.
Tumour-normal concordance Confirms matched samples belong to the same individual. SNP concordance and relatedness checks.
Cross-sample swaps Sample mix-ups can invalidate interpretation. Genotype fingerprints or known SNP panels.
Batch effects Can affect coverage, CNV and variant calling. QC clustering by run, lane, capture batch or library-prep batch.

13. Germline variant calling

Germline variant calling identifies inherited variants in an individual or cohort. Common outputs include VCF or gVCF files containing SNVs and indels.

Prepare BAM Align, sort, index, mark duplicates and collect QC.
Call variants Use a germline caller on single samples, families or cohorts.
Filter and annotate Apply caller filters, frequency data, consequence annotation and phenotype context.
Example germline variant calling with bcftools
mkdir -p results/variants

bcftools mpileup \
  -f reference/genome.fa \
  results/alignments/S1.markdup.bam | \
  bcftools call -mv -Oz -o results/variants/S1.raw.vcf.gz

bcftools index results/variants/S1.raw.vcf.gz

bcftools filter \
  -e 'QUAL<20 || DP<10' \
  -Oz -o results/variants/S1.filtered.vcf.gz \
  results/variants/S1.raw.vcf.gz

bcftools index results/variants/S1.filtered.vcf.gz
The simple example above is for tutorial purposes. Production germline workflows usually require carefully validated parameters, appropriate callers, quality calibration or filtering, reference resources and project-specific QC.

14. Somatic variant calling

Somatic variant calling identifies acquired variants, often in tumour DNA. A matched normal sample is preferred because it helps distinguish somatic variants from inherited variants and technical artefacts.

Design Advantages Challenges
Tumour-normal Best for distinguishing somatic and germline variants. Requires matched normal DNA and careful sample concordance checks.
Tumour-only Can be used when normal DNA is unavailable. More difficult germline filtering and higher interpretation uncertainty.
Panel-based somatic High depth and focused clinical/research targets. Requires assay-specific artefact handling and coverage review.
cfDNA / liquid biopsy Non-invasive and useful for low-frequency variants. Requires high depth, UMI/error suppression and strict noise modelling.

Somatic QC considerations

  • Tumour purity and ploidy can strongly affect variant allele fractions.
  • Matched-normal contamination or tumour-in-normal can reduce sensitivity.
  • Oxidative artefacts, FFPE damage and sequencing context biases require filtering.
  • Panel of normals and germline population databases can help remove recurrent artefacts.
  • Clinical-style interpretation should include evidence levels and review by qualified experts.

15. Copy-number and structural-variant analysis

DNA-seq can be used to detect copy-number variants and structural variants, but performance depends strongly on assay type, coverage, insert size and sample quality.

Variant type Common evidence Assay notes
Copy-number variants Read depth, allele balance and segmentation. Works well in WGS; exome/panel CNV needs normalization and careful validation.
Deletions Read depth loss, split reads and discordant pairs. Resolution depends on read length, coverage and repetitive sequence context.
Duplications Read depth gain and discordant pairs. Tandem duplications can be difficult in repetitive regions.
Inversions Discordant read orientation and split reads. Short reads can struggle near repeats or segmental duplications.
Translocations Discordant pairs, split reads and breakpoint evidence. Requires careful artefact filtering and visual review.
Long-read sequencing can be more informative for complex structural variants, phasing and repetitive regions, but short-read DNA-seq remains useful for many CNV and SV applications.

16. Variant annotation

Annotation adds context to raw variant calls. It helps identify affected genes, predicted consequences, transcript effects, population frequency, known clinical significance and cancer relevance.

Gene consequence Missense, nonsense, frameshift, splice, synonymous, intronic or regulatory consequences.
Population frequency gnomAD or other population resources help distinguish rare variants from common polymorphisms.
Clinical databases ClinVar and similar resources provide submitted interpretations and evidence context.
Cancer resources COSMIC, CIViC, OncoKB-style resources and cBioPortal-style cohort data can support cancer interpretation.
Example annotation command with SnpEff
mkdir -p results/annotation

snpEff -v GRCh38.99 \
  results/variants/S1.filtered.vcf.gz \
  > results/annotation/S1.snpeff.vcf
Annotation database versions change. Always report annotation tool versions, transcript set, genome assembly and database release.

17. Variant filtering and prioritization

Variant filtering turns a large call set into a smaller list of candidates for review. Filtering should be transparent and reproducible.

Filter type Common criteria Caution
Quality filters Depth, genotype quality, mapping quality, strand bias, allele balance. Hard thresholds can remove real variants in difficult regions.
Frequency filters Population allele frequency from gnomAD or cohort controls. Frequency interpretation depends on ancestry, disease model and dataset coverage.
Consequence filters Protein-altering, splice, loss-of-function or regulatory categories. Noncoding and synonymous variants can still matter in some contexts.
Inheritance filters Dominant, recessive, de novo, compound heterozygous or family-based models. Requires accurate pedigree and sample identity.
Cancer filters Variant allele fraction, tumour-normal status, panel of normals, hotspot lists. Tumour purity and clonality must be considered.

18. Biological and clinical interpretation

Interpretation connects variant data to the scientific or clinical question. It should include review of evidence, assay limitations, coverage gaps and potential artefacts.

Research interpretation Focuses on mechanisms, candidate genes, pathways, cohort patterns and testable hypotheses.
Industrial interpretation Focuses on reproducible deliverables, scalable pipelines, QC dashboards and decision-support outputs.
Clinical-style interpretation Requires validated methods, evidence review, clear classification rules and appropriate expert oversight.
AI-assisted interpretation Can summarize evidence and prioritize variants, but final decisions must remain traceable and reviewable.

19. Example DNA-seq analysis workflow

The following example illustrates a simplified short-read DNA-seq workflow. It should be adapted and validated for real projects.

Project folders
mkdir -p reference data/fastq targets \
  results/qc/{fastqc,multiqc,alignment,coverage} \
  results/alignments results/variants results/annotation results/logs
Reference indexes
samtools faidx reference/genome.fa
bwa index reference/genome.fa
FASTQ QC
fastqc data/fastq/*.fastq.gz \
  --outdir results/qc/fastqc \
  --threads 8

multiqc results/qc \
  --outdir results/qc/multiqc
Align one sample
sample="S1"

bwa mem -t 16 \
  -R "@RG\tID:${sample}\tSM:${sample}\tLB:${sample}_lib1\tPL:ILLUMINA" \
  reference/genome.fa \
  "data/fastq/${sample}_R1.fastq.gz" \
  "data/fastq/${sample}_R2.fastq.gz" \
  2> "results/logs/${sample}.bwa.log" | \
  samtools sort -@ 8 -o "results/alignments/${sample}.sorted.bam"

samtools index "results/alignments/${sample}.sorted.bam"
Alignment and coverage QC
samtools flagstat "results/alignments/${sample}.sorted.bam" \
  > "results/qc/alignment/${sample}.flagstat.txt"

samtools stats "results/alignments/${sample}.sorted.bam" \
  > "results/qc/alignment/${sample}.samtools.stats.txt"

mosdepth --threads 8 \
  "results/qc/coverage/${sample}" \
  "results/alignments/${sample}.sorted.bam"
Simple variant calling with bcftools
bcftools mpileup \
  -f reference/genome.fa \
  "results/alignments/${sample}.sorted.bam" | \
  bcftools call -mv -Oz -o "results/variants/${sample}.raw.vcf.gz"

bcftools index "results/variants/${sample}.raw.vcf.gz"

20. DNA-seq deliverables and reports

DNA-seq deliverables should include both data files and interpretive summaries. The exact deliverable set depends on project type.

Recommended deliverables

  • FASTQ QC and MultiQC reports.
  • Sorted and indexed BAM or CRAM files.
  • Alignment, duplication, coverage and target-performance metrics.
  • VCF/BCF files for raw and filtered variants.
  • Annotated variant tables with database versions.
  • Coverage-gap reports for target regions, if applicable.
  • CNV/SV outputs when in scope.
  • Somatic tumour-normal summary, if applicable.
  • Final report with methods, QC interpretation, limitations and prioritized findings.
  • Workflow logs, commands, environment files and software versions.

21. Reproducibility and workflow automation

Reproducible DNA-seq analysis requires more than a list of tools. It requires consistent references, fixed parameters, versioned software and documented decisions.

Version references Record genome assembly, FASTA checksum, annotation version and target BED version.
Use environments Use mamba/conda, containers or workflow-managed software environments.
Automate workflows Use Nextflow or Snakemake when projects include many samples or repeated analyses.
Keep audit trails Preserve logs, MultiQC reports, variant filters and manual review decisions.

22. DNA-seq data analysis cheat sheet

Step Common tools Main outputs
FASTQ QC FastQC, fastp, MultiQC Read-quality reports and sample-level QC summary.
Trimming fastp, Cutadapt, Trimmomatic Trimmed FASTQ files and preprocessing reports.
Alignment BWA-MEM, BWA-MEM2, Bowtie2, minimap2 SAM/BAM/CRAM alignments.
BAM processing SAMtools, Picard, GATK tools Sorted, indexed, duplicate-marked BAM/CRAM files.
Coverage QC mosdepth, bedtools, Picard, Qualimap Coverage tables, target metrics and coverage gaps.
Germline variants GATK HaplotypeCaller, DeepVariant, bcftools, FreeBayes VCF/gVCF variant files.
Somatic variants GATK Mutect2, Strelka-style workflows, VarScan, LoFreq-style tools Somatic VCF files and filtering summaries.
CNV/SV analysis CNVkit, Delly, Manta-style tools, read-depth and split-read tools CNV segments, structural variants and supporting metrics.
Annotation VEP, SnpEff, SnpSift, ANNOVAR-style workflows, bcftools plugins Annotated VCFs and variant tables.
Reporting MultiQC, R/Python, workflow reports, AI-assisted summaries QC report, methods, prioritized findings and limitations.

Frequently asked questions

What is DNA-seq data analysis?

DNA-seq data analysis is the computational processing of DNA sequencing reads to evaluate sequence variation, coverage, copy number, structural variation, sample identity, contamination, target performance and biologically or clinically relevant genomic findings.

What is the difference between WGS, WES and targeted panel analysis?

Whole-genome sequencing evaluates the entire genome, whole-exome sequencing enriches protein-coding regions, and targeted panels focus on selected genes or regions. The analysis principles overlap, but coverage, sensitivity, variant filtering and reporting differ.

Which files are needed for DNA-seq analysis?

Typical inputs include FASTQ files, a sample sheet, reference genome FASTA, known sites resources when required, target BED files for exomes or panels, and metadata describing sample type, organism, library preparation and sequencing design.

Which aligner is commonly used for short-read DNA-seq?

BWA-MEM, BWA-MEM2, Bowtie2 and minimap2 are common options for short-read DNA alignment. Many germline and somatic workflows use BWA-MEM or BWA-MEM2 with sorted, indexed BAM or CRAM outputs.

Should duplicates be removed in DNA-seq analysis?

Duplicates are commonly marked in WGS, WES and many targeted DNA-seq workflows, but whether they should be removed or ignored depends on the assay. UMI-based assays should use UMI-aware consensus or deduplication instead of simple coordinate-based duplicate removal.

What is the difference between germline and somatic variant calling?

Germline variant calling identifies inherited variants present in the individual, often using one sample or a family/cohort design. Somatic variant calling identifies acquired variants, often by comparing tumour DNA with matched normal DNA or using tumour-only methods with additional filtering.

What is variant annotation?

Variant annotation adds biological and clinical context to variants, such as gene name, transcript consequence, amino acid change, population frequency, known clinical significance, cancer relevance and predicted functional effect.

Can AI help with DNA-seq analysis?

AI can assist with QC summarization, variant prioritization, literature triage, report drafting and interpretation support, but the analysis must remain reproducible, auditable and based on validated tools, explicit thresholds and expert review.