Epigenomics Analysis Tutorial

Bisulfite-seq data analysis: from converted reads to methylation maps.

A practical tutorial for whole-genome bisulfite sequencing, RRBS and targeted bisulfite sequencing. It covers experimental design, FASTQ QC, bisulfite-specific trimming, bisulfite-aware alignment, methylation extraction, conversion efficiency, coverage QC, differential methylation, annotation, visualization and reproducible reporting.

1. Overview

Bisulfite sequencing estimates DNA methylation by comparing converted and unconverted cytosines. After bisulfite treatment, unmethylated cytosines are read as thymines, while methylated cytosines remain cytosines. The ratio of methylated reads to total informative reads is used as the methylation estimate.

PreprocessQC, adapter removal and protocol-specific trimming.
Map and callBisulfite-aware alignment and methylation extraction.
InterpretCoverage filtering, DMP/DMR testing, annotation and reporting.
Core principle: bisulfite-seq is not ordinary DNA-seq. The conversion chemistry changes sequence complexity, strand interpretation and quality-control expectations.

2. Bisulfite-seq assay types

AssayObjectiveAnalysis focus
WGBSGenome-wide methylation profiling.Large data volume, genome-wide coverage, conversion efficiency and DMR discovery.
RRBSCpG-rich reduced representation.Restriction-enzyme bias, M-bias, RRBS-specific trimming and region-specific coverage.
Targeted bisulfite-seqSelected loci or panels.Target coverage, primer handling and locus-level interpretation.
Amplicon bisulfite-seqDeep methylation profiling of specific regions.Per-amplicon QC, overlap handling and high-depth interpretation.

3. Experimental design

Before analysis, define the biological comparison, replicate structure, batch variables, methylation context and reporting scope.

  • Clarify whether the analysis targets CpG, CHG, CHH or all cytosines.
  • Use balanced biological replicates across conditions and batches.
  • Record library type: directional, non-directional, PBAT-like, RRBS or targeted.
  • Define minimum coverage and whether single-site or region-level testing is preferred.
  • Record whether spike-in controls are available for conversion-efficiency estimation.

4. Inputs and metadata

InputFormatUse
Raw readsFASTQ.GZSequencing reads after bisulfite conversion.
Reference genomeFASTAUsed to build bisulfite-converted indexes.
Sample sheetTSV/CSVDefines groups, batches, library type and FASTQ paths.
Gene annotationGTF/GFF3Promoter, gene body and transcript annotation.
Feature regionsBEDCpG islands, promoters, enhancers, targets or custom regions.
Example sample sheet
sample_id	group	batch	assay	library_type	fastq_1	fastq_2
S1	control	A	WGBS	directional	S1_R1.fastq.gz	S1_R2.fastq.gz
S2	control	A	WGBS	directional	S2_R1.fastq.gz	S2_R2.fastq.gz
S3	treated	B	WGBS	directional	S3_R1.fastq.gz	S3_R2.fastq.gz
S4	treated	B	WGBS	directional	S4_R1.fastq.gz	S4_R2.fastq.gz

5. FASTQ quality control

Bisulfite-seq base composition can look unusual because conversion changes C/T balance. Interpret FastQC warnings in the context of the assay.

FastQC and MultiQC
mkdir -p results/qc/fastqc results/qc/multiqc

fastqc data/fastq/*.fastq.gz --outdir results/qc/fastqc --threads 8

multiqc results/qc --outdir results/qc/multiqc
  • Inspect read quality, adapter content and overrepresented sequences.
  • Check duplication, but interpret it differently for WGBS, RRBS and amplicons.
  • Compare read counts and quality between samples and between R1/R2.
  • Look for signs of short inserts or protocol-specific sequence bias.

6. Bisulfite-specific trimming

Trimming removes adapters and biased bases. RRBS often needs special treatment because restriction-enzyme and fill-in steps can create biased methylation at read ends.

Trim Galore paired-end example
mkdir -p results/trimmed results/qc/trim_galore

trim_galore --paired --quality 20 --illumina \
  --output_dir results/trimmed \
  data/fastq/S1_R1.fastq.gz data/fastq/S1_R2.fastq.gz \
  2> results/qc/trim_galore/S1.trim_galore.log
RRBS mode example
trim_galore --paired --rrbs --quality 20 \
  --output_dir results/trimmed \
  data/fastq/S1_R1.fastq.gz data/fastq/S1_R2.fastq.gz

7. Reference genome and bisulfite indexes

Bisulfite-aware aligners use converted reference indexes. Keep these separate from ordinary DNA-seq indexes and record reference versions.

Build Bismark index
mkdir -p reference/bismark_genome
cp reference/genome.fa reference/bismark_genome/

bismark_genome_preparation \
  --bowtie2 \
  --parallel 8 \
  reference/bismark_genome

8. Bisulfite-aware alignment

Ordinary DNA alignment is not suitable for bisulfite reads. Use Bismark, BS-Seeker2, bwa-meth or another bisulfite-aware method.

Bismark paired-end alignment
mkdir -p results/bismark results/logs

bismark \
  --genome reference/bismark_genome \
  --bowtie2 \
  --parallel 4 \
  -1 results/trimmed/S1_R1_val_1.fq.gz \
  -2 results/trimmed/S1_R2_val_2.fq.gz \
  --output_dir results/bismark \
  2> results/logs/S1.bismark.log
The directional or non-directional setting must match the library protocol. Wrong strand settings can cause low mapping or distorted methylation calls.

9. Deduplication

Duplicate removal is common in WGBS, but RRBS and amplicon bisulfite-seq require assay-aware interpretation because many reads may start at the same coordinate by design.

Deduplicate Bismark BAM
mkdir -p results/bismark/deduplicated

deduplicate_bismark \
  --bam \
  --paired \
  --output_dir results/bismark/deduplicated \
  results/bismark/S1_R1_val_1_bismark_bt2_pe.bam

10. Methylation extraction

Methylation extraction generates cytosine-level counts, CpG coverage files and browser-track-friendly outputs.

Bismark methylation extractor
mkdir -p results/methylation

bismark_methylation_extractor \
  --paired-end \
  --comprehensive \
  --merge_non_CpG \
  --bedGraph \
  --cytosine_report \
  --genome_folder reference/bismark_genome \
  --output results/methylation \
  results/bismark/deduplicated/S1_R1_val_1_bismark_bt2_pe.deduplicated.bam

11. Methylation-specific QC

MetricWhat it detectsInterpretation
Mapping rateHow many reads align to the converted genome.Low rate can indicate wrong reference, contamination, poor quality or incorrect strand setting.
Conversion efficiencyCompleteness of bisulfite conversion.Poor conversion can create false methylation.
M-biasPosition-specific methylation bias.May require additional end trimming.
CpG coverageDepth at methylation sites.Low coverage increases uncertainty; extreme coverage may reflect bias.
Global methylationSample-level methylation distribution.Outliers may be biological or technical and should be investigated.

12. Differential methylation analysis

Differential methylation can be tested at single cytosines or across regions. Region-level analysis is often more robust and easier to interpret biologically.

DMPsSingle cytosines with methylation differences.
DMRsClusters of CpGs with coordinated methylation changes.
RegionsPromoters, CpG islands, enhancers, gene bodies or custom regions.

Common R/Bioconductor approaches include methylKit, DSS, bsseq and DMRcate. Selection depends on coverage, design complexity, sample size and whether smoothing is appropriate.

13. DMR annotation

Annotation connects methylation changes to genomic features, but proximity does not prove regulatory causality. Interpret DMRs together with biological context and, where possible, expression or chromatin data.

PromotersMethylation near transcription start sites can be linked to gene regulation.
CpG islandsImportant regulatory features often evaluated with shores and shelves.
EnhancersIntegration with ATAC-seq or ChIP-seq strengthens interpretation.
Gene bodiesInterpretation is context-dependent and should be handled carefully.
Annotate DMRs by overlap
bedtools intersect \
  -a results/dmr/dmrs.bed \
  -b annotation/promoters.bed \
  -wa -wb \
  > results/dmr/dmrs_overlapping_promoters.tsv

14. Visualization

Common visualization outputs include bedGraph and bigWig methylation tracks, DMR heatmaps, global methylation distributions, PCA/clustering and browser snapshots of key loci.

Convert bedGraph to bigWig
sort -k1,1 -k2,2n S1.methylation.bedGraph > S1.methylation.sorted.bedGraph

bedGraphToBigWig \
  S1.methylation.sorted.bedGraph \
  reference/genome.chrom.sizes \
  S1.methylation.bw

15. Example workflow summary

Minimal paired-end Bismark workflow
# 1. QC
fastqc data/fastq/*.fastq.gz --outdir results/qc/fastqc --threads 8

# 2. Trim
trim_galore --paired --quality 20 --output_dir results/trimmed \
  data/fastq/S1_R1.fastq.gz data/fastq/S1_R2.fastq.gz

# 3. Index
bismark_genome_preparation --bowtie2 reference/bismark_genome

# 4. Align
bismark --genome reference/bismark_genome --bowtie2 \
  -1 results/trimmed/S1_R1_val_1.fq.gz \
  -2 results/trimmed/S1_R2_val_2.fq.gz \
  --output_dir results/bismark

# 5. Deduplicate
deduplicate_bismark --bam --paired results/bismark/S1_R1_val_1_bismark_bt2_pe.bam

# 6. Extract methylation
bismark_methylation_extractor --paired-end --comprehensive --bedGraph \
  --cytosine_report --genome_folder reference/bismark_genome \
  results/bismark/S1_R1_val_1_bismark_bt2_pe.deduplicated.bam

# 7. Summarize
multiqc results --outdir results/qc/multiqc_final

16. Deliverables and reporting

  • FASTQ QC, trimming, alignment, deduplication and methylation-extraction reports.
  • Final MultiQC report with sample-level summaries.
  • Sorted BAM files where required, plus methylation cytosine reports and CpG coverage files.
  • bedGraph or bigWig tracks for genome-browser visualization.
  • Conversion-efficiency assessment when controls are available.
  • DMP/DMR tables with effect sizes, adjusted p-values and coverage metrics.
  • DMR annotation against promoters, genes, CpG islands, enhancers and custom features.
  • Methods section with software versions, reference assembly, parameters and limitations.

17. Bisulfite-seq data analysis cheat sheet

StepCommon toolsMain outputs
FASTQ QCFastQC, MultiQCRaw-read QC report.
TrimmingTrim Galore, Cutadapt, fastpTrimmed FASTQ and trimming logs.
Bisulfite indexingBismark, BS-Seeker2, bwa-methConverted genome indexes.
AlignmentBismark, BS-Seeker2, bwa-methBisulfite-aware BAM and alignment report.
Deduplicationdeduplicate_bismark, SAMtools/Picard in selected workflowsDeduplicated BAM and duplicate metrics.
Methylation callingBismark methylation extractor, MethylDackelCytosine reports, CpG coverage and bedGraph files.
Differential methylationmethylKit, DSS, bsseq, DMRcateDMP and DMR tables.
AnnotationBEDTools, genomation, ChIPseeker-style toolsAnnotated regions and gene-feature overlaps.

Frequently asked questions

What is bisulfite-seq data analysis?

It is the computational analysis of bisulfite-converted DNA sequencing reads to estimate DNA methylation levels at cytosines, most often CpG sites, and to identify methylation differences between samples or groups.

Why does bisulfite-seq need special alignment?

Bisulfite treatment converts unmethylated cytosines to uracils, which are read as thymines. This changes read composition and requires bisulfite-aware genome indexing, alignment and methylation calling.

What is the difference between WGBS and RRBS?

WGBS profiles methylation genome-wide, while RRBS enriches CpG-rich regions using restriction digestion and size selection. RRBS is cheaper but covers a reduced and protocol-dependent part of the genome.

Which tools are commonly used?

Common tools include FastQC, MultiQC, Trim Galore, Cutadapt, Bismark, BS-Seeker2, bwa-meth, MethylDackel, methylKit, DSS, bsseq, DMRcate, BEDTools and genome-browser utilities.

What is conversion efficiency?

Conversion efficiency measures how completely unmethylated cytosines were converted. Poor conversion can cause false methylation calls, so spike-ins or other conversion checks are important where available.

Should duplicates be removed?

For WGBS, duplicate removal is commonly performed. For RRBS, targeted or amplicon bisulfite sequencing, duplicate handling must consider protocol design because identical genomic starts may be expected.

What are DMPs and DMRs?

DMPs are differentially methylated positions, usually single cytosines. DMRs are regions containing multiple CpGs with coordinated methylation differences between groups.

Can AI help with bisulfite-seq?

AI can help summarize QC, flag outliers, draft reports and interpret DMR-associated genes or pathways, while the core methylation calling and statistics should remain reproducible and auditable.