Microbiome and Metagenomics Tutorial

Metagenomics data analysis: from mixed-community reads to biological insight.

A practical tutorial for 16S/amplicon and shotgun metagenomics projects. It covers experimental design, metadata, controls, FASTQ quality control, trimming, host-read removal, taxonomic profiling, functional profiling, diversity analysis, assembly, metagenome-assembled genomes, contamination control, visualization and reproducible reporting.

1. Overview: what is metagenomics?

Metagenomics analyzes DNA recovered directly from mixed microbial communities without culturing individual organisms. It is used in human microbiome studies, environmental microbiology, pathogen surveillance, biotechnology, agriculture, wastewater monitoring, food safety and industrial process monitoring.

Profile organismsEstimate bacterial, archaeal, viral, fungal or parasite composition.
Profile functionIdentify genes, pathways, enzymes, resistance markers or virulence factors.
Compare communitiesUse diversity, abundance and statistical models to interpret differences.
Core principle: metagenomics is highly sensitive to sampling, extraction, controls, database choice and compositional statistics. Good metadata and contamination control are as important as the computational pipeline.

2. Main types of metagenomics studies

Study typeTypical questionAnalysis focus
16S rRNA amplicon sequencingWhich bacteria and archaea are present?ASVs/OTUs, taxonomy, alpha diversity, beta diversity and group comparisons.
ITS amplicon sequencingWhich fungi are present?Fungal community composition and diversity.
Shotgun metagenomicsWhich organisms and genes are present?Taxonomy, functional profiling, strain-level analysis and assembly.
ViromicsWhich viruses are present?Viral detection, contig assembly, host prediction and low-biomass contamination control.
MetatranscriptomicsWhich microbial genes are expressed?RNA-based functional activity with additional rRNA depletion and expression-analysis concerns.
Genome-resolved metagenomicsCan genomes be reconstructed from mixed communities?Assembly, binning, MAG quality, taxonomy and metabolic reconstruction.

3. Experimental design

Metagenomics studies should be designed around sampling, controls, sequencing depth and expected biomass. Bias can enter before sequencing, during DNA extraction, library preparation, database classification or statistical analysis.

Questions to answer early

  • Is the project 16S/ITS amplicon, shotgun metagenomics, viromics or genome-resolved metagenomics?
  • What environment is being sampled: gut, soil, water, skin, oral, respiratory, wastewater, food, bioreactor or another setting?
  • Are samples high-biomass or low-biomass?
  • Which controls are included: extraction blanks, PCR blanks, mock communities and positive controls?
  • Which confounders must be recorded: batch, extraction kit, storage, diet, medication, age, geography, time point or sequencing run?
  • Is the primary endpoint taxonomy, function, diversity, pathogen detection, AMR profiling or MAG reconstruction?
  • What database versions and taxonomic naming conventions will be used?
Low-biomass microbiome samples are especially vulnerable to reagent and laboratory contamination. Controls should be planned from the start, not added only during analysis.

4. Metadata and controls

Metagenomics interpretation depends heavily on metadata. Technical variables can strongly influence observed community composition.

Example metagenomics sample sheet
sample_id	group	body_site	batch	control_type	fastq_1	fastq_2
S1	control	gut	A	none	S1_R1.fastq.gz	S1_R2.fastq.gz
S2	control	gut	A	none	S2_R1.fastq.gz	S2_R2.fastq.gz
S3	treated	gut	B	none	S3_R1.fastq.gz	S3_R2.fastq.gz
S4	treated	gut	B	none	S4_R1.fastq.gz	S4_R2.fastq.gz
Blank1	control_blank	NA	A	extraction_blank	Blank1_R1.fastq.gz	Blank1_R2.fastq.gz
Mock1	mock	NA	A	mock_community	Mock1_R1.fastq.gz	Mock1_R2.fastq.gz
Extraction blanksDetect reagent and laboratory background contamination.
PCR blanksDetect amplification and library-preparation contamination.
Mock communitiesEvaluate classification accuracy and pipeline bias.
Positive controlsConfirm that expected organisms can be detected.

5. Input files and reference resources

InputTypical formatUse
Raw readsFASTQ.GZSequencing reads from amplicon or shotgun libraries.
Sample metadataTSV/CSVGroups, batches, sites, controls and covariates.
Host referenceFASTA/index filesRemoval of human, animal, plant or other host reads.
Taxonomic databaseTool-specific indexSpecies, genus or higher-level classification.
Functional databaseTool-specific databaseGenes, pathways, enzymes, AMR or virulence markers.
Primer informationSequences or metadataRequired for amplicon primer trimming and expected region.

6. FASTQ quality control

FASTQ QC checks sequencing quality before downstream profiling. In metagenomics, read quality, adapter contamination, sample read depth and control samples are all important.

FastQC and MultiQC
mkdir -p results/qc/fastqc results/qc/multiqc

fastqc data/fastq/*.fastq.gz \
  --outdir results/qc/fastqc \
  --threads 8

multiqc results/qc \
  --outdir results/qc/multiqc
  • Check read quality, adapter content and per-sample read counts.
  • Compare control samples against true biological samples.
  • Review GC-content distribution, which may be multimodal in mixed communities.
  • Check whether paired-end reads are suitable for merging in amplicon workflows.

7. Adapter, primer and quality trimming

Trimming strategies differ for amplicon and shotgun workflows. Amplicon analysis often requires primer removal, while shotgun analysis usually focuses on adapters and low-quality tails.

Shotgun trimming with fastp
mkdir -p results/trimmed results/qc/fastp

fastp \
  --in1 data/fastq/S1_R1.fastq.gz \
  --in2 data/fastq/S1_R2.fastq.gz \
  --out1 results/trimmed/S1_R1.trimmed.fastq.gz \
  --out2 results/trimmed/S1_R2.trimmed.fastq.gz \
  --html results/qc/fastp/S1.fastp.html \
  --json results/qc/fastp/S1.fastp.json \
  --thread 8
Amplicon primer trimming concept with Cutadapt
cutadapt \
  -g FORWARD_PRIMER_SEQUENCE \
  -G REVERSE_PRIMER_SEQUENCE \
  -o results/trimmed/S1_R1.primertrim.fastq.gz \
  -p results/trimmed/S1_R2.primertrim.fastq.gz \
  data/fastq/S1_R1.fastq.gz \
  data/fastq/S1_R2.fastq.gz

8. Host read removal

Human, animal or plant-associated metagenomics often includes host DNA. Host filtering reduces privacy risk, storage burden and false microbial assignments.

Host depletion concept with Bowtie2
mkdir -p results/host_filtered results/logs

bowtie2 \
  -x reference/host/bowtie2/host_genome \
  -1 results/trimmed/S1_R1.trimmed.fastq.gz \
  -2 results/trimmed/S1_R2.trimmed.fastq.gz \
  --very-sensitive \
  -p 16 \
  --un-conc-gz results/host_filtered/S1_microbiome_reads.fastq.gz \
  -S /dev/null \
  2> results/logs/S1.host_filtering.log
Document the host reference, aligner settings and the fraction of reads removed. Host filtering can affect downstream profiles if microbial reads are accidentally removed or host reads remain.

9. 16S, ITS and amplicon analysis

Amplicon sequencing targets marker genes or regions. The analysis usually produces ASV or OTU tables, taxonomy assignments and diversity metrics.

Primer removalRemove primer sequences and filter low-quality reads.
ASV inferenceDenoise reads to high-resolution sequence variants.
TaxonomyAssign ASVs against a marker-gene reference database.
StepCommon toolsMain outputs
DenoisingDADA2, Deblur, QIIME 2 workflowsASV table and representative sequences.
TaxonomyQIIME 2 classifiers, DADA2 classifiersTaxonomic assignments for ASVs.
DiversityQIIME 2, vegan, phyloseqAlpha and beta diversity metrics.
VisualizationQIIME 2, R, PythonBar plots, ordination, heatmaps and differential-abundance plots.

10. Shotgun metagenomics

Shotgun metagenomics sequences all DNA in the community. It supports taxonomic profiling, functional profiling, strain analysis, assembly, MAG recovery and pathogen or AMR screening.

Read-based profilingClassify reads directly against taxonomic or functional databases.
Marker-gene profilingUse clade-specific marker genes for taxonomic abundance estimates.
Assembly-based analysisAssemble contigs, predict genes and bin genomes.
Hybrid strategyCombine read-based speed with assembly-based genome recovery.

11. Taxonomic profiling

Taxonomic profiling estimates which organisms are present and their relative abundance. Different tools use different database structures and classification strategies.

ApproachCommon toolsStrengths
K-mer classificationKraken2, Bracken, CentrifugeFast classification of large datasets; database choice is critical.
Marker-gene profilingMetaPhlAnOften robust for species-level profiling using clade-specific markers.
Alignment-based profilingDIAMOND/MEGAN-style workflowsUseful for protein-level or broader functional/taxonomic interpretation.
Custom pathogen detectionKraken2, minimap2, BLAST-style confirmationRequires strict controls, thresholds and confirmatory logic.
Kraken2 and Bracken example
mkdir -p results/taxonomy/kraken2 results/taxonomy/bracken

kraken2 \
  --db databases/kraken2_standard \
  --paired results/host_filtered/S1_R1.fastq.gz results/host_filtered/S1_R2.fastq.gz \
  --threads 16 \
  --report results/taxonomy/kraken2/S1.report \
  --output results/taxonomy/kraken2/S1.kraken

bracken \
  -d databases/kraken2_standard \
  -i results/taxonomy/kraken2/S1.report \
  -o results/taxonomy/bracken/S1.species.tsv \
  -r 150 \
  -l S

12. Functional profiling

Functional profiling estimates genes, pathways, enzymes or biological functions encoded by the community. It can be read-based, assembly-based or gene-catalog-based.

GoalCommon outputsInterpretation
Pathway profilingPathway abundance and coverage tables.Useful for comparing metabolic potential across samples.
Gene family profilingOrtholog, enzyme or protein-family abundance.Requires careful normalization and database version tracking.
AMR profilingResistance gene tables and class summaries.Requires thresholds, coverage checks and contamination-aware interpretation.
Virulence profilingVirulence factor hits or pathway summaries.Presence of genes does not always imply pathogenicity or expression.
Functional profiles describe genetic potential. They do not automatically prove activity unless supported by metatranscriptomics, proteomics, metabolomics or functional validation.

13. Alpha and beta diversity

Diversity metrics summarize within-sample richness and between-sample community differences. They are commonly used in 16S and shotgun metagenomics.

Metric typeExamplesWhat it describes
Alpha diversityObserved taxa, Shannon, Simpson, Faith PDDiversity within each sample.
Beta diversityBray-Curtis, Jaccard, UniFrac, AitchisonDifferences between samples.
OrdinationPCoA, PCA, NMDS, UMAPVisualizes sample relationships based on community profiles.
Group testingPERMANOVA-style testsTests whether group labels explain community distances.
Microbiome data are compositional. Relative abundances can change even when absolute microbial loads differ in ways not captured by sequencing proportions.

14. Metagenome assembly

Assembly combines shotgun reads into contigs. It can reveal genes, operons, plasmids, phages and partial or complete microbial genomes.

MEGAHIT assembly example
mkdir -p results/assembly/S1

megahit \
  -1 results/host_filtered/S1_R1.fastq.gz \
  -2 results/host_filtered/S1_R2.fastq.gz \
  -o results/assembly/S1 \
  --num-cpu-threads 16
Single-sample assemblyUseful when samples are very different or high depth.
Co-assemblyCombines related samples to improve contig recovery.
Assembly QCCheck total length, N50, contig count, coverage and contamination.
Gene predictionPredict coding sequences before functional annotation.

15. Metagenome-assembled genomes

MAG analysis bins contigs into draft genomes based on coverage, composition and co-abundance patterns. MAGs can support taxonomy, metabolic reconstruction and genome-resolved ecology.

StepPurposeCommon tools
Contig binningGroup contigs into draft genomes.MetaBAT2, MaxBin2, CONCOCT, VAMB-style tools.
Bin refinementCombine and improve bins.DAS Tool, manual curation and refinement tools.
Quality assessmentEstimate completeness and contamination.CheckM, CheckM2-style tools.
Taxonomic classificationAssign MAG taxonomy.GTDB-Tk-style workflows.
Functional annotationReconstruct metabolic potential.Prokka, DRAM-style workflows, eggNOG-style annotation.

16. Antimicrobial resistance and virulence analysis

Metagenomics can screen for antimicrobial resistance genes, virulence markers and mobile genetic elements. Interpretation requires caution because gene detection does not necessarily prove expression, viability or clinical resistance.

AMR genesScreen reads or contigs against curated resistance databases.
Virulence factorsDetect genes associated with pathogenic potential.
Mobile elementsPlasmids, phages and integrons can influence gene transfer.
ThresholdsUse explicit identity, coverage and abundance thresholds.

17. Contamination and low-biomass interpretation

Contamination can come from reagents, kits, laboratory environment, barcode leakage, sample carryover or database misclassification. It is especially important in low-biomass samples.

SignalPossible causeWhat to check
Taxa present in blanksReagent or laboratory background.Compare blanks, samples and extraction batches.
Unexpected high-abundance organismContamination, sample swap or true biology.Review controls, metadata and confirm with targeted methods if needed.
Low total microbial readsLow biomass or high host DNA.Check host fraction, extraction yield and controls.
Implausible taxonomyDatabase or classifier artefact.Confirm with alternative methods or alignment to reference sequences.

18. Statistical analysis and differential abundance

Metagenomics data are sparse, compositional and often strongly affected by covariates. Statistical testing should match the data type and study design.

Alpha diversity testsCompare within-sample diversity between groups.
Beta diversity testsTest whether community composition differs by group or covariate.
Differential abundanceIdentify taxa, genes or pathways associated with conditions.
Covariate modelsAccount for batch, host factors, environment or repeated measures.
Avoid interpreting relative abundance changes as absolute microbial load changes unless absolute quantification, spike-ins, qPCR or biomass measurements support that conclusion.

19. Visualization

Good metagenomics visualization should show both community structure and data quality. Avoid overcrowded stacked bar plots with too many rare taxa.

PlotPurposeWhat to inspect
Stacked bar plotTaxonomic composition.Dominant taxa and sample-level outliers.
Alpha diversity boxplotWithin-sample diversity.Group differences and outliers.
PCoA/NMDSBetween-sample distances.Group separation, batch effects and control clustering.
HeatmapSelected taxa, genes or pathways.Patterns across groups and controls.
Volcano-style plotDifferential abundance results.Effect size and significance together.
Contig/bin plotsAssembly and MAG QC.Coverage, GC content, bin quality and contamination.

20. Example shotgun metagenomics workflow

The following simplified workflow illustrates a common shotgun metagenomics route. Real projects should adapt parameters to environment, host, database, controls and research question.

Minimal shotgun workflow
# 1. Raw-read QC
fastqc data/fastq/*.fastq.gz --outdir results/qc/fastqc --threads 8

# 2. Trim reads
fastp \
  --in1 data/fastq/S1_R1.fastq.gz \
  --in2 data/fastq/S1_R2.fastq.gz \
  --out1 results/trimmed/S1_R1.fastq.gz \
  --out2 results/trimmed/S1_R2.fastq.gz \
  --html results/qc/fastp/S1.html \
  --json results/qc/fastp/S1.json \
  --thread 8

# 3. Remove host reads
bowtie2 -x reference/host/bowtie2/host_genome \
  -1 results/trimmed/S1_R1.fastq.gz \
  -2 results/trimmed/S1_R2.fastq.gz \
  --very-sensitive -p 16 \
  --un-conc-gz results/host_filtered/S1_microbiome.fastq.gz \
  -S /dev/null

# 4. Taxonomic profiling
kraken2 --db databases/kraken2_standard \
  --paired results/host_filtered/S1_microbiome.1.fastq.gz results/host_filtered/S1_microbiome.2.fastq.gz \
  --threads 16 \
  --report results/taxonomy/S1.kraken2.report \
  --output results/taxonomy/S1.kraken2.output

# 5. Optional assembly
megahit \
  -1 results/host_filtered/S1_microbiome.1.fastq.gz \
  -2 results/host_filtered/S1_microbiome.2.fastq.gz \
  -o results/assembly/S1 \
  --num-cpu-threads 16

# 6. Summarize QC
multiqc results --outdir results/qc/multiqc_final

21. Deliverables and reporting

  • FASTQ QC, trimming and host-filtering reports.
  • Final MultiQC report with sample-level metrics and outliers.
  • Host-read fraction and microbial-read retention summary.
  • Taxonomic abundance tables at phylum, genus, species or strain level where appropriate.
  • Functional abundance tables for genes, pathways, enzymes, AMR or virulence markers when requested.
  • Alpha and beta diversity tables and plots.
  • Control-sample contamination assessment.
  • Assembly statistics, contigs and MAG reports when assembly is performed.
  • Statistical analysis tables with effect sizes, adjusted p-values and model notes.
  • Methods section with database names, versions, parameters and limitations.

22. Metagenomics analysis cheat sheet

StepCommon toolsMain outputs
FASTQ QCFastQC, MultiQC, fastpRaw-read QC and project summary.
Trimmingfastp, Cutadapt, TrimmomaticTrimmed FASTQ files and preprocessing reports.
Host filteringBowtie2, BWA, minimap2, KneadData-style workflowsHost-depleted microbial reads.
Amplicon denoisingDADA2, Deblur, QIIME 2ASV table and representative sequences.
TaxonomyKraken2, Bracken, MetaPhlAn, CentrifugeTaxonomic abundance profiles.
FunctionHUMAnN-style workflows, DIAMOND, eggNOG-style annotationGene family and pathway abundance tables.
DiversityQIIME 2, phyloseq, vegan, scikit-bioAlpha diversity, beta diversity and ordination plots.
AssemblyMEGAHIT, metaSPAdesContigs and assembly statistics.
MAGsMetaBAT2, MaxBin2, CONCOCT, CheckM, GTDB-TkBinned genomes, QC and taxonomy.
ReportingMultiQC, R/Python, workflow reports, AI-assisted summariesQC, taxonomy, function, statistics and interpretation.

Frequently asked questions

What is metagenomics?

Metagenomics is the study of genetic material recovered directly from mixed biological or environmental communities, such as gut microbiome, soil, water, skin, oral, respiratory or industrial samples.

What is the difference between 16S sequencing and shotgun metagenomics?

16S sequencing profiles bacteria and archaea using marker-gene amplicons, while shotgun metagenomics sequences all DNA in a sample and can support taxonomic, functional, strain-level and genome-resolved analyses.

What are the main analysis goals in metagenomics?

Common goals include taxonomic profiling, diversity analysis, functional profiling, pathogen detection, antimicrobial-resistance screening, genome assembly, metagenome-assembled genomes and comparison between groups.

Which files are needed for metagenomics analysis?

Typical inputs include FASTQ files, sample metadata, negative and positive controls, reference databases, host reference genomes for depletion, and information about library preparation, read length and sequencing platform.

Why are controls important in metagenomics?

Metagenomics is sensitive to contamination, especially in low-biomass samples. Extraction blanks, PCR blanks, mock communities and positive controls help identify background taxa, reagent contaminants and technical bias.

Should host reads be removed?

For human or animal-associated samples, host read removal is commonly performed for privacy, storage and downstream focus. The host-filtering strategy should be documented and evaluated carefully to avoid unintended bias.

What is the difference between taxonomic and functional profiling?

Taxonomic profiling estimates which organisms are present and their relative abundance. Functional profiling estimates genes, pathways, enzymes or resistance markers that may be present in the community.

Can AI help with metagenomics?

AI can help summarize QC, flag contaminants, compare community profiles, prioritize organisms or pathways, draft reports and integrate metagenomics with metadata, while the primary analysis should remain reproducible, transparent and database-versioned.