Metagenomics analyzes DNA recovered directly from mixed microbial communities without culturing individual organisms. It is used in human microbiome studies, environmental microbiology, pathogen surveillance, biotechnology, agriculture, wastewater monitoring, food safety and industrial process monitoring.
Profile organismsEstimate bacterial, archaeal, viral, fungal or parasite composition.
Profile functionIdentify genes, pathways, enzymes, resistance markers or virulence factors.
Compare communitiesUse diversity, abundance and statistical models to interpret differences.
Core principle: metagenomics is highly sensitive to sampling, extraction, controls, database choice and compositional statistics. Good metadata and contamination control are as important as the computational pipeline.
2. Main types of metagenomics studies
Study type
Typical question
Analysis focus
16S rRNA amplicon sequencing
Which bacteria and archaea are present?
ASVs/OTUs, taxonomy, alpha diversity, beta diversity and group comparisons.
ITS amplicon sequencing
Which fungi are present?
Fungal community composition and diversity.
Shotgun metagenomics
Which organisms and genes are present?
Taxonomy, functional profiling, strain-level analysis and assembly.
Viromics
Which viruses are present?
Viral detection, contig assembly, host prediction and low-biomass contamination control.
Metatranscriptomics
Which microbial genes are expressed?
RNA-based functional activity with additional rRNA depletion and expression-analysis concerns.
Genome-resolved metagenomics
Can genomes be reconstructed from mixed communities?
Assembly, binning, MAG quality, taxonomy and metabolic reconstruction.
3. Experimental design
Metagenomics studies should be designed around sampling, controls, sequencing depth and expected biomass. Bias can enter before sequencing, during DNA extraction, library preparation, database classification or statistical analysis.
Questions to answer early
Is the project 16S/ITS amplicon, shotgun metagenomics, viromics or genome-resolved metagenomics?
What environment is being sampled: gut, soil, water, skin, oral, respiratory, wastewater, food, bioreactor or another setting?
Are samples high-biomass or low-biomass?
Which controls are included: extraction blanks, PCR blanks, mock communities and positive controls?
Which confounders must be recorded: batch, extraction kit, storage, diet, medication, age, geography, time point or sequencing run?
Is the primary endpoint taxonomy, function, diversity, pathogen detection, AMR profiling or MAG reconstruction?
What database versions and taxonomic naming conventions will be used?
Low-biomass microbiome samples are especially vulnerable to reagent and laboratory contamination. Controls should be planned from the start, not added only during analysis.
4. Metadata and controls
Metagenomics interpretation depends heavily on metadata. Technical variables can strongly influence observed community composition.
Example metagenomics sample sheet
sample_id group body_site batch control_type fastq_1 fastq_2
S1 control gut A none S1_R1.fastq.gz S1_R2.fastq.gz
S2 control gut A none S2_R1.fastq.gz S2_R2.fastq.gz
S3 treated gut B none S3_R1.fastq.gz S3_R2.fastq.gz
S4 treated gut B none S4_R1.fastq.gz S4_R2.fastq.gz
Blank1 control_blank NA A extraction_blank Blank1_R1.fastq.gz Blank1_R2.fastq.gz
Mock1 mock NA A mock_community Mock1_R1.fastq.gz Mock1_R2.fastq.gz
Extraction blanksDetect reagent and laboratory background contamination.
PCR blanksDetect amplification and library-preparation contamination.
Mock communitiesEvaluate classification accuracy and pipeline bias.
Positive controlsConfirm that expected organisms can be detected.
5. Input files and reference resources
Input
Typical format
Use
Raw reads
FASTQ.GZ
Sequencing reads from amplicon or shotgun libraries.
Sample metadata
TSV/CSV
Groups, batches, sites, controls and covariates.
Host reference
FASTA/index files
Removal of human, animal, plant or other host reads.
Taxonomic database
Tool-specific index
Species, genus or higher-level classification.
Functional database
Tool-specific database
Genes, pathways, enzymes, AMR or virulence markers.
Primer information
Sequences or metadata
Required for amplicon primer trimming and expected region.
6. FASTQ quality control
FASTQ QC checks sequencing quality before downstream profiling. In metagenomics, read quality, adapter contamination, sample read depth and control samples are all important.
Check read quality, adapter content and per-sample read counts.
Compare control samples against true biological samples.
Review GC-content distribution, which may be multimodal in mixed communities.
Check whether paired-end reads are suitable for merging in amplicon workflows.
7. Adapter, primer and quality trimming
Trimming strategies differ for amplicon and shotgun workflows. Amplicon analysis often requires primer removal, while shotgun analysis usually focuses on adapters and low-quality tails.
Human, animal or plant-associated metagenomics often includes host DNA. Host filtering reduces privacy risk, storage burden and false microbial assignments.
Document the host reference, aligner settings and the fraction of reads removed. Host filtering can affect downstream profiles if microbial reads are accidentally removed or host reads remain.
9. 16S, ITS and amplicon analysis
Amplicon sequencing targets marker genes or regions. The analysis usually produces ASV or OTU tables, taxonomy assignments and diversity metrics.
Primer removalRemove primer sequences and filter low-quality reads.
ASV inferenceDenoise reads to high-resolution sequence variants.
TaxonomyAssign ASVs against a marker-gene reference database.
Step
Common tools
Main outputs
Denoising
DADA2, Deblur, QIIME 2 workflows
ASV table and representative sequences.
Taxonomy
QIIME 2 classifiers, DADA2 classifiers
Taxonomic assignments for ASVs.
Diversity
QIIME 2, vegan, phyloseq
Alpha and beta diversity metrics.
Visualization
QIIME 2, R, Python
Bar plots, ordination, heatmaps and differential-abundance plots.
10. Shotgun metagenomics
Shotgun metagenomics sequences all DNA in the community. It supports taxonomic profiling, functional profiling, strain analysis, assembly, MAG recovery and pathogen or AMR screening.
Read-based profilingClassify reads directly against taxonomic or functional databases.
Marker-gene profilingUse clade-specific marker genes for taxonomic abundance estimates.
Assembly-based analysisAssemble contigs, predict genes and bin genomes.
Hybrid strategyCombine read-based speed with assembly-based genome recovery.
11. Taxonomic profiling
Taxonomic profiling estimates which organisms are present and their relative abundance. Different tools use different database structures and classification strategies.
Approach
Common tools
Strengths
K-mer classification
Kraken2, Bracken, Centrifuge
Fast classification of large datasets; database choice is critical.
Marker-gene profiling
MetaPhlAn
Often robust for species-level profiling using clade-specific markers.
Alignment-based profiling
DIAMOND/MEGAN-style workflows
Useful for protein-level or broader functional/taxonomic interpretation.
Custom pathogen detection
Kraken2, minimap2, BLAST-style confirmation
Requires strict controls, thresholds and confirmatory logic.
Functional profiling estimates genes, pathways, enzymes or biological functions encoded by the community. It can be read-based, assembly-based or gene-catalog-based.
Goal
Common outputs
Interpretation
Pathway profiling
Pathway abundance and coverage tables.
Useful for comparing metabolic potential across samples.
Gene family profiling
Ortholog, enzyme or protein-family abundance.
Requires careful normalization and database version tracking.
AMR profiling
Resistance gene tables and class summaries.
Requires thresholds, coverage checks and contamination-aware interpretation.
Virulence profiling
Virulence factor hits or pathway summaries.
Presence of genes does not always imply pathogenicity or expression.
Functional profiles describe genetic potential. They do not automatically prove activity unless supported by metatranscriptomics, proteomics, metabolomics or functional validation.
13. Alpha and beta diversity
Diversity metrics summarize within-sample richness and between-sample community differences. They are commonly used in 16S and shotgun metagenomics.
Metric type
Examples
What it describes
Alpha diversity
Observed taxa, Shannon, Simpson, Faith PD
Diversity within each sample.
Beta diversity
Bray-Curtis, Jaccard, UniFrac, Aitchison
Differences between samples.
Ordination
PCoA, PCA, NMDS, UMAP
Visualizes sample relationships based on community profiles.
Group testing
PERMANOVA-style tests
Tests whether group labels explain community distances.
Microbiome data are compositional. Relative abundances can change even when absolute microbial loads differ in ways not captured by sequencing proportions.
14. Metagenome assembly
Assembly combines shotgun reads into contigs. It can reveal genes, operons, plasmids, phages and partial or complete microbial genomes.
Single-sample assemblyUseful when samples are very different or high depth.
Co-assemblyCombines related samples to improve contig recovery.
Assembly QCCheck total length, N50, contig count, coverage and contamination.
Gene predictionPredict coding sequences before functional annotation.
15. Metagenome-assembled genomes
MAG analysis bins contigs into draft genomes based on coverage, composition and co-abundance patterns. MAGs can support taxonomy, metabolic reconstruction and genome-resolved ecology.
16. Antimicrobial resistance and virulence analysis
Metagenomics can screen for antimicrobial resistance genes, virulence markers and mobile genetic elements. Interpretation requires caution because gene detection does not necessarily prove expression, viability or clinical resistance.
AMR genesScreen reads or contigs against curated resistance databases.
Virulence factorsDetect genes associated with pathogenic potential.
Mobile elementsPlasmids, phages and integrons can influence gene transfer.
ThresholdsUse explicit identity, coverage and abundance thresholds.
17. Contamination and low-biomass interpretation
Contamination can come from reagents, kits, laboratory environment, barcode leakage, sample carryover or database misclassification. It is especially important in low-biomass samples.
Signal
Possible cause
What to check
Taxa present in blanks
Reagent or laboratory background.
Compare blanks, samples and extraction batches.
Unexpected high-abundance organism
Contamination, sample swap or true biology.
Review controls, metadata and confirm with targeted methods if needed.
Low total microbial reads
Low biomass or high host DNA.
Check host fraction, extraction yield and controls.
Implausible taxonomy
Database or classifier artefact.
Confirm with alternative methods or alignment to reference sequences.
18. Statistical analysis and differential abundance
Metagenomics data are sparse, compositional and often strongly affected by covariates. Statistical testing should match the data type and study design.
Alpha diversity testsCompare within-sample diversity between groups.
Beta diversity testsTest whether community composition differs by group or covariate.
Differential abundanceIdentify taxa, genes or pathways associated with conditions.
Covariate modelsAccount for batch, host factors, environment or repeated measures.
Avoid interpreting relative abundance changes as absolute microbial load changes unless absolute quantification, spike-ins, qPCR or biomass measurements support that conclusion.
19. Visualization
Good metagenomics visualization should show both community structure and data quality. Avoid overcrowded stacked bar plots with too many rare taxa.
Plot
Purpose
What to inspect
Stacked bar plot
Taxonomic composition.
Dominant taxa and sample-level outliers.
Alpha diversity boxplot
Within-sample diversity.
Group differences and outliers.
PCoA/NMDS
Between-sample distances.
Group separation, batch effects and control clustering.
Heatmap
Selected taxa, genes or pathways.
Patterns across groups and controls.
Volcano-style plot
Differential abundance results.
Effect size and significance together.
Contig/bin plots
Assembly and MAG QC.
Coverage, GC content, bin quality and contamination.
20. Example shotgun metagenomics workflow
The following simplified workflow illustrates a common shotgun metagenomics route. Real projects should adapt parameters to environment, host, database, controls and research question.
Metagenomics is the study of genetic material recovered directly from mixed biological or environmental communities, such as gut microbiome, soil, water, skin, oral, respiratory or industrial samples.
What is the difference between 16S sequencing and shotgun metagenomics?
16S sequencing profiles bacteria and archaea using marker-gene amplicons, while shotgun metagenomics sequences all DNA in a sample and can support taxonomic, functional, strain-level and genome-resolved analyses.
What are the main analysis goals in metagenomics?
Common goals include taxonomic profiling, diversity analysis, functional profiling, pathogen detection, antimicrobial-resistance screening, genome assembly, metagenome-assembled genomes and comparison between groups.
Which files are needed for metagenomics analysis?
Typical inputs include FASTQ files, sample metadata, negative and positive controls, reference databases, host reference genomes for depletion, and information about library preparation, read length and sequencing platform.
Why are controls important in metagenomics?
Metagenomics is sensitive to contamination, especially in low-biomass samples. Extraction blanks, PCR blanks, mock communities and positive controls help identify background taxa, reagent contaminants and technical bias.
Should host reads be removed?
For human or animal-associated samples, host read removal is commonly performed for privacy, storage and downstream focus. The host-filtering strategy should be documented and evaluated carefully to avoid unintended bias.
What is the difference between taxonomic and functional profiling?
Taxonomic profiling estimates which organisms are present and their relative abundance. Functional profiling estimates genes, pathways, enzymes or resistance markers that may be present in the community.
Can AI help with metagenomics?
AI can help summarize QC, flag contaminants, compare community profiles, prioritize organisms or pathways, draft reports and integrate metagenomics with metadata, while the primary analysis should remain reproducible, transparent and database-versioned.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.