Practical NGS Software Tutorial

Installing software for NGS data analysis.

A practical tutorial for setting up a Linux workstation for next-generation sequencing analysis. It covers system packages, mamba/conda and bioconda, containers, workflow managers, quality-control tools, aligners, RNA-seq tools, variant-calling utilities, R/Bioconductor and reproducible project environments.

1. Overview

NGS data analysis depends on many command-line tools, libraries, reference files and statistical packages. Installing everything directly into the operating system can quickly become difficult to maintain. A better strategy is to separate software into environments and containers that can be documented, shared and recreated.

System layer Linux packages: shell utilities, compilers, Java, Git, curl, wget, basic monitoring tools.
Environment layer mamba/conda environments for FastQC, MultiQC, SAMtools, aligners, quantifiers and variant tools.
Workflow layer Nextflow, Snakemake, containers and versioned configuration files for reproducible projects.
Core principle: install tools in a way that lets you reproduce the analysis later. Save commands, versions, environment YAML files, container names and workflow versions.

2. Recommended installation strategy

Different installation methods are useful for different tasks. A robust workstation can use several methods together.

Method Best use Recommendation
apt Base operating-system utilities and stable system packages. Use for build tools, Git, curl, wget, Java, htop, tree and general utilities.
mamba / conda Most everyday bioinformatics tools and reproducible analysis environments. Use separate environments for QC, alignment, RNA-seq, DNA-seq and reporting.
Docker Containerized workflows on local workstations, servers and cloud platforms. Use for reproducible pipelines where Docker is allowed.
Apptainer / Singularity Containers on HPC and shared multi-user systems. Use when Docker is not permitted on a cluster.
Source installation Special cases, development versions or custom compilation. Avoid unless required; document compiler, dependencies and commit hash.

3. Prepare Ubuntu/Linux

Start with an updated system and install general utilities needed by many bioinformatics tools and installers.

Update the system and install base packages
sudo apt update
sudo apt upgrade -y

sudo apt install -y \
  build-essential cmake make gcc g++ \
  curl wget git git-lfs unzip zip tar gzip bzip2 xz-utils \
  htop tree pigz parallel rsync screen tmux \
  ca-certificates gnupg lsb-release software-properties-common \
  openjdk-17-jre

Create standard folders

Recommended software and data folders
mkdir -p ~/software ~/projects ~/datasets ~/references ~/scratch
mkdir -p ~/bin ~/.local/bin

echo 'export PATH="$HOME/bin:$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

4. Install mamba / conda environment management

A conda-compatible package manager is one of the most convenient ways to install bioinformatics software. Miniforge is a common lightweight installer that uses community channels and provides a clean starting point.

Install Miniforge on Linux
cd ~/software

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh

bash Miniforge3-Linux-x86_64.sh

# Close and reopen the terminal, or run:
source ~/.bashrc

conda --version
mamba --version
Avoid installing bioinformatics tools into the base environment. Create named environments instead.

5. Configure conda-forge and bioconda channels

Many bioinformatics packages are distributed through the bioconda and conda-forge channels. Channel order matters.

Configure channels
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --set channel_priority strict

conda config --show channels
conda config --show channel_priority

Export and recreate environments

Save an environment for reproducibility
mamba activate ngs-qc
mamba env export --no-builds > envs/ngs-qc.yml

# Recreate later:
mamba env create -f envs/ngs-qc.yml

6. Quick install checklist

The following environments cover many common NGS data-analysis tasks. For production projects, pin tool versions and keep the YAML files.

QC environment
mamba create -n ngs-qc -y \
  fastqc multiqc fastp cutadapt trimmomatic seqkit pigz

mamba activate ngs-qc
fastqc --version
multiqc --version
fastp --version
Alignment and file-processing environment
mamba create -n ngs-align -y \
  samtools bcftools htslib bedtools bwa bowtie2 hisat2 star minimap2 subread

mamba activate ngs-align
samtools --version
bwa 2>&1 | head
STAR --version
RNA-seq environment
mamba create -n rnaseq -y \
  salmon kallisto star hisat2 subread samtools multiqc fastqc fastp

mamba activate rnaseq
salmon --version
featureCounts -v
Variant-analysis environment
mamba create -n variant-tools -y \
  samtools bcftools htslib bedtools bwa gatk4 freebayes vcftools vcflib snpeff ensembl-vep

mamba activate variant-tools
gatk --version
bcftools --version

7. Core NGS tools and what they are used for

Tool Purpose Typical install route
FastQC Quality control of FASTQ files. conda/bioconda or apt
MultiQC Aggregates QC reports across samples and tools. conda/bioconda or pip
fastp / Cutadapt Adapter trimming and read filtering. conda/bioconda
SAMtools / BCFtools / HTSlib BAM/CRAM/VCF processing, indexing and statistics. conda/bioconda or apt
BEDTools Genomic interval operations such as overlaps and coverage. conda/bioconda or apt
SeqKit FASTQ/FASTA manipulation and statistics. conda/bioconda
BWA / Bowtie2 / HISAT2 / STAR Short-read alignment for DNA-seq or RNA-seq workflows. conda/bioconda
Minimap2 Long-read and short-read alignment for many use cases. conda/bioconda

8. Installing RNA-seq software

RNA-seq workflows typically require FASTQ QC, trimming, transcript quantification or splice-aware alignment, gene counting and downstream statistical analysis in R or Python.

Transcript quantification Salmon and Kallisto are widely used for fast transcript-level quantification.
Splice-aware alignment STAR and HISAT2 are common choices when alignments, splice junctions or genome-browser tracks are needed.
Gene counts featureCounts from Subread is commonly used to generate gene-level count matrices from alignments.
Statistics DESeq2, edgeR, limma, tximport and clusterProfiler are common R/Bioconductor packages for interpretation.
R packages for RNA-seq analysis
mamba create -n r-rnaseq -y \
  r-base r-essentials bioconductor-deseq2 bioconductor-edger \
  bioconductor-limma bioconductor-tximport bioconductor-clusterprofiler \
  r-pheatmap r-ggplot2 r-tidyverse r-rmarkdown

mamba activate r-rnaseq
R

9. Installing DNA-seq and variant-calling software

DNA-seq workflows often require alignment, duplicate marking, base-level processing, variant calling, filtering and annotation. For tumour-normal or clinical-genomics-style analysis, the exact workflow should be validated for the project.

Task Common tools Notes
Alignment BWA, BWA-MEM2, Bowtie2, minimap2 Choice depends on read type, reference and assay.
BAM processing SAMtools, Picard, GATK Sorting, indexing, duplicate marking and metrics.
Germline variants GATK HaplotypeCaller, DeepVariant, FreeBayes, BCFtools Use an appropriate validated workflow for the data type.
Somatic variants GATK Mutect2, VarScan, Strelka-style workflows Matched normal DNA is preferred where available.
Annotation VEP, SnpEff, ANNOVAR-style workflows, bcftools plugins Record annotation database versions.
Install additional variant-analysis tools
mamba create -n variant-annotation -y \
  ensembl-vep snpeff snpsift bcftools vcftools tabix htslib

mamba activate variant-annotation
vep --help | head
snpEff -version

10. Installing long-read analysis tools

Long-read sequencing analysis often requires different tools from short-read workflows. Tool choice depends on whether the data are PacBio HiFi, Oxford Nanopore, long-read RNA, genome assembly or structural-variant analysis.

Long-read starter environment
mamba create -n longread -y \
  minimap2 samtools bcftools bedtools seqkit \
  flye raven filtlong porechop nanofilt

mamba activate longread
minimap2 --version
Nanopore basecalling and PacBio HiFi workflows often include vendor-specific software or GPU-aware tools. Install those according to the instrument provider’s current documentation.

11. Installing R, Python and notebooks

Statistical analysis, visualization and reporting are often done in R, Python or Jupyter notebooks. Keep project environments separate.

Python analysis environment
mamba create -n py-analysis -y \
  python=3.11 jupyterlab notebook pandas numpy scipy matplotlib seaborn \
  scikit-learn statsmodels biopython scanpy anndata plotly

mamba activate py-analysis
jupyter lab
R analysis and reporting environment
mamba create -n r-analysis -y \
  r-base r-essentials r-tidyverse r-data.table r-ggplot2 \
  r-rmarkdown r-knitr r-devtools r-pheatmap

mamba activate r-analysis
R

12. Installing Docker and Apptainer

Containers make software more reproducible by bundling tools and dependencies. Docker is common on workstations and cloud environments. Apptainer is common on HPC clusters.

Docker on a local workstation

Install Docker from Ubuntu packages
sudo apt update
sudo apt install -y docker.io

sudo systemctl enable --now docker
sudo usermod -aG docker "$USER"

# Log out and log back in before testing:
docker --version
docker run hello-world

Apptainer

Install Apptainer if available from your distribution
sudo apt update
sudo apt install -y apptainer

apptainer --version
On institutional servers or HPC systems, do not install Docker without administrator approval. Many clusters use Apptainer/Singularity instead.

13. Installing workflow engines

Workflow engines make NGS pipelines easier to run, scale and reproduce.

Install Nextflow
curl -s https://get.nextflow.io | bash
mkdir -p "$HOME/.local/bin"
mv nextflow "$HOME/.local/bin/"
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

nextflow -version
Install Snakemake in its own environment
mamba create -n snakemake -y snakemake
mamba activate snakemake
snakemake --version
For established pipelines, consider community workflows such as nf-core pipelines. They can save time and improve reproducibility, especially when paired with containers.

14. Shared servers and HPC systems

Shared systems often use modules, Apptainer, SLURM and centrally managed filesystems. Do not assume you can install system packages or Docker on a cluster.

Use modules first Check available software with commands such as module avail or module spider.
Install in home or project space Use mamba environments in user-owned locations when allowed by the system policy.
Avoid heavy I/O in home Use scratch or project filesystems for FASTQ, BAM and intermediate data.
Document modules Record loaded modules, versions, SLURM settings and container images in each analysis report.

15. Project template for reproducible software setup

Store environment files and installation notes together with project scripts. This makes it much easier to rerun the project later.

Recommended project structure
project_name/
├── data/
│   ├── fastq/
│   └── metadata/
├── references/
├── envs/
│   ├── ngs-qc.yml
│   ├── rnaseq.yml
│   └── variant-tools.yml
├── workflows/
├── scripts/
├── results/
├── logs/
└── reports/
Record software versions
mkdir -p reports/software_versions

fastqc --version > reports/software_versions/fastqc.txt
multiqc --version > reports/software_versions/multiqc.txt
samtools --version > reports/software_versions/samtools.txt
bcftools --version > reports/software_versions/bcftools.txt
bwa 2> reports/software_versions/bwa.txt
STAR --version > reports/software_versions/star.txt
salmon --version > reports/software_versions/salmon.txt

Example environment YAML

envs/ngs-qc.yml
name: ngs-qc
channels:
  - conda-forge
  - bioconda
dependencies:
  - fastqc
  - multiqc
  - fastp
  - cutadapt
  - seqkit
  - samtools
  - pigz

16. Troubleshooting common installation problems

Problem Likely cause What to try
Conda environment solving is slow Large dependency search or conflicting packages. Use mamba, create smaller environments and avoid mixing many workflows in one environment.
Package not found Missing bioconda or conda-forge channels, wrong channel order or unsupported platform. Check channel configuration and search the package on bioconda or conda-forge.
Tool version mismatch Package manager selected a different version than expected. Pin versions in the environment YAML and record the final installed versions.
Docker permission denied User not added to docker group or session not refreshed. Run sudo usermod -aG docker "$USER", then log out and log in again.
Java-based tools fail Java missing or incompatible Java version. Install a suitable OpenJDK version and check java -version.
HPC tool works interactively but not in jobs Environment not activated or modules not loaded in the job script. Activate conda environments or load modules inside the SLURM/job script itself.

Frequently asked questions

What is the best way to install NGS bioinformatics software?

For most users, the best approach is to combine system packages for basic tools, mamba or conda for bioinformatics environments, containers for reproducible pipelines, and workflow managers such as Nextflow or Snakemake for larger projects.

Should I install bioinformatics tools with apt, conda, Docker or from source?

Use apt for general system utilities, mamba/conda for most command-line bioinformatics tools, containers for reproducible pipelines, and source installation only when a tool is unavailable or requires custom compilation.

Why is mamba often preferred over conda?

Mamba uses the conda package ecosystem but resolves environments faster in many cases. This is helpful when installing bioinformatics tools from channels such as bioconda and conda-forge.

Do I need Docker for NGS analysis?

Docker is very useful for reproducible local and cloud workflows. On shared HPC systems, Apptainer or Singularity is often preferred because it is designed for multi-user environments.

Can all NGS tools be installed in one environment?

It is usually better to create separate environments for different workflows, such as QC, RNA-seq, variant calling, long-read analysis or single-cell analysis. This reduces package conflicts and makes projects easier to reproduce.

How should I document installed software?

Record tool names, versions, installation commands, conda environment YAML files, container image names, workflow versions and reference genome versions. This is essential for reproducibility and project reporting.