Resources /
Installing Software for NGS Data Analysis
Practical NGS Software Tutorial
Installing software for NGS data analysis.
A practical tutorial for setting up a Linux workstation for next-generation sequencing analysis. It covers system packages, mamba/conda and bioconda, containers, workflow managers, quality-control tools, aligners, RNA-seq tools, variant-calling utilities, R/Bioconductor and reproducible project environments.
NGS data analysis depends on many command-line tools, libraries, reference files and statistical packages. Installing everything directly into the operating system can quickly become difficult to maintain. A better strategy is to separate software into environments and containers that can be documented, shared and recreated.
Environment layermamba/conda environments for FastQC, MultiQC, SAMtools, aligners, quantifiers and variant tools.
Workflow layerNextflow, Snakemake, containers and versioned configuration files for reproducible projects.
Core principle: install tools in a way that lets you reproduce the analysis later. Save commands, versions, environment YAML files, container names and workflow versions.
2. Recommended installation strategy
Different installation methods are useful for different tasks. A robust workstation can use several methods together.
Method
Best use
Recommendation
apt
Base operating-system utilities and stable system packages.
Use for build tools, Git, curl, wget, Java, htop, tree and general utilities.
mamba / conda
Most everyday bioinformatics tools and reproducible analysis environments.
Use separate environments for QC, alignment, RNA-seq, DNA-seq and reporting.
Docker
Containerized workflows on local workstations, servers and cloud platforms.
Use for reproducible pipelines where Docker is allowed.
Apptainer / Singularity
Containers on HPC and shared multi-user systems.
Use when Docker is not permitted on a cluster.
Source installation
Special cases, development versions or custom compilation.
Avoid unless required; document compiler, dependencies and commit hash.
3. Prepare Ubuntu/Linux
Start with an updated system and install general utilities needed by many bioinformatics tools and installers.
A conda-compatible package manager is one of the most convenient ways to install bioinformatics software. Miniforge is a common lightweight installer that uses community channels and provides a clean starting point.
Install Miniforge on Linux
cd ~/software
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh
# Close and reopen the terminal, or run:
source ~/.bashrc
conda --version
mamba --version
Avoid installing bioinformatics tools into the base environment. Create named environments instead.
5. Configure conda-forge and bioconda channels
Many bioinformatics packages are distributed through the bioconda and conda-forge channels. Channel order matters.
Genomic interval operations such as overlaps and coverage.
conda/bioconda or apt
SeqKit
FASTQ/FASTA manipulation and statistics.
conda/bioconda
BWA / Bowtie2 / HISAT2 / STAR
Short-read alignment for DNA-seq or RNA-seq workflows.
conda/bioconda
Minimap2
Long-read and short-read alignment for many use cases.
conda/bioconda
8. Installing RNA-seq software
RNA-seq workflows typically require FASTQ QC, trimming, transcript quantification or splice-aware alignment, gene counting and downstream statistical analysis in R or Python.
Transcript quantificationSalmon and Kallisto are widely used for fast transcript-level quantification.
Splice-aware alignmentSTAR and HISAT2 are common choices when alignments, splice junctions or genome-browser tracks are needed.
Gene countsfeatureCounts from Subread is commonly used to generate gene-level count matrices from alignments.
StatisticsDESeq2, edgeR, limma, tximport and clusterProfiler are common R/Bioconductor packages for interpretation.
9. Installing DNA-seq and variant-calling software
DNA-seq workflows often require alignment, duplicate marking, base-level processing, variant calling, filtering and annotation. For tumour-normal or clinical-genomics-style analysis, the exact workflow should be validated for the project.
Long-read sequencing analysis often requires different tools from short-read workflows. Tool choice depends on whether the data are PacBio HiFi, Oxford Nanopore, long-read RNA, genome assembly or structural-variant analysis.
Nanopore basecalling and PacBio HiFi workflows often include vendor-specific software or GPU-aware tools. Install those according to the instrument provider’s current documentation.
11. Installing R, Python and notebooks
Statistical analysis, visualization and reporting are often done in R, Python or Jupyter notebooks. Keep project environments separate.
Containers make software more reproducible by bundling tools and dependencies. Docker is common on workstations and cloud environments. Apptainer is common on HPC clusters.
Docker on a local workstation
Install Docker from Ubuntu packages
sudo apt update
sudo apt install -y docker.io
sudo systemctl enable --now docker
sudo usermod -aG docker "$USER"
# Log out and log back in before testing:
docker --version
docker run hello-world
Apptainer
Install Apptainer if available from your distribution
For established pipelines, consider community workflows such as nf-core pipelines. They can save time and improve reproducibility, especially when paired with containers.
14. Shared servers and HPC systems
Shared systems often use modules, Apptainer, SLURM and centrally managed filesystems. Do not assume you can install system packages or Docker on a cluster.
Use modules firstCheck available software with commands such as module avail or module spider.
Install in home or project spaceUse mamba environments in user-owned locations when allowed by the system policy.
Avoid heavy I/O in homeUse scratch or project filesystems for FASTQ, BAM and intermediate data.
Document modulesRecord loaded modules, versions, SLURM settings and container images in each analysis report.
15. Project template for reproducible software setup
Store environment files and installation notes together with project scripts. This makes it much easier to rerun the project later.
What is the best way to install NGS bioinformatics software?
For most users, the best approach is to combine system packages for basic tools, mamba or conda for bioinformatics environments, containers for reproducible pipelines, and workflow managers such as Nextflow or Snakemake for larger projects.
Should I install bioinformatics tools with apt, conda, Docker or from source?
Use apt for general system utilities, mamba/conda for most command-line bioinformatics tools, containers for reproducible pipelines, and source installation only when a tool is unavailable or requires custom compilation.
Why is mamba often preferred over conda?
Mamba uses the conda package ecosystem but resolves environments faster in many cases. This is helpful when installing bioinformatics tools from channels such as bioconda and conda-forge.
Do I need Docker for NGS analysis?
Docker is very useful for reproducible local and cloud workflows. On shared HPC systems, Apptainer or Singularity is often preferred because it is designed for multi-user environments.
Can all NGS tools be installed in one environment?
It is usually better to create separate environments for different workflows, such as QC, RNA-seq, variant calling, long-read analysis or single-cell analysis. This reduces package conflicts and makes projects easier to reproduce.
How should I document installed software?
Record tool names, versions, installation commands, conda environment YAML files, container image names, workflow versions and reference genome versions. This is essential for reproducibility and project reporting.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.