Bioinformatics Scripts

Calculate raw and estimated read-count matrices with eXpress and RSEM.

This tutorial shows how to derive transcript-level count outputs from aligned BAM files using eXpress and RSEM. It covers installation, reference preparation, strand-specific counting, combining sample-level outputs into matrices, and practical quality-control steps before differential expression analysis.

Overview

After reads are aligned to a transcriptome or genome reference, the next step is often to calculate transcript-level or gene-level read-count tables. These tables can be used for quality control, sample comparison, normalization, clustering, and downstream differential expression analysis.

eXpress and RSEM are probabilistic quantification tools that estimate transcript abundance from sequencing reads. Their outputs are not always simple integer read counts, especially when reads map to multiple transcripts, but the resulting count-like values are widely used to construct expression matrices.

Input Aligned BAM files, transcriptome FASTA references, genome FASTA references, and optionally GTF annotations.
Processing Estimate transcript or gene abundance using eXpress or RSEM with correct strand and reference settings.
Output Sample-level quantification files and combined count matrices for downstream analysis.

Terminology note

The current SciBerg resource uses the phrase “raw read counts.” For eXpress and RSEM, some output columns represent estimated or expected counts rather than literal per-feature integer counts. Choose the output column according to your downstream method.

Required inputs and assumptions

Before running quantification, confirm that the BAM file, reference FASTA, annotation file, and strand settings all match the alignment workflow.

Aligned BAM file Generated by Bowtie, Bowtie2, STAR, HISAT2, BWA, or another compatible aligner.
Reference FASTA The same transcriptome or genome reference used to generate or interpret the alignments.
Annotation GTF Required for genome-reference preparation in RSEM and useful for gene-level summarization.
Library strandedness Forward-stranded, reverse-stranded, or unstranded settings must match the experimental protocol.
# Basic BAM validation examples
samtools quickcheck INPUT_over_reference.bam
samtools flagstat INPUT_over_reference.bam > INPUT_over_reference.flagstat.txt
samtools idxstats INPUT_over_reference.bam > INPUT_over_reference.idxstats.txt

Install eXpress

The original SciBerg page downloads the eXpress 1.5.1 Linux binary and copies the executable into the system path.

# Download and unpack eXpress
wget https://pachterlab.github.io/eXpress/downloads/express-1.5.1/express-1.5.1-linux_x86_64.tgz
tar -xvzf express-1.5.1-linux_x86_64.tgz

# Copy the binary to a location in your PATH
cd express-1.5.1-linux_x86_64
sudo cp express /usr/local/bin/

# Check installation
express --version

Shared-server tip

On institutional or HPC systems, avoid sudo unless your administrator recommends it. You can copy express into ~/bin and add that folder to your PATH.

Calculate counts with eXpress

eXpress can quantify reads from BAM alignments against a corresponding transcriptome reference FASTA file. The current SciBerg page provides strand-specific examples for forward and reverse single-end alignments.

Forward-stranded library

# Read 1 of a stranded library: accept forward single-end alignments
express \
  --f-stranded \
  --no-bias-correct \
  reference.fa \
  INPUT_over_reference.bam

mv results.xprs INPUT_over_reference.xprs

Reverse-stranded library

# Read 2 of a stranded library: accept reverse single-end alignments
express \
  --r-stranded \
  --no-bias-correct \
  reference.fa \
  INPUT_over_reference.bam

mv results.xprs INPUT_over_reference.xprs

Unstranded example

# Example unstranded run
express \
  --no-bias-correct \
  reference.fa \
  INPUT_over_reference.bam

mv results.xprs INPUT_over_reference.xprs

Important eXpress notes from the current SciBerg page

eXpress works with sorted BAM files generated by most aligners. The current SciBerg page notes that eXpress will not work on sorted BAM files generated by Bowtie or Bowtie2 when the -k option was used during mapping.

Combine eXpress outputs into a count matrix

The original SciBerg workflow sorts each .xprs file by transcript ID, extracts selected columns, pastes sample files together, and keeps one identifier block plus count columns.

Sort sample-level XPRS files

sort -k2,2 INPUT1_over_reference.xprs > INPUT1_over_reference_sorted.xprs
sort -k2,2 INPUT2_over_reference.xprs > INPUT2_over_reference_sorted.xprs
sort -k2,2 INPUT3_over_reference.xprs > INPUT3_over_reference_sorted.xprs
sort -k2,2 INPUT4_over_reference.xprs > INPUT4_over_reference_sorted.xprs
sort -k2,2 INPUTN_over_reference.xprs > INPUTN_over_reference_sorted.xprs

Extract selected columns

cut -f2,3,5 INPUT1_over_reference_sorted.xprs > INPUT1_over_reference_sorted_ss.xprs
cut -f2,3,5 INPUT2_over_reference_sorted.xprs > INPUT2_over_reference_sorted_ss.xprs
cut -f2,3,5 INPUT3_over_reference_sorted.xprs > INPUT3_over_reference_sorted_ss.xprs
cut -f2,3,5 INPUT4_over_reference_sorted.xprs > INPUT4_over_reference_sorted_ss.xprs
cut -f2,3,5 INPUTN_over_reference_sorted.xprs > INPUTN_over_reference_sorted_ss.xprs

Paste and select matrix columns

paste \
  INPUT1_over_reference_sorted_ss.xprs \
  INPUT2_over_reference_sorted_ss.xprs \
  INPUT3_over_reference_sorted_ss.xprs \
  INPUT4_over_reference_sorted_ss.xprs \
  INPUTN_over_reference_sorted_ss.xprs \
  > INPUT1234N_over_reference_temp.xprs

cut -f1,2,3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48,51,N+3 \
  INPUT1234N_over_reference_temp.xprs \
  > INPUT1234N_over_reference.xprs

More scalable AWK-based example

For many samples, it is often easier to script matrix generation rather than manually specifying every column.

# Example: extract target ID and estimated counts from each .xprs file
# Adjust column selection if your downstream method needs a different column.
for xprs in *_over_reference.xprs; do
  sample="${xprs%_over_reference.xprs}"
  sort -k2,2 "$xprs" | awk -v s="$sample" 'BEGIN{OFS="\t"} NR==1{next} {print $2,$5}' \
    > "${sample}.counts.tsv"
done

# Use a dedicated R/Python script for robust joining across many samples.

Install RSEM

The current SciBerg page downloads RSEM source code, compiles it, and installs the executables.

Source-install example from the current SciBerg workflow

wget https://github.com/deweylab/RSEM/archive/v1.3.1.tar.gz

tar -xvzf v1.3.1.tar.gz
cd RSEM-1.3.1

make
sudo make install

rsem-calculate-expression --version

Conda/Mamba installation option

mamba create -n rsem-env -c conda-forge -c bioconda rsem bowtie2 star samtools
mamba activate rsem-env

rsem-calculate-expression --version

Alignment backend

RSEM can work with BAM input or perform alignment through supported aligners depending on the workflow. This page focuses on the current SciBerg BAM-input examples.

Prepare RSEM reference sequences

RSEM requires prepared reference files before quantification. The current SciBerg page provides examples for both transcriptome and genome references.

Transcriptome reference

rsem-prepare-reference reference.fa reference.fa

Genome reference with GTF annotation

rsem-prepare-reference \
  --gtf path_to_gtf_file/genome_reference.gtf \
  path_to_fasta_file/genome_reference.fa \
  genome_reference.fa

Clearer project-folder example

mkdir -p RSEM_refs

rsem-prepare-reference \
  --gtf annotation.gtf \
  genome.fa \
  RSEM_refs/genome_reference

Calculate expression with RSEM

The current SciBerg page provides examples for strand-specific reads mapped to transcriptome and genome references using BAM input.

Strand-specific reads mapped to a transcriptome reference

rsem-calculate-expression \
  --bam \
  --no-bam-output \
  -p [insert number of threads] \
  --strand-specific \
  INPUT_over_reference.bam \
  path_to_RSEM_refs/RSEM_refs

Strand-specific reads mapped to a genome reference

rsem-calculate-expression \
  --bam \
  --no-bam-output \
  -p [insert number of threads] \
  --strand-specific \
  INPUT_over_reference.bam \
  path_to_RSEM_refs/RSEM_refs \
  outputfilename

Concrete example with 8 threads

rsem-calculate-expression \
  --bam \
  --no-bam-output \
  -p 8 \
  --strand-specific \
  sample.sorted.bam \
  RSEM_refs/genome_reference \
  sample

RSEM commonly writes output files such as sample.genes.results and sample.isoforms.results, which can be used to construct gene-level or transcript-level matrices.

Reusable RSEM matrix workflow skeleton

The script below runs RSEM on multiple BAM files and extracts a simple gene-level expected-count matrix. Adjust filenames and reference paths to your project.

#!/usr/bin/env bash
set -euo pipefail

THREADS=8
RSEM_REF="RSEM_refs/genome_reference"
BAM_DIR="bam"
OUTDIR="rsem_results"
MATRIX="rsem_gene_expected_counts.tsv"

mkdir -p "$OUTDIR"

# Run RSEM for each sorted BAM file
for bam in "$BAM_DIR"/*.sorted.bam; do
  sample=$(basename "$bam" .sorted.bam)

  rsem-calculate-expression \
    --bam \
    --no-bam-output \
    -p "$THREADS" \
    --strand-specific \
    "$bam" \
    "$RSEM_REF" \
    "$OUTDIR/$sample"
done

# Build a simple gene-level expected-count matrix
# Column names are derived from sample filenames.
first_file=$(ls "$OUTDIR"/*.genes.results | head -1)

awk 'NR>1 {print $1}' "$first_file" > "$OUTDIR/gene_ids.tmp"

for result in "$OUTDIR"/*.genes.results; do
  sample=$(basename "$result" .genes.results)
  awk -v s="$sample" 'BEGIN{print s} NR>1 {print $5}' "$result" \
    > "$OUTDIR/${sample}.expected_count.tmp"
done

paste "$OUTDIR/gene_ids.tmp" "$OUTDIR"/*.expected_count.tmp > "$MATRIX"

rm "$OUTDIR"/*.tmp

echo "Done. Gene expected-count matrix: $MATRIX"

Matrix-building caution

Matrix construction should preserve consistent gene or transcript order across samples. For production workflows, use a validated R, Python, or workflow-manager script that joins rows by gene or transcript ID.

Next steps after count generation

After generating count matrices, continue with quality control, normalization, exploratory analysis, and statistical testing.

  1. Check total assigned counts per sample and compare with mapping statistics.
  2. Verify that strand-specific settings match the library protocol.
  3. Confirm that the reference FASTA, annotation GTF, and BAM files are from compatible genome or transcriptome versions.
  4. Generate gene-level or transcript-level matrices and inspect sample clustering.
  5. Use suitable downstream tools for normalization, differential expression, visualization, and biological interpretation.