Calculate raw and estimated read-count matrices with eXpress and RSEM.
This tutorial shows how to derive transcript-level count outputs from aligned BAM files using eXpress and RSEM. It covers installation, reference preparation, strand-specific counting, combining sample-level outputs into matrices, and practical quality-control steps before differential expression analysis.
After reads are aligned to a transcriptome or genome reference, the next step is often to calculate transcript-level or gene-level read-count tables. These tables can be used for quality control, sample comparison, normalization, clustering, and downstream differential expression analysis.
eXpress and RSEM are probabilistic quantification tools that estimate transcript abundance from sequencing reads. Their outputs are not always simple integer read counts, especially when reads map to multiple transcripts, but the resulting count-like values are widely used to construct expression matrices.
InputAligned BAM files, transcriptome FASTA references, genome FASTA references, and optionally GTF annotations.
ProcessingEstimate transcript or gene abundance using eXpress or RSEM with correct strand and reference settings.
OutputSample-level quantification files and combined count matrices for downstream analysis.
Terminology note
The current SciBerg resource uses the phrase “raw read counts.” For eXpress and RSEM, some output columns represent estimated or expected counts rather than literal per-feature integer counts. Choose the output column according to your downstream method.
Required inputs and assumptions
Before running quantification, confirm that the BAM file, reference FASTA, annotation file, and strand settings all match the alignment workflow.
Aligned BAM fileGenerated by Bowtie, Bowtie2, STAR, HISAT2, BWA, or another compatible aligner.
Reference FASTAThe same transcriptome or genome reference used to generate or interpret the alignments.
Annotation GTFRequired for genome-reference preparation in RSEM and useful for gene-level summarization.
Library strandednessForward-stranded, reverse-stranded, or unstranded settings must match the experimental protocol.
The original SciBerg page downloads the eXpress 1.5.1 Linux binary and copies the executable into the system path.
# Download and unpack eXpress
wget https://pachterlab.github.io/eXpress/downloads/express-1.5.1/express-1.5.1-linux_x86_64.tgz
tar -xvzf express-1.5.1-linux_x86_64.tgz
# Copy the binary to a location in your PATH
cd express-1.5.1-linux_x86_64
sudo cp express /usr/local/bin/
# Check installation
express --version
Shared-server tip
On institutional or HPC systems, avoid sudo unless your administrator recommends it. You can copy express into ~/bin and add that folder to your PATH.
Calculate counts with eXpress
eXpress can quantify reads from BAM alignments against a corresponding transcriptome reference FASTA file. The current SciBerg page provides strand-specific examples for forward and reverse single-end alignments.
# Example unstranded run
express \
--no-bias-correct \
reference.fa \
INPUT_over_reference.bam
mv results.xprs INPUT_over_reference.xprs
Important eXpress notes from the current SciBerg page
eXpress works with sorted BAM files generated by most aligners. The current SciBerg page notes that eXpress will not work on sorted BAM files generated by Bowtie or Bowtie2 when the -k option was used during mapping.
Combine eXpress outputs into a count matrix
The original SciBerg workflow sorts each .xprs file by transcript ID, extracts selected columns, pastes sample files together, and keeps one identifier block plus count columns.
For many samples, it is often easier to script matrix generation rather than manually specifying every column.
# Example: extract target ID and estimated counts from each .xprs file
# Adjust column selection if your downstream method needs a different column.
for xprs in *_over_reference.xprs; do
sample="${xprs%_over_reference.xprs}"
sort -k2,2 "$xprs" | awk -v s="$sample" 'BEGIN{OFS="\t"} NR==1{next} {print $2,$5}' \
> "${sample}.counts.tsv"
done
# Use a dedicated R/Python script for robust joining across many samples.
Install RSEM
The current SciBerg page downloads RSEM source code, compiles it, and installs the executables.
Source-install example from the current SciBerg workflow
wget https://github.com/deweylab/RSEM/archive/v1.3.1.tar.gz
tar -xvzf v1.3.1.tar.gz
cd RSEM-1.3.1
make
sudo make install
rsem-calculate-expression --version
RSEM can work with BAM input or perform alignment through supported aligners depending on the workflow. This page focuses on the current SciBerg BAM-input examples.
Prepare RSEM reference sequences
RSEM requires prepared reference files before quantification. The current SciBerg page provides examples for both transcriptome and genome references.
RSEM commonly writes output files such as sample.genes.results and sample.isoforms.results, which can be used to construct gene-level or transcript-level matrices.
Reusable RSEM matrix workflow skeleton
The script below runs RSEM on multiple BAM files and extracts a simple gene-level expected-count matrix. Adjust filenames and reference paths to your project.
#!/usr/bin/env bash
set -euo pipefail
THREADS=8
RSEM_REF="RSEM_refs/genome_reference"
BAM_DIR="bam"
OUTDIR="rsem_results"
MATRIX="rsem_gene_expected_counts.tsv"
mkdir -p "$OUTDIR"
# Run RSEM for each sorted BAM file
for bam in "$BAM_DIR"/*.sorted.bam; do
sample=$(basename "$bam" .sorted.bam)
rsem-calculate-expression \
--bam \
--no-bam-output \
-p "$THREADS" \
--strand-specific \
"$bam" \
"$RSEM_REF" \
"$OUTDIR/$sample"
done
# Build a simple gene-level expected-count matrix
# Column names are derived from sample filenames.
first_file=$(ls "$OUTDIR"/*.genes.results | head -1)
awk 'NR>1 {print $1}' "$first_file" > "$OUTDIR/gene_ids.tmp"
for result in "$OUTDIR"/*.genes.results; do
sample=$(basename "$result" .genes.results)
awk -v s="$sample" 'BEGIN{print s} NR>1 {print $5}' "$result" \
> "$OUTDIR/${sample}.expected_count.tmp"
done
paste "$OUTDIR/gene_ids.tmp" "$OUTDIR"/*.expected_count.tmp > "$MATRIX"
rm "$OUTDIR"/*.tmp
echo "Done. Gene expected-count matrix: $MATRIX"
Matrix-building caution
Matrix construction should preserve consistent gene or transcript order across samples. For production workflows, use a validated R, Python, or workflow-manager script that joins rows by gene or transcript ID.
Next steps after count generation
After generating count matrices, continue with quality control, normalization, exploratory analysis, and statistical testing.
Check total assigned counts per sample and compare with mapping statistics.
Verify that strand-specific settings match the library protocol.
Confirm that the reference FASTA, annotation GTF, and BAM files are from compatible genome or transcriptome versions.
Generate gene-level or transcript-level matrices and inspect sample clustering.
Use suitable downstream tools for normalization, differential expression, visualization, and biological interpretation.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.