Bioinformatics Scripts

Extracting promoter and TSS sequences from genome FASTA files.

This tutorial shows how to extract promoter and transcription start site regions from a reference genome using a genome FASTA file, a gene BED file, samtools faidx, and bedtools. The examples generate 2 kb upstream promoter FASTAs and ±1 kb TSS-region FASTAs.

Overview

Promoter and TSS-region sequence extraction is useful for motif analysis, transcription-factor binding studies, regulatory genomics, machine-learning feature generation, and sequence-based interpretation of gene regulation.

The workflow requires a genome FASTA file, a corresponding gene BED file, a chromosome-size file, and interval operations from bedtools. A FASTA index generated by samtools faidx is used to derive chromosome lengths.

Input Genome FASTA and a matching gene BED file created from the same genome assembly.
Intervals Use strand-aware bedtools flank and bedtools slop to define promoter and TSS windows.
Output BED and FASTA files containing promoter or TSS-region sequences.

Install required tools

The original SciBerg page installs bedtools from source. For modern reproducible workflows, Conda or Mamba installation is usually easier, but a source-install example is also included.

Recommended: install with Mamba or Conda

mamba create -n genome-intervals -c conda-forge -c bioconda bedtools samtools
mamba activate genome-intervals

bedtools --version
samtools --version

Source-install example for bedtools

wget https://github.com/arq5x/bedtools2/archive/v2.28.0.zip
unzip bedtools2-2.28.0.zip
cd bedtools2-2.28.0
make

# Copy bedtools binaries to your PATH, if permitted
cd bin
sudo cp bedtools* /usr/local/bin/

Tip for shared servers

Avoid sudo on institutional or HPC systems unless your administrator recommends it. A personal Conda/Mamba environment is usually safer and easier to reproduce.

Prepare genome FASTA and chromosome-size file

The genome FASTA and gene BED file must use the same genome assembly and compatible chromosome names, for example GRCh38 chromosome names without chr prefixes or UCSC-style names with chr prefixes.

Create a FASTA index

samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa

Create a chromosome-size file

cut -f 1,2 Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai > chrom.sizes

The chrom.sizes file is required by bedtools flank and bedtools slop so intervals do not extend beyond chromosome boundaries.

samtools faidx Indexes the genome FASTA and creates a .fai file with sequence names and lengths.
chrom.sizes Two-column file containing chromosome or contig names and sequence lengths.
gene.bed BED file containing gene intervals and strand information.
bedtools getfasta Extracts DNA sequences from the genome FASTA using BED intervals.

Extract 2 kb upstream promoter sequences

The original SciBerg command generates a FASTA file with 2,000 nucleotides upstream of the transcription start site for every gene using a corresponding gene BED file.

# Index the genome FASTA
samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa

# Create chromosome-size file
cut -f 1,2 Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai > chrom.sizes

# Generate 2 kb strand-aware upstream intervals
bedtools flank \
  -i Homo_sapiens.GRCh38.93.gene.bed \
  -g chrom.sizes \
  -l 2000 \
  -r 0 \
  -s \
  > Homo_sapiens.GRCh38.93.gene_2000up.bed

# Extract promoter sequences
bedtools getfasta \
  -fi Homo_sapiens.GRCh38.dna.primary_assembly.fa \
  -bed Homo_sapiens.GRCh38.93.gene_2000up.bed \
  -fo Homo_sapiens.GRCh38.93.gene_2000up.fa

Why use -s?

The -s option makes upstream and downstream definitions strand-aware. For genes on the negative strand, upstream sequence lies in the opposite genomic direction compared with positive-strand genes.

Extract ±1 kb TSS-region sequences

The original SciBerg command generates a FASTA file with 1,000 nucleotides upstream and 1,000 nucleotides downstream of the transcription start site for every gene.

# Index the genome FASTA
samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa

# Create chromosome-size file
cut -f 1,2 Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai > chrom.sizes

# First generate 1 kb upstream regions
bedtools flank \
  -i Homo_sapiens.GRCh38.93.gene.bed \
  -g chrom.sizes \
  -l 1000 \
  -r 0 \
  -s \
  > temp.bed

# Extend each interval by 1 kb downstream to create ±1 kb around TSS
bedtools slop \
  -i temp.bed \
  -g chrom.sizes \
  -l 0 \
  -r 1000 \
  -s \
  > Homo_sapiens.GRCh38.93.gene_TSSs.bed

# Extract TSS-region sequences
bedtools getfasta \
  -fi Homo_sapiens.GRCh38.dna.primary_assembly.fa \
  -bed Homo_sapiens.GRCh38.93.gene_TSSs.bed \
  -fo Homo_sapiens.GRCh38.93.gene_TSSs.fa

This approach first creates upstream windows and then extends them to include downstream sequence, preserving strand-aware interpretation.

Validate promoter and TSS outputs

After interval generation and FASTA extraction, validate that the output files contain the expected number of records and interval lengths.

# Compare number of genes with generated promoter intervals
wc -l Homo_sapiens.GRCh38.93.gene.bed
wc -l Homo_sapiens.GRCh38.93.gene_2000up.bed

# Count FASTA records
grep -c "^>" Homo_sapiens.GRCh38.93.gene_2000up.fa
grep -c "^>" Homo_sapiens.GRCh38.93.gene_TSSs.fa

# Inspect interval lengths
awk '{print $3-$2}' Homo_sapiens.GRCh38.93.gene_2000up.bed | sort -n | uniq -c | head
awk '{print $3-$2}' Homo_sapiens.GRCh38.93.gene_TSSs.bed | sort -n | uniq -c | head

# Inspect FASTA headers
grep "^>" Homo_sapiens.GRCh38.93.gene_2000up.fa | head

Boundary effects

Genes near chromosome starts or ends may produce shorter regions because bedtools prevents intervals from extending outside chromosome boundaries when a genome-size file is supplied.

Reusable extraction script

The script below creates both 2 kb upstream promoter sequences and ±1 kb TSS-region sequences from a genome FASTA and gene BED file.

#!/usr/bin/env bash
set -euo pipefail

GENOME="Homo_sapiens.GRCh38.dna.primary_assembly.fa"
GENE_BED="Homo_sapiens.GRCh38.93.gene.bed"
PREFIX="Homo_sapiens.GRCh38.93"
OUTDIR="promoter_tss_sequences"
CHROMSIZES="chrom.sizes"

mkdir -p "$OUTDIR"

# Create genome FASTA index and chromosome-size file
samtools faidx "$GENOME"
cut -f 1,2 "${GENOME}.fai" > "$CHROMSIZES"

# 2 kb upstream promoters
bedtools flank \
  -i "$GENE_BED" \
  -g "$CHROMSIZES" \
  -l 2000 \
  -r 0 \
  -s \
  > "$OUTDIR/${PREFIX}.gene_2000up.bed"

bedtools getfasta \
  -fi "$GENOME" \
  -bed "$OUTDIR/${PREFIX}.gene_2000up.bed" \
  -fo "$OUTDIR/${PREFIX}.gene_2000up.fa"

# ±1 kb TSS regions
bedtools flank \
  -i "$GENE_BED" \
  -g "$CHROMSIZES" \
  -l 1000 \
  -r 0 \
  -s \
  > "$OUTDIR/${PREFIX}.gene_1000up.temp.bed"

bedtools slop \
  -i "$OUTDIR/${PREFIX}.gene_1000up.temp.bed" \
  -g "$CHROMSIZES" \
  -l 0 \
  -r 1000 \
  -s \
  > "$OUTDIR/${PREFIX}.gene_TSSs.bed"

bedtools getfasta \
  -fi "$GENOME" \
  -bed "$OUTDIR/${PREFIX}.gene_TSSs.bed" \
  -fo "$OUTDIR/${PREFIX}.gene_TSSs.fa"

# Summary
{
  echo "Genome FASTA: $GENOME"
  echo "Gene BED: $GENE_BED"
  echo "Promoter BED records: $(wc -l < "$OUTDIR/${PREFIX}.gene_2000up.bed")"
  echo "Promoter FASTA records: $(grep -c '^>' "$OUTDIR/${PREFIX}.gene_2000up.fa")"
  echo "TSS BED records: $(wc -l < "$OUTDIR/${PREFIX}.gene_TSSs.bed")"
  echo "TSS FASTA records: $(grep -c '^>' "$OUTDIR/${PREFIX}.gene_TSSs.fa")"
} > "$OUTDIR/${PREFIX}.promoter_tss_summary.txt"

echo "Done. Output written to $OUTDIR"

Next steps after extracting promoter and TSS sequences

After promoter or TSS sequences are extracted, they can be used for motif discovery, machine learning, regulatory annotation, transcription-factor binding analysis, or custom reference generation.

  1. Validate interval counts and FASTA record counts.
  2. Check coordinate and strand conventions carefully.
  3. Ensure that chromosome names match between genome FASTA, BED files, and other annotations.
  4. Run motif discovery, sequence composition analysis, regulatory annotation, or machine-learning workflows.
  5. Document genome assembly, annotation source, BED-generation command, promoter/TSS window sizes, and tool versions.