Extracting promoter and TSS sequences from genome FASTA files.
This tutorial shows how to extract promoter and transcription start site regions from a reference genome using a genome FASTA file, a gene BED file, samtools faidx, and bedtools. The examples generate 2 kb upstream promoter FASTAs and ±1 kb TSS-region FASTAs.
Promoter and TSS-region sequence extraction is useful for motif analysis, transcription-factor binding studies, regulatory genomics, machine-learning feature generation, and sequence-based interpretation of gene regulation.
The workflow requires a genome FASTA file, a corresponding gene BED file, a chromosome-size file, and interval operations from bedtools. A FASTA index generated by samtools faidx is used to derive chromosome lengths.
InputGenome FASTA and a matching gene BED file created from the same genome assembly.
IntervalsUse strand-aware bedtools flank and bedtools slop to define promoter and TSS windows.
OutputBED and FASTA files containing promoter or TSS-region sequences.
Install required tools
The original SciBerg page installs bedtools from source. For modern reproducible workflows, Conda or Mamba installation is usually easier, but a source-install example is also included.
wget https://github.com/arq5x/bedtools2/archive/v2.28.0.zip
unzip bedtools2-2.28.0.zip
cd bedtools2-2.28.0
make
# Copy bedtools binaries to your PATH, if permitted
cd bin
sudo cp bedtools* /usr/local/bin/
Tip for shared servers
Avoid sudo on institutional or HPC systems unless your administrator recommends it. A personal Conda/Mamba environment is usually safer and easier to reproduce.
Prepare genome FASTA and chromosome-size file
The genome FASTA and gene BED file must use the same genome assembly and compatible chromosome names, for example GRCh38 chromosome names without chr prefixes or UCSC-style names with chr prefixes.
The chrom.sizes file is required by bedtools flank and bedtools slop so intervals do not extend beyond chromosome boundaries.
samtools faidxIndexes the genome FASTA and creates a .fai file with sequence names and lengths.
chrom.sizesTwo-column file containing chromosome or contig names and sequence lengths.
gene.bedBED file containing gene intervals and strand information.
bedtools getfastaExtracts DNA sequences from the genome FASTA using BED intervals.
Extract 2 kb upstream promoter sequences
The original SciBerg command generates a FASTA file with 2,000 nucleotides upstream of the transcription start site for every gene using a corresponding gene BED file.
The -s option makes upstream and downstream definitions strand-aware. For genes on the negative strand, upstream sequence lies in the opposite genomic direction compared with positive-strand genes.
Extract ±1 kb TSS-region sequences
The original SciBerg command generates a FASTA file with 1,000 nucleotides upstream and 1,000 nucleotides downstream of the transcription start site for every gene.
This approach first creates upstream windows and then extends them to include downstream sequence, preserving strand-aware interpretation.
Validate promoter and TSS outputs
After interval generation and FASTA extraction, validate that the output files contain the expected number of records and interval lengths.
# Compare number of genes with generated promoter intervals
wc -l Homo_sapiens.GRCh38.93.gene.bed
wc -l Homo_sapiens.GRCh38.93.gene_2000up.bed
# Count FASTA records
grep -c "^>" Homo_sapiens.GRCh38.93.gene_2000up.fa
grep -c "^>" Homo_sapiens.GRCh38.93.gene_TSSs.fa
# Inspect interval lengths
awk '{print $3-$2}' Homo_sapiens.GRCh38.93.gene_2000up.bed | sort -n | uniq -c | head
awk '{print $3-$2}' Homo_sapiens.GRCh38.93.gene_TSSs.bed | sort -n | uniq -c | head
# Inspect FASTA headers
grep "^>" Homo_sapiens.GRCh38.93.gene_2000up.fa | head
Boundary effects
Genes near chromosome starts or ends may produce shorter regions because bedtools prevents intervals from extending outside chromosome boundaries when a genome-size file is supplied.
Reusable extraction script
The script below creates both 2 kb upstream promoter sequences and ±1 kb TSS-region sequences from a genome FASTA and gene BED file.
Next steps after extracting promoter and TSS sequences
After promoter or TSS sequences are extracted, they can be used for motif discovery, machine learning, regulatory annotation, transcription-factor binding analysis, or custom reference generation.
Validate interval counts and FASTA record counts.
Check coordinate and strand conventions carefully.
Ensure that chromosome names match between genome FASTA, BED files, and other annotations.
Run motif discovery, sequence composition analysis, regulatory annotation, or machine-learning workflows.