GTF and GFF3 files from Ensembl, GENCODE, and related annotation databases provide genome coordinates for genes, transcripts, exons, and other features. Many interval-based tools such as bedtools work most conveniently with BED files, especially when extracting genomic regions such as genes, transcripts, promoters, or transcription start sites.
Both GENCODE and Ensembl provide GTF or GFF3 annotation files with genomic coordinates, but many downstream tools require BED files. BED files are especially useful for bedtools, genome-browser tracks, interval intersections, promoter extraction, transcription start site extraction, and custom genomic-region workflows.
This tutorial modernizes the original SciBerg GTF-to-BED commands and adds context around coordinate conventions, validation, and reproducibility.
InputGTF annotation file, for example Homo_sapiens.GRCh38.93.gtf.
ConversionSelect gene or transcript features, extract coordinates, parse attributes, and convert start positions to BED convention.
OutputSorted BED files for genes, transcripts, promoters, TSS regions, or other genome intervals.
GTF versus BED coordinate logic
GTF and BED files both describe genomic intervals, but their coordinate systems differ. This is the most important detail when converting between them.
GTF start and endGTF coordinates are typically 1-based and inclusive.
BED start and endBED uses a 0-based start and a half-open interval end.
Conversion ruleSubtract 1 from the GTF start coordinate when creating a BED start coordinate.
SortingSort BED output by chromosome and numeric start coordinate before downstream use.
Minimal BED fields
The first three BED columns are chromosome, start, and end. Additional columns are often used for name, score, strand, gene ID, transcript ID, biotype, or gene symbol.
chrom chromStart chromEnd name score strand
Generate a gene-level BED file
The original SciBerg command extracts gene rows from an Ensembl GTF file, keeps chromosome, start, end, strand, and attribute fields, then formats them into a sorted BED-like table.
# Gene-level BED from an Ensembl-style GTF file
grep -P "\tgene\t" Homo_sapiens.GRCh38.93.gtf | cut -f1,4,5,7,9 | \
sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$6,".",$4,$10,$12,$14 }' | \
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.93.gene.bed
In this version, the output contains BED-style genomic coordinates plus selected annotation fields parsed from the GTF attribute column.
Attribute order warning
This compact command assumes a particular order of attributes in the GTF file. It is useful for the original Ensembl v93-style example, but attribute positions can change across databases, releases, or custom annotations.
Generate a transcript-level BED file
The original SciBerg command extracts transcript rows and creates a transcript-level BED-like file.
# Transcript-level BED from an Ensembl-style GTF file
grep -P "\ttranscript\t" Homo_sapiens.GRCh38.93.gtf | cut -f1,4,5,7,9 | \
sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$10,".",$4,$14,$16,$18 }' | \
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.93.transcript.bed
Transcript-level BED files are useful for transcript interval analyses, overlap checks, and extracting transcript-associated genomic regions.
A more robust AWK parser for GTF attributes
For production workflows, it is safer to parse attributes by name rather than by their position after splitting the attribute field. The example below extracts gene intervals and named attributes such as gene_id, gene_name, and gene_biotype.
awk 'BEGIN { OFS="\t" }
$3 == "gene" {
gene_id = gene_name = gene_biotype = "NA"
n = split($9, attrs, ";")
for (i = 1; i <= n; i++) {
gsub(/^ +| +$/, "", attrs[i])
split(attrs[i], kv, " ")
key = kv[1]
value = kv[2]
gsub(/"/, "", value)
if (key == "gene_id") gene_id = value
if (key == "gene_name") gene_name = value
if (key == "gene_biotype" || key == "gene_type") gene_biotype = value
}
print $1, $4 - 1, $5, gene_id, ".", $7, gene_name, gene_biotype
}' Homo_sapiens.GRCh38.93.gtf | \
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.93.gene.robust.bed
Robust transcript-level parser
awk 'BEGIN { OFS="\t" }
$3 == "transcript" {
gene_id = transcript_id = gene_name = transcript_biotype = "NA"
n = split($9, attrs, ";")
for (i = 1; i <= n; i++) {
gsub(/^ +| +$/, "", attrs[i])
split(attrs[i], kv, " ")
key = kv[1]
value = kv[2]
gsub(/"/, "", value)
if (key == "gene_id") gene_id = value
if (key == "transcript_id") transcript_id = value
if (key == "gene_name") gene_name = value
if (key == "transcript_biotype" || key == "transcript_type") transcript_biotype = value
}
print $1, $4 - 1, $5, transcript_id, ".", $7, gene_id, gene_name, transcript_biotype
}' Homo_sapiens.GRCh38.93.gtf | \
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.93.transcript.robust.bed
Validate the BED files
After conversion, verify record counts, inspect the first lines, and check whether start coordinates are valid.
# Count genes and transcripts in the original GTF
grep -P "\tgene\t" Homo_sapiens.GRCh38.93.gtf | wc -l
grep -P "\ttranscript\t" Homo_sapiens.GRCh38.93.gtf | wc -l
# Count output BED records
wc -l Homo_sapiens.GRCh38.93.gene.bed
wc -l Homo_sapiens.GRCh38.93.transcript.bed
# Inspect output
head Homo_sapiens.GRCh38.93.gene.bed
head Homo_sapiens.GRCh38.93.transcript.bed
# Check for negative starts, which should not occur in valid BED files
awk '$2 < 0' Homo_sapiens.GRCh38.93.gene.bed | head
Sort order and chromosome naming
Some tools require consistent chromosome naming, for example 1 versus chr1. Make sure your BED files, genome FASTA, BAM files, and other interval files use compatible chromosome names.
Reusable conversion script
The script below creates both gene-level and transcript-level BED files using robust attribute parsing and writes a small summary file.
#!/usr/bin/env bash
set -euo pipefail
GTF="Homo_sapiens.GRCh38.93.gtf"
PREFIX="Homo_sapiens.GRCh38.93"
OUTDIR="bed"
LOG="gtf_to_bed_summary.txt"
mkdir -p "$OUTDIR"
awk 'BEGIN { OFS="\t" }
$3 == "gene" {
gene_id = gene_name = gene_biotype = "NA"
n = split($9, attrs, ";")
for (i = 1; i <= n; i++) {
gsub(/^ +| +$/, "", attrs[i])
split(attrs[i], kv, " ")
key = kv[1]
value = kv[2]
gsub(/"/, "", value)
if (key == "gene_id") gene_id = value
if (key == "gene_name") gene_name = value
if (key == "gene_biotype" || key == "gene_type") gene_biotype = value
}
print $1, $4 - 1, $5, gene_id, ".", $7, gene_name, gene_biotype
}' "$GTF" | sort -k1,1 -k2,2n > "$OUTDIR/${PREFIX}.genes.bed"
awk 'BEGIN { OFS="\t" }
$3 == "transcript" {
gene_id = transcript_id = gene_name = transcript_biotype = "NA"
n = split($9, attrs, ";")
for (i = 1; i <= n; i++) {
gsub(/^ +| +$/, "", attrs[i])
split(attrs[i], kv, " ")
key = kv[1]
value = kv[2]
gsub(/"/, "", value)
if (key == "gene_id") gene_id = value
if (key == "transcript_id") transcript_id = value
if (key == "gene_name") gene_name = value
if (key == "transcript_biotype" || key == "transcript_type") transcript_biotype = value
}
print $1, $4 - 1, $5, transcript_id, ".", $7, gene_id, gene_name, transcript_biotype
}' "$GTF" | sort -k1,1 -k2,2n > "$OUTDIR/${PREFIX}.transcripts.bed"
{
echo "Input GTF: $GTF"
echo "Gene BED: $OUTDIR/${PREFIX}.genes.bed"
echo "Transcript BED: $OUTDIR/${PREFIX}.transcripts.bed"
echo "Gene records: $(wc -l < "$OUTDIR/${PREFIX}.genes.bed")"
echo "Transcript records: $(wc -l < "$OUTDIR/${PREFIX}.transcripts.bed")"
} > "$LOG"
echo "Done. Summary written to $LOG"
Next steps after GTF-to-BED conversion
Once the BED files are created, they can be used for genome interval operations, promoter extraction, TSS extraction, overlap analysis, and browser visualization.
Validate the first few lines and total record counts.
Confirm coordinate conventions and chromosome naming compatibility.
Sort BED files before using tools such as bedtools.
Use the gene or transcript BED files to derive promoters, TSS windows, gene bodies, exons, or custom regions.
Document the annotation source, release number, command, and output BED schema.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.