Preparing FASTA references for main human RNA classes.
This tutorial shows how to prepare class-specific FASTA reference files for mRNA, rRNA, snoRNA, scaRNA, snRNA, RN7S, miRNA, Y RNA, vault RNA, lincRNA, immunoglobulin, T-cell receptor, tRNA, and other RNA classes using GENCODE, Ensembl, UCSC RefSeq, and miRBase sources.
Class-specific FASTA references are useful when reads need to be mapped against separate RNA categories rather than one combined transcriptome. This is common in small RNA-seq, extracellular RNA analysis, total RNA-seq, contaminant screening, and workflows that quantify reads by RNA biotype.
The examples below modernize the original SciBerg commands while preserving the core workflows: download reference FASTA files, normalize FASTA headers so they can be matched more easily, extract RNA classes with faFilter, and document the resulting reference files.
GENCODETranscriptome FASTA with pipe-delimited headers that can be filtered by biotype and gene symbol patterns.
EnsemblSeparate cDNA and ncRNA FASTA files with descriptive headers and biotype fields.
UCSC / miRBaseRefSeq transcripts, tRNA genes, and mature or hairpin miRNA references for specialized analysis.
Version note
The original SciBerg page uses historical examples such as GENCODE release 28 and Ensembl release 93. For new projects, update URLs and file names to the latest release, then keep the release number in your methods and project documentation.
Setup and required tools
The examples use standard Linux tools plus UCSC faFilter. Install faFilter once and make sure it is available in your PATH.
# Create a local tools folder
mkdir -p ~/bin
# Download UCSC faFilter
wget -O ~/bin/faFilter \
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faFilter
# Make the tool executable
chmod +x ~/bin/faFilter
# Add ~/bin to PATH if needed
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Check that faFilter is available
faFilter 2>&1 | head
Create a project folder for reference preparation and keep all intermediate files, ID lists, scripts, and logs together.
mkdir -p RNA_references/{raw,processed,logs,scripts}
cd RNA_references
Prepare RNA-class FASTAs from GENCODE
The original workflow downloads GENCODE human transcript reference sequences, replaces pipe symbols in FASTA headers with #, and filters records into RNA classes using faFilter.
Download and normalize FASTA headers
# Historical example from the original page: GENCODE v28, GRCh38.p12
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.transcripts.fa.gz
gunzip gencode.v28.transcripts.fa.gz
# Replace "|" symbols with "#" to simplify pattern matching
sed 's/|/#/g' gencode.v28.transcripts.fa > gencode.v28.transcripts_#.fa
mv gencode.v28.transcripts_#.fa gencode.v28.transcripts.fa
The original Ensembl workflow downloads separate cDNA and ncRNA FASTA files, replaces spaces in headers with #, and extracts classes by gene_biotype, transcript_biotype, or gene symbol patterns.
Download and normalize Ensembl FASTA files
# Historical example from the original page: Ensembl release 93
wget ftp://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
wget ftp://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
gunzip Homo_sapiens.GRCh38.ncrna.fa.gz
gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
# Replace spaces in FASTA headers with "#"
sed 's/ /#/g' Homo_sapiens.GRCh38.ncrna.fa > Homo_sapiens.GRCh38.ncrna_#.fa
sed 's/ /#/g' Homo_sapiens.GRCh38.cdna.all.fa > Homo_sapiens.GRCh38.cdna.all_#.fa
mv Homo_sapiens.GRCh38.ncrna_#.fa Homo_sapiens.GRCh38.ncrna.fa
mv Homo_sapiens.GRCh38.cdna.all_#.fa Homo_sapiens.GRCh38.cdna.all.fa
Ensembl, GENCODE, RefSeq, and miRBase may define or name classes differently. When building a reference for a publication or production workflow, inspect headers and document the exact selection rules.
Prepare transcript references from UCSC RefSeq
The original page describes downloading human transcript sequences and transcript names using the UCSC Table Browser, deduplicating the FASTA and name tables, cleaning headers, and splitting protein-coding and non-coding RefSeq transcripts.
Download from UCSC Table Browser
Open the UCSC Table Browser for the human hg38 assembly.
Use group Genes and Gene Predictions.
Select track NCBI RefSeq and table UCSC RefSeq (refGene).
For transcript sequences, choose output format sequence, then genomic sequence output, and save as UCSC_RefSeq_transcripts.fa.
For transcript names, choose selected fields and export name and name2 as UCSC_RefSeq_transcripts_names.
Deduplicate and clean headers
# Remove duplicated sequences
faFilter -uniq UCSC_RefSeq_transcripts.fa UCSC_RefSeq_transcripts_noDup.fa
mv UCSC_RefSeq_transcripts_noDup.fa UCSC_RefSeq_transcripts.fa
# Remove duplicated rows in the transcript-name table
awk '!seen[$0]++' UCSC_RefSeq_transcripts_names > UCSC_RefSeq_transcripts_names_noDup
mv UCSC_RefSeq_transcripts_names_noDup UCSC_RefSeq_transcripts_names
# Clean headers in UCSC_RefSeq_transcripts.fa
sed -i -e 's/hg38_refGene_N/N/g' UCSC_RefSeq_transcripts.fa
Split protein-coding and non-coding RefSeq transcripts
# All protein-coding transcripts
faFilter -name=*NM* UCSC_RefSeq_transcripts.fa mRNA_ucsc.fa
# All non-coding transcripts
faFilter -name=*NR* UCSC_RefSeq_transcripts.fa ncRNA_ucsc.fa
Extract non-coding RNA classes using prepared ID lists
Each ID-list file should contain a single column with transcript IDs belonging to one RNA class.
tRNA and miRNA references
The original workflow recommends using the UCSC Table Browser for tRNA references and miRBase for mature miRNA and pre-miRNA references.
tRNA references from UCSC Table Browser
Open the UCSC Table Browser for the human hg38 assembly.
Use group Genes and Gene Predictions.
Select track tRNA genes and table tRNAs.
Choose output format sequence and save the result as UCSC_tRNA.fa.
miRNA and pre-miRNA references from miRBase
# Download and unzip mature miRNA and hairpin/pre-miRNA references
wget ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz
wget ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.gz
gunzip mature.fa.gz
gunzip hairpin.fa.gz
# Extract human miRNA and pre-miRNA records
faFilter -name=hsa-* mature.fa hsa_miRNA_temp.fa
faFilter -name=hsa-* hairpin.fa hsa_premiRNA_temp.fa
# Convert RNA alphabet to DNA alphabet if needed for DNA-style aligners
sed 's/U/T/g' hsa_miRNA_temp.fa > hsa_miRNA.fa
sed 's/U/T/g' hsa_premiRNA_temp.fa > hsa_premiRNA.fa
RNA vs DNA alphabet
Some aligners and indexing tools expect DNA-style references using T rather than RNA-style U. Convert only when appropriate for the downstream tool.
Reusable workflow skeleton
The script below shows a simple pattern for generating multiple class-specific FASTA references and counting records for documentation.
#!/usr/bin/env bash
set -euo pipefail
INPUT_FASTA="gencode.v28.transcripts.fa"
OUTDIR="processed/gencode_rna_classes"
LOG="logs/gencode_rna_class_counts.tsv"
mkdir -p "$OUTDIR" "$(dirname "$LOG")"
# Normalize headers for easier pattern matching
sed 's/|/#/g' "$INPUT_FASTA" > "$OUTDIR/gencode.headers_normalized.fa"
REF="$OUTDIR/gencode.headers_normalized.fa"
declare -A PATTERNS=(
["mRNA"]="*protein_coding*"
["rRNA"]="*#rRNA#*"
["snoRNA"]="*#snoRNA#*"
["scaRNA"]="*#scaRNA#*"
["snRNA"]="*#snRNA#*"
["RN7S"]="*RN7S*"
["pre_miRNA"]="*miRNA*"
["RNY"]="*RNY*"
["VTRNA"]="*VTRNA*"
["lincRNA"]="*lincRNA*"
)
echo -e "class\tfile\trecords" > "$LOG"
for CLASS in "${!PATTERNS[@]}"; do
OUTFILE="$OUTDIR/${CLASS}.fa"
faFilter -name="${PATTERNS[$CLASS]}" "$REF" "$OUTFILE"
COUNT=$(grep -c '^>' "$OUTFILE" || true)
echo -e "${CLASS}\t${OUTFILE}\t${COUNT}" >> "$LOG"
done
echo "Reference preparation complete: $OUTDIR"
echo "Counts written to: $LOG"
Next steps after preparing RNA-class FASTAs
Once class-specific references have been prepared, validate them before using them in an alignment or quantification workflow.
Count records in every FASTA file and check whether any class is unexpectedly empty.
Inspect a few headers from each class with grep "^>" file.fa | head.
Remove duplicated sequences if multiple reference sources were merged.
Build indexes for Bowtie, Bowtie2, BWA, HISAT2, STAR, BLAST, Salmon, Kallisto, or the tool required by your workflow.