Bioinformatics Scripts

Extracting specific sequences from FASTA files by sequence ID.

This tutorial shows how to extract selected records from FASTA files using UCSC faFilter. You can select sequences by header pattern, by an ID list, or remove unwanted records when preparing custom references for RNA classes, transcriptomes, organisms, contaminants, or targeted analysis.

Overview

FASTA files often contain many sequence records, but downstream analysis may require only a subset. For example, you may want to create separate references for RNA classes such as rRNA, tRNA, snRNA, snoRNA, or miRNA from a total transcriptome FASTA file.

faFilter provides a convenient way to extract FASTA records based on header IDs. It can select records whose names match a pattern, select records listed in an external ID file, or exclude unwanted records.

Input A FASTA file containing multiple sequence records with informative headers.
Selection Extract records by wildcard-like name patterns or by a file containing IDs.
Output A smaller FASTA file containing only the selected sequence records.

Install faFilter

faFilter is part of the UCSC utilities collection. The current SciBerg resource downloads the Linux binary from UCSC and creates a command-line link.

Download the Linux binary

# Create a local tools folder
mkdir -p ~/bin

# Download faFilter from UCSC
wget -O ~/bin/faFilter \
  http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faFilter

# Make it executable
chmod +x ~/bin/faFilter

# Add ~/bin to PATH if needed
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Check that the command is available
faFilter 2>&1 | head

Alternative: create a system-wide link

# Example system-wide link
sudo ln -s /path/to/faFilter/faFilter /usr/local/bin/faFilter

Tip

On shared servers and HPC systems, install command-line tools in a personal folder such as ~/bin unless your system administrator recommends a central installation.

Extract sequences by header pattern

Use -name=PATTERN when the desired sequences have headers that follow a predictable pattern. The current SciBerg page gives the example of extracting only sequence IDs that start with hg38.

# Extract only sequences with IDs starting with "hg38"
faFilter \
  -name=hg38* \
  original_fasta.fa \
  fasta_containing_only_sequences_having_ID_started_with_hg38.fa

More pattern examples

# Extract mitochondrial records if headers start with chrM
faFilter -name=chrM* genome.fa mitochondrial.fa

# Extract records whose IDs start with rRNA
faFilter -name=rRNA* transcriptome.fa rRNA.fa

# Extract records whose IDs start with miRNA
faFilter -name=miRNA* transcriptome.fa miRNA.fa
hg38* Matches headers beginning with hg38.
chrM* Matches mitochondrial or chromosome-M-style headers, depending on naming convention.
rRNA* Matches RNA-class-style headers that begin with rRNA.
*contaminant* Matches headers containing a specific word, if supported by your shell/tool pattern behavior.

Inspect headers first

Always inspect the FASTA headers before choosing the pattern. Different databases use different naming systems, and a pattern that works for one reference may not work for another.

# Inspect the first FASTA headers
grep "^>" original_fasta.fa | head -20

Extract sequences using an ID list

Use -namePatList when you have a text file containing the IDs or patterns you want to keep. This is useful for custom gene lists, RNA-class lists, contaminant panels, and selected transcript IDs.

Create an ID list

# Example ID list
cat > ID_list.txt <<'EOF'
hg38_chr1*
hg38_chr2*
hg38_chrM*
EOF

Extract records in the list

faFilter \
  -namePatList=ID_list.txt \
  original_fasta.fa \
  fasta_containing_only_sequences_with_IDs_from_ID_list.fa

Extract exact IDs from a header table

If you maintain ID lists in spreadsheets, export the desired ID column as a plain text file with one ID or pattern per line.

# Example: one ID or pattern per line
head ID_list.txt

# Then use the ID list in faFilter
faFilter -namePatList=ID_list.txt original_fasta.fa selected_sequences.fa

Remove unwanted sequence IDs

In some workflows, it is easier to define what should be removed rather than what should be kept. Depending on your faFilter version, exclusion can be performed using options such as -not together with a name or name-pattern list.

# Remove records matching a single unwanted pattern
faFilter \
  -not \
  -name=contaminant* \
  original_fasta.fa \
  fasta_without_contaminants.fa

# Remove records matching patterns in a list
faFilter \
  -not \
  -namePatList=remove_IDs.txt \
  original_fasta.fa \
  filtered_reference.fa

Check your installed faFilter help

Option names and behavior can vary across utility versions. Run faFilter 2>&1 | head -50 or the local help command to confirm available options before building production workflows.

Reusable workflow script

The script below extracts selected FASTA records from an input file using an ID pattern list and records the number of sequences before and after extraction.

#!/usr/bin/env bash
set -euo pipefail

INPUT_FASTA="original_fasta.fa"
ID_LIST="ID_list.txt"
OUTPUT_FASTA="selected_sequences.fa"
LOG_FILE="fasta_extraction_summary.txt"

if ! command -v faFilter >/dev/null 2>&1; then
  echo "ERROR: faFilter not found in PATH" >&2
  exit 1
fi

if [[ ! -s "$INPUT_FASTA" ]]; then
  echo "ERROR: input FASTA not found: $INPUT_FASTA" >&2
  exit 1
fi

if [[ ! -s "$ID_LIST" ]]; then
  echo "ERROR: ID list not found: $ID_LIST" >&2
  exit 1
fi

faFilter -namePatList="$ID_LIST" "$INPUT_FASTA" "$OUTPUT_FASTA"

{
  echo "Input FASTA: $INPUT_FASTA"
  echo "ID list: $ID_LIST"
  echo "Output FASTA: $OUTPUT_FASTA"
  echo "Input records: $(grep -c '^>' "$INPUT_FASTA")"
  echo "Selected records: $(grep -c '^>' "$OUTPUT_FASTA")"
  echo "Command: faFilter -namePatList=$ID_LIST $INPUT_FASTA $OUTPUT_FASTA"
} > "$LOG_FILE"

echo "Done. Summary written to $LOG_FILE"

Next steps after extraction

After extracting a FASTA subset, treat the output as a new reference file and validate that it contains exactly the intended records.

  1. Inspect output headers with grep "^>" selected_sequences.fa | head.
  2. Count records in the input and output FASTA files.
  3. Check whether duplicate sequence IDs or duplicated sequence content should also be removed.
  4. Build a new reference index for the aligner, quantifier, or classification tool used downstream.
  5. Document the source FASTA, ID list, commands, tool version, and date.