Bioinformatics Scripts

Removing duplicated sequences from FASTA reference files.

Duplicate reference sequences can reduce uniquely mapped reads and interfere with downstream tools such as read quantification, pseudoalignment, and custom reference workflows. This tutorial shows how to clean FASTA files with UCSC faFilter and how to document the result.

Overview

FASTA reference files downloaded from public repositories or assembled from multiple sources can contain identical sequences under different headers. This is common when combining transcript databases, RNA classes, isoforms, repetitive elements, microbial references, or custom contaminant sequences.

Duplicate reference sequences can be problematic because a read matching two identical records may no longer be considered uniquely mapped. This can reduce usable alignments and complicate downstream quantification or sequence classification.

Input FASTA files containing reference sequences from UCSC, NCBI, Ensembl, custom assemblies, or merged references.
Processing Identify and keep one representative copy of duplicated sequences using faFilter -uniq.
Output A non-redundant FASTA file ready for indexing, alignment, quantification, or classification.

Install faFilter

faFilter is part of the UCSC utilities collection. The current SciBerg resource uses the UCSC Linux binary and creates a link in the system path.

Download the Linux binary

# Create a local tools folder
mkdir -p ~/bin

# Download faFilter from UCSC
wget -O ~/bin/faFilter \
  http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faFilter

# Make it executable
chmod +x ~/bin/faFilter

# Add ~/bin to PATH if it is not already included
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Check that the command is available
faFilter 2>&1 | head

Alternative: create a system-wide link

On a workstation where you have administrator permissions, you can place or link the binary into a system-wide location.

# Example system-wide link
sudo ln -s /path/to/faFilter /usr/local/bin/faFilter

Tip for shared servers

On HPC systems or institutional servers, avoid sudo unless permitted. A personal ~/bin installation is usually safer and easier to reproduce.

Remove duplicated sequences

The core command is simple: provide an input FASTA file and the desired output FASTA file.

faFilter -uniq reference.fa reference_no_duplicates.fa

This keeps a unique set of sequences in the output FASTA file. The exact behavior is useful for reference cleanup before building aligner indexes or running downstream quantification.

Example with compressed input

If the input reference is compressed, decompress it first or stream it into a temporary file.

# Decompress a gzipped FASTA file
gunzip -c reference.fa.gz > reference.fa

# Remove duplicated sequences
faFilter -uniq reference.fa reference_no_duplicates.fa

Example for a custom reference folder

mkdir -p references/cleaned

faFilter \
  -uniq \
  references/raw/custom_reference.fa \
  references/cleaned/custom_reference.no_duplicates.fa

Verify the output

After deduplication, it is useful to compare the number of FASTA records before and after filtering and preserve a small log for reproducibility.

Record count before filtering Count FASTA headers in the original reference file.
Record count after filtering Count FASTA headers in the deduplicated output file.
Reference index rebuild Rebuild aligner indexes after changing the FASTA file.
Document software and commands Store command lines, versions, input files, and output filenames.
# Count FASTA records before and after filtering
grep -c "^>" reference.fa
grep -c "^>" reference_no_duplicates.fa

# Store counts in a small log file
{
  echo "Input records:"
  grep -c "^>" reference.fa
  echo "Deduplicated records:"
  grep -c "^>" reference_no_duplicates.fa
} > faFilter_deduplication_summary.txt

Header duplicates vs sequence duplicates

This tutorial focuses on duplicated sequence content. Duplicate FASTA headers are a separate problem and should also be checked before building reference indexes.

Reusable workflow script

The script below performs a simple reproducible deduplication step and records input/output sequence counts.

#!/usr/bin/env bash
set -euo pipefail

INPUT_FASTA="reference.fa"
OUTPUT_FASTA="reference_no_duplicates.fa"
LOG_FILE="faFilter_deduplication_summary.txt"

if ! command -v faFilter >/dev/null 2>&1; then
  echo "ERROR: faFilter not found in PATH" >&2
  exit 1
fi

echo "Input FASTA: $INPUT_FASTA"
echo "Output FASTA: $OUTPUT_FASTA"

faFilter -uniq "$INPUT_FASTA" "$OUTPUT_FASTA"

{
  echo "Input FASTA: $INPUT_FASTA"
  echo "Output FASTA: $OUTPUT_FASTA"
  echo "Input records: $(grep -c '^>' "$INPUT_FASTA")"
  echo "Output records: $(grep -c '^>' "$OUTPUT_FASTA")"
  echo "Command: faFilter -uniq $INPUT_FASTA $OUTPUT_FASTA"
} > "$LOG_FILE"

echo "Done. Summary written to $LOG_FILE"

Next steps after FASTA deduplication

Once a FASTA reference has been cleaned, the next step is usually to build a new index for the aligner or quantification tool you plan to use.

  1. Check that duplicate records were removed as expected.
  2. Validate that sequence headers remain compatible with annotation files, metadata, or downstream scripts.
  3. Build a new index for Bowtie, Bowtie2, BWA, HISAT2, STAR, BLAST, Salmon, Kallisto, or another tool as appropriate.
  4. Document the original FASTA source, deduplication command, and date of reference creation.