Removing duplicated sequences from FASTA reference files.
Duplicate reference sequences can reduce uniquely mapped reads and interfere with downstream tools such as read quantification, pseudoalignment, and custom reference workflows. This tutorial shows how to clean FASTA files with UCSC faFilter and how to document the result.
FASTA reference files downloaded from public repositories or assembled from multiple sources can contain identical sequences under different headers. This is common when combining transcript databases, RNA classes, isoforms, repetitive elements, microbial references, or custom contaminant sequences.
Duplicate reference sequences can be problematic because a read matching two identical records may no longer be considered uniquely mapped. This can reduce usable alignments and complicate downstream quantification or sequence classification.
InputFASTA files containing reference sequences from UCSC, NCBI, Ensembl, custom assemblies, or merged references.
ProcessingIdentify and keep one representative copy of duplicated sequences using faFilter -uniq.
OutputA non-redundant FASTA file ready for indexing, alignment, quantification, or classification.
Install faFilter
faFilter is part of the UCSC utilities collection. The current SciBerg resource uses the UCSC Linux binary and creates a link in the system path.
Download the Linux binary
# Create a local tools folder
mkdir -p ~/bin
# Download faFilter from UCSC
wget -O ~/bin/faFilter \
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faFilter
# Make it executable
chmod +x ~/bin/faFilter
# Add ~/bin to PATH if it is not already included
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Check that the command is available
faFilter 2>&1 | head
Alternative: create a system-wide link
On a workstation where you have administrator permissions, you can place or link the binary into a system-wide location.
# Example system-wide link
sudo ln -s /path/to/faFilter /usr/local/bin/faFilter
Tip for shared servers
On HPC systems or institutional servers, avoid sudo unless permitted. A personal ~/bin installation is usually safer and easier to reproduce.
Remove duplicated sequences
The core command is simple: provide an input FASTA file and the desired output FASTA file.
This keeps a unique set of sequences in the output FASTA file. The exact behavior is useful for reference cleanup before building aligner indexes or running downstream quantification.
Example with compressed input
If the input reference is compressed, decompress it first or stream it into a temporary file.
# Decompress a gzipped FASTA file
gunzip -c reference.fa.gz > reference.fa
# Remove duplicated sequences
faFilter -uniq reference.fa reference_no_duplicates.fa
After deduplication, it is useful to compare the number of FASTA records before and after filtering and preserve a small log for reproducibility.
Record count before filteringCount FASTA headers in the original reference file.
Record count after filteringCount FASTA headers in the deduplicated output file.
Reference index rebuildRebuild aligner indexes after changing the FASTA file.
Document software and commandsStore command lines, versions, input files, and output filenames.
# Count FASTA records before and after filtering
grep -c "^>" reference.fa
grep -c "^>" reference_no_duplicates.fa
# Store counts in a small log file
{
echo "Input records:"
grep -c "^>" reference.fa
echo "Deduplicated records:"
grep -c "^>" reference_no_duplicates.fa
} > faFilter_deduplication_summary.txt
Header duplicates vs sequence duplicates
This tutorial focuses on duplicated sequence content. Duplicate FASTA headers are a separate problem and should also be checked before building reference indexes.
Reusable workflow script
The script below performs a simple reproducible deduplication step and records input/output sequence counts.
#!/usr/bin/env bash
set -euo pipefail
INPUT_FASTA="reference.fa"
OUTPUT_FASTA="reference_no_duplicates.fa"
LOG_FILE="faFilter_deduplication_summary.txt"
if ! command -v faFilter >/dev/null 2>&1; then
echo "ERROR: faFilter not found in PATH" >&2
exit 1
fi
echo "Input FASTA: $INPUT_FASTA"
echo "Output FASTA: $OUTPUT_FASTA"
faFilter -uniq "$INPUT_FASTA" "$OUTPUT_FASTA"
{
echo "Input FASTA: $INPUT_FASTA"
echo "Output FASTA: $OUTPUT_FASTA"
echo "Input records: $(grep -c '^>' "$INPUT_FASTA")"
echo "Output records: $(grep -c '^>' "$OUTPUT_FASTA")"
echo "Command: faFilter -uniq $INPUT_FASTA $OUTPUT_FASTA"
} > "$LOG_FILE"
echo "Done. Summary written to $LOG_FILE"
Next steps after FASTA deduplication
Once a FASTA reference has been cleaned, the next step is usually to build a new index for the aligner or quantification tool you plan to use.
Check that duplicate records were removed as expected.
Validate that sequence headers remain compatible with annotation files, metadata, or downstream scripts.
Build a new index for Bowtie, Bowtie2, BWA, HISAT2, STAR, BLAST, Salmon, Kallisto, or another tool as appropriate.
Document the original FASTA source, deduplication command, and date of reference creation.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.