Bioinformatics Scripts

Manipulation with FASTA and FASTQ files in Linux Bash/Shell.

This tutorial collects practical one-line commands and reusable shell snippets for common FASTA and FASTQ operations: counting reads, inspecting file ranges, converting FASTQ to FASTA, extracting motif-containing reads, counting FASTA records, cleaning headers, and converting multi-line FASTA files into single-line format.

Overview

Linux command-line tools are extremely useful for quick inspection and manipulation of sequencing files. Many everyday operations can be performed with standard tools such as cat, zcat, wc, sed, awk, grep, perl, and bc.

The examples below modernize the original SciBerg command list and add context, validation steps, and reusable snippets for more reproducible work.

FASTQ Count reads, inspect records, convert to FASTA, and extract reads containing motifs.
FASTA Count records, extract headers, clean descriptions, and convert multi-line records to one-line format.
Workflow hygiene Always document filenames, commands, tool versions, and output files for reproducibility.

Important file-format reminder

A standard FASTQ record has four lines: header, sequence, separator, and quality string. That is why line counts are divided by four when estimating the number of reads in a valid FASTQ file.

FASTQ operations

Count reads in an uncompressed FASTQ file

echo $(cat fastq_file.fastq | wc -l)/4 | bc

Count reads in a gzip-compressed FASTQ file

echo $(zcat fastq_file.fastq.gz | wc -l)/4 | bc

A slightly more explicit version stores the line count first and then divides by four:

LINES=$(zcat fastq_file.fastq.gz | wc -l)
echo "$LINES / 4" | bc

Inspect selected lines in a FASTQ file

Use sed or awk to print a range of lines from a large FASTQ file without opening the entire file in a text editor.

# Print lines 530 to 640
sed -n '530,640p;641q' fastq_file.fastq

# Print lines 530 to 540
awk 'FNR>=530 && FNR<=540' fastq_file.fastq
wc -l Counts the number of lines in a file or stream.
zcat Streams a gzip-compressed file without creating an uncompressed copy.
sed -n Prints selected line ranges efficiently.
awk Processes records and line ranges using programmable conditions.

Convert FASTQ to FASTA

FASTQ-to-FASTA conversion removes quality-score lines and outputs only sequence headers and sequences. This is useful for some motif-search, BLAST, reference-building, or quick inspection tasks.

FASTQ to FASTA with AWK

awk 'NR%4==1{a=substr($0,2);}NR%4==2{print ">"a"\n"$0}' input.fastq > output.fa

FASTQ to FASTA with sed

sed '/^@/!d;s//>/;N' input.fastq > output.fa

Compression note

For compressed FASTQ files, stream with zcat and pipe into the conversion command, for example:

zcat input.fastq.gz | \
awk 'NR%4==1{a=substr($0,2);}NR%4==2{print ">"a"\n"$0}' \
> output.fa

Extract reads containing a motif or restriction site

The original SciBerg example extracts reads containing the XbaI cleavage site TCTAGA and outputs matching reads in FASTA format.

awk 'NR%4==1{a=substr($0,2);}NR%4==2 && $1~/TCTAGA/ {print ">"a"\n"$0}' fastq_file.fastq

To save the output into a FASTA file:

awk 'NR%4==1{a=substr($0,2);}NR%4==2 && $1~/TCTAGA/ {print ">"a"\n"$0}' \
  fastq_file.fastq \
  > reads_containing_XbaI.fa

For compressed FASTQ input:

zcat fastq_file.fastq.gz | \
awk 'NR%4==1{a=substr($0,2);}NR%4==2 && $1~/TCTAGA/ {print ">"a"\n"$0}' \
> reads_containing_XbaI.fa

FASTA operations

Count sequences in a FASTA file

grep -c "^>" fasta_file.fa

Extract FASTA headers

Extracting headers is useful when creating annotation tables, inspecting reference content, or preparing ID lists for downstream filtering.

grep -e ">" fasta.fa > fasta_header

Keep only the first column of FASTA headers

awk '{print $1}' fasta_file_input.fa > fasta_file_output.fa

Clean FASTA headers in place with Perl

perl -p -i -e 's/>(.+?) .+/>$1/g' fasta_file.fa

Convert multi-line FASTA to single-line FASTA

Some tools or downstream scripts require each FASTA record to contain a single sequence line after the header.

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' \
  multi_line.fa \
  > single_line.fa

Header-cleaning caution

Cleaning headers can break compatibility with annotation files if IDs are expected to match exactly. Preserve a copy of the original reference and document any header transformations.

Reusable workflow snippets

Summarize FASTQ and FASTA files in a project folder

#!/usr/bin/env bash
set -euo pipefail

echo -e "file\ttype\trecords"

for fq in *.fastq *.fastq.gz; do
  [[ -e "$fq" ]] || continue

  if [[ "$fq" == *.gz ]]; then
    READS=$(echo "$(zcat "$fq" | wc -l) / 4" | bc)
  else
    READS=$(echo "$(cat "$fq" | wc -l) / 4" | bc)
  fi

  echo -e "${fq}\tFASTQ\t${READS}"
done

for fa in *.fa *.fasta; do
  [[ -e "$fa" ]] || continue
  RECORDS=$(grep -c '^>' "$fa")
  echo -e "${fa}\tFASTA\t${RECORDS}"
done

Convert all FASTQ files in a folder to FASTA

#!/usr/bin/env bash
set -euo pipefail

mkdir -p fasta_output

for fq in *.fastq; do
  [[ -e "$fq" ]] || continue
  sample="${fq%.fastq}"

  awk 'NR%4==1{a=substr($0,2);}NR%4==2{print ">"a"\n"$0}' \
    "$fq" \
    > "fasta_output/${sample}.fa"
done

Next steps

After basic FASTA/FASTQ manipulation, continue with quality control, trimming, reference preparation, read alignment, or quantification depending on the analysis goal.

  1. Run FastQC before major downstream analysis steps.
  2. Trim adapters or low-quality bases when required.
  3. Prepare clean FASTA references and remove duplicates where appropriate.
  4. Build aligner indexes after any reference modification.
  5. Document every command and preserve raw data unchanged.