Practical Scripting Tutorial

Introduction to custom scripting with Shell/Bash, AWK and Python.

A beginner-friendly tutorial for scientists and bioinformatics users who want to automate repetitive tasks, process text files, validate metadata, run NGS tools across many samples and build reproducible scripts using Shell/Bash, AWK and Python.

1. Overview

Custom scripting is the bridge between interactive command-line work and reproducible data analysis. In bioinformatics, scripts are used to rename files, validate sample sheets, loop through FASTQ files, run tools, summarize QC reports, filter tables, generate reports and prepare data for downstream interpretation.

Shell/Bash Connect command-line tools, manage files and automate pipelines.
AWK Process structured text files line by line with compact commands.
Python Build maintainable scripts with validation, functions, modules and reusable logic.
Practical rule: use Bash to orchestrate tools, AWK for quick text-table transformations and Python when the logic becomes complex enough to deserve a real program.

2. When should you write a script?

A single command is fine for exploration. A script is better when work should be repeated, shared, checked, logged or run on many files.

Write a script when...
  • You run the same commands more than once.
  • You process many samples.
  • You need a record of exact commands.
  • You want to reduce manual mistakes.
  • You need logs, reports or version tracking.
Keep it interactive when...
  • You are exploring an unknown file.
  • You are testing one command.
  • The command will not be reused.
  • You are checking quick summary information.

3. Recommended project setup

Keep scripts, logs, input data and results separated. This structure makes projects easier to review and rerun.

Create a scripting practice project
mkdir -p scripting_tutorial/{data,results,scripts,logs,reports}
cd scripting_tutorial

cat > data/samples.tsv << 'EOF'
sample_id	group	batch	fastq_1	fastq_2
S1	control	A	S1_R1.fastq.gz	S1_R2.fastq.gz
S2	control	A	S2_R1.fastq.gz	S2_R2.fastq.gz
S3	treated	B	S3_R1.fastq.gz	S3_R2.fastq.gz
S4	treated	B	S4_R1.fastq.gz	S4_R2.fastq.gz
EOF
Keep raw data in data/, scripts in scripts/, logs in logs/, and generated outputs in results/ or reports/.

4. Bash script basics

Bash scripts are text files containing shell commands. A typical script starts with a shebang line, safety settings and a clear description.

Minimal Bash script
cat > scripts/hello_bash.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail

echo "Hello from Bash"
echo "Current directory: $(pwd)"
echo "Current date: $(date)"
EOF

chmod +x scripts/hello_bash.sh
./scripts/hello_bash.sh
Line Meaning Why it helps
#!/usr/bin/env bash Run the script with Bash. Makes the script portable across systems.
set -e Exit when a command fails. Prevents continuing after errors.
set -u Treat missing variables as errors. Catches typos in variable names.
set -o pipefail Detect failures inside pipes. Makes pipelines safer.

5. Bash variables and arguments

Variables store values such as input files, output folders, sample names and parameters. Quote variables to protect spaces and special characters.

Variables and arguments
cat > scripts/count_lines.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail

input_file="${1}"
output_file="${2}"

echo "Input: ${input_file}"
echo "Output: ${output_file}"

wc -l "${input_file}" > "${output_file}"
EOF

chmod +x scripts/count_lines.sh
./scripts/count_lines.sh data/samples.tsv results/sample_line_count.txt
cat results/sample_line_count.txt
Use "${variable}" rather than $variable in most scripts. Quoting prevents many mistakes when paths contain spaces or special characters.

6. Bash loops over files and samples

Loops are useful for running the same command on multiple files or samples.

Loop over FASTQ files
mkdir -p results/fastq_names

for fq in data/*.fastq.gz; do
  sample="$(basename "$fq" .fastq.gz)"
  echo "Found FASTQ: $sample"
done
Loop over sample sheet rows
tail -n +2 data/samples.tsv | while IFS=$'\t' read -r sample_id group batch fastq_1 fastq_2; do
  echo "Sample: $sample_id | group: $group | batch: $batch"
  echo "R1: $fastq_1"
  echo "R2: $fastq_2"
done

7. Bash conditionals and file checks

Conditionals let scripts check whether files exist, whether folders are empty and whether required inputs are available.

Check files before running
cat > scripts/check_sample_sheet.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail

sample_sheet="${1:-data/samples.tsv}"

if [[ ! -f "$sample_sheet" ]]; then
  echo "ERROR: sample sheet not found: $sample_sheet" >&2
  exit 1
fi

echo "Sample sheet exists: $sample_sheet"
echo "Number of lines:"
wc -l "$sample_sheet"
EOF

chmod +x scripts/check_sample_sheet.sh
./scripts/check_sample_sheet.sh data/samples.tsv

8. Bash functions

Functions help avoid repeating the same code. They are useful for logging, checking files and running repeated command patterns.

Bash functions for logging and checks
cat > scripts/functions_example.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail

log() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"
}

require_file() {
  local file="$1"
  if [[ ! -f "$file" ]]; then
    log "ERROR: missing file: $file"
    exit 1
  fi
}

sample_sheet="data/samples.tsv"
require_file "$sample_sheet"
log "Processing $sample_sheet"
wc -l "$sample_sheet"
EOF

chmod +x scripts/functions_example.sh
./scripts/functions_example.sh

9. Bash safety habits

Bash can run thousands of operations quickly, including destructive operations. Safe habits matter.

Always check location Run pwd before deleting or moving files.
Preview wildcards Use ls pattern* before rm pattern*.
Quote variables Use "${file}" to avoid word-splitting bugs.
Write logs Redirect command output and errors into a logs/ folder.
Keep raw data read-only Never write intermediate files into raw data folders.
Test on small files Validate scripts before running on full NGS datasets.

10. AWK basics

AWK is designed for line-by-line processing of structured text. It is especially useful for tabular files such as TSV metadata, BED intervals, GTF annotations and simple result tables.

Basic AWK examples
# Print all lines
awk '{print}' data/samples.tsv

# Print first column
awk 'BEGIN{FS="\t"} {print $1}' data/samples.tsv

# Skip header and print sample/group
awk 'BEGIN{FS="\t"} NR>1 {print $1, $2}' data/samples.tsv

# Select treated samples
awk 'BEGIN{FS="\t"} NR>1 && $2=="treated" {print $1}' data/samples.tsv
AWK element Meaning Example
FS Input field separator. BEGIN{FS="\t"}
OFS Output field separator. BEGIN{OFS="\t"}
NR Current line number. NR>1
$1, $2 Column 1, column 2. print $1, $2

11. AWK patterns for bioinformatics-style tables

AWK is excellent for quick summaries and column transformations.

Count samples per group
awk 'BEGIN{FS="\t"} NR>1 {count[$2]++} END{for (group in count) print group, count[group]}' data/samples.tsv
Create a simple two-column manifest
awk 'BEGIN{FS=OFS="\t"} NR>1 {print $1, $4}' data/samples.tsv > results/r1_manifest.tsv
cat results/r1_manifest.tsv
Filter BED-like intervals by length
# Example assumes columns: chrom, start, end, name
awk 'BEGIN{FS=OFS="\t"} ($3-$2) >= 100 {print}' regions.bed > regions_min100bp.bed
Use AWK for compact transformations. When commands become long and hard to read, move the logic into a Python script.

12. Python scripting basics

Python is a good choice for scripts that need clear structure, validation, functions, error handling and reuse. It is often easier to maintain than long Bash or AWK one-liners.

Minimal Python script
cat > scripts/hello_python.py << 'EOF'
#!/usr/bin/env python3

from pathlib import Path

project_dir = Path.cwd()
print(f"Hello from Python")
print(f"Current directory: {project_dir}")
print("Files:")
for path in sorted(project_dir.iterdir()):
    print(path.name)
EOF

chmod +x scripts/hello_python.py
./scripts/hello_python.py
Python module Use Example task
pathlib Path and file handling. Find FASTQ files in a folder.
csv Read and write CSV/TSV files. Validate sample sheets.
gzip Read compressed text files. Inspect FASTQ files.
argparse Command-line arguments. Reusable scripts with input/output options.
logging Structured logs. Record script progress and errors.

13. Python command-line arguments with argparse

argparse turns a Python file into a reusable command-line tool.

Validate a sample sheet with Python
cat > scripts/validate_samples.py << 'EOF'
#!/usr/bin/env python3

import argparse
import csv
from pathlib import Path

REQUIRED_COLUMNS = ["sample_id", "group", "batch", "fastq_1", "fastq_2"]

def parse_args():
    parser = argparse.ArgumentParser(description="Validate a simple NGS sample sheet.")
    parser.add_argument("--samples", required=True, help="Input sample sheet in TSV format")
    return parser.parse_args()

def main():
    args = parse_args()
    sample_sheet = Path(args.samples)

    if not sample_sheet.exists():
        raise FileNotFoundError(f"Sample sheet not found: {sample_sheet}")

    with sample_sheet.open(newline="") as handle:
        reader = csv.DictReader(handle, delimiter="\t")
        columns = reader.fieldnames or []

        missing = [col for col in REQUIRED_COLUMNS if col not in columns]
        if missing:
            raise ValueError(f"Missing required columns: {missing}")

        seen = set()
        count = 0
        for row in reader:
            count += 1
            sample_id = row["sample_id"]
            if sample_id in seen:
                raise ValueError(f"Duplicate sample_id: {sample_id}")
            seen.add(sample_id)

    print(f"OK: {count} samples validated in {sample_sheet}")

if __name__ == "__main__":
    main()
EOF

chmod +x scripts/validate_samples.py
./scripts/validate_samples.py --samples data/samples.tsv

14. Python file processing

Python is useful for reading compressed files, checking FASTQ structure and writing clean output tables.

Count FASTQ reads with Python
cat > scripts/count_fastq_reads.py << 'EOF'
#!/usr/bin/env python3

import argparse
import gzip
from pathlib import Path

def parse_args():
    parser = argparse.ArgumentParser(description="Count reads in a FASTQ or FASTQ.GZ file.")
    parser.add_argument("fastq", help="Input FASTQ or FASTQ.GZ file")
    return parser.parse_args()

def open_text(path: Path):
    if path.suffix == ".gz":
        return gzip.open(path, "rt")
    return path.open("rt")

def main():
    args = parse_args()
    path = Path(args.fastq)

    line_count = 0
    with open_text(path) as handle:
        for line_count, _ in enumerate(handle, start=1):
            pass

    if line_count % 4 != 0:
        raise ValueError(f"FASTQ line count is not divisible by 4: {line_count}")

    print(f"{path}\t{line_count // 4}")

if __name__ == "__main__":
    main()
EOF

chmod +x scripts/count_fastq_reads.py

15. Running command-line tools from Python

Python can call command-line programs, but avoid building unsafe shell strings. Prefer passing a list of arguments to subprocess.run.

Run a command safely from Python
cat > scripts/run_fastqc_example.py << 'EOF'
#!/usr/bin/env python3

import argparse
import subprocess
from pathlib import Path

def parse_args():
    parser = argparse.ArgumentParser(description="Run FastQC on FASTQ files.")
    parser.add_argument("--fastq-dir", required=True)
    parser.add_argument("--outdir", required=True)
    return parser.parse_args()

def main():
    args = parse_args()
    fastq_dir = Path(args.fastq_dir)
    outdir = Path(args.outdir)
    outdir.mkdir(parents=True, exist_ok=True)

    fastq_files = sorted(fastq_dir.glob("*.fastq.gz"))
    if not fastq_files:
        raise ValueError(f"No FASTQ files found in {fastq_dir}")

    command = ["fastqc", "--outdir", str(outdir), *map(str, fastq_files)]
    print("Running:", " ".join(command))
    subprocess.run(command, check=True)

if __name__ == "__main__":
    main()
EOF
Avoid subprocess.run("command " + user_input, shell=True) unless you have a specific reason and fully control the input. Passing arguments as a list is safer.

16. Combining Bash, AWK and Python

Strong bioinformatics workflows often combine tools rather than forcing one language to do everything.

Task Good tool Reason
Run FastQC on many FASTQ files Bash or workflow engine Mostly command orchestration.
Extract sample IDs from a TSV AWK Simple table column extraction.
Validate a complex sample sheet Python Clear error handling and structured logic.
Summarize many QC reports MultiQC, Python or R Structured parsing and reporting.
Run a production RNA-seq pipeline Nextflow or Snakemake Scalability, dependency tracking and reproducibility.

17. Bioinformatics scripting examples

Create a FASTQ manifest from a sample sheet

AWK manifest generation
awk 'BEGIN{FS=OFS="\t"} NR==1 {print "sample_id","r1","r2"; next} {print $1,$4,$5}' \
  data/samples.tsv > results/fastq_manifest.tsv

Prepare per-sample output folders

Bash folder creation
tail -n +2 data/samples.tsv | while IFS=$'\t' read -r sample_id group batch fastq_1 fastq_2; do
  mkdir -p "results/${sample_id}"/{qc,alignment,counts}
done

Generate a simple run report with Python

Python Markdown report
cat > scripts/make_report.py << 'EOF'
#!/usr/bin/env python3

import csv
from pathlib import Path
from collections import Counter

sample_sheet = Path("data/samples.tsv")
report = Path("reports/project_summary.md")
report.parent.mkdir(parents=True, exist_ok=True)

groups = Counter()
batches = Counter()

with sample_sheet.open(newline="") as handle:
    reader = csv.DictReader(handle, delimiter="\t")
    for row in reader:
        groups[row["group"]] += 1
        batches[row["batch"]] += 1

lines = ["# Project summary", ""]
lines.append("## Samples per group")
for group, count in sorted(groups.items()):
    lines.append(f"- {group}: {count}")

lines.append("")
lines.append("## Samples per batch")
for batch, count in sorted(batches.items()):
    lines.append(f"- {batch}: {count}")

report.write_text("\n".join(lines) + "\n", encoding="utf-8")
print(f"Wrote {report}")
EOF

chmod +x scripts/make_report.py
./scripts/make_report.py

18. Testing, logging and reproducibility

Scripts used for real analysis should be tested and documented. Even small scripts benefit from clear input checks and logs.

Test small Run scripts on two or three small files before full datasets.
Log commands Capture stdout and stderr using >, 2> or tee.
Save versions Record software versions in reports/software_versions/.
Use version control Track scripts with Git when projects become important or collaborative.
Logging a Bash script run
mkdir -p logs

./scripts/check_sample_sheet.sh data/samples.tsv \
  > logs/check_sample_sheet.out.log \
  2> logs/check_sample_sheet.err.log

19. When scripts should become workflows

Scripts are excellent for small and medium tasks. When a project includes many samples, multiple processing stages, large files, parallel execution and complex dependencies, consider a workflow manager.

Project sign Risk with simple scripts Better option
Many dependent steps Hard to rerun only failed steps. Nextflow or Snakemake.
Dozens or hundreds of samples Manual loops become fragile. Workflow with sample sheet.
HPC or cloud execution Job management becomes complex. Workflow profiles and containers.
Publication or regulated project Insufficient provenance. Versioned workflow and documented environments.

20. Mini exercises

Practice on the example files from this tutorial before applying scripts to real data.

  1. Create a Bash script that prints the number of samples in data/samples.tsv.
  2. Use AWK to print only treated samples into results/treated_samples.tsv.
  3. Create output folders for each sample_id.
  4. Write a Python script that checks whether sample IDs are unique.
  5. Add command-line arguments to the Python script.
  6. Write logs for every script run into the logs/ folder.
  7. Save a Markdown report summarizing groups and batches.

21. Scripting cheat sheet

Need Bash / AWK / Python pattern Example
Exit on errors set -euo pipefail Place near the top of Bash scripts.
Use first argument "${1}" input="${1}"
Loop over files for file in *.txt; do ...; done Run one command per file.
Read TSV rows while IFS=$'\t' read -r ... Loop over sample sheet rows.
AWK tab-separated input BEGIN{FS=OFS="\t"} Process TSV tables.
AWK skip header NR>1 Ignore column names.
Python paths from pathlib import Path Reliable file handling.
Python arguments argparse.ArgumentParser() Reusable command-line tools.
Run external tool subprocess.run([...], check=True) Safer than shell strings.

Frequently asked questions

When should I write a Bash script instead of typing commands manually?

Write a Bash script when a command sequence must be repeated, documented, shared, run on many files, or used as part of a larger workflow. Manual commands are fine for exploration, but scripts are better for reproducibility.

When should I use AWK instead of Python?

Use AWK for quick line-by-line processing of text tables, logs, BED files, GTF files, sample sheets and simple summaries. Use Python when the logic becomes complex, needs data structures, modules, plotting, APIs, validation, or maintainable software design.

Is Bash enough for bioinformatics scripting?

Bash is excellent for connecting command-line tools, looping over files and managing pipelines. For complex parsing, statistics, metadata validation or reusable programs, it is usually better to combine Bash with AWK, Python, R or a workflow engine.

Should scripts include software versions?

Yes. For NGS and clinical-bioinformatics projects, scripts should capture command versions, parameters, input files, reference versions and logs. This makes analysis auditable and reproducible.

How do I avoid accidentally deleting files in Bash?

Check the current directory with pwd, preview wildcards with ls, avoid rm -rf until you are certain, quote variables, use set -euo pipefail, write output to new folders and keep raw data read-only.

What is the best first Python package for bioinformatics scripting?

For general scripting, start with the Python standard library: pathlib, argparse, csv, gzip, subprocess and logging. For tables, pandas is useful. For sequence files, Biopython is widely used.