Resources /
Custom Scripting with Shell/Bash, AWK and Python
Practical Scripting Tutorial
Introduction to custom scripting with Shell/Bash, AWK and Python.
A beginner-friendly tutorial for scientists and bioinformatics users who want to automate repetitive tasks, process text files, validate metadata, run NGS tools across many samples and build reproducible scripts using Shell/Bash, AWK and Python.
Custom scripting is the bridge between interactive command-line work and reproducible data analysis. In bioinformatics, scripts are used to rename files, validate sample sheets, loop through FASTQ files, run tools, summarize QC reports, filter tables, generate reports and prepare data for downstream interpretation.
Shell/BashConnect command-line tools, manage files and automate pipelines.
AWKProcess structured text files line by line with compact commands.
PythonBuild maintainable scripts with validation, functions, modules and reusable logic.
Practical rule: use Bash to orchestrate tools, AWK for quick text-table transformations and Python when the logic becomes complex enough to deserve a real program.
2. When should you write a script?
A single command is fine for exploration. A script is better when work should be repeated, shared, checked, logged or run on many files.
Write a script when...
You run the same commands more than once.
You process many samples.
You need a record of exact commands.
You want to reduce manual mistakes.
You need logs, reports or version tracking.
Keep it interactive when...
You are exploring an unknown file.
You are testing one command.
The command will not be reused.
You are checking quick summary information.
3. Recommended project setup
Keep scripts, logs, input data and results separated. This structure makes projects easier to review and rerun.
Create a scripting practice project
mkdir -p scripting_tutorial/{data,results,scripts,logs,reports}
cd scripting_tutorial
cat > data/samples.tsv << 'EOF'
sample_id group batch fastq_1 fastq_2
S1 control A S1_R1.fastq.gz S1_R2.fastq.gz
S2 control A S2_R1.fastq.gz S2_R2.fastq.gz
S3 treated B S3_R1.fastq.gz S3_R2.fastq.gz
S4 treated B S4_R1.fastq.gz S4_R2.fastq.gz
EOF
Keep raw data in data/, scripts in scripts/, logs in logs/, and generated outputs in results/ or reports/.
4. Bash script basics
Bash scripts are text files containing shell commands. A typical script starts with a shebang line, safety settings and a clear description.
Bash can run thousands of operations quickly, including destructive operations. Safe habits matter.
Always check locationRun pwd before deleting or moving files.
Preview wildcardsUse ls pattern* before rm pattern*.
Quote variablesUse "${file}" to avoid word-splitting bugs.
Write logsRedirect command output and errors into a logs/ folder.
Keep raw data read-onlyNever write intermediate files into raw data folders.
Test on small filesValidate scripts before running on full NGS datasets.
10. AWK basics
AWK is designed for line-by-line processing of structured text. It is especially useful for tabular files such as TSV metadata, BED intervals, GTF annotations and simple result tables.
# Example assumes columns: chrom, start, end, name
awk 'BEGIN{FS=OFS="\t"} ($3-$2) >= 100 {print}' regions.bed > regions_min100bp.bed
Use AWK for compact transformations. When commands become long and hard to read, move the logic into a Python script.
12. Python scripting basics
Python is a good choice for scripts that need clear structure, validation, functions, error handling and reuse. It is often easier to maintain than long Bash or AWK one-liners.
Minimal Python script
cat > scripts/hello_python.py << 'EOF'
#!/usr/bin/env python3
from pathlib import Path
project_dir = Path.cwd()
print(f"Hello from Python")
print(f"Current directory: {project_dir}")
print("Files:")
for path in sorted(project_dir.iterdir()):
print(path.name)
EOF
chmod +x scripts/hello_python.py
./scripts/hello_python.py
Python module
Use
Example task
pathlib
Path and file handling.
Find FASTQ files in a folder.
csv
Read and write CSV/TSV files.
Validate sample sheets.
gzip
Read compressed text files.
Inspect FASTQ files.
argparse
Command-line arguments.
Reusable scripts with input/output options.
logging
Structured logs.
Record script progress and errors.
13. Python command-line arguments with argparse
argparse turns a Python file into a reusable command-line tool.
Validate a sample sheet with Python
cat > scripts/validate_samples.py << 'EOF'
#!/usr/bin/env python3
import argparse
import csv
from pathlib import Path
REQUIRED_COLUMNS = ["sample_id", "group", "batch", "fastq_1", "fastq_2"]
def parse_args():
parser = argparse.ArgumentParser(description="Validate a simple NGS sample sheet.")
parser.add_argument("--samples", required=True, help="Input sample sheet in TSV format")
return parser.parse_args()
def main():
args = parse_args()
sample_sheet = Path(args.samples)
if not sample_sheet.exists():
raise FileNotFoundError(f"Sample sheet not found: {sample_sheet}")
with sample_sheet.open(newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
columns = reader.fieldnames or []
missing = [col for col in REQUIRED_COLUMNS if col not in columns]
if missing:
raise ValueError(f"Missing required columns: {missing}")
seen = set()
count = 0
for row in reader:
count += 1
sample_id = row["sample_id"]
if sample_id in seen:
raise ValueError(f"Duplicate sample_id: {sample_id}")
seen.add(sample_id)
print(f"OK: {count} samples validated in {sample_sheet}")
if __name__ == "__main__":
main()
EOF
chmod +x scripts/validate_samples.py
./scripts/validate_samples.py --samples data/samples.tsv
14. Python file processing
Python is useful for reading compressed files, checking FASTQ structure and writing clean output tables.
Count FASTQ reads with Python
cat > scripts/count_fastq_reads.py << 'EOF'
#!/usr/bin/env python3
import argparse
import gzip
from pathlib import Path
def parse_args():
parser = argparse.ArgumentParser(description="Count reads in a FASTQ or FASTQ.GZ file.")
parser.add_argument("fastq", help="Input FASTQ or FASTQ.GZ file")
return parser.parse_args()
def open_text(path: Path):
if path.suffix == ".gz":
return gzip.open(path, "rt")
return path.open("rt")
def main():
args = parse_args()
path = Path(args.fastq)
line_count = 0
with open_text(path) as handle:
for line_count, _ in enumerate(handle, start=1):
pass
if line_count % 4 != 0:
raise ValueError(f"FASTQ line count is not divisible by 4: {line_count}")
print(f"{path}\t{line_count // 4}")
if __name__ == "__main__":
main()
EOF
chmod +x scripts/count_fastq_reads.py
15. Running command-line tools from Python
Python can call command-line programs, but avoid building unsafe shell strings. Prefer passing a list of arguments to subprocess.run.
Avoid subprocess.run("command " + user_input, shell=True) unless you have a specific reason and fully control the input. Passing arguments as a list is safer.
16. Combining Bash, AWK and Python
Strong bioinformatics workflows often combine tools rather than forcing one language to do everything.
Task
Good tool
Reason
Run FastQC on many FASTQ files
Bash or workflow engine
Mostly command orchestration.
Extract sample IDs from a TSV
AWK
Simple table column extraction.
Validate a complex sample sheet
Python
Clear error handling and structured logic.
Summarize many QC reports
MultiQC, Python or R
Structured parsing and reporting.
Run a production RNA-seq pipeline
Nextflow or Snakemake
Scalability, dependency tracking and reproducibility.
Scripts are excellent for small and medium tasks. When a project includes many samples, multiple processing stages, large files, parallel execution and complex dependencies, consider a workflow manager.
Project sign
Risk with simple scripts
Better option
Many dependent steps
Hard to rerun only failed steps.
Nextflow or Snakemake.
Dozens or hundreds of samples
Manual loops become fragile.
Workflow with sample sheet.
HPC or cloud execution
Job management becomes complex.
Workflow profiles and containers.
Publication or regulated project
Insufficient provenance.
Versioned workflow and documented environments.
20. Mini exercises
Practice on the example files from this tutorial before applying scripts to real data.
Create a Bash script that prints the number of samples in data/samples.tsv.
Use AWK to print only treated samples into results/treated_samples.tsv.
Create output folders for each sample_id.
Write a Python script that checks whether sample IDs are unique.
Add command-line arguments to the Python script.
Write logs for every script run into the logs/ folder.
Save a Markdown report summarizing groups and batches.
21. Scripting cheat sheet
Need
Bash / AWK / Python pattern
Example
Exit on errors
set -euo pipefail
Place near the top of Bash scripts.
Use first argument
"${1}"
input="${1}"
Loop over files
for file in *.txt; do ...; done
Run one command per file.
Read TSV rows
while IFS=$'\t' read -r ...
Loop over sample sheet rows.
AWK tab-separated input
BEGIN{FS=OFS="\t"}
Process TSV tables.
AWK skip header
NR>1
Ignore column names.
Python paths
from pathlib import Path
Reliable file handling.
Python arguments
argparse.ArgumentParser()
Reusable command-line tools.
Run external tool
subprocess.run([...], check=True)
Safer than shell strings.
Useful scripting resources
These resources are useful for learning syntax and checking details.
When should I write a Bash script instead of typing commands manually?
Write a Bash script when a command sequence must be repeated, documented, shared, run on many files, or used as part of a larger workflow. Manual commands are fine for exploration, but scripts are better for reproducibility.
When should I use AWK instead of Python?
Use AWK for quick line-by-line processing of text tables, logs, BED files, GTF files, sample sheets and simple summaries. Use Python when the logic becomes complex, needs data structures, modules, plotting, APIs, validation, or maintainable software design.
Is Bash enough for bioinformatics scripting?
Bash is excellent for connecting command-line tools, looping over files and managing pipelines. For complex parsing, statistics, metadata validation or reusable programs, it is usually better to combine Bash with AWK, Python, R or a workflow engine.
Should scripts include software versions?
Yes. For NGS and clinical-bioinformatics projects, scripts should capture command versions, parameters, input files, reference versions and logs. This makes analysis auditable and reproducible.
How do I avoid accidentally deleting files in Bash?
Check the current directory with pwd, preview wildcards with ls, avoid rm -rf until you are certain, quote variables, use set -euo pipefail, write output to new folders and keep raw data read-only.
What is the best first Python package for bioinformatics scripting?
For general scripting, start with the Python standard library: pathlib, argparse, csv, gzip, subprocess and logging. For tables, pandas is useful. For sequence files, Biopython is widely used.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.