Bioinformatics Tutorial

Data sources and data formats for NGS and bioinformatics.

A practical tutorial for understanding where bioinformatics data come from, how public repositories and reference databases are used, and how common file formats such as FASTQ, FASTA, BAM, CRAM, VCF, GTF, BED, bigWig and count matrices fit into reproducible NGS workflows.

1. Overview

Bioinformatics projects usually combine several data sources and file types. A typical NGS analysis may start with raw FASTQ files, align them to a reference genome, use a gene annotation file to count reads, compare the results across metadata-defined sample groups, and interpret findings using variant, pathway or disease databases.

Input data FASTQ files, public datasets, reference genomes, annotations and sample metadata.
Analysis files BAM/CRAM alignments, VCF variants, count matrices, peak files and signal tracks.
Interpretation Population databases, disease databases, cancer resources, pathways and literature.
Core principle: keep raw data, reference data and metadata clearly separated from derived results. Record where each file came from, which version it represents and how it was processed.

2. Main types of bioinformatics data sources

Different resources serve different purposes. Some store raw sequencing reads, some provide reference genomes, some curate variants or diseases, and others provide processed expression or clinical datasets.

Source type Examples Typical use
Raw read archives NCBI SRA, EMBL-EBI ENA, DDBJ Download public sequencing reads for reanalysis, benchmarking or meta-analysis.
Functional genomics repositories NCBI GEO, ArrayExpress, BioStudies Find expression, epigenomics and multi-omics studies with metadata and processed results.
Reference genome resources Ensembl, UCSC, NCBI RefSeq, GENCODE Download genome FASTA files, gene annotations and genome indexes.
Variation databases dbSNP, gnomAD, ClinVar, dbVar Annotate variants, filter population polymorphisms and interpret clinical relevance.
Cancer genomics resources COSMIC, OncoKB, CIViC, cBioPortal, NCI GDC Interpret tumour variants, driver mutations, gene alterations and cancer datasets.
Pathway and function databases Gene Ontology, Reactome, KEGG, UniProt, STRING Interpret gene lists, pathways, networks and protein function.

3. Raw read repositories

Raw read repositories store sequencing data submitted by researchers and consortia. These repositories are important for reanalysis, methods development, benchmarking and meta-analysis.

NCBI SRA Large archive of raw sequencing reads. Data can often be downloaded with SRA Toolkit or fasterq-dump workflows.
EMBL-EBI ENA European archive for sequence reads, assemblies and related metadata, often offering FTP/Aspera download routes.
DDBJ Japanese sequence data archive and part of the international nucleotide sequence database collaboration.
GEO and BioStudies Often provide study-level metadata, processed matrices and links to raw reads stored in SRA or ENA.
Public dataset identifiers matter. Record accessions such as SRR, ERR, DRR, PRJNA, PRJEB, GSE or E-MTAB identifiers in project reports.

4. Reference genomes

A reference genome provides the coordinate system for alignment, variant calling, gene annotation and visualization. The chosen genome assembly must match all downstream files.

Resource Typical files Notes
Ensembl Genome FASTA, GTF/GFF3, cDNA, CDS and protein FASTA files. Widely used for gene annotation and comparative genomics.
GENCODE High-quality human and mouse gene annotations. Common choice for RNA-seq and gene-expression analysis.
NCBI RefSeq Reference sequences, genomes, transcripts and protein annotations. Curated NCBI reference resource used in many annotation pipelines.
UCSC Genome Browser Genome FASTA, chain files, liftOver files, tracks and annotation tables. Useful for coordinate conversion and browser-based visualization.
Do not mix genome assemblies accidentally. Human GRCh37/hg19 and GRCh38/hg38 coordinates are not interchangeable without coordinate conversion.

5. Genome annotations

Genome annotation files define genes, transcripts, exons, promoters, repeats, regulatory regions or target intervals. They are essential for read counting, variant annotation and genomic-interval analysis.

Gene annotations GTF or GFF3 files define gene, transcript, exon and coding-sequence structures.
Target regions BED or interval-list files define capture panels, exomes, amplicons or regions of interest.
Regulatory tracks BED, bigBed, bigWig and bedGraph files can represent enhancers, promoters, peaks or signal intensity.
Coordinate conversion Chain files and liftOver tools convert genomic coordinates between assemblies when appropriate.

6. Variant and population databases

Variant databases support filtering and interpretation. Population databases help identify common variants. Clinical and disease databases help interpret relevance, but they should be used with context and version tracking.

Database Use Typical caution
dbSNP Known short variants and identifiers. Presence in dbSNP does not automatically mean benign or clinically relevant.
gnomAD Population allele frequencies for filtering rare or common variants. Interpret frequency in population and disease context.
ClinVar Clinical assertions for human variants. Review assertion criteria, submitter confidence and update date.
dbVar Structural variation records. Structural variant representation differs between datasets and tools.
GWAS Catalog Trait-associated variants from genome-wide association studies. Association is not the same as causality or diagnostic relevance.

7. Clinical and cancer-genomics resources

Cancer and clinical-genomics interpretation often combines variant calls with curated databases, cohort resources and literature. These resources differ in scope, licensing and evidence model.

COSMIC Catalogue of somatic mutations in cancer and cancer-gene information.
OncoKB Precision-oncology knowledgebase for variant oncogenicity and therapeutic evidence.
CIViC Open resource for evidence-based clinical interpretation of cancer variants.
NCI GDC and cBioPortal Resources for exploring cancer cohorts, mutations, copy number, expression and clinical associations.
Clinical interpretation requires validated workflows, appropriate consent, data protection and expert review. Research analysis should not be presented as a diagnostic report unless the workflow is validated for that purpose.

8. Omics and functional interpretation resources

Processed omics resources can help validate findings, compare studies and interpret gene lists.

Resource type Examples Use
Expression atlases GEO, Expression Atlas, GTEx-style resources Compare expression across tissues, disease states or public studies.
Pathway databases Reactome, KEGG, Gene Ontology Interpret gene sets and biological processes.
Protein resources UniProt, STRING Protein function, domains, interaction networks and functional associations.
Single-cell resources Cell atlases, public h5ad/matrix repositories Reference annotation, cell-type signatures and comparative analysis.

9. Metadata and sample sheets

Metadata describes what each file and sample represents. In many projects, metadata quality is as important as sequencing quality.

Recommended sample-sheet columns

  • sample_id: stable, unique identifier.
  • fastq_1 and fastq_2: file names or paths for paired-end reads.
  • group: biological condition or class.
  • replicate: biological replicate identifier.
  • batch: sequencing run, library-preparation batch or processing batch.
  • organism: species or strain.
  • reference: genome assembly and annotation version.
  • library_type: RNA-seq strandedness, single-cell chemistry, capture kit or panel information.
Example sample sheet
sample_id	group	batch	fastq_1	fastq_2
S1	control	A	S1_R1.fastq.gz	S1_R2.fastq.gz
S2	control	A	S2_R1.fastq.gz	S2_R2.fastq.gz
S3	treated	B	S3_R1.fastq.gz	S3_R2.fastq.gz
S4	treated	B	S4_R1.fastq.gz	S4_R2.fastq.gz

10. Raw sequence formats: FASTQ and FASTA

FASTQ and FASTA are foundational sequence formats.

Format What it stores Common use
FASTQ Read identifier, sequence, separator and per-base quality scores. Raw sequencing reads from Illumina, MGI, Ion Torrent, PacBio or Oxford Nanopore workflows.
FASTA Sequence identifier and sequence without per-base quality scores. Reference genomes, transcriptomes, proteins, contigs and extracted sequences.
FASTQ record
@read_001
ACGTACGTACGT
+
FFFFFFFFFFFF
FASTA record
>chr1_example
ACGTACGTACGTACGTACGT

11. Alignment formats: SAM, BAM and CRAM

Alignment files store where sequencing reads map to a reference genome or transcriptome.

Format Description Notes
SAM Text alignment format. Human-readable but large. Useful for inspection, not ideal for storage.
BAM Binary compressed alignment format. Common storage and analysis format. Usually indexed with .bai.
CRAM Reference-based compressed alignment format. Can save storage, but requires the correct reference sequence for full interpretation.
Alignment files should usually be sorted and indexed before visualization or many downstream analyses.

12. Variant formats: VCF and BCF

Variant files store genomic differences relative to a reference. They can represent single-nucleotide variants, indels, genotypes, allele depths, filters, annotations and sometimes structural variants.

VCF Text-based Variant Call Format. Human-readable and widely supported.
BCF Binary compressed equivalent of VCF. Efficient for large-scale processing with BCFtools.
Simplified VCF example
##fileformat=VCFv4.3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	12345	rsExample	A	G	99	PASS	DP=42	GT:DP	0/1:42

13. Annotation and interval formats: GTF, GFF, BED and interval lists

Annotation and interval files describe genomic features and regions.

Format Stores Common use
GTF Gene, transcript and exon annotations with structured attributes. RNA-seq gene counting and transcriptome annotation.
GFF3 General feature annotations with hierarchical relationships. Genome annotation for many organisms.
BED Genomic intervals with 0-based start coordinates. Target regions, peaks, promoters, blacklists and custom intervals.
Interval list Genome intervals with sequence dictionary metadata. Often used in Picard/GATK-style workflows.
Coordinate conventions matter. BED uses 0-based, half-open intervals, while many other formats use 1-based coordinates. Mixing conventions can shift regions by one base.

14. Signal tracks, matrices and tabular formats

Many downstream results are stored as matrices, tracks or tables.

Format What it stores Common use
bedGraph Genome-wide signal values over intervals. Coverage, methylation or enrichment tracks.
bigWig Indexed binary signal track. Efficient genome-browser visualization.
Count matrix Features by samples or cells. RNA-seq, single-cell RNA-seq and other quantified omics data.
HDF5 / h5ad / h5 Hierarchical binary data structures. Single-cell, spatial, machine-learning and large matrix workflows.
CSV / TSV Plain-text tables. Metadata, result tables, annotation exports and reports.
JSON / YAML Structured configuration and metadata. Workflow configuration, software environments and pipeline parameters.

15. Compression, indexing and checksums

Bioinformatics files are often large. Compression and indexing make storage and access more efficient.

gzip Common compression for FASTQ, FASTA, VCF and tabular files. Files often end with .gz.
bgzip and tabix Block compression and indexing used for coordinate-sorted VCF, BED and GFF-like files.
BAM/CRAM indexes .bai, .crai and related indexes allow fast access to genomic regions.
Checksums MD5 or SHA checksums verify that files were transferred without corruption.
Checksum examples
md5sum sample_R1.fastq.gz > sample_R1.fastq.gz.md5
sha256sum reference.fa.gz > reference.fa.gz.sha256

md5sum -c sample_R1.fastq.gz.md5

16. Recommended project organization

A consistent folder structure helps prevent confusion between raw data, reference files and derived results.

Example project structure
project_name/
├── data/
│   ├── raw_fastq/
│   ├── external/
│   └── metadata/
├── references/
│   ├── genome/
│   ├── annotation/
│   └── indexes/
├── results/
│   ├── qc/
│   ├── alignments/
│   ├── variants_or_counts/
│   └── figures/
├── scripts/
├── envs/
├── logs/
└── reports/

Recommended rules

  • Keep raw files read-only when possible.
  • Use clear sample IDs that match the metadata table.
  • Record database names, versions, genome assemblies and download dates.
  • Separate reference files from project-specific results.
  • Keep scripts, logs and environment files with the project.

17. Basic quality checks before analysis

Before running a full workflow, verify that files are present, readable, correctly compressed and consistent with metadata.

Check Question Typical command or action
File existence Are all files listed in the sample sheet present? ls, custom script or workflow validation.
Compression Are gzip files readable? gzip -t file.fastq.gz
FASTQ structure Does the number of lines divide by four? zcat reads.fastq.gz | wc -l
Genome consistency Do BAM/VCF chromosome names match the reference? Compare sequence dictionaries and contig names.
Metadata consistency Are sample IDs unique and group labels correct? Inspect sample sheet and run validation scripts.

18. Example command-line checks

The following examples show common format inspection commands. Adapt file names and paths to your project.

Inspect FASTQ and FASTA files
# Show one FASTQ record
zcat sample_R1.fastq.gz | head -n 4

# Count reads in FASTQ
echo $(( $(zcat sample_R1.fastq.gz | wc -l) / 4 ))

# Show FASTA headers
grep '^>' reference.fa | head

# Count FASTA sequences
grep -c '^>' reference.fa
Inspect BAM/CRAM files
# Alignment summary
samtools flagstat sample.bam

# Header and reference names
samtools view -H sample.bam | head
samtools idxstats sample.bam | head

# Index BAM
samtools index sample.bam
Inspect VCF files
# View VCF header
bcftools view -h variants.vcf.gz | head

# Count variants
bcftools view -H variants.vcf.gz | wc -l

# List samples
bcftools query -l variants.vcf.gz

# Index compressed VCF
bcftools index variants.vcf.gz
Inspect sample metadata
# Show sample sheet columns
head -n 1 samples.tsv

# Count samples by group
tail -n +2 samples.tsv | cut -f 2 | sort | uniq -c

# Check duplicate sample IDs
tail -n +2 samples.tsv | cut -f 1 | sort | uniq -d

19. Data format cheat sheet

Extension Format Typical content
.fastq.gz Compressed FASTQ Raw sequencing reads and base qualities.
.fa, .fasta FASTA Reference genomes, transcripts, proteins or contigs.
.sam SAM Text alignment file.
.bam BAM Compressed binary alignment file.
.cram CRAM Reference-compressed alignment file.
.vcf, .vcf.gz VCF Genomic variants and genotypes.
.bcf BCF Binary variant format.
.gtf, .gff3 Annotation formats Genes, transcripts, exons and other features.
.bed BED Genomic intervals such as targets, peaks or regions.
.bw, .bigWig bigWig Indexed genome-wide signal tracks.
.tsv, .csv Tables Metadata, counts, results and reports.
.h5, .h5ad HDF5-based formats Large matrices, single-cell data and structured results.
.json, .yaml Structured text Configuration, metadata and workflow parameters.

Frequently asked questions

What are the most important public data sources for NGS analysis?

Common public sources include NCBI SRA, EMBL-EBI ENA, DDBJ, NCBI GEO, ArrayExpress/BioStudies, Ensembl, UCSC Genome Browser, GENCODE, NCBI RefSeq, gnomAD, ClinVar, dbSNP, COSMIC, cBioPortal and the NCI Genomic Data Commons. The best source depends on whether you need raw reads, reference genomes, annotations, variants, expression matrices or clinical metadata.

What is the difference between FASTQ and FASTA?

FASTA stores sequence records without per-base quality scores. FASTQ stores sequence records together with quality scores and is the standard format for raw sequencing reads.

What is the difference between SAM, BAM and CRAM?

SAM is a human-readable text alignment format. BAM is a compressed binary version of SAM. CRAM is a reference-based compressed alignment format that can reduce storage requirements but requires access to the reference sequence for full interpretation.

What is a VCF file?

VCF stands for Variant Call Format. It stores genomic variants such as SNVs, indels and sometimes structural variants, together with sample genotypes, filters, annotations and quality fields.

Why are metadata files important?

Metadata files connect raw data to biological meaning. They define sample names, groups, batches, organism, library type, read layout, experimental design and comparisons. Without good metadata, analysis can be delayed or misinterpreted.

Should I keep raw data files after analysis?

Yes. Raw FASTQ files and original metadata should be preserved whenever possible. They are the starting point for reanalysis, troubleshooting, publication review and reproducibility.