Data sources and data formats for NGS and bioinformatics.
A practical tutorial for understanding where bioinformatics data come from, how public repositories and reference databases are used, and how common file formats such as FASTQ, FASTA, BAM, CRAM, VCF, GTF, BED, bigWig and count matrices fit into reproducible NGS workflows.
Bioinformatics projects usually combine several data sources and file types. A typical NGS analysis may start with raw FASTQ files, align them to a reference genome, use a gene annotation file to count reads, compare the results across metadata-defined sample groups, and interpret findings using variant, pathway or disease databases.
Input dataFASTQ files, public datasets, reference genomes, annotations and sample metadata.
Analysis filesBAM/CRAM alignments, VCF variants, count matrices, peak files and signal tracks.
InterpretationPopulation databases, disease databases, cancer resources, pathways and literature.
Core principle: keep raw data, reference data and metadata clearly separated from derived results. Record where each file came from, which version it represents and how it was processed.
2. Main types of bioinformatics data sources
Different resources serve different purposes. Some store raw sequencing reads, some provide reference genomes, some curate variants or diseases, and others provide processed expression or clinical datasets.
Source type
Examples
Typical use
Raw read archives
NCBI SRA, EMBL-EBI ENA, DDBJ
Download public sequencing reads for reanalysis, benchmarking or meta-analysis.
Functional genomics repositories
NCBI GEO, ArrayExpress, BioStudies
Find expression, epigenomics and multi-omics studies with metadata and processed results.
Reference genome resources
Ensembl, UCSC, NCBI RefSeq, GENCODE
Download genome FASTA files, gene annotations and genome indexes.
Variation databases
dbSNP, gnomAD, ClinVar, dbVar
Annotate variants, filter population polymorphisms and interpret clinical relevance.
Cancer genomics resources
COSMIC, OncoKB, CIViC, cBioPortal, NCI GDC
Interpret tumour variants, driver mutations, gene alterations and cancer datasets.
Pathway and function databases
Gene Ontology, Reactome, KEGG, UniProt, STRING
Interpret gene lists, pathways, networks and protein function.
3. Raw read repositories
Raw read repositories store sequencing data submitted by researchers and consortia. These repositories are important for reanalysis, methods development, benchmarking and meta-analysis.
NCBI SRALarge archive of raw sequencing reads. Data can often be downloaded with SRA Toolkit or fasterq-dump workflows.
EMBL-EBI ENAEuropean archive for sequence reads, assemblies and related metadata, often offering FTP/Aspera download routes.
DDBJJapanese sequence data archive and part of the international nucleotide sequence database collaboration.
GEO and BioStudiesOften provide study-level metadata, processed matrices and links to raw reads stored in SRA or ENA.
Public dataset identifiers matter. Record accessions such as SRR, ERR, DRR, PRJNA, PRJEB, GSE or E-MTAB identifiers in project reports.
4. Reference genomes
A reference genome provides the coordinate system for alignment, variant calling, gene annotation and visualization. The chosen genome assembly must match all downstream files.
Resource
Typical files
Notes
Ensembl
Genome FASTA, GTF/GFF3, cDNA, CDS and protein FASTA files.
Widely used for gene annotation and comparative genomics.
GENCODE
High-quality human and mouse gene annotations.
Common choice for RNA-seq and gene-expression analysis.
NCBI RefSeq
Reference sequences, genomes, transcripts and protein annotations.
Curated NCBI reference resource used in many annotation pipelines.
UCSC Genome Browser
Genome FASTA, chain files, liftOver files, tracks and annotation tables.
Useful for coordinate conversion and browser-based visualization.
Do not mix genome assemblies accidentally. Human GRCh37/hg19 and GRCh38/hg38 coordinates are not interchangeable without coordinate conversion.
5. Genome annotations
Genome annotation files define genes, transcripts, exons, promoters, repeats, regulatory regions or target intervals. They are essential for read counting, variant annotation and genomic-interval analysis.
Gene annotationsGTF or GFF3 files define gene, transcript, exon and coding-sequence structures.
Target regionsBED or interval-list files define capture panels, exomes, amplicons or regions of interest.
Regulatory tracksBED, bigBed, bigWig and bedGraph files can represent enhancers, promoters, peaks or signal intensity.
Coordinate conversionChain files and liftOver tools convert genomic coordinates between assemblies when appropriate.
6. Variant and population databases
Variant databases support filtering and interpretation. Population databases help identify common variants. Clinical and disease databases help interpret relevance, but they should be used with context and version tracking.
Database
Use
Typical caution
dbSNP
Known short variants and identifiers.
Presence in dbSNP does not automatically mean benign or clinically relevant.
gnomAD
Population allele frequencies for filtering rare or common variants.
Interpret frequency in population and disease context.
ClinVar
Clinical assertions for human variants.
Review assertion criteria, submitter confidence and update date.
dbVar
Structural variation records.
Structural variant representation differs between datasets and tools.
GWAS Catalog
Trait-associated variants from genome-wide association studies.
Association is not the same as causality or diagnostic relevance.
7. Clinical and cancer-genomics resources
Cancer and clinical-genomics interpretation often combines variant calls with curated databases, cohort resources and literature. These resources differ in scope, licensing and evidence model.
COSMICCatalogue of somatic mutations in cancer and cancer-gene information.
OncoKBPrecision-oncology knowledgebase for variant oncogenicity and therapeutic evidence.
CIViCOpen resource for evidence-based clinical interpretation of cancer variants.
NCI GDC and cBioPortalResources for exploring cancer cohorts, mutations, copy number, expression and clinical associations.
Clinical interpretation requires validated workflows, appropriate consent, data protection and expert review. Research analysis should not be presented as a diagnostic report unless the workflow is validated for that purpose.
8. Omics and functional interpretation resources
Processed omics resources can help validate findings, compare studies and interpret gene lists.
Resource type
Examples
Use
Expression atlases
GEO, Expression Atlas, GTEx-style resources
Compare expression across tissues, disease states or public studies.
Pathway databases
Reactome, KEGG, Gene Ontology
Interpret gene sets and biological processes.
Protein resources
UniProt, STRING
Protein function, domains, interaction networks and functional associations.
Single-cell resources
Cell atlases, public h5ad/matrix repositories
Reference annotation, cell-type signatures and comparative analysis.
9. Metadata and sample sheets
Metadata describes what each file and sample represents. In many projects, metadata quality is as important as sequencing quality.
Recommended sample-sheet columns
sample_id: stable, unique identifier.
fastq_1 and fastq_2: file names or paths for paired-end reads.
group: biological condition or class.
replicate: biological replicate identifier.
batch: sequencing run, library-preparation batch or processing batch.
organism: species or strain.
reference: genome assembly and annotation version.
library_type: RNA-seq strandedness, single-cell chemistry, capture kit or panel information.
Example sample sheet
sample_id group batch fastq_1 fastq_2
S1 control A S1_R1.fastq.gz S1_R2.fastq.gz
S2 control A S2_R1.fastq.gz S2_R2.fastq.gz
S3 treated B S3_R1.fastq.gz S3_R2.fastq.gz
S4 treated B S4_R1.fastq.gz S4_R2.fastq.gz
10. Raw sequence formats: FASTQ and FASTA
FASTQ and FASTA are foundational sequence formats.
Format
What it stores
Common use
FASTQ
Read identifier, sequence, separator and per-base quality scores.
Raw sequencing reads from Illumina, MGI, Ion Torrent, PacBio or Oxford Nanopore workflows.
FASTA
Sequence identifier and sequence without per-base quality scores.
Reference genomes, transcriptomes, proteins, contigs and extracted sequences.
FASTQ record
@read_001
ACGTACGTACGT
+
FFFFFFFFFFFF
FASTA record
>chr1_example
ACGTACGTACGTACGTACGT
11. Alignment formats: SAM, BAM and CRAM
Alignment files store where sequencing reads map to a reference genome or transcriptome.
Format
Description
Notes
SAM
Text alignment format.
Human-readable but large. Useful for inspection, not ideal for storage.
BAM
Binary compressed alignment format.
Common storage and analysis format. Usually indexed with .bai.
CRAM
Reference-based compressed alignment format.
Can save storage, but requires the correct reference sequence for full interpretation.
Alignment files should usually be sorted and indexed before visualization or many downstream analyses.
12. Variant formats: VCF and BCF
Variant files store genomic differences relative to a reference. They can represent single-nucleotide variants, indels, genotypes, allele depths, filters, annotations and sometimes structural variants.
VCFText-based Variant Call Format. Human-readable and widely supported.
BCFBinary compressed equivalent of VCF. Efficient for large-scale processing with BCFtools.
Simplified VCF example
##fileformat=VCFv4.3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1
chr1 12345 rsExample A G 99 PASS DP=42 GT:DP 0/1:42
13. Annotation and interval formats: GTF, GFF, BED and interval lists
Annotation and interval files describe genomic features and regions.
Format
Stores
Common use
GTF
Gene, transcript and exon annotations with structured attributes.
RNA-seq gene counting and transcriptome annotation.
GFF3
General feature annotations with hierarchical relationships.
Genome annotation for many organisms.
BED
Genomic intervals with 0-based start coordinates.
Target regions, peaks, promoters, blacklists and custom intervals.
Interval list
Genome intervals with sequence dictionary metadata.
Often used in Picard/GATK-style workflows.
Coordinate conventions matter. BED uses 0-based, half-open intervals, while many other formats use 1-based coordinates. Mixing conventions can shift regions by one base.
14. Signal tracks, matrices and tabular formats
Many downstream results are stored as matrices, tracks or tables.
Format
What it stores
Common use
bedGraph
Genome-wide signal values over intervals.
Coverage, methylation or enrichment tracks.
bigWig
Indexed binary signal track.
Efficient genome-browser visualization.
Count matrix
Features by samples or cells.
RNA-seq, single-cell RNA-seq and other quantified omics data.
HDF5 / h5ad / h5
Hierarchical binary data structures.
Single-cell, spatial, machine-learning and large matrix workflows.
CSV / TSV
Plain-text tables.
Metadata, result tables, annotation exports and reports.
JSON / YAML
Structured configuration and metadata.
Workflow configuration, software environments and pipeline parameters.
15. Compression, indexing and checksums
Bioinformatics files are often large. Compression and indexing make storage and access more efficient.
gzipCommon compression for FASTQ, FASTA, VCF and tabular files. Files often end with .gz.
bgzip and tabixBlock compression and indexing used for coordinate-sorted VCF, BED and GFF-like files.
BAM/CRAM indexes.bai, .crai and related indexes allow fast access to genomic regions.
ChecksumsMD5 or SHA checksums verify that files were transferred without corruption.
Use clear sample IDs that match the metadata table.
Record database names, versions, genome assemblies and download dates.
Separate reference files from project-specific results.
Keep scripts, logs and environment files with the project.
17. Basic quality checks before analysis
Before running a full workflow, verify that files are present, readable, correctly compressed and consistent with metadata.
Check
Question
Typical command or action
File existence
Are all files listed in the sample sheet present?
ls, custom script or workflow validation.
Compression
Are gzip files readable?
gzip -t file.fastq.gz
FASTQ structure
Does the number of lines divide by four?
zcat reads.fastq.gz | wc -l
Genome consistency
Do BAM/VCF chromosome names match the reference?
Compare sequence dictionaries and contig names.
Metadata consistency
Are sample IDs unique and group labels correct?
Inspect sample sheet and run validation scripts.
18. Example command-line checks
The following examples show common format inspection commands. Adapt file names and paths to your project.
Inspect FASTQ and FASTA files
# Show one FASTQ record
zcat sample_R1.fastq.gz | head -n 4
# Count reads in FASTQ
echo $(( $(zcat sample_R1.fastq.gz | wc -l) / 4 ))
# Show FASTA headers
grep '^>' reference.fa | head
# Count FASTA sequences
grep -c '^>' reference.fa
Inspect BAM/CRAM files
# Alignment summary
samtools flagstat sample.bam
# Header and reference names
samtools view -H sample.bam | head
samtools idxstats sample.bam | head
# Index BAM
samtools index sample.bam
Inspect VCF files
# View VCF header
bcftools view -h variants.vcf.gz | head
# Count variants
bcftools view -H variants.vcf.gz | wc -l
# List samples
bcftools query -l variants.vcf.gz
# Index compressed VCF
bcftools index variants.vcf.gz
What are the most important public data sources for NGS analysis?
Common public sources include NCBI SRA, EMBL-EBI ENA, DDBJ, NCBI GEO, ArrayExpress/BioStudies, Ensembl, UCSC Genome Browser, GENCODE, NCBI RefSeq, gnomAD, ClinVar, dbSNP, COSMIC, cBioPortal and the NCI Genomic Data Commons. The best source depends on whether you need raw reads, reference genomes, annotations, variants, expression matrices or clinical metadata.
What is the difference between FASTQ and FASTA?
FASTA stores sequence records without per-base quality scores. FASTQ stores sequence records together with quality scores and is the standard format for raw sequencing reads.
What is the difference between SAM, BAM and CRAM?
SAM is a human-readable text alignment format. BAM is a compressed binary version of SAM. CRAM is a reference-based compressed alignment format that can reduce storage requirements but requires access to the reference sequence for full interpretation.
What is a VCF file?
VCF stands for Variant Call Format. It stores genomic variants such as SNVs, indels and sometimes structural variants, together with sample genotypes, filters, annotations and quality fields.
Why are metadata files important?
Metadata files connect raw data to biological meaning. They define sample names, groups, batches, organism, library type, read layout, experimental design and comparisons. Without good metadata, analysis can be delayed or misinterpreted.
Should I keep raw data files after analysis?
Yes. Raw FASTQ files and original metadata should be preserved whenever possible. They are the starting point for reanalysis, troubleshooting, publication review and reproducibility.
Privacy noticeWe process contact-form data only to respond to your enquiry. Please review our Privacy Policy for details.