Data-Oriented Virology
GLUE’s Philosophy in Plain Language
Modern virology increasingly depends on large‐scale sequence data and on analytical workflows that must be reproducible, traceable, and adaptable across research questions. The GLUE (Genes Linked by Underlying Evolution) framework was developed to meet these needs. Although originally created as part of a series of research programmes, GLUE has matured into a general-purpose platform for constructing reproducible bioinformatics resources focused on viruses, viral elements, and other sequence-based systems.
This article describes the data-oriented principles that underpin GLUE and explains them in plain language, without assuming prior familiarity with the project.
1. Data as the Primary Object
GLUE is designed around a simple proposition: the data are more important than the software.
In many bioinformatics systems, the codebase determines how data are structured, and changing the code requires changing the data. GLUE reverses this relationship. The central objects are sequences, alignments, features, metadata, and reference structures that are stored in a declarative form inside a project database. The analytical logic interacts with this data, but the data themselves remain stable and independent of particular tools or workflows.
In practice this means:
the underlying data model is explicit and transparent
raw and derived data are stored in a way that preserves provenance
the same dataset can support multiple analytical uses without reformatting
project maintainers can extend or modify analyses without altering the core data
This approach supports reproducibility and long-term maintainability. A GLUE project can be archived, shared, cloned, or embedded in other applications with minimal dependency on workflow-specific code.
2. Representing Viral Biology Explicitly
Viruses have idiosyncratic genome structures: overlapping coding regions, frameshifts, transcriptional complexities, segmented genomes, and (in some groups) circularity. GLUE encodes these biological characteristics directly.
A GLUE project contains:
reference sequences with annotated features
alignment trees that formalise homology relationships
constrained alignments that preserve coding context
metadata describing isolates, sampling events, or external properties
taxonomic relationships defined through explicit data objects
Because these elements are stored as data rather than baked into software, the system can accommodate arbitrary viral groups. The same framework supports lentiviruses, deltaretroviruses, influenza viruses, parvoviruses, AAVs, and endogenous retroviruses, even though their genomes differ substantially.
The model is intentionally neutral: GLUE does not assume a particular evolutionary history, clustering method, or genus-level classification. Instead, it lets users encode their preferred phylogenies, feature definitions, and nomenclature within the project structure.
3. Alignments as the Core Integrating Structure
GLUE treats the multiple sequence alignment (MSA) as the key structure linking all data types.
Two principles motivate this design:
Amino acid-level homology can be mapped into nucleotide space via a reference-constrained alignment.
All higher-order analyses—phylogenetics, genotyping, constraint detection—depend on consistent alignment context.
By formalising alignments in a tree structure (alignment nodes with parent–child relationships), GLUE captures evolutionary assumptions explicitly. This avoids common problems where different tools silently apply incompatible assumptions, producing inconsistent results.
The alignment framework also enables analyses such as:
mutation frequency scanning
genotyping through MLCA (Maximum Likelihood Clade Assignment)
reference-based coordinate systems
slicing and extraction of genomic regions
compositional and codon-bias analyses
Because alignments and phylogenies are data objects, alternative evolutionary hypotheses can co-exist in the same project without conflict.
4. Modular and Extensible Analysis Logic
While the data model is central, GLUE’s analytical behaviour is provided by modules. These are configuration-driven components that perform specific tasks:
sequence import (NCBI Entrez, BLAST)
metadata parsing (GenBankImporter, JSON/XML processors)
alignment import/export (FASTA, protein-constrained importers)
reconciliation of user-defined phylogenies
RAxML EPA genotyping
mutation scanning and reporting
batch workflows for filtering or annotation
Modules are declarative. Configuration files specify which module instances exist in a project and how they operate. As a result, nearly all project-specific behaviour can be expressed as configuration rather than custom software.
This enables:
reproducibility (workflow logic is stored with the data)
auditability (module configs can be versioned in Git)
extensibility (new module types can be added without modifying existing ones)
For more specialised tasks, GLUE includes a JavaScript-based scripting layer. Many historic resources depend on this layer, and GLUE maintains compatibility for this reason. Scripts can access project data, invoke module commands, and orchestrate multi-step workflows.
5. Sharing, Re-use, and Collaborative Development
GLUE reflects an assumption that many virology questions reuse the same underlying components:
reference genomes
curated alignments
feature annotations
phylogenetic trees
nomenclature and taxonomies
quality-control filters
metadata extraction logic
By packaging these components into a structured project that lives in a Git repository, GLUE enables:
direct re-use of components between projects
divergence of projects for new purposes without corrupting the originals
collaborative curation
stable, public releases of curated viral data
reproducible workflows for research and public health
For example, projects such as HCV-GLUE, AAV-Atlas, Flu-GLUE, SRLV-GLUE, and ERVdb all use similar internal structures but serve different scientific and operational contexts. Each can incorporate improvements from the others without rewriting code.
6. Interoperability and Environment Independence
GLUE can run:
as a command-line tool
as a server integrated into a web application
inside containerised workflows
in automated update pipelines (e.g., Jenkins)
on a local laptop with no external dependencies beyond MySQL/Java
Because GLUE projects store both the data and the logic, they can be deployed in nearly any environment. This helps address a common problem in infectious disease analytics: laboratory groups and public health agencies often lack access to specialised computational infrastructure, but need reliable tools that can run locally or in constrained environments.
Containers (e.g., Docker GLUE) further support reproducibility by eliminating system-level variability.
7. A Platform for Reproducible Virology
The overarching goal is to enable reproducible virology: the ability to pick up a dataset, load it into GLUE, and independently verify all claims made from it. This principle extends across many of the lab’s projects:
phylogenetic reconstruction
recombination analysis
genotype assignment
ERV classification
nucleic acid constraint detection
viral diversity curation
genome annotation and ORF inference
Because GLUE preserves raw and processed data—and because workflow steps are encoded into project modules—others can repeat analyses, adjust assumptions, or incorporate additional data while retaining provenance.
This approach supports open science, long-term resource maintenance, and the generation of community standards for viral genomics.
8. Why “Plain Language”?
Although GLUE implements a sophisticated schema and supports advanced computational workflows, its philosophy is simple:
Keep the data stable.
Encode biological assumptions explicitly.
Use configuration to express workflow logic.
Make everything reproducible.
Allow others to critique, extend, or repurpose the work.
Keep the system general enough to support any viral group.
In practice, this makes GLUE both technically robust and accessible to new users who want to build dependable resources for specific viruses, viral groups, transposable elements, or endogenous viral elements.



