A Layered Architecture for Virus Genomics
Separating canonical biological structure from context-specific analysis
The GLUE framework supports a layered project architecture built around a stable, shared core project, extended through one or more composable project layers. Genomics resources developed in the Gifford Lab have made extensive use of this approach.
The rationale for layering reflects a basic property of virus genomics: analytical contexts vary widely, while the foundational elements of analysis—reference sequences, genome feature definitions, and multiple sequence alignments—tend to be comparatively stable. Treating these foundations as shared, field-specific standards is essential if analyses are to remain interoperable and reproducible across projects and over time.
GLUE formalises this separation by allowing projects to be constructed as stacks of dependent layers. Each layer inherits and reuses upstream definitions, while introducing additional data, logic, or presentation appropriate to a specific analytical or operational context. Although this structure is often described as a “core-plus-extension” model, in practice GLUE supports arbitrarily deep layering, with projects branching and stacking as required.
This post outlines the rationale for this architecture, describes what typically belongs in different layers, and illustrates how alternative structuring choices support different forms of virus genomics work.
1. Why a Layered Architecture?
Virus genomics spans a wide range of activities: evolutionary analysis, genotyping, drug resistance interpretation, surveillance, metadata curation, exploratory research, and public-facing reporting. Despite their diversity, these activities rely on a shared conceptual foundation:
genome feature definitions
reference sequences
multiple sequence alignments
standardised metadata schemas
taxonomy and clade definitions
In most real-world settings, multiple resources need access to these shared foundations, while also needing freedom to innovate locally. Analytical logic evolves faster than reference definitions; private datasets coexist with public ones; exploratory analyses should not destabilise production resources.
GLUE’s layered architecture institutionalises this separation of concerns. Stable, canonical representations are defined once and reused, while downstream layers evolve independently. The result is a system that supports reuse without rigidity and innovation without fragmentation.
2. What Belongs in a Core Project?
A core project in GLUE provides a canonical representation of a virus (or virus group) at a chosen biological scope. Its defining characteristic is not size or taxonomic rank, but stability: downstream layers depend on it and treat it as authoritative.
Typical components include:
2.1 Reference Sequences and Feature Definitions
Core projects define one or more reference sequences (e.g. virus genomes, genes of interest) and a hierarchical set of genome features (ORFs, proteins, regulatory regions). Reference features are annotated with genome feature locations, and thereby establish a coordinate system for downstream analyses.
2.2 Constrained Alignments
Reference-constrained multiple sequence alignments form a central part of the core. These ensure that variation, phylogenies, and annotations are all expressed in a consistent coordinate framework.
2.3 Canonical Metadata Schemas
Core projects define canonical metadata schemas by extending GLUE’s core data model with tables and fields that represent biologically meaningful, stable attributes of sequences, isolates, or taxa. These schema extensions encode assumptions about what constitutes comparable data within the chosen biological scope, and therefore shape how sequences are interpreted, grouped, and analysed across all downstream layers.
By placing such schema definitions in the core rather than in extensions, GLUE ensures that essential metadata semantics are shared consistently across projects. Downstream layers can populate, refine, or derive values from these fields, but they inherit a common structural vocabulary that supports interoperability, reproducibility, and long-term stability.
2.4 Baseline Module Configurations
Core modules typically establish default behaviours for sequence import/export, alignment handling, BLAST-based sequence recognition, or maximum likelihood-based phylogenetic placement. Downstream layers can reuse or override these as needed.
3. What Belongs in Extension Layers?
Extension layers build on a core by adding data, logic, or interfaces that are specific to a particular analytical or operational context. Importantly, extensions do not modify the core; they depend on it.
Common roles for extension layers include:
3.1 Data Expansion
Incorporating large public datasets (e.g. GenBank), private institutional data, or region-specific collections, without altering canonical definitions.
3.2 Context-Specific Analysis Logic
Adding genotyping pipelines, drug resistance interpretation, epitope mapping, or lineage-specific workflows that are not universally applicable.
3.3 Alternative Analytical Views
Exploring different partitioning schemes, phylogenetic methods, or rooting strategies in parallel with canonical analyses.
3.4 Presentation and Interface Layers
Defining web user interfaces, APIs, or reporting pipelines that expose core data and logic to different audiences.
Because layers are independent, multiple extensions can coexist, branch, or be combined, all while sharing the same upstream foundation.
4. Versioning, Stability, and Interoperability
A major strength of this architecture is that stability cascades downstream. Core projects can adopt conservative versioning strategies—minimising schema changes and evolving reference definitions cautiously—while extension layers can iterate rapidly.
This supports:
long-term reproducibility
decentralised development across institutions
interoperability between independently developed layers
selective updating without full reprocessing
In practice, core projects may be updated periodically via controlled pipelines (e.g. scheduled data refreshes), while extensions evolve continuously in response to local needs.
5. Three (and a Half) Structuring Patterns in Practice
Different biological questions motivate different choices about what constitutes a “core” in a GLUE project. The following examples illustrate several common structuring patterns that have emerged in practice.
5.1 Dengue-GLUE: Minimal Core + Data Extension
Dengue-GLUE exemplifies the simplest useful layered structure. A stable core project defines reference sequences, genome features, constrained alignments, and metadata schemas for dengue virus. A single extension layer (NCBI-Dengue-GLUE) incorporates all publicly available dengue virus sequences from GenBank.
This pattern cleanly separates canonical definitions from population-scale data expansion, allowing the dataset to grow without destabilising the core.
5.2 HCV-GLUE: Multiple Orthogonal Extensions
HCV-GLUE builds directly on the same core-plus-data model, but extends it further. A canonical HCV core is reused by multiple independent extension layers, including:
large-scale sequence ingestion from GenBank (NCBI-HCV-GLUE)
drug resistance interpretation pipelines (PHE-HCV-DRUG-RESISTANCE)
neutralising antibody binding site analysis (HCV-NABS)
These layers are orthogonal rather than sequential: each addresses a different analytical or operational context, yet all depend on the same shared foundation. This structure supports clinical, research, and public-facing use cases without fragmenting the underlying resource.
In HCV-GLUE, the web interface logic is included within the core project, reflecting the tight coupling between sequence analysis, interpretation, and user-facing functionality in early deployments.
5.3 HBV-GLUE: Abstracted Web Layer with Shared Analytical Core
HBV-GLUE follows a structure similar to HCV-GLUE in terms of analytical scope, but makes a different architectural choice regarding presentation. Here, the web layer is fully abstracted out of the core project and implemented as a dedicated extension (HBV-GLUE-WEB).
The HBV-GLUE core defines:
phylogenetically structured reference sequences and alignments,
automated genotyping using MLCA,
rich genome feature annotation and comparative analysis logic.
Separate extension layers then add:
population-scale sequence data (NCBI-HBV-GLUE),
drug resistance interpretation (HBV-DRUG-RESISTANCE),
web-based browsing and sequence analysis (HBV-GLUE-WEB).
This separation allows the same analytical core to be used unchanged in command-line, batch, and programmatic contexts, while supporting multiple independently deployable web interfaces. It reflects a maturing architectural pattern in which presentation concerns are cleanly decoupled from analytical logic, improving reusability and long-term maintainability.
5.4 Lentivirus-GLUE: Comparative Core with Embedded Species Resources
Lentivirus-GLUE illustrates a different structuring choice, motivated by comparative and evolutionary analysis. Here, the core project operates at the level of an entire virus genus, defining shared genomic structure and deep phylogenetic relationships across lentiviruses.
Extension layers correspond to major lentiviral lineages—such as primate lentiviruses, SRLVs, EIAV, FIV, and endogenous lentiviruses—each adding dense sequence data, curated metadata, and lineage-specific logic.
In this model, extension projects can themselves function as canonical resources at a narrower scope. For example, a primate lentivirus layer could serve as the foundation for HIV-specific development, or one might instead choose to define an HIV-1-specific core directly. GLUE accommodates both approaches, leaving the choice of canonical scope to the project designer.
6. Conclusion
GLUE’s layered, core-centric architecture is a response to the structural demands of virus genomics, where stable biological definitions must coexist with rapidly evolving analytical, clinical, and surveillance contexts. Central to this design is a clear separation between canonical representations of genomes, features, alignments, and phylogenetic structure, and the context-specific extensions that build upon them. This allows resources to evolve without compromising their underlying biological foundations.
In practical terms, this separation supports a model in which reference-driven core projects remain stable over time, while innovation occurs in modular extensions that can evolve independently. Because these extensions are anchored to a shared canonical foundation, they can be developed, deployed, and maintained by different groups without fragmenting the resource or duplicating core definitions.
This structure, in turn, enables interoperability across institutions, accommodates diverse analytical goals, and supports update strategies that minimise unnecessary reprocessing and reduce long-term technical debt. The same architectural principles make it possible to support a wide spectrum of applications, ranging from single-virus surveillance resources to multi-virus comparative frameworks, and from exploratory research projects to operational public health pipelines.
By enabling decentralised collaboration without requiring centralised control, GLUE is designed to operate effectively within the fragmented institutional, geographical, and analytical conditions that characterise modern genomics research.
As virus genomics continues to expand across research, clinical, and surveillance domains, the ability to compose layered, reusable resources at appropriate biological and analytical scales is likely to become increasingly important for sustainable genomic infrastructure.




Really smart architectural thinking here. The separation between canonical biological structure and context-specific analysis layers mirrors alot of what we see in modern software design patterns, but the motivation is way more interesting. I've spent time with bioinformatics pipelines that broke every time reference genomes got updated, so the stability-through-immutability approach makes total sense. The Lentivirus-GLUE example where a genus-level core supports multiple species extensions is particularly clever for comparative work.