A Long-Lived, Data-Oriented Knowledge Base for Comparative Genomics
From Viral Genomes to Rapidly Evolving Host Genes
This perspective is aimed at researchers and developers responsible for maintaining complex comparative genomics resources beyond the lifespan of a single project, grant, or deployment.
Introduction
In many comparative genomics workflows the most important scientific artefacts—reference sequences, feature definitions, alignments, taxonomies, and metadata conventions—exist only implicitly. They are encoded across scripts, pipeline stages, directory structures, and ad hoc configuration files. Over time, this implicit structure becomes difficult to reconstruct, reuse, or validate.
As datasets grow and analytical contexts multiply, this model does not scale. Pipelines drift, assumptions become embedded in code, and the operational knowledge required to interpret results is lost as tools and personnel change. Reproducibility becomes fragile, and long-term maintenance increasingly depends on institutional memory rather than explicit structure.
GLUE was designed to address this problem by reversing the usual relationship between code and data. At its core, a GLUE project is a version-controlled database that stores sequences, alignments, feature maps, taxonomies, and analysis logic together, rather than reconstructing them on demand from scripts.
Although GLUE emerged from viral genomics, the architectural choices it embodies were driven by properties that are not uniquely viral:
extreme sequence divergence
ambiguous homology
evolving annotation models
need to revise interpretations without destabilising underlying data.
Instead of treating data as transient inputs to analysis pipelines, GLUE treats curated data and their structural relationships as the primary scientific artefact. Analytical code is secondary: modular, replaceable, and explicitly constrained by a persistent data model. Analyses operate against the project, rather than defining it.
This design aligns closely with principles often described as data-oriented programming (DOP), particularly the separation of data from behaviour and the treatment of data as a first-class, inspectable entity. While GLUE does not enforce strict immutability in the functional programming sense, it is designed to preserve raw data, provenance, and structural context, allowing interpretations and analyses to evolve without erasing prior states.
In practice, this produces projects that function as long-lived, evolvable knowledge bases rather than disposable analysis outputs. The goal is not simply to run analyses, but to preserve the scientific context required to understand, reproduce, and extend them over time.
Projects as Persistent Scientific Objects
In GLUE, a project is not simply a container for sequence data. It is a structured representation of scientific knowledge about a virus, a gene, or a genome feature, encoded in a form that is both machine-interpretable and durable over time.
A project typically includes:
curated primary sequences and associated metadata
reference sequences with defined genomic feature maps
reference-constrained multiple sequence alignments
taxonomic and classification structures
analysis configurations, including genotyping rules and phylogenetic context
custom modules and scripts operating within the project model
These components are stored together and defined relative to a shared coordinate system. The result is a self-consistent representation of both data and context.
This design allows a project to function as a persistent scientific object. It captures not only observations, but also the assumptions and conventions required to interpret them. Alignments are not regenerated implicitly; reference frames are not inferred from code; classification logic is not buried in scripts. Each element is explicit, named, and queryable.
See the GLUE User Guide for a more concrete description of how a project is composed.
There are several beneficial consequences of treating projects this way:
First, projects are interpretable over time. A project restored from version control or archival storage can be understood without reconstructing the original computational environment or pipeline. The schema and stored objects define the meaning of the data.
Second, projects are extensible without re-engineering. New sequences, annotations, or analytical methods can be introduced incrementally. Existing alignments and classifications remain valid, and alternative interpretations can coexist rather than overwrite one another.
Third, projects are composable. Related projects can share reference sequences, feature definitions, or taxonomic structures, or extend one another through layered designs. For example, a project focused on contemporary viral diversity can be extended by another that incorporates endogenous or ancient sequences using the same coordinate framework.
In effect, a GLUE project occupies a middle ground between a database and an analysis pipeline. Like a database, it preserves structured knowledge with strong internal consistency. Like a pipeline, it supports active analysis. Unlike either, it is portable, versionable, and designed to evolve without losing its historical context.
This perspective shifts the role of computational infrastructure in virus genomics. Rather than serving only to produce results, the project itself becomes the enduring scientific artefact.
Minimal Dependencies and Deployment Portability
When long-lived scientific systems fail, it is often due to infrastructure churn rather than analytical obsolescence. Operating systems change, libraries are deprecated, services are retired, and institutional hosting arrangements dissolve. Systems that depend on complex or rapidly evolving software stacks are difficult to preserve over time.
GLUE is designed to minimise this risk by constraining its runtime dependencies to a small set of mature, widely supported technologies:
a Java runtime environment
a relational database accessed via standard JDBC interfaces
simple, stable data formats such as FASTA, Newick, and tabular metadata
These choices are not about performance or fashion. They are about choosing technologies whose behaviour is stable, well-understood, and unlikely to be reinterpreted or abandoned over decades. In this context, predictability is a feature, not a limitation.
Analytical capabilities beyond this core—such as sequence similarity search, phylogenetic inference, or multiple sequence alignment—are accessed through optional modules that interface with external tools. These tools are not embedded in the core runtime and can be replaced or omitted without affecting the integrity of the project data.
This separation has several practical effects.
First, projects are portable. A GLUE project can be moved between machines, institutions, and operating environments with minimal modification. Restoring a project requires only the database and a compatible runtime, not a reconstruction of the original execution environment.
Second, projects are deployable in multiple modes. The same project can be:
used interactively on a local workstation
embedded in automated analysis workflows
exposed via web services
packaged and distributed using container-based technologies
No changes to the underlying data model are required to support these different deployment contexts.
Third, projects are isolatable from infrastructure volatility. Because analytical tools are invoked through well-defined interfaces, changes in external software ecosystems do not invalidate stored data or project structure. Analytical methods can evolve independently of the knowledge base they operate on.
By imposing strict limits on dependencies, GLUE trades short-term convenience for long-term stability. This constraint is intentional. It ensures that projects remain usable and interpretable across technological transitions, supporting service lifetimes measured in decades rather than development cycles.
Schema and Version Control as Reproducibility Mechanisms
In many genomics projects — particularly those involving complex comparative analysis — reproducibility is treated as an operational concern. It depends on preserving a collection of scripts, reference files, intermediate outputs, and informal documentation. Over time, this approach becomes fragile: files are lost, assumptions are forgotten, and the relationship between data and analysis logic degrades.
GLUE approaches reproducibility as a structural property of the system. A project is defined by a small number of explicitly versioned components:
the database schema
the curated data stored within it
alignment structures and reference coordinate systems
configuration documents encoding analytical assumptions
scripts and modules that operate against the schema
Taken together, these elements define a complete and inspectable representation of a scientific state.
Because all project components are stored in structured form, they can be version-controlled as a single unit. Changes to data, schema, or analytical configuration are explicit, reviewable, and attributable. Differences between project versions reflect changes in scientific interpretation rather than undocumented pipeline behaviour.
This has several important consequences.
First, analyses become reproducible by construction. Re-running an analysis does not require reconstructing an execution environment or inferring hidden assumptions. The project itself encodes the context in which results were produced.
Second, scientific hypotheses become versionable objects. Decisions about feature boundaries, taxonomic groupings, or genotyping rules are represented as structured configuration rather than transient parameters. Alternative hypotheses can coexist, be compared, or be revised without overwriting prior interpretations.
Third, collaboration becomes synchronisable. Projects can be cloned, forked, and updated across sites while preserving a shared reference frame. Divergence between versions is explicit and traceable, supporting distributed development without loss of coherence.
In this model, a versioned GLUE project functions as a “frozen hypothesis”: a precise statement of how viral diversity was represented and interpreted at a particular point in time. Updating a project produces a new hypothesis rather than erasing the old one.
By combining a stable schema with version-controlled data and configuration, GLUE shifts reproducibility from a best-effort practice to a guaranteed property of the system.
In GLUE, reproducibility is an emergent property of how projects are represented, not an afterthought.
Supporting Distributed Curation Without Central Services
Comparative genomics is increasingly conducted in a distributed environment. Sequencing capacity, computational expertise, and domain knowledge are spread across institutions, regions, and use cases. Centralised analysis platforms struggle to accommodate this diversity over long time horizons, particularly when sustained funding or institutional stewardship cannot be guaranteed.
GLUE is designed to operate without reliance on a central service. Projects are self-contained and can be created, extended, and maintained independently. At the same time, they are structured to remain interoperable through shared schemas, reference coordinate systems, and data conventions.
This architecture supports a distributed model of curation and analysis:
individual sites retain local control over data, annotations, and extensions
shared reference frames enable consistent comparison across projects
compatible schemas allow data and analyses to be exchanged without translation
Coordination occurs through shared structure rather than shared infrastructure.
Because projects are portable and versionable, they can be synchronised across sites using standard software development practices. Institutions can maintain local instances adapted to their analytical needs while tracking upstream changes or contributing downstream extensions. Divergence between projects is explicit and manageable, rather than accidental.
For example, one group may maintain a public reference project for a virus species, while another maintains a private extension incorporating unpublished sequences or alternative classification rules. Both can evolve independently while remaining interoperable.
This model scales across different modes of use. A project may exist solely as a local research resource, be embedded within an institutional analysis pipeline, or be exposed as a public web service. None of these deployment choices alters the underlying project representation.
Importantly, this approach reduces the long-term fragility of virus genomics resources. When a central service is retired, the projects it hosted do not disappear. They remain usable, auditable scientific objects that can be redeployed elsewhere without modification.
By decoupling knowledge preservation from service operation, GLUE enables distributed knowledge production without imposing central coordination or long-term infrastructure commitments. The result is an ecosystem of compatible, locally maintained resources rather than a dependency on monolithic platforms.
Design Summary: A Durable Substrate for Genomics
GLUE is designed around a small set of architectural commitments intended to support long-lived, reproducible virus genomics.
First, data are the primary abstraction. Sequences, alignments, annotations, taxonomies, and analytical assumptions are represented as structured, persistent objects. Code operates against this structure rather than redefining it.
Second, projects are treated as persistent scientific objects. A project captures not only data, but also the contextual knowledge required to interpret them. Alignments, reference frames, and classification logic are explicit and durable, allowing scientific state to be preserved over time.
Third, reproducibility is structural rather than procedural. By combining a stable schema with version-controlled data and configuration, projects encode complete and inspectable scientific hypotheses. Re-analysis produces new states without erasing prior interpretations.
Fourth, dependencies are constrained to support longevity. By relying on a minimal, mature runtime and isolating external tools behind well-defined interfaces, projects remain portable across platforms and resilient to infrastructure change.
Finally, the architecture supports distributed curation without central services. Projects can be maintained locally while remaining interoperable through shared schemas and reference coordinate systems. Knowledge persists independently of any single deployment or hosting environment.
Taken together, these properties define GLUE as a substrate for durable scientific knowledge. The intent is to ensure that genomic data — and the assumptions that give them meaning — remain interpretable, reproducible, and extensible across analytical contexts and over time.










This architecture is exceptionally well-thought-out. The shift from treating data as transient pipeline inputs to viewing it as the primary scientific artifact solves a huge pain point in genomics reproducibility. Ive worked on projects where knowledge was effectively lost when the original team left becuase everything was implicit in scripts. Treating projects as versionable hypotheses rather than outputs feels like a paradigm shift.