Data Immutability and Longevity in GLUE
A Design for Durable Host-Virus Genomics
Introduction
Host–virus genomics generates data at a scale and rate that exceed the capacity of many analysis systems to preserve scientific context over time. As datasets grow and analytical practices evolve, core resources often become transient: pipelines drift, assumptions are encoded implicitly, and the structure required to interpret results is lost as software, infrastructure, and personnel change.
GLUE was designed to address this problem directly. A central principle in the framework is data immutability—not in the strict functional-programming sense, but in a scientific one. Data and the context required to interpret them must remain stable, inspectable, and recoverable over time.
This principle allows GLUE projects to function as long-lived, evolving knowledge bases rather than disposable analysis artefacts. Although originally developed for virus genomics, the same design applies to host gene families—particularly immune genes—that exhibit rapid evolution, complex homology, and continually revised annotation.
Scientific Data Rot as a Structural Problem
Genomic resources degrade for predictable reasons:
Data drift: sequence sets change without explicit versioning or audit trails.
Pipeline drift: tools, parameters, and assumptions change silently.
Context loss: alignments, annotations, and classification decisions are not preserved as structured objects.
Personnel turnover: local operational knowledge disappears.
Infrastructure churn: operating systems, dependencies, and services become obsolete.
These factors produce a reproducibility failure that is structural rather than procedural. Even well-maintained resources lose coherence when scientific context exists only implicitly in code or workflows.
Durable Scientific Objects
GLUE treats the components of host–virus genomics—sequences, metadata, alignments, phylogenies, feature definitions, and classification rules—as first-class scientific objects stored in a relational schema. This creates a stable, queryable substrate for analysis.
The framework adopts a layered model:
Primary data (sequences, metadata) are preserved as received.
Derived objects (alignments, trees, annotations) are stored explicitly and versioned.
Hypotheses (clade definitions, recombination boundaries, feature maps) are represented as structured configuration rather than transient pipeline parameters.
This has concrete consequences. A phylogeny used for classification is not embedded in a script; it is stored, versioned, and referenceable. An alignment is not regenerated ad hoc; it exists as a durable object with a defined coordinate system. Re-analysis produces new objects rather than overwriting previous interpretations.
Scientific context is preserved while allowing progressive refinement.
Database-Centric Persistence
Each GLUE project is anchored in a relational database. This provides:
strong consistency guarantees
automatic referential integrity
explicit separation of data and analytical logic
structured evolution through schema extension and version control
A project captured at a given point in time remains runnable by restoring its database state. Interpretation does not depend on reconstructing an execution environment or inferring undocumented workflow assumptions.
This contrasts with pipeline-based approaches that rely on mutable codebases, external file hierarchies, and environment-specific configurations. GLUE’s minimal dependency model further isolates projects from platform fragility.
Schema Stability and Long-Term Usability
GLUE’s core schema is compact, stable, and domain-specific. It is extensible through standard mechanisms that allow new fields, tables, and modules to be introduced without invalidating existing analyses.
The schema functions as a contract: any compatible GLUE instance can interpret the stored data correctly. This supports:
long-lived public resources
reproducible cross-institutional collaboration
local analytical workflows
service-oriented or microservice deployments
Schema stability is central to the longevity of host–virus genomic resources.
Alignments as Persistent Artefacts
In many genomics workflows, multiple sequence alignments are repeatedly regenerated, producing different results as methods or parameters change. In GLUE:
alignments are stored as persistent database objects
alignment rows link sequences to explicit reference coordinate systems
unresolvable regions are recorded explicitly
alignments are reversible rather than destructive
This transforms the alignment from a disposable intermediate into a durable scientific artefact. Alignments can be curated incrementally, compared across versions, and shared between projects using a common reference frame.
Longevity Through Minimal Dependencies
GLUE’s runtime architecture is intentionally constrained:
a Java-based execution engine
a JDBC-accessible relational database
optional interfaces to external tools for alignment, similarity search, and phylogenetic inference
External tools are invoked through well-defined interfaces and are not embedded in the core system. This allows analytical methods to evolve independently of stored data.
By limiting dependencies to mature, widely supported technologies, GLUE supports service lifetimes measured in decades rather than development cycles.
Scientific Continuity as an Architectural Goal
Many genomics resources are fragile because they treat analytical outputs as ephemeral. GLUE reverses this by embedding scientific context at the data layer:
hypotheses are stored as structured knowledge
alignments are persistent artefacts
metadata relationships are queryable
reference sequences define stable coordinate systems
As a result, projects can survive changes in staff, infrastructure, and analytical practice. New data can be integrated without erasing prior interpretations, and knowledge accumulates rather than being periodically reconstructed.
This is architectural support for scientific continuity.
Summary
Data immutability in GLUE is a response to the structural challenges of host–virus genomics: rapidly expanding datasets, complex evolutionary histories, heterogeneous metadata, and the need for long-lived scientific resources.
By embedding data, hypotheses, and analytical context within a stable relational model, GLUE provides a foundation for durable, reproducible genomic knowledge.
The result is simple but consequential: host–virus genomic knowledge that persists.



