Historical Postgres data

If we can store historical data in Postgres rather than drop and rebuild tables, then we can simplify the pipeline, reduce reliance on SQLite and make historical data available in any API we put on top of it, à la Ensembl releases.

To do this, we'll need to namespace tables somehow. Either:

Index tables

Versioning table:

release_name	dataset_name	table
20210804_v1	wp1_some_dataset	20210804_v1_wp1_some_dataset
20210804_v1	wp2_some_other_dataset	20210804_v1_wp2_some_other_dataset
20210805_v2	wp1_some_dataset	20210805_v2_wp1_some_dataset

SELECT * FROM isaric.20210805_v2_wp1_some_dataset;

I don't like this solution much, since it will be more complicated to implement and use than the second option:

Multiple schemas

Add new tables to a new schema named after the release name, and use the isaric schema as the live version:

isaric
- wp1_some_dataset
- wp2_some_other_dataset
20210804_v1
- wp1_some_dataset
- wp2_some_other_dataset
20210805_v2
- wp1_some_dataset
- wp2_some_other_dataset

SELECT * FROM isaric.wp1_some_dataset;  -- live, most recent data
SELECT * FROM 20210805_v2.wp1_some_dataset;  -- latest release, so should be identical to the above
SELECT * FROM 20210804_v1.wp1_some_dataset;  -- historical data

For this solution, the user that the pipeline connects as would need permission to create new schemas. I'm also not sure if this is how schemas are meant to be used.

Either solution would require sufficient storage on the Postgres server, as stored data will start growing. The largest table in production, isaric.clinical_data, is ~35MB.

Edited Aug 04, 2021 by mwham