Historical Postgres data
If we can store historical data in Postgres rather than drop and rebuild tables, then we can simplify the pipeline, reduce reliance on SQLite and make historical data available in any API we put on top of it, à la Ensembl releases.
To do this, we'll need to namespace tables somehow. Either:
Index tables
Versioning table:
release_name | dataset_name | table |
---|---|---|
20210804_v1 | wp1_some_dataset | 20210804_v1_wp1_some_dataset |
20210804_v1 | wp2_some_other_dataset | 20210804_v1_wp2_some_other_dataset |
20210805_v2 | wp1_some_dataset | 20210805_v2_wp1_some_dataset |
SELECT * FROM isaric.20210805_v2_wp1_some_dataset;
I don't like this solution much, since it will be more complicated to implement and use than the second option:
Multiple schemas
Add new tables to a new schema named after the release name, and use the isaric
schema as the live version:
- isaric
- wp1_some_dataset
- wp2_some_other_dataset
- 20210804_v1
- wp1_some_dataset
- wp2_some_other_dataset
- 20210805_v2
- wp1_some_dataset
- wp2_some_other_dataset
SELECT * FROM isaric.wp1_some_dataset; -- live, most recent data
SELECT * FROM 20210805_v2.wp1_some_dataset; -- latest release, so should be identical to the above
SELECT * FROM 20210804_v1.wp1_some_dataset; -- historical data
For this solution, the user that the pipeline connects as would need permission to create new schemas. I'm also not sure if this is how schemas are meant to be used.
Either solution would require sufficient storage on the Postgres server, as stored data will start growing. The largest table in production, isaric.clinical_data, is ~35MB.