Configurable columns?
The columns in the input data are highly variable, and it sometimes doesn't make so much sense to have hard-coded Column objects in each datasource. Maybe we could have a seperate config file per dataset:
# some_dataset_2020-12-06.csv
sample_id this that other
CCP0001 0.1 4 some description
...
# some_dataset.yaml
---
description: 'Some dataset'
columns:
- name: patient_id
description: Clean ISARIC patient ID
type: string
patient_id: true
- name: sample_id
description: ISARIC sample ID
type: string
dirty_sample_id: true
- name: clean_kit_id
description: Clean ISARIC kit ID
type: string
clean_kit_id: true
- name: this
type: float
- name: that
type: integer
- name: other
type: string
Then linkage_config.yaml would have:
---
data_sources:
some_dataset:
input_data: path/to/some_dataset_????-??-??.csv
config: path/to/some_dataset.yaml
This way, adding new datasets would just be a config change. We'd need some way of defining custom behaviour, though.
Further down the line with auto-file syncing, we could then potentially let the users drop their data into SharePoint and supply their own config file alongside, and we just connect it up to the pipeline.