Pre-cleaned SQL tables
Rewrite of the data cleaning to pre-clean the data before storing it.
-
Refactors elements of cleaning.py and schema.py into the DataSource
- DataSource now handles all column definitions, custom cleaning, patient/timepoint/kit columns, table names
- Adds patient ID and timepoint as column types
- Fixes 'nan's in timepoints, fixes miscasting of clinical columns, e.g. sitename
-
Decouples data frame creation from cleaning/SQL handling
-
Uses Pandas data frames to pre-clean data
- Assays now define columns for clean patient, timepoint and kit ID
- Linker object to handle data frame joins
- Loads tables in prescribed order
- Loads linkage tables
- Loads and cleans LIMS data
- Loads and cleans assays
- All sample IDs now cleaned into a kit ID - removes lims_data.isaric.id
- Removing data cleaning done on-the-fly in SQL views
-
Performance
- Fewer joins and case statements for obtaining clean data
- Release process now takes ~3 mins instead of 2-3 hours
-
Documentation
- Adds docstrings throughout
- Rearranges deployed documentation
- Data map now part of the release (changes over time)
- Linkage documentation for how data has been cleaned
- Top-level Readme for user notes on contributing data
- Automates doc deployment and latest release link
-
Configuration
- Removes file_names. DataSource now points to full-path wildcard at raw_data_sources -> datasource.name
-
Tests
- Adds test coverage for data cleaning and linkage
Edited by mwham