Pre-cleaned SQL tables (!5) · Merge requests · ISARIC-4C / WP5 Data Integration

mwham requested to merge refactor into master May 28, 2021

Rewrite of the data cleaning to pre-clean the data before storing it.

Refactors elements of cleaning.py and schema.py into the DataSource
- DataSource now handles all column definitions, custom cleaning, patient/timepoint/kit columns, table names
- Adds patient ID and timepoint as column types
- Fixes 'nan's in timepoints, fixes miscasting of clinical columns, e.g. sitename
Decouples data frame creation from cleaning/SQL handling
Uses Pandas data frames to pre-clean data
- Assays now define columns for clean patient, timepoint and kit ID
- Linker object to handle data frame joins
- Loads tables in prescribed order
  - Loads linkage tables
  - Loads and cleans LIMS data
  - Loads and cleans assays
- All sample IDs now cleaned into a kit ID - removes lims_data.isaric.id
- Removing data cleaning done on-the-fly in SQL views
Performance
- Fewer joins and case statements for obtaining clean data
- Release process now takes ~3 mins instead of 2-3 hours
Documentation
- Adds docstrings throughout
- Rearranges deployed documentation
  - Data map now part of the release (changes over time)
  - Linkage documentation for how data has been cleaned
  - Top-level Readme for user notes on contributing data
- Automates doc deployment and latest release link
Configuration
- Removes file_names. DataSource now points to full-path wildcard at raw_data_sources -> datasource.name
Tests
- Adds test coverage for data cleaning and linkage

Edited May 28, 2021 by mwham

Pre-cleaned SQL tables