Skip to content

Pre-cleaned SQL tables

mwham requested to merge refactor into master

Rewrite of the data cleaning to pre-clean the data before storing it.

  • Refactors elements of cleaning.py and schema.py into the DataSource

    • DataSource now handles all column definitions, custom cleaning, patient/timepoint/kit columns, table names
    • Adds patient ID and timepoint as column types
    • Fixes 'nan's in timepoints, fixes miscasting of clinical columns, e.g. sitename
  • Decouples data frame creation from cleaning/SQL handling

  • Uses Pandas data frames to pre-clean data

    • Assays now define columns for clean patient, timepoint and kit ID
    • Linker object to handle data frame joins
    • Loads tables in prescribed order
      • Loads linkage tables
      • Loads and cleans LIMS data
      • Loads and cleans assays
    • All sample IDs now cleaned into a kit ID - removes lims_data.isaric.id
    • Removing data cleaning done on-the-fly in SQL views
  • Performance

    • Fewer joins and case statements for obtaining clean data
    • Release process now takes ~3 mins instead of 2-3 hours
  • Documentation

    • Adds docstrings throughout
    • Rearranges deployed documentation
      • Data map now part of the release (changes over time)
      • Linkage documentation for how data has been cleaned
      • Top-level Readme for user notes on contributing data
    • Automates doc deployment and latest release link
  • Configuration

    • Removes file_names. DataSource now points to full-path wildcard at raw_data_sources -> datasource.name
  • Tests

    • Adds test coverage for data cleaning and linkage
Edited by mwham

Merge request reports