Overview
nuMoM2b-preprocessing: Preprocessing scripts to create reproducible partitions of the nuMoM2b data set.
- Source Code: https://github.com/hayesall/nuMoM2b_preprocessing
Motivation
I was pretty involved in the Precision Health Initiative, nuMoM2b project, and follow-up Hoosier Moms Cohort study.
I immediately ran into a problem where there were tens of thousands of variables and it wasn’t obvious whether colleagues were using the same variables or the same preprocessing steps.
What I wanted was a simple domain-specific language (DSL) to describe where CSV files were stored, what variables I wanted out of them, and how to merge them into a single design matrix. It never evolved beyond the JSON representation, but I found this extremely helpful.
data:image/s3,"s3://crabby-images/2e91a/2e91a0462c70c3c39b2c1d4a4617c2f800cdf915" alt="A JSON file showing a path to CSV files, a set of target variables, and a set of files where other variables are defined."
data:image/s3,"s3://crabby-images/9085a/9085a5c8f07f5e895281ea6eefeafd2e8db040ba" alt="High-level architecture of the preprocessing stages. Data and a config file are turned into a final dataframe, and information about each step gets logged to a file."