Data Set Comparator

CodeDay Labs 2020 ∙

Mentor: Matthew Ludewig, Quality Engineering Lead at LexisNexis Risk Solutions

Team members: Sam Bai, Khushi Wadhwa, An Nguyen

This is an analytics based project. We need help creating a python based executable code that can compare 2 datasets with various data types (integer/string) and calculate basic statistical differences between the two. For example, % records different, % records increase, % records decrease, change in cardinality. The challenge arises when tying to compare changes in string values (list of states for example) Both files would have a column "Residency State" and the data within that column is a list (CA,CO,MN) for file A but for file B the list could be (CA,CO,MN,TX). While the value exists in both we would want the change to be represented as an increase for the column since the string is adding one state value.

We will provide 4 to 8 datasets with approximately 50K rows and anywhere from 50-1000 columns to aid in creating the code.

The deliverable should be executable code that can read in 2 datasets of the same structure and output a summarized table of the differences between the 2 files.

We would integrate this code into our team processes for analytical monitoring and research.

How much experience does your group have? Does the project use anything (art, music, starter kits) you didn't create?

CodeDay Labs advanced-track team

Participation Certificate

Members

An Nguyen