Data Harmonization across large data sets
Fully harmonized database that incorporates data selected from more than 130 million patients
Name of the customer Mass General Brigham (Patient-Centered Outcome Research Institute study) Comparing Second-Line Medicines to Treat Type 2 Diabetes -- The BESTMED Study
Details
Fully harmonized database that incorporates nationwide real-world data selected from more than 130 million patients from twelve partners forming a coalition of healthcare plans and integrated healthcare systems.
· We mimicked a pragmatic clinical trial comparing the effects of multiple second-line agents for individuals ≥ 30 years of age at moderate cardiovascular risk.
· The BESTMED coalition contains 33.4 million HbA1c measurements and 216.7 million diabetes medication records.
· Encounters: ~1B+, Lab counts: ~500M+
· Key Features: Cloud-hosted, privacy-preserving linkage enabled for de-duplication, automated transformation, data analytics using SAS, SQL, and R.
· Data includes Medicare Advantage, Medicaid, Fully Insured Commercial, Administrative Services Only (ASO), Military (TRICARE), Vision, and Dental claims from two large players with ~100M+ total patient population.
The newer statistical methods like causal inference and the informatics techniques to build longitudinal patient records are rooted in this study for realizing the ambitious goal of emulating a trial with a population size of 10s of 1000s.
Know-how and subject matter expertise in bringing together health claims and clinical data, harmonizing it for building analytics/applying AIML techniques, and visualizing the outcomes is key to the success of the project. We have a team of healthcare domain experts in understating the data for mapping (often to standards) and building automation for parsing regular intake and value add for streaming feeds.
The challenges posed were mostly around bringing data together from multiple partners for over 10 years of historical records. The data need to be in the latest clinical standards like RxNorm, LOINC, ICD, CPT etc, should be mapped to a common data model for better analytics, backed by a responsive system to onboard large 100s of GB data in a secure way, and should follow same baseline query to avoid missing data from sources. Our team used their experience to create comprehensive guidelines and worked within the system to build quality checks in a staging environment to query revamp and built a feedback loop with sources to ensure data completeness.