A Semi-Automated Approach to Data Harmonization Across Environmental Health Studies

The NIEHS-supported Human Health and Exposure Analysis Resource (HHEAR) Data Center maintains a public-use data repository to promote reuse of environmental health data generated by the HHEAR program. The creation and maintenance of this repository requires the integration of information from a wide variety of epidemiologic studies. We have developed the Human Aware Data Acquisition Framework to enable this complex integration, supporting harmonization across multiple studies, and enabling meaningful search and access of the data deposited in the HHEAR Data Repository. To integrate data from a new study, investigators engage in an initial, time-consuming effort to link study data to the HHEAR ontology, a controlled vocabulary of environmental and public health terms. This is accomplished by generating a semantic data dictionary (SDD) from the data dictionaries and codebooks provided by HHEAR study investigators. Originally, this had been done manually by an expert in both epidemiological terminology and ontological modeling. To increase the accessibility of these tools for environmental health scientists who lack formal ontologic training, we have developed an SDD-Editor that simplifies the ontology modeling process. The SDD-Editor reuses elements common to epidemiologic data dictionaries and spreadsheet software, while integrating features needed to form semantic links between public health concepts and existing ontologies. The SDD-Editor suggests potential concept matches for study variables within the SDD using natural language processing to capture the semantic similarity between data dictionary and ontology class descriptions. If no suitable suggestion exists, investigators can search for ontology terms using a search engine powered by Bioportal. Once finished, a validator is run to check that the SDD has the correct format and all classes are valid. By automating parts of the ontology modeling process, the SDD-Editor greatly facilitates the dynamic integration of HHEAR environmental health studies into a single repository, benefiting the scientific community.

View Publication

Associated Projects

In 2019 the Human Health Exposure Analysis Resource (HHEAR) Data Center was established by NIEHS as a continuation of the CHEAR Data Center expanding to include health outcomes at all ages. The goal is to provide approved HHEAR investigators their laboratory analysis results and incorporate them in statistical analyses of their study data. We then make that data publicly available as a means to improve our knowledge of the comprehensive effects of environmental exposures on human health throughout the life course.