The Semantic Data Dictionary- An Approach for Describing and Annotating Data

It is common practice for data providers to include text descriptions for each column when publishing datasets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a dataset, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse datasets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey dataset, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large NIH-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.

View Publication

Associated Projects

The aim of the Semantic Data Dictionary (SDD) approach is to annotate datasets such that it is machine readable, uses best practice ontologies, and follows FAIR Guiding Principles. It is a project that was developed to address machines’ difficulty in understanding data dictionaries, a standard method used to describe datasets through the use of tables that identify information about data variables’ content, description, and format. With SDD, there is an extension and integration of data from multiple domains using a common metadata standard.

In 2019 the Human Health Exposure Analysis Resource (HHEAR) Data Center was established by NIEHS as a continuation of the CHEAR Data Center expanding to include health outcomes at all ages. The goal is to provide approved HHEAR investigators their laboratory analysis results and incorporate them in statistical analyses of their study data. We then make that data publicly available as a means to improve our knowledge of the comprehensive effects of environmental exposures on human health throughout the life course.