Three principles for building government dataset catalog vocabulary
There are some ongoing interests in vocabulary for government dataset publishing. There are a number of proposals such as DERI dcat, Sunlight Lab’s guidelines and RPI’s proposal on Data-gov Vocabulary. Based on our experiences on data.gov catalog data, we found the following principles are useful for consolidate the vocabulary building process and potentially bring consensus:
- 1. modular vocabulary with minimal core
- keep the core vocabulary small and stable, only include a small set of frequently used (or required) terms
- allow extensions contributed by anyone. Extensions should be connected to the core ontology and be possible to be promoted to core status later.
- 2. choice of term
- make it easy for curator to produce metadata using the term, e.g. do they need to specify data quality ?
- make it clear on the expected range of term , e.g. should they use “New York” or “dbpedia:New_York” for spatial coverage? does it require a controlled vocabulary? A validator would be very helpful
- make it clear on the expected use of term, e.g. can it be displayed in rich snippet? can it be used in SPARQL query, search or facet browsing?
- try to reuse a term from existing popular vocabulary
- identify the required, recommended, and optional terms
- 3. best practices for actual usage
- we certainly want the metadata to be part of linked data, but that is not the end. We would like to see the linked data actually being used by users who don’t know much about the semantic web.
- we should consider make vocabulary available in different formats for a wider range of users , e.g. RDFa, Microformat, ATOM, JSON, XML Schema, OData
- we should build use cases, tools and demos to exhibit the use of vocabulary to promote adoption
comments are welcome.
Li Ding @ RPI