Demo for resource manager extension to the water quality portal

Printer-friendly version

Table of Contents


Overview


Water pollution could both cause health effects on animals directly and disrupt their habitats. Thus, this project extends the SemantEco portal to help resource managers identify water pollution that threatens wildlife, especially endangered species. To enable this extension, the portal needs to incorporate knowledge from three different fields: species observation data, water quality criteria for aquatic life, and health effects of contaminants on species. We firstly use ontologies to model these fields, and conduct data integration according to the ontologies. We capture and leverage provenance while integrating data to enable the transparency of the portal.

Background Knowledge


HUC8 is an 8-digit hydrological unit code identifying a subbasin area of size around 700 square miles. See http://en.wikipedia.org/wiki/Hydrological_code

ReachCode: A reach is a continuous piece of surface water with similar hydrologic characteristics. Some unconnected (isolated) features are also reaches, for example, isolated lakes and single, unconnected streams. In the National Hydrography Dataset (NHD), each reach is assigned a reach code. See http://nhd.usgs.gov/nhd_faq.html

GeoJSON is a format for encoding a variety of geographic data structures. See http://www.geojson.org/geojson-spec.html

Technical Foundations


Architecture


The system architecture of the SemantAqua is illustrated as below. The system comprises six major components: (1) ontology, (2) data conversion, (3) storage, (4) reasoning, (5) provenance and (6) visualization.
Ontology Component: The ontology component is for capturing the domain knowledge. The portal uses the following ontologies: an upper ontology that defines the basic terms for environmental monitoring, and a water ontology for the domain of water quality monitoring, a wildlife ontology for modeling wildlife observations, and a health ontology for capturing the health effects of water polluiton on different species.
Data Conversion Component: The water quality data and wildlife observation data are converted into RDF triples via csv2rdf4lod. Water quality regulations are converted into OWL classes by our regulation converter. Health data are encoded manually.
Storage Component: The RDF data are stored in OpenLink’s Virtuoso 6 open source community edition triple store, which includes a web-accessible endpoint that answers SPARQL queries from web clients. The OWL files are hosted with our web server and available on line.
Reasoning Component: We utilize the Pellet OWL Reasoner together with the Jena Semantic Web Framework to reason over the data and ontologies in order to identify water pollution.
Provenance Component: The provenance Component refers to the provenance capture support that the system gets from the converters and the provenance applications.
Visualization Component: This component is responsible for mashing up and representing the data collected from various sources. The existing portal supports
two types of visualizations: (1) map visualization that displays the sources of the water pollution in
the context of geographic regions, (2) time series visualization that depicts pollution levels over time with respect to a particular water source or facility.
The project develops 3 more types of visualizations: (1) map visualization that displays species observation at the level of water body, (2) line graph that depicts the count of observed species over time, and (3) heat map that presents the density of the observed species in different counties of a selected state.
Architecture

Ontology Overview


todo

Provenance Approach


Provenance are captured and encoded in PML 2 during the data integration stages via csv2rdf4lod and our regulation converter. In each data tegration stage, we capture provenance such as data source, time, method, and protocol used.
In this project, we expand our provenance support by incorporating one more type of provenance: decisions about why we chose to use the data set we did. With such provenance, the user could gain more indepth understanding toward the data integration stages.
In our work, we build three provenance-aware services as follows.
1) Expose data lineage: We display data lineage using a pop up window when the user click on the question mark near a pollution record.
2) Dynamic data source listing: the portal queries the provenance about data sources to get the source organizations for the data and generates the data source facet. With this facet, the user can select the data organizations he/she trusts and the portal will use only data from the selected organizations.
3) Cross validation: The service is at: http://aquarius.tw.rpi.edu/projects/semantaqua/val/validate.html
Environmental data depends on the geographic location and the measurement time, thus only data measured at close locations and time could be used to validate each other. In our portal, we compare and cross-validate water quality data originating from different source agencies while considering provenance like location and measurement time.

Use Case


The resource manager chooses a geographic region of interest by entering a zip code and the species of concern in the species facet. The portal identifies polluted water sources and polluting facilities, and visualizes the results on a map using different icons. Meanwhile, the portal displays the distribution of the species in the region at the level of water body. Then, the resource manager views the map to find out if the selected species might be endangered by water pollution in the region. The resource manager can click on polluting facilities or polluted sites to investigate more about the pollution, e.g. the health effects of the pollution on the species.


 

Demonstrations


Species Facet


The species facet determines which sets of water quality regulations are displayed in the regulation facet. If the user chooses "Human", then the regulation facet displays multiple sets of water quality regulations for human health. "Human" is the default setting.

Human

If the user chooses "Aquatic life", then the regulation facet displays multiple sets of water quality regulations for the protection of aquatic life.

Aqua

If the user chooses "All species", then the regulation facet displays all sets of water quality regulations.

All Species

Water Quality Regulations for the Protection of Aquatic Life


We extend the portal for wildlife in that we incorporate water quality regulations for the protection of aquatic life as shown in the figures below.
The user chooses to apply the "EPA regulation for aquatic life", then the portal identifies polluted sites according to this set of regulations.

Overview of EPA Regulation for Aquatic Life

The user clicks on a polluted site, and a pop up window shows more details about the pollution: names of contaminants, measured values, limit values, time of measurement, and health effects.

Site Details from Applying EPA Regulation for Aquatic Life

The user changes the species from the "Species" drop-down list, and the habitats of that species in the scope of the current viewport are highlighted. The user then clicks one of the highlighted area, the information of the water body and the provenance of the information are shown in the "Water Body Properties" tab.

Canada Goose Distribution near Seattle, Washington

In the "Bird Count" tab, a plot derived from data at http://www.avianknowledge.net/ is shown.

Bird Count for Canada Geese in Washington State in 2007

Heatmap


We support visualize the bird observation via heat map.
The user can select the species to investigate, the state, year and month of interest.
Canada Goose Distribution in Washington

Industry Facet


The user can focus on facilities from particular industries via the industry facet.
If the user chooses "All Data", the portal retrieves all the facilities and reasons over their water measurements. "All Data" is the default setting.

EPA Facilities from All Industry at Seattle

If the user chooses "Manufacturing", the portal only retrieves manufacturing facilities and reasons over their water measurements. We can see that the set of facilities is narrowed down via the industry facet.

EPA Manufacturing Facilities at Seattle

Cross Validation


In this gadget, the user specifies the state and county of interest and also a threshold value which decides if two sites are close enough so that it makes sense to compare their measurements.
The gadget first get sites that are located close to each other.
Next, it retrieves the characteristics that are measured by both sites.
The user needs to select a test type for measurements from EPA, since characteristics are measured with multiple test types in the EPA datasets.
Lastly, the gadgets fetches the measurements of the selected characteristic and test type and visualizes the measurements as time seires. The user check if there is data inconsistence presented with the time series visualization.
Cross Validation Example

Live Demo


Discussions


Learnings from Our Project


GIS tools such as FWTools and QGIS are indispensible in building map-based portals. In our project, FWTools is used to convert the shape files of water bodies downloaded from the National Hydrography Dataset into GeoJSON objects that are consumable by JavaScript.

We studied four JavaScript map libraries -- OpenLayers, Polymaps, Google Maps JavaScript API version 2 and Google Maps JavaScript API version 3. Currently the system is based on Google Maps JavaScript API version 2, because the original SemantAqua is based on this library, but we favor OpenLayers over the other 3 libraries because of its flexiblility. Both Google Maps APIs don't provide support for adding GeoJSON layers directly to the map, while Polymaps cannot use Google Map service. OpenLayers can use Google Map service and conveniently add GeoJSON layers to the map.

D3 (meaning Data-Driven Documents) is a JavaScript that allows you to bind arbitrary data to Document Object Model (DOM) elements. It's an ideal library for building data oriented web applications. In our object, d3 is used to generate the bird count visualizations in the forms of heatmaps and timeline plots.

todo

Value of Semantics


The value of semantics in your demonstration by Han

By extending ontologies, we provided a controlled vocabulary across the domains of water quality, human and wildlife health. Users, from resource managers to citizen scientists, can all use terms in the same vocabulary. This greatly increases the interoperability between users with different backgrounds.The portal we built is based on this controlled vocabulary and we used semantic technologies to facilitate this interoperability. One value of this interoperability is that it allows an ability to work on datasets that are collected by different agencies with different data schemas. We integrated a huge amount of data describing water quality regulations, wildlife distribution, and diseases of human and aquatic life by encoding them into our ontologies. This cross-domain integration also provides possibilities to explore problems that are difficult to deal with before. For example, we are able to present a potential linkage between water pollution and wildlife health by examining the water pollutant measurements and the causes of certain wildlife diseases. We can get conclusions such as a regulation violation of arsenic concentration measurement in a certain water body might explain arsenic poisoning happened on Canada Geese observed in that water body.

The value of semantics by Ping

  • Ontology and SPARQL for data aggregation

As stated in [1], data aggregation is often used to get an overall understanding about the datasets and it can only be performed when it is sensible to aggregate the data objects. For example, it doesn't make sense to aggregate the count of observers into the count of the species observed.
If the data has been converted into triples, we can utilize SPARQL perform appropriate data aggregation. Not only does SPARQL enable us to specify the constraints of the data aggregation, it also support aggregation functions like COUNT, SUM, AVG, MIN, MAX.
In our example, we obtain the total counts of "Canada Goose" in the counties of the Washington state in June 2007 with the SPARQL query as follows.
While we can use SPARQL queries to do data aggregation over data in semantic format, it can be challenging to aggregate data in other format. For instance, the Washington Department of Fisn and Wildlife provide species distribution data in a speedsheet (http://wdfw.wa.gov/publications/00165/2012_distribution_by_county.xls). To retrieve the data to be aggregated and then perform the aggregation, a resource manager has three options: do it manually, write ad hoc programs, or write complex excel macros. All of the three options requres considerable time and effort from the resource manager.

PREFIX geospecies:
PREFIX wildlife:
SELECT ?ctyName SUM(?count) as ?total
WHERE {graph {
?obv wildlife:hasState "Washington";

      wildlife:hasCounty ?ctyName ;
      geospecies:hasCommonName "Canada Goose";
      wildlife:hasYearCollected "2007";
      wildlife:hasMonthCollected "06";
      wildlife:hasObservationCount ?count.}} GROUP BY ?ctyName

  • Link data from various sources

Example: We adopt the terms from geospecies when we design our wildlife ontology, e.g. geospecies:hasCommonName. So the species data that we integrated from AKN or other sources could be linked to geospecies with the common vocabulary.

Value of semantics by Linyun

The beauty of semantic technology comes from the separation of logic from source code. In our demonstration, the regulatory logic to decide whether a water body (facility) is polluted (polluting) or not is being separated apart from the source code.

For example, if we do not apply semantic technology in the website, in order to decide whether a water body is polluted according to the EPA Drinking Water Regulations from http://water.epa.gov/drink/contaminants/index.cfm on 2012-02-11 in terms of arsenic concentration, which has a threshold value of 10.0 ug/L, we need to embed the regulatory logic somewhere in the source code as an if-else statement:

if arsenicConcentration >= 10.0 then isPolluted := true;
else isPolluted := false

If later the regulation becomes stricter and the threshold value is changed to 9.0 ug/L, we need to change every such if-else statement to keep the website up to date. Of course we can use constant variables to replace the raw numbers, but changing source code is still inevitable. Note that this is just an overly simplistic example to showcase the idea of separating logic from source code, the logic for polluting site detection, in reality, is usually not that simple.

With the semantic technology, we encode the regulatory logic in a separate ontology file in OWL, the source code executes a SPARQL query to retrieve the polluted water body and the polluting facility, which utilizes the reasoning power of the Jena reasoner to populate the OWL constraint class corresponding to violations. In this way it's much easier to maintain the website as regulations change, because now we only need to update the ontology file and let the source code point to the new ontology to apply the new regulation. The source code remains unchanged. Note that the ontology file only contains the regulatory logic, but if we embed such logic in the source code, it's very easy to scatter logic pieces among other source code statements.

Value of Provenance


The value of provenance in your demonstration by Han

  • Provenance tracking

Provenance is captured and maintained in the data retrieval process. For example, a data converting tool csv2rdf4lod is used in the species distribution data integration to enable PML encoded provenance such as data source URL, data retrieval time, method and protocol. In the health effect ontology, each health effect instance has a reference webpage as the information source. For instance, the ArsenicPoisoning instance has source from this Wildpro webpage. By including provenance, we can enable the users to look at the data with confidence and help them to decide how useful the data is for their application.

Value of provenance by Ping

Explanation generation

As the portal incorporates more ontologies and integrates more data, it is able to give high level hypothesis such as "H1. facility X has affected species Y". We need to support such hypothesis with explanation or else the user is unlikely to take the hypothesis seriously. We can utilize provenance to generate explanation for the hypothesis proposed by the portal at different granularity.
For this example, the explanation at the first level of granularity are:
E1. Facility X has caused pollution at location l1, and time period t1.
E2. Health problems of species Y has been identified at location l2, and time period t2.
Provenance data such as location and time describe the context of the events. And E1 and E2 can be used to derive H1, only when l1/t1 and l2/t2 are close enough.
The user might ask for the explanation at the first level of granularity.
E11. The measurements of the water released by the facility, e.g. the concentration of Arsenic.
E12. The regulation rule, e.g. the threshold of the concentration of Arsenic.
E21. Health records e.g. the population of the species is declining
Providing provenance data, e.g. data source, data processing time, who processed the data, improves transparency and thus potentially increases trust from users.

Provenance-aware cross validation

Cross validation is about comparing data from multiple sources, and check if there is inconsistence. When cross validation tells that there is no data inconsistence, the user would be more confident about the data. On the other hand, if cross validation reveals data inconsistence, the user could investigate the inconsistence and check if it is a real inconsistence (in some cases, some consistence might appear when we compare incomparable data) and its cause.
In the portal, we developed a cross validation gadget, which conducts validation semi-automatically. Without such a gadget, it would be very time consuming to check a large amount of data from different source agencies.

Value of provenance by Linyun

Provenance brings transparency, and transparency fosters trust. Provenance information tells the user how the data s/he sees on the website has been collected, manipulated, integrated and visualized so that the data shown on the website become validatable. For example, in addition to showing that the measurement of arsenic concentration at a certain place on a certain date is 18.0 ug/L, we also shows that this measurement was taken by USGS, it was converted into RDF format with the csv2rdf4lod-automation tool on 2011-11-03, by a certain student. This information does not necessarily cause trust of the website, but it brings the user enough information so that s/he can do his/her own independent study on this information and decide whether s/he would like to believe it.

Related Work


[1] proposes the Extensible Observation Ontology (OBOE) which is a formal ontology for capturing the semantics of generic scientific
observation and measurement. OBOE can be extended with specialized domain vocabularies and serves as a convenient basis for adding
detailed semantic annotations to scientific data. Alghough the SemantECO ontologies are not as general as OBOE, they are more light weight and straightforward to understand and deploy.
In our example, we model one water measurement in OBOE and our pollution ontology as below.
OBOE brings in the additional class "Observation" and measurements are tied to the corresponding observations.
With OBOE, both the measured value and the measurement time are measurements. And the measurement time is connected to the measured value using the predicate "hasContext".
The data encoded in our pollution forms a flat structure, while the data encoded in OBOE forms a hierarchy.
Measurement in pollution example
Measurement in OBOE example

If the water measurements are encoded in OBOE, we need to change the way the regulation rules are encoded. Some changes are minor, such changing the name of the predicate from pol:hasCharacteristic to oboe:ofCharacteristic. One major change is to change the modeling of polluted site.
In our current ontology, a polluted site is modeled as something that is both a measurement site and polluted thing, which is something that has at least one "measurement" that violates a regulation.
With OBOE, we need to model a polluted site modeled as something that is both a measurement site and polluted thing, which is something that has at least one "observation" that violates a regulation.
A violation observation is modeled as a observation that has at least one "measurement" that violates a regulation.
The additional observation layer might slow down the reasoning process.

Polluted thing in OBOE example
Arsenic violation in OBOE example

BioCaster [2] is a web monitoring system for the early detection of infectious disease events. The system uses ontology to describe the terms and relations necessary to detect and risk assess public health events. Our work could learn from BioCaster on how to encode the health domain and probably link to their system in the future.

The GeoSpecies Knowledge Base [3] is an effort for enabling species data to be linked together as part of the Linked Data network of distributed data. The bridge they use for data linking is a set of unique resolvable identifiers for species concepts that are stable despite changes in taxonomy. The knowledge base is linked to other knowledge bases e.g. DBpedia, Freebase, Bio2RDF, Uniprot, uBio. However, the GeoSpecies Knowledge Base currently contains approximately 6,500 species observations, which is relatively sparse given there are huge amount of species observations on the web.

geospecies
bbc
cafo report
data integration

Data Sources Used


Get bird distribution from http://www.avianknowledge.net/
State: WA
Time: 2007
Species: Branta canadensis (Canada goose)

Get fish distribution for WA from http://ecosystems.usgs.gov/fisheriesdata/querybystate.aspx

The water body shape data come from ftp://www.ecy.wa.gov/gis_a/hydro/nhd/NHDmajor.zip

NHD Rhode Island data come from http://open-market.weogeo.com/, you need to create an account and order the dataset (for free), they will send you the download link via email

NHD data downloading entry point: http://nhd.usgs.gov/data.html

NHD terminology explanation comes from http://nhd.usgs.gov/NHDDataDictionary_model2.0.pdf

Regulations:
Nebraska Department of Environmental Quality (NDEQ). 2002. Title 117 Nebraska Water Quality Standards. http://www.deq.state.ne.us/
National Recommended Water Quality Criteria. http://water.epa.gov/scitech/swguidance/standards/current/index.cfm

Tools and Services Used


FWTools is used to convert the water body shape data into GeoJSON objects. The tool is available at http://fwtools.maptools.org/

Quantum GIS (QGIS) is a user friendly Open Source Geographic Information System, available at http://www.qgis.org/

The USGS National Hydrography Dataset (NHD) services are used to get the hydrological unit codes given locations on the map. See http://services.nationalmap.gov/ArcGIS/rest/services/nhd/MapServer

The specific service used to get the HUC8 code given a specific location is http://services.nationalmap.gov/ArcGIS/rest/services/nhd/MapServer/4/query

jQuery is used to get water bodies (encoded as GeoJSON objects) from the water body shape data given a specific HUC8 code. See http://jquery.com/

OpenLayers is used to visualize water bodies encoded as GeoJSON objects. See http://openlayers.org/

csv2rdf4lod is used to convert the txt data into rdf format. See https://github.com/timrdf/csv2rdf4lod-automation/wiki/Installing-csv2rdf4lod-automation

Future Work


One direction of our future work is to improve the modeling of water quality criteria. Water Quality Criteria usually provides multiple types of thresholds: for acute pollution in freshwater, for chronic pollution in freshwater, for acute pollution in saltwater, and for chronic pollution in saltwater. We currently incorporate thresholds for acute pollution in freshwater for two reasons: 1) we mainly focus on in land water bodies; 2) acute pollution can affect both species that lives near the polluting water source or pass by the water source occasionally. To support thresholds for chronic pollution, we need to consider some additional factors, e.g. the time that species stay near the polluting water source. We would resort to species distribution models from animal experts for modeling the chronic pollution.

References


[1] Joshua Madin, Shawn Bowers, Mark Schildhauer, Sergeui Krivov, Deana Pennington, and Ferdinando Villa. An ontology for describing and synthesizing ecological observation data. Ecological Informatics, (2)3:279--296, October 2007.
[2] Collier, N., Matsuda Goodwin, R., McCrae, J., Doan, S., Kawazoe, A., Conway, M., Kawtrakul, A., Takeuchi, K. and Dien, D. (2010), "An ontology-driven system for detecting global health events", Proc. 23rd International Conference on Computational Linguistics (COLING), Beijing, China, August 23-27, pp.215-222.

[3] http://glri.wher.org/
[4] GeoSpecies Knowledge Base Available from http://lod.geospecies.org. Accessed 15 Jan 2009.

Project Page