Semantic eScience Meeting March 11, 2011

Printer-friendly version

Meeting Information

Agenda

  • Scribe: Eric
  • Help with ESG
  • VSTO update, and help with VSTO
  • Help with BCO-DMO
  • SeSF update
  • data.rpi.edu
  • Discussion of Semantic Discovery (if Rajashree can make it)
  • Report from GeoData (if Peter can make it)

Attendance

  • Eric
  • Linyun (PA linju:n) (Bob)
  • Arun
  • Rajashree
  • Patrick
  • Sumitra
  • Peter
  • Andy M.

Previous Action Items

  • PW: Add Events to drupal (DONE)
    • EGU 2011 (DONE)
    • OWLED 2011 (DONE)
    • SemTech 2011 (DONE)
  • PW: Erase ts evaluation virtual server (DONE)
  • Arun: Talk to Josh/Jesse/Shangguan about Allegrograph licensed version (partly done)
  • SZ: Pick up FedEx envelopes and poster tubes (DONE)
  • SZ: Bring your printer to GeoData (and yourself!) (DONE)
  • SZ: Register for EGU and arrange travel (DONE)
  • SZ: Contact Peter about connections at EGU (partly done)
  • GeoData attendants: writeup blog on meeting (Eric/Arun DONE)
  • HW: Talk to Greg about getting access to Grind Stone
  • PW: Reschedule eScience meeting (DONE)

Action Items

  • PW: Find the Allegrograph installation for VSTO
  • SZ,KD: GeoData blogs
  • HW: Talk to Greg about getting access to Grind Stone
  • PF: Funding regarding getting Nathan out
  • PW: Narrative quarterly report for ESG work by 15th
  • ER,HW: parallel time queries
  • PW,EP: data.rpi.edu, get on escience.rpi.edu, clone the interface
    • Work with Lindsay to get this to happen

Notes

Allegrograph licensed version

  • We have a free version in the lab.
  • Shangguan installed it

ESG

  • Patrick needs help / has questions on Ferret, etc.
  • Will talk to Linyun after meeting
  • Peter to determine funding for Nathan Potter (OPeNDAP to visit)
  • Sumitra work with Ferret and visual responses to requests (gif, movies, etc...)

VSTO Update

  • Shrikant has been working (reading docs, updating code)
  • VSTO updates page on SeSF:

http://tw.rpi.edu/web/project/SeSF/workinggroups/VstoUpdates

  • response times regarding temporal instance querying are not nearly as good as RESTful service calls to HAO server (cedarweb.hao.ucar.edu)
    • took 60 seconds
  • getting this running on a cluster machine(s) and see about scalability
  • taking advantage of all 4 cores on aquarius

BCO-DMO

  • Conversation with Cyndy regarding Roles
    • in context of program, project, dataset, etc...
    • Eric has been researching roles and contexts.
    • Ping Wang has been working on this with Health Informatics
    • As has Peter Ragone
  • bad characters still in descriptive content in the relational database
    • had gotten it down to just a few errors
    • Now there are much more errors
    • Our scripts translate many of the characters, but not all
    • Try to get the RDF on Virtuoso (Aquarius)
  • S2S2
    • Everything except Bounding Box, problem with Google Maps
    • Try open layers instead to get this to work
    • hard coding google maps into the HTML would work
    • SEAVoX, not hierarchical, but works in tw2 version
      • would be nice to have the hierarchical capability, a hierarchical widget
      • Evan has the person reference in publication instance creation form that has hierarchical
      • http://vsto.org/data/byInstrument.htm, filter on the right has type filter. Also, instruments displayed based on the instrument type/hierarchy
    • http://aquarius.tw.rpi.edu/s2s/

SeSF

  • Requirements Documents
  • TW Website, BCO-DMO, VSTO, ESG...

data.rpi.edu

  • Institution data (spreadsheets or semantic data)
  • Need a landing point
    • put on escience
    • has static IP address
  • data release service
    • data identifiers, citations

Data Citation

  • Organizations have different reasons for wanting data citation
  • credit
  • identify their dataset

Semantic Discovery

  • discovery search engine (in a particular domain, geoscience, or even more limited BCO-DMO)
  • VIVO: life sciences
  • most other discovery services for libraries/documents
  • diff. from Google: an integrated view of
  • data ingestion: OpenSearch, service casting, data casting
  • use cases: BCO-DMO data
  • ISO 23950 / ZED 3950 (hierarchical discovery search protocol)
    • Successive retrieval, page results
    • Focused on Dublin Core
    • Oldie but Goodie
  • OAI?? push interface

GeoData

  • Eric's Blog entry: http://bit.ly/eTtPIf
  • Arun's Blog entry: http://tw.rpi.edu/weblog/2011/03/07/geodata-2011-experiences/

Eric's Time Scalability Email

Thanks again Arun for getting this loaded.

Han, I got some preliminary results for this SPARQL endpoint.

I decided to run a quick experiment. In VSTO, I assume one of the most difficult queries is to get any year for all datasets. This is like entering the VSTO Portal and choosing the Start By Dates Workflow (and waiting for the available year choice). This is the SPARQL query for that question...

PREFIX vsto:
PREFIX time:
PREFIX xsd:
SELECT DISTINCT ?y WHERE {

  ?dataset vsto:hasDateTimeCoverage ?c .
  ?c ?p ?i .
  ?i a time:Instant .
  ?i time:inDateTime ?dt .
  ?dt time:year ?y .

}

Unfortunately, this query times out.

However, an unexpected result was that I could actually answer the query: What months have data in the year 2000? The query for this was:

PREFIX vsto:
PREFIX time:
PREFIX xsd:
SELECT DISTINCT ?m WHERE {

  ?dataset vsto:hasDateTimeCoverage ?c .
  ?c ?p ?i .
  ?i a time:Instant .
  ?i time:inDateTime ?dt .
  ?dt time:year "2000"^^xsd:gYear .
  ?dt time:month ?m .

}

Unfortunately, it took more than 60 seconds to answer this query.

It's pretty clear that this is not going to be an acceptable solution. Also, working with date time strings only (instead of the granular OWL-Time instances) is even worse. I can't get a query like:

PREFIX vsto:
PREFIX time:
PREFIX xsd:
SELECT DISTINCT ?dt WHERE {

  ?dataset vsto:hasDateTimeCoverage ?c .
  ?c ?p ?i .
  ?i a time:Instant .
  ?i time:inXSDDateTime ?dt .
  FILTER regex(?dt, "^2000")

}

to return at all (needs to perform a regex over every date time string in the dataset).

Also, as a quick way to look at the potential performances for the other workflows (Start by Instruments and Start by Parameters) I did some simple analysis on the number of datasets that each instrument and parameter occurs in with the following queries:

PREFIX vsto:
PREFIX cedar:
PREFIX time:
PREFIX xsd:
SELECT DISTINCT ?p count(?datasets) WHERE {

  ?datasets vsto:hasContainedParameter ?p .

}

PREFIX vsto:
PREFIX cedar:
PREFIX time:
PREFIX xsd:
SELECT DISTINCT ?i count(?datasets) WHERE {

  ?datasets vsto:isFromInstrument ?i .

}

In the parameter analysis, there is more than 1/6 of all parameters associated with more than 100 datasets. There are also many cases of parameters being associated with 600 or more datasets, and one case (CEDAR parameter 110) that is associated to 1200 datasets.

The same goes for instruments, there are many instruments associated to 100 or more datasets.

This is just something to start thinking about. Were going to need an alternative approach here, or find some value that this semantic data adds beyond the RESTful metadata service.

Sorry for the long email, thought it might be the best way to get this out.

-Eric