Archive

Archive for the ‘Semantic Web’ Category

Data.gov – it’s useful, but also could be better.

April 5th, 2011

The “Nerd Collider” Web site invited me to be a “power nerd” and respond to the question “What would you change about Data.gov to get more people to care?”  The whole discussion including my response can be found here.  However, I hope people won’t mind my reprinting my response here, as the TWC blog gets aggregated to some important Linked Data/Semantic Web sites.

My response:

I was puzzling over how I wanted to respond until I saw the blog in the Guardian – http://www.guardian.co.uk/news/datablog/2011/apr/05/data-gov-crisis-obama – which also reflects this flat line as a failure, and poses, by contrast, the number of hits the Guardian.com website gets. This is such a massive apples vs. oranges error that I figure I should start there.

So, primarily, let’s think about what visits to a web page are about — for the Guardian, they are lots of people coming to read the different articles each day. However, for data.gov, there isn’t lot of repeat traffic – the data feeds are updated on a relatively slow basis, and once you’ve downloaded some, you don’t have to go back for weeks or months until the next update. Further, for some of the rapidly changing data, like the earthquake data, there are RSS feeds so once setup, one doesn’t return to the site. So my question is, are we looking at the right number?

In fact, the answer is no — if you want to see the real use of data.gov, take a look at the chart at http://www.data.gov/metric/visitorstats/monthlyredirecttrend — the number of total downloads of dataset since 2009 is well over 1,000,000 and in February of this year (the most recent data available) there were over 100,000 downloads — so the 10k number appears to be tracking the wrong thing – the data is being downloaded and that implies it is being used!!

Could we do better? Yes, very much so. Here’s things I’m interested in seeing (and working with the data.gov team to make available)

1 – Searching for data on the site is tough — keyword search is not a good way to look for data (for lots of reasons) and thus we need better ways – doing this really well is a research task I’ve got some PhD students working on, but doing better than is there requires some better metadata and approach. There is already work afoot at data.gov (assuming funding continues) to improve this significantly.

2 – Tools for using the data, and particularly for mashing it up, need to be more easily used and more widely available. My group makes a lot of info and tools available at http://logd.tw.rpi.edu – but a lot more is needed. This is where the developer community could really help.

3 – Tools to support community efforts (see the comment by Danielle Gould to this effect) are crucial – she says it better than I can so go read that.

4- there are efforts by data.gov to create communities – these are hard to get going, but could be a great value in the long run. I suggest people look to these at the data.gov communities site, and think about how they could be improved to bring more use – I know the data.gov leadership team would love to get some good comments about that.

5 – We need to find ways to turn the data release into a “conversation” between government and users. I have discussed this with Vivek Kundra numerous times and he is a strong proponent (and we have thought about writing a paper on the subject if time ever allows). The British data.gov.uk site has some interesting ideas along this line, based on open streetmap and similar projects, but I think one could do better. This is the real opportunity for “government 2.0″ – a chance for citizens to comment just on legislation, but to help make sure the data that informs the policy decisions is the best it can be.

So, to summarize, there are things we can do to improve things, many of which are getting done. However, the numbers in the graph above are misleading, and don’t really reflect the true usage of data.gov per se, let alone the other sites and sites like the LOGD site I mention above which are powered by data.gov.

VN:F [1.9.13_1145]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)

Budget Cuts Threatening Data.gov

March 31st, 2011

You may have heard that the Data.gov website is going to be shut down.  I wish I could say this is completely false, but  I can at least say that it is a bit premature — if Congress cuts the budgets to the threatened level, a number of sites, including Data.gov will have trouble continuing to grow, and some may have to be shut down — but right now the budget cuts are not final, and the plans are still in the works.  Data.gov, luckily, is less expensive than some of the other sites to maintain, so the discussion right now is more about cutting plans for expansion than shutting down completely, but even that would be a major blow to open government data. However, sites like USAspending and others will be harder to maintain, and even data.gov could end up shut down if the full cuts go through unchanged (but at least I’m personally hoping the Senate and White House will resist this)

What you can do is to get involved!  Let your politicians hear from you — the Sunlight Foundation has a great site about this at http://sunlightfoundation.com/savethedata/ which will let you sign a petition and has some suggestions for other actions.  It also has up to date information on the situation — please go look there.

There’s also a lot of articles out there, and much to follow in twitter space — here’s some starting points

In the past day, there have been a lot of articles in the news about Data.gov:  http://www.google.com/search?q=%22data+gov%22&hl=en&prmdo=1&tbm=mbl&num=10&lr=&ft=i&cr=&safe=images&tbs=qdr:w#q=data.gov&hl=en&lr=&prmdo=1&tbm=nws&ei=ehuVTfz9GY3msQOCiJ3MBQ&start=0&sa=N&bav=on.2,or.r_gc.r_pw.&fp=83f1e1e6450f219c

A good article by Beth Noveck (I’m the president of her fan club :-) ): Huffington Post: “Why Cutting E-Gov Funding Threatens American Jobs
http://www.huffingtonpost.com/beth-simone-noveck/why-cutting-egov-funding-_b_840430.html

The hashtag for following this on twitter is #savethedata

So please, join us in saving these important government transparency efforts!!

-Jim Hendler

p.s. For some irony, Hong Kong’s open data site went live today: http://www.gov.hk/en/theme/psi/welcome/

Here’s some more articles and things for those interested

Federal News Radio, Daniel Shuman, Sunlight Foundation, “Budget cuts may end transparency programs”  http://www.federalnewsradio.com/index.php?nid=17&sid=232614
Federal News Radio, Executive Editor, Jason Miller, “OMB prepares for open gov sites to go dark in May”: http://www.federalnewsradio.com/?nid=35&sid=2327798
Sunlight Foundation, Daniel Shuman, “Budget Technopocalypse Deepens: Transparency Sites will go dark in a few months”: http://sunlightfoundation.com/blog/2011/03/31/budget-technopocalypse-deepens-transparency-sites-will-go-dark-in-a-few-months/
Washington Examiner, Mark Tapscott, “Transparency advocates appeal to Congress to avoid budget cuts”: http://washingtonexaminer.com/blogs/beltway-confidential/2011/03/transparent-advocates-appeal-congress-avoid-budget-cuts
PCWorld, Grant Gross, “Group Protests Proposed Cuts to e-Government Transparency Efforts”: http://www.pcworld.com/businesscenter/article/223618/group_protests_proposed_cuts_in_egovt_transparency_efforts.html
“Data.gov and 7 other sites to shut down after budget cuts”: http://www.readwriteweb.com/archives/datagov_7_other_sites_to_shut_down_after_budgets_c.php

VN:F [1.9.13_1145]
Rating: 9.3/10 (3 votes cast)
VN:F [1.9.13_1145]
Rating: +1 (from 1 vote)

Fall 2010 TWC Undergraduate Research Summary

December 20th, 2010

The Fall 2010 semester marked the beginning of the Tetherless World Constellation’s undergraduate research program at Rensselaer Polytechnic Institute (RPI). Although TWC has enjoyed significant contributions from RPI undergrads since its inception, this term we stepped up our game by more “formally” incorporating a group of undergrads into TWC’s research programs, established regular meetings for the group, and with input from the students began outfitting their own space in RPI’s Winslow Building.

Patrick West, my fellow TWC undergrad research coordinator and I asked the students to blog about their work throughout the semester; with the end of term, we asked them to post summary descriptions of their work and their thoughts about the fledgling TWC undergrad research program itself. We’ve provided short summaries and links to those blogs below…

  • Cameron Helm began the term coming up to speed on SPARQL and RDF, experimented with several of the public TWC endpoints, and then worked with Phillip on basic visualizations. He then slashed his way through the tutorials on TWC’s LOGD Portal, eventually creating impressive visualizations such as this earthquake map. Cameron is very interested in the subject of data visualization and looks to do more work in this area in the future.
  • After a short TWC learning period, Dan Souza began helping doctoral candidate Evan Patton create an Android version of the Mobile Wine Agent application, with all the amazing visualization and data integration required, including Twitter and Facebook integration. Mid-semester Dan also responded to the call to help with the crash” development of the Android/iPhone TalkTracker app, in time for ISWC 2010 in early November. Dan continues to work with Evan and others for early 2011 releases of Android, iPhone/iPad Touch and iPad versions of the Mobile Wine Agent.
  • David Molik reports that he learned web coding skills, ontology creation, server installation and administration. David contributed to the development and operation of a test site for the new, semantic web savvy website for the Biological and Chemical Oceanography Data Management Office BCO-DMO of the Woods Hole Oceanographic Institute.
  • Jay Chamberlin spent much of his time working on the OPeNDAP Project, an open source server to distribute scientific data that is stored in various formats. His involvement included everything from learning his way around the OPeNAP server, to working with infrastructure such as TWC’s LDAP services, to helping migrate documentation from the previous Wiki to the new Drupal site, to actually implementing required changes to the OPeNDAP code base.
  • Phillip Ng worked on a wide variety of projects this fall, starting with basic visualizations, helping with ISWC applications, and including iPad development for the Mobile Wine Agent. Phillip’s blog is fascinating to read as he works his way through the challenges of creating applications, including his multi-part series on implementing the social media features.
  • Alexei Bulazel began working with Dominic DiFranzo on a health-related mashup using Data.gov datasets and is now working on a research paper with David on “human flesh search engine” techniques, a topic that top thinkers including Tetherless World Senior Constellation Professor Jim Hendler have explored in recent talks. Note: For more background on this phenomena, see e.g. China’s Cyberposse, NY Times (03 Mar 2010)

Many of these students will be continuing on with these or other projects at TWC in 2011; we also expect several new students to be joining the group. The entire team at the Tetherless World Constellation thanks them for their efforts and many important contributions this fall, and looks forward to being amazed by their continued great work in the coming year!

John S. Erickson, Ph.D.

VN:F [1.9.13_1145]
Rating: 9.0/10 (1 vote cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)

Food+Tech Hackathon

December 9th, 2010

On December 4th, developers, designers, entrepreneurs, and general food enthusiasts came together at the Food+Tech Hackathon to develop and explore applications to help evolve the food and information technology community. The event, which was part of the International Open Data Hackathon, was in New York City and was organized by Danielle Gould from Food + Tech Connect, Marc Alt from Open Source Cities, and Tian He from Gojee.

Evan Patton and I had a chance to come down and help out with the day’s hacking. I kicked off the event with a lecture on Open Data and the Semantic Web. I gave some background on the Open Data movement in the last few years, discussed some of the current challenges in open data, and talked about how Semantic Web technologies can help address these challenges.

Evan helped explain some of his work on publishing USDA nutrition data on semanticdiet.com and discussed the Wine Agent’s food ontology and recommendation with participants. Semantic Diet uses semantic web technologies to bring together nutrition data, recipes contributed by users and crawled off the web, and personal dietary needs. Having these data organized and encoded using semantic technologies allowed groups to query and reason about food data, and even link it into their own hackathon ideas.

During the hackathon there were thirteen groups work on everything from application to help people eat more sustainably to projects that allowed people to understand price fluctuations in food products over time. We were thrilled to see some of the groups using some semantic data and technologies provided by Semantic Diet and TWC’s LinkedOpen Government Data project. Evan and I spent most of our time educating and assisting teams on using semantic technologies and data. It was great to see so many people enthusiastic about semantics and thinking about how they could use open data to start a project or improve existing projects.

All in all I feel the hackathon was a huge success. At the end of the day we had many applications and projects that have potential to really move forward and make a real impact in the community. Evan and I would like to thank the sponsors and organizers of the first ever Food+Tech Hackathon and hope to help and participate in many more.

Links to other great blog posts on the Food+Tech Hackathon:

VN:F [1.9.13_1145]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)

Suggestions to the Supercomputing Community

December 4th, 2010

As mentioned in my last blog post, I recently participated in a birds-of-a-feather (BOF) on semantic graph/database processing at Supercomputing 2010 (SC10).  My general research interest is in high-performance computing (HPC) for the semantic web, so this BOF was a great fit.  At the BOF, I very briefly made three suggestions to HPC researchers; in this blog post, I expand on and explain these suggestions.  I welcome feedback, particularly from those in the semantic web community who have something to share with the supercomputing community.

1. There is a need for good benchmarks from a HPC perspective.

By “good,” I primarily mean that the datasets and queries need to be realistic.  In other words, the data should reflect data that occurs in the real world, and queries should reflect queries that would be posed by actual users or systems.  By “HPC perspective,” I mean that it needs to test strong scaling (change in time for fixed total dataset size and varying number of processors) and weak scaling (change in time for fixed dataset size per processor and varying number of processors).

The Lehigh University Benchmark (LUBM) [1] has arguably been the most widely used benchmark likely because it is one of the earliest benchmarks that provide a data generator and a standard set of queries.  It is targeted towards inferencing. However, LUBM datasets are not only synthetic, but they are quite unrealistic.  In addition to uniform distribution of data, it suffers from other inadequacies like few links between universities and the use of a single, nonsensical phone number for every person (“xxx-xxx-xxxx”).  Therefore, LUBM datasets do not provide a realistic data distribution and thus cannot test the ability of systems to handle realistic selectivity and skew.

There is also the Berlin SPARQL Benchmark (BSBM) [2], but it is “built around an e-commerce use case” and “illustrates the search and navigation pattern of a consumer looking for a product” [3].  From a HPC perspective, we will likely be more concerned with overall run-time of queries or reasoning processes (or whatever other interesting processes) rather than handling interaction with users.

Finally, there is SP2Bench [4].  This is perhaps the most useful benchmark for SPARQL benchmarking.  It provides a data generator that mimics statistical properties of DBLP data, and it provides a set of sensible queries.  Therefore, the dataset is more realistic than LUBM, and it is focused on SPARQL query (whereas LUBM focuses on reasoning).

However, there is still a need for a good reasoning benchmark from a HPC perspective.  It’s difficult to be more specific than that because providing such a benchmark is still very much an open research topic.  Clearly there needs to be an ontology that uses features from various reasoning standards (e.g., RDFS, OWL) and a corresponding data generator.  There should also be some way to verify validity of inferences based on certain entailments.  Again, this is very much an open research topic which is why I made the suggestion but have few answers myself.

2. Consider existing reasoning standards as starting points.

This may be the more controversial of my suggestions, but there is good reason for it.  Recent history indicates that the reasoning standards continue to iteratively evolve based on the needs of the community.

Consider RDFS (by which I mean RDFS entailment as defined in RDF Semantics).  First of all, it is technically undecidable [5], but in a way that is trivial and easily overcome.  Secondly, few systems (in my experience) completely support inferences based on literal generalization, XML literals, and container-membership properties.  Other rules, like “everything is a resource,” are generally trivial and uninteresting.  More commonly, implementations align with a fragment of RDFS that I call RDFS Muñoz [6] (originally termed the ρdf fragment), which essentially boils down to domains, ranges, subclasses, and subproperties. Perhaps Muñoz said it best:

“Efficient processing of any kind of data relies on a compromise between the size of the data and the expressiveness of the language describing it. As we already pointed out, in the RDF case the size of the data to be processed will be enormous, as current developments show …. Hence, a program to make RDF processing scalable has to consider necessarily the compromise between complexity and expressiveness. Such a program amounts essentially to look for fragments of RDF with good behavior with respect to complexity of processing.” [6]

Consider also OWL 1.  How many scalable systems completely support one of the OWL 1 fragments (Lite, DL, Full)?  I cannot say for sure, but my impression from experience and feedback from others is that the cost for higher expressivity can often be too expensive in terms of performance, especially as you scale dataset size.  Perhaps it is for this reason that OWL Horst [7] (originally termed the pD* fragment) has gained popularity as (arguably) the most widely supported OWL fragment.

Now there is OWL 2OWL 2 RL (a fragment of OWL 2) is “inspired by description logic programs and pD* [OWL Horst]” [8]. The SAOR paper from ISWC 2010 [9] has already shown a subset of OWL 2 RL rules for which closure can be efficiently produced in parallel.

So my point is this. Reasoning standards capture well-defined and understood fragments, but research and practice continue to explore subfragments that are suitable for certain problems, and as the subfragments become stable and gain popularity, they inspire future standards. It is an iterative process, so it is not necessary to become obsessed with fully complying with existing standards (unless that is actually necessary to meet your use case). It is probably more interesting to search for fragments of the standards that fit certain HPC paradigms.

3. Review the literature to reconsider approaches that were once considered less viable.

This suggestion seems obvious.  As an example, I recently did a literature review of parallel join processing, and one thing I noticed is that a majority of the literature is focused on shared-nothing architectures.  In 1992, DeWitt and Gray stated:

“A consensus on parallel and distributed database system architecture has emerged.  This architecture is based on a shared-nothing hardware design ….” [10]

However, in 1996, Norman, Zurek, and Thanisch directly opposed (or reversed) the claim of DeWitt and Gray saying:

“We argue that shared-nothingness is no longer the consensus hardware architecture and that hardware resource sharing is a poor basis for categorising parallel DBMS software architectures if one wishes to compare the performance characteristics of parallel DBMS products.” [11]

The popularity of the shared-nothing paradigm was probably further fueled by the advent of inexpensive supercomputing by way of Beowulf clusters and Networks of Workstations (around the mid 90′s).  However, many modern supercomputers provide shared-disk and shared-memory paradigms.  The Blue Gene/L in our Computational Center for Nanotechnology Innovations (CCNI) is networked with a General Parallel File System (GPFS).  Making use of GPFS, the Blue Gene/L could be considered shared-disk in a programmatic sense.  The Cray XMT uses large shared-memory. Rahm points out that a major advantage of shared-disk is its potential for truly dynamic load-balancing [11], so lets look back at some of the shared-disk and shared-memory research that has been done [12-15].

All of that just to say, a review of literature is in order. Potential sources of inspiration include parallel databases, parallel graph algorithms, deductive databases, and graph databases.

Jesse Weaver
Ph.D. Student, Patroon Fellow
Tetherless World Constellation
Rensselaer Polytechnic Institute

[1] Guo, Pan, Heflin.  LUBM: A benchmark for OWL knowledge base systems.  JWS 2005.
[2] Bizer, Schultz.  The Berlin SPARQL Benchmark.  IJSWIS 2009.
[3] http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/
[4] Schmidt, Hornung, Lausen, Pinkel.  SP2Bench: A SPARQL Performance Benchmark.  ICDE 2009.
[5] Weaver.  Redefining the RDFS Closure to be Decidable.  RDF Next Steps 2010.
[6] Muñoz, Pérez, Gutierrez.  Simple and Efficient Minimal RDFS.  JWS 2009.
[7] ter Horst.  Completeness, decidability and complexity of entailment for RDF Schema and a semantic extensions involving the OWL vocabulary.  JWS 2005.
[8] http://www.w3.org/TR/owl2-profiles/#OWL_2_RL
[9] Hogan, Pan, Polleres, Decker. SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples. ISWC 2010.
[10] DeWitt, Gray.  Parallel Database Systems: The Future of High Performance Database Systems.  Communications of the ACM 1992.
[11] Norman, Zurek, Thanisch.  Much Ado About Shared-Nothing.  SIGMOD Record 1996.
[12] Rahm.  Parallel Query Processing in Shared Disk Database Systems.  SIGMOD Record 1993.
[13] Lu, Tan.  Dynamic and Load-balanced Task-Oriented Database Query Processing in Parallel Systems.  EDBT 1992.
[14] Märtens.  Skew-Insensitive Join Processing in Shared-Disk Database Systems.  IADT 1998.
[15] Moon, On, Cho.  Performance of Dynamic Load Balanced Join Algorithms in Shared Disk Parallel Database Systems.  Workshop on Future Trends of Distributed Computing Systems 1999.

VN:F [1.9.13_1145]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Author: Categories: Semantic Web, tetherless world Tags: