Archive

Archive for the ‘tetherless world’ Category

Is Data Publication the right metaphor?

December 15th, 2011

http://mp-datamatters.blogspot.com/2011/12/seeking-open-review-of-provocative-data.html

VN:F [1.9.13_1145]
Rating: 5.0/10 (2 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: , ,

TWC Undergrads Visualize Linked Open Corporate Data

December 1st, 2011

Two undergraduate members of the Tetherless World team, Alexei Bulazel and Bharath Santosh recently wrote great summaries of their work creating visualizations based on linked open corporate data aggregated through the ORGPedia project. In this post I’ll include snippets from their posts; I encourage you to check out their full posts and the demos they link to!

First, a bit of context (from the ORGPedia site):

ORGPedia: The Open Organizational Data Project, led by NYLS Professor and former United State Deputy CTO Beth Noveck (Project Lead) and TWC Senior Constellation Professor Jim Hendler (Tech Lead) explores how to create the legal, policy and technology framework for a data exchange to facilitate efficient comparison of organizational data across regulatory schemes as well as public reuse and annotation of that data. By designing a universal exchange rather than a new numbering scheme, OrgPedia aims to achieve goals like improving corporate transparency and efficiency, organizational performance, risk management, and data-driven regulatory policy–without having to wait until legislation is enacted for a single, legal entity identifier.

To date, TWC’s contribution to ORGPedia has been to aggregate data from a variety of sources, develop an experimental site to serve as a platform for integrating the data and prototyping ORGPedia concepts, and develop data visualizations and mashups that demonstrate the potential of an open system of canonical identifiers for corporate entities. Led by TWC Ph.D. student Xian Li, undergrads Alexei Bulazel and Bharath Santosh teamed together to create interesting visualizations based on the data aggregated.

Bharath first describes a visualization he created that allows users to analyze various financial properties of the financial sectors in the US using our aggregated data:

The visualization itself is through Google Motion Charts which is in Google’s Visualization API. It is an interactive multidimensional graph of a dataset of sectors and the mean of various financial properties across the sector’s companies. The data shown above is represented is represented in millions USD. The Motion Chart allows for really neat temporal analysis of data in various forms. Clicking the play button shows the change in properties from 2008 to 2011. There are also three different styles you can view the data: bubbles(shown above), bar charts, line graphs. These can be switched in the top corner.

The dataset behind the visualization was created in R. I made a sparql query that would access Orgpedia’s datasets and pull out sector of the US and the companies and their stock tickers within the sectors. Then I took these companies and pulled in their income statements from Google Finance and went through each sector and averaged various properties from the sector’s companies’ financial statements. The data manipulation in R took some getting used to, but now its very easy for me to transform data frames, matrices, and other objects in R. After the dataset was created and cleaned for non-existent values its just defining properties of the Motion Chart and running it. It generates a html file with the graph and data represented in javascript. All the data processing and manipulation takes around 15 minutes mostly due to the large amount of data to be downloaded.

Bharath then goes on to describe the compelling visualization he and Alexei created of the “social network” of corporate board members:

…The visualization utilizes data from LittleSis.org and gathers data about board members of various companies in the US and shows the members in a force graph that shows which board members are on multiple boards (Board Members Network):

The graph visualization is done using the D3 visualization toolkit’s Forced Graph. Each node represents a board member. The clustered colored nodes are a group of members on the same board. The multicolored nodes represent board members that are on multiple boards. Mousing over a node shows you their name and the companies they work for. Clicking a node takes you to their LittleSis.org page. The graph shows many interesting relationships between various companies and board members. Especially Steven S Reinemund who resides on 5 different boards.

On his blog, Alexei provides additional detail about the work they did to prepare the data for the visualization:

The project involved creating an interactive graph visualization of connections between members of corporate boards (the final product can be found here). Given a list of a few hundred stock tickers and access to the LittleSis API, the goal was to ultimately produce a JSON file of board members that could be use by the D3.js force-directed graph framework. I started by looking up each ticker symbol, yielding a JSON file with a unique ID number for each company. My script then queried the API for actual company page associated with that ID and stored the names, company associations, and URIs of each board member. Finally, a JSON file for the D3.js graph was output describing the ~2800 board members and the links between each of them.

While I had used Python a bit for command line scripting, I hadn’t really dug into it before this project. The work gave me a better taste for the language and its capabilities. I made extensive use of the “urllib” library for accessing web content, and worked with opening up the data in JSON files. Bharath helped me with the syntax of program and some of the graph construction. While I was aware of Python’s reputation for ease of use and high level abstraction, working with it let me experience this abstraction first hand, I was very impressed. The ease with which complex multistep operations could be completed let me focus more on the flow of the data through the process rather than the specifics of handling it. The project also gave me a bit more hands on experience with JSON.

The reader is encouraged to read both Alexei‘s and Bharath‘s blogs for more details on these great contributions by a couple of our TWC undergrads!

VN:F [1.9.13_1145]
Rating: 9.5/10 (2 votes cast)
VN:F [1.9.13_1145]
Rating: +1 (from 1 vote)
Author: Categories: tetherless world Tags:

Two Misconceptions about the Semantic Web

November 18th, 2011

I recently presented at the Semantic Graph Database Processing BOF at SC2011, and I had the opportunity to discuss with others the needs for high-performance computing in web-scale computation and the benefits of Linked Data and ontologies on the World Wide Web. There was one participant there who was adamantly opposed to the semantic web.  (I think his exact quotes outside of the presentation were something like “I do not believe in the semantic web” and “only the semantic web cares about the semantic web”).  As I tried to make my case with him, it became increasingly clear to me that this person had a few misconceptions about the semantic web. I want to address those misconceptions here.

Before I continue, though, allow me to disclaim a bit. I am not a representative of the entire semantic web community, although I do consider myself a member of it. Additionally, I am not officially associated with the W3C. I write this blog entry simply in the capacity of a semantic web enthusiast (henceforth, semwebber), and not even as a member of the Tetherless World Constellation. I invite, nay, urge other semwebbers to contribute comments to this blog post in any capacity (agree, disagree, amend, etc.).

1. “One ontology to rule them all”

To my knowledge, nobody has ever claimed that there should be “one ontology to rule them all.” Instead, what is regularly promoted is ontology reuse and/or integration. For example, the FOAF ontology is widely used in the semantic web to describe persons; why create your own ontology when you can reuse a well-established one? Integration of ontologies allows for conciliation of perspectives, causing data that use these ontologies to become meaningfully related. Admittedly, there are some rather large, comprehensive ontologies out there, and there are some very popular and pervasive ones, too. However, there is no standard or recommendation that requires publishers of RDF data to comply with any particular ontology. You could even ignore the RDF vocabulary if you so please (yes, even rdf:type).

The primary purpose of an ontology (in my view) is to attach explicit semantics to your data. Just as the participant had stated (although he meant it in contrast to the semantic web), there are many ontologies. They compete in the ecosystem of the World Wide Web and evolve accordingly (or become extinct).

2. “Triples all the way down”

(First, let me say, this is not an affront to Planet RDF.)

This is a bit of a pet peeve of mine, and perhaps what I say here will offend some semwebbers (I hope not). The semantic web (in my view) is not about “triples all the way down.” What do I mean by that? Let me explain.

RDF brings primarily two things to the table when it comes to publishing and integrating data on the web: names in the form of URIs, and a simple data model that is flexible enough for (arguably) nearly any kind of data. (I would like to add a third, meaningful links, but I will avoid that for now.) So when data is published to the web, publishing it as RDF allows you: (1) to identify the things in your data across the World Wide Web, and (2) to structurally (and possibly semantically) integrate your data with other data on the World Wide Web. (I emphasize “World Wide” here to bring to attention the vast scope of publication, identification, and integration that is being achieved.) Fantastic.

Does this mean that everything can be efficiently (or rather, ideally) represented in RDF? No. Then why would you ever want to handle triples? You probably don’t. Let me explain.

RDF is meant to solve the problem of meaningfully publishing data (not just documents) on the World Wide Web. Beyond that, do what you want. More specifically, when you crawl and/or aggregate data from the World Wide Web, you don’t have to keep the RDF data as triples in your system. It is no longer on the global stage of the World Wide Web; rather, it is now in your system where you are king. So optimize away! Store it or process it however you like! Relational databases? Sure! Rewrite URIs as shorter terms? Whatever floats your boat! Ignore the explicit semantics and treat it like an unlabeled graph? I wouldn’t recommend it, but you’re the king! Do whatever it takes to meet your use case, and if your use case has something to do with RDF data, then fine, leave it as triples if you want. My point is, it’s not necessarily “RDF all the way down,” but it is “RDF at the top” where “top” is the place of publication, the World Wide Web. The universal naming mechanism of URIs and the generic data model enables data publishers to get data out there in a way that can be explicitly understood by machines (for example, when I say “Beast is furry,” am I talking about Mark Zuckerberg’s dog or the fictional X-Man Dr. Henry Philip “Hank” McCoy?), but as the creator of that machine, it’s up to you how to utilize those explicit semantics.

Beast, Mark Zuckerberg's DogBeast, the fictional X-Man (They both look furry to me.)

To be clear, though, I am promoting RDF as a way to publish structured, semantic data as opposed to not publishing structured, semantic data.  In the future, it is conceivable that there may exist other good ways to publish structured, semantic data, but RDF exists today and is widely used.

So I will leave it at that. Again, I invite comments, rebuttals, accolades, disparagements, etc.

Jesse Weaver

VN:F [1.9.13_1145]
Rating: 9.4/10 (5 votes cast)
VN:F [1.9.13_1145]
Rating: +2 (from 2 votes)
Author: Categories: tetherless world Tags:

Biomedical Semantics and the Cloud

November 18th, 2011

I’ve been asked to give a 30 minute talk on biomedical semantics in the cloud at the Molecular Med Tri Con in the symposium on cloud computing. Here’s what I know about what’s going on in this area at the moment:

So that’s on the “semantics using the cloud” side, but I really think that there’s a lot of potential going the other way: using semantics to discover data and services in the cloud. SADI has the ability to discover and link services through ontologies. It’s similar to SAWSDL (in fact, they wrap SAWSDL services), but they don’t bother with the extra layer, and just let the service process RDF directly. When SADI services are deployed to the cloud, it’ll solve a big problem for people who want others to use their services/algorithms without the overhead of maintaining those servers themselves. In fact, with the Amazon DevPay structure, it’s possible for small labs to release datasets, databases, and algorithms to the world and not have to pay to support it.

I say when, not if, because my implementation of SADI in Python is almost ready for deployment through Google App Engine (which can be deployed in AWS or other systems using AppScale), and from what I hear, it won’t take much work to do the same with the Java implementation. Between this and the extreme portability of python SADI services (it’s just a script), use in the cloud and redeployment to private clouds is going to be trivial.

So I’m asking folks, am I full of it? Also, what else is there out there? Please help me out so that we all get some good exposure!

VN:F [1.9.13_1145]
Rating: 8.3/10 (3 votes cast)
VN:F [1.9.13_1145]
Rating: +2 (from 2 votes)

My report on Open Government Data camp 2011

November 2nd, 2011

A few days ago I (Alvaro Graves) participated in the Open Government Data Camp 2011 in Warsaw, Poland, where people from different groups, organizations and governments met to discuss issues related to Open Data at government level. Here are some of the most important issues found in theese talk, in my opinion.

The current state of OGD

David Eaves, an activist who advises the city of Vancouver, Canada in issues about Open Data, gave a keynote in which he described his views on the current state of Open Data movement. First, it is striking that the success stories are not just a few anymore (as Data.gov or Data.gov.uk) but there are dozens (perhaps hundreds), both at national, regional and local levels. Similarly, the term Open Government Data is becoming increasingly popular, which is good because it is easier to stop explaining the ‘what’ and start focusing in the ‘how’.

Another interesting point is how the movement of Open Government Data already passed an inflection point, where it is no longer seen as people demanding from the outside, but being increasingly being invited to help working on these initiatives from within the government. For many, this change in perspective can be confusing and may create some concerns of Open Data being absorbed in a bureaucratic system that makes impossible to implement Open Data initiatives. However, it is clear that in order for these changes to occur, the movement can not reject to collaborate with governments.

Local initiatives, by locals

A talk that I really liked was by Ton Zylstra, who lives in the city of Enschede, the Netherlands. This city has only 150,000 inhabitants. He wanted an Open Data initiative there, however, it was difficult to convince the authorities, so he with a group of people decided to start working on their own. Inviting a handful of hackers to a bar, they created their first application that used data from Twitter, Foursquare, and the venues of a local festival. Eventually they convinced the municipal government that the default option for local data ought to be open.

From this experience, Ton showed several important lessons: You have to create something concrete, no matter if it is small: This implies something that requires little funding (the first beers at the bar were free) and short-term (no more than a couple of weeks). It does not matter if it is something original or not, there are some great ideas out there that deserve to be copied and are very useful for the local community.

How the Open Data died

Another very interesting keynote was by Chris Taggart, founder of OpenCorporates, who warned of the risks that the Open Data movement is facing today. His main concern is the lack of relevance in terms of impact Open Data has on society. For example, he mentioned that so far no one’s business depends on Open Data (although this is not true, there are a few out there, but I have to concede they are rare examples). In general, making data available is not enough, it is necessary for it to be used either in applications, by data journalists, etc. Also, it is fundamental to link different sites with Open Data (something quite uncommon in the movement), so that people can find out more information. Finally, I liked his idea that if the Open Data does not cause problems to its incumbents, then it is not working.

Redefining what is public

Finally another talk that I found interesting was the idea of ​​Dave Rasiej, founder of Personal Democracy, and Nigel Shaldbolt, professor at University of Southampton, to redefine “the public” in terms of data that “is available on the Web in machine-processable formats.” That is, uploading a bunch of PDFs with scanned tables does not make that information public, because it is not easily accessible. This initiative raises the bar of what public data is, especially when compared to the FOIA (Freedom of Information Act) that allows you to request information from government. Note that this applies to all information, as Rasiej so vehemently described it.

So… what did you talked about at OGDCamp?

In my case, I presented a system for publishing Linked Data called LODSPeaKr, which can be used for the rapid publication of government data and to create applications based on Linked Data. In the near future I will be writing more about this framework, but for now you can see my presentation here.

VN:F [1.9.13_1145]
Rating: 9.5/10 (2 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)