TWC Undergrads Visualize Linked Open Corporate Data
Two undergraduate members of the Tetherless World team, Alexei Bulazel and Bharath Santosh recently wrote great summaries of their work creating visualizations based on linked open corporate data aggregated through the ORGPedia project. In this post I’ll include snippets from their posts; I encourage you to check out their full posts and the demos they link to!
First, a bit of context (from the ORGPedia site):
ORGPedia: The Open Organizational Data Project, led by NYLS Professor and former United State Deputy CTO Beth Noveck (Project Lead) and TWC Senior Constellation Professor Jim Hendler (Tech Lead) explores how to create the legal, policy and technology framework for a data exchange to facilitate efficient comparison of organizational data across regulatory schemes as well as public reuse and annotation of that data. By designing a universal exchange rather than a new numbering scheme, OrgPedia aims to achieve goals like improving corporate transparency and efficiency, organizational performance, risk management, and data-driven regulatory policy–without having to wait until legislation is enacted for a single, legal entity identifier.
To date, TWC’s contribution to ORGPedia has been to aggregate data from a variety of sources, develop an experimental site to serve as a platform for integrating the data and prototyping ORGPedia concepts, and develop data visualizations and mashups that demonstrate the potential of an open system of canonical identifiers for corporate entities. Led by TWC Ph.D. student Xian Li, undergrads Alexei Bulazel and Bharath Santosh teamed together to create interesting visualizations based on the data aggregated.
Bharath first describes a visualization he created that allows users to analyze various financial properties of the financial sectors in the US using our aggregated data:
The visualization itself is through Google Motion Charts which is in Google’s Visualization API. It is an interactive multidimensional graph of a dataset of sectors and the mean of various financial properties across the sector’s companies. The data shown above is represented is represented in millions USD. The Motion Chart allows for really neat temporal analysis of data in various forms. Clicking the play button shows the change in properties from 2008 to 2011. There are also three different styles you can view the data: bubbles(shown above), bar charts, line graphs. These can be switched in the top corner.
The dataset behind the visualization was created in R. I made a sparql query that would access Orgpedia’s datasets and pull out sector of the US and the companies and their stock tickers within the sectors. Then I took these companies and pulled in their income statements from Google Finance and went through each sector and averaged various properties from the sector’s companies’ financial statements. The data manipulation in R took some getting used to, but now its very easy for me to transform data frames, matrices, and other objects in R. After the dataset was created and cleaned for non-existent values its just defining properties of the Motion Chart and running it. It generates a html file with the graph and data represented in javascript. All the data processing and manipulation takes around 15 minutes mostly due to the large amount of data to be downloaded.
Bharath then goes on to describe the compelling visualization he and Alexei created of the “social network” of corporate board members:
…The visualization utilizes data from LittleSis.org and gathers data about board members of various companies in the US and shows the members in a force graph that shows which board members are on multiple boards (Board Members Network):
The graph visualization is done using the D3 visualization toolkit’s Forced Graph. Each node represents a board member. The clustered colored nodes are a group of members on the same board. The multicolored nodes represent board members that are on multiple boards. Mousing over a node shows you their name and the companies they work for. Clicking a node takes you to their LittleSis.org page. The graph shows many interesting relationships between various companies and board members. Especially Steven S Reinemund who resides on 5 different boards.
On his blog, Alexei provides additional detail about the work they did to prepare the data for the visualization:
The project involved creating an interactive graph visualization of connections between members of corporate boards (the final product can be found here). Given a list of a few hundred stock tickers and access to the LittleSis API, the goal was to ultimately produce a JSON file of board members that could be use by the D3.js force-directed graph framework. I started by looking up each ticker symbol, yielding a JSON file with a unique ID number for each company. My script then queried the API for actual company page associated with that ID and stored the names, company associations, and URIs of each board member. Finally, a JSON file for the D3.js graph was output describing the ~2800 board members and the links between each of them.
While I had used Python a bit for command line scripting, I hadn’t really dug into it before this project. The work gave me a better taste for the language and its capabilities. I made extensive use of the “urllib” library for accessing web content, and worked with opening up the data in JSON files. Bharath helped me with the syntax of program and some of the graph construction. While I was aware of Python’s reputation for ease of use and high level abstraction, working with it let me experience this abstraction first hand, I was very impressed. The ease with which complex multistep operations could be completed let me focus more on the flow of the data through the process rather than the specifics of handling it. The project also gave me a bit more hands on experience with JSON.
The reader is encouraged to read both Alexei‘s and Bharath‘s blogs for more details on these great contributions by a couple of our TWC undergrads!