Archive

Archive for March, 2011

Budget Cuts Threatening Data.gov

March 31st, 2011

You may have heard that the Data.gov website is going to be shut down.  I wish I could say this is completely false, but  I can at least say that it is a bit premature — if Congress cuts the budgets to the threatened level, a number of sites, including Data.gov will have trouble continuing to grow, and some may have to be shut down — but right now the budget cuts are not final, and the plans are still in the works.  Data.gov, luckily, is less expensive than some of the other sites to maintain, so the discussion right now is more about cutting plans for expansion than shutting down completely, but even that would be a major blow to open government data. However, sites like USAspending and others will be harder to maintain, and even data.gov could end up shut down if the full cuts go through unchanged (but at least I’m personally hoping the Senate and White House will resist this)

What you can do is to get involved!  Let your politicians hear from you — the Sunlight Foundation has a great site about this at http://sunlightfoundation.com/savethedata/ which will let you sign a petition and has some suggestions for other actions.  It also has up to date information on the situation — please go look there.

There’s also a lot of articles out there, and much to follow in twitter space — here’s some starting points

In the past day, there have been a lot of articles in the news about Data.gov:  http://www.google.com/search?q=%22data+gov%22&hl=en&prmdo=1&tbm=mbl&num=10&lr=&ft=i&cr=&safe=images&tbs=qdr:w#q=data.gov&hl=en&lr=&prmdo=1&tbm=nws&ei=ehuVTfz9GY3msQOCiJ3MBQ&start=0&sa=N&bav=on.2,or.r_gc.r_pw.&fp=83f1e1e6450f219c

A good article by Beth Noveck (I’m the president of her fan club :-) ): Huffington Post: “Why Cutting E-Gov Funding Threatens American Jobs
http://www.huffingtonpost.com/beth-simone-noveck/why-cutting-egov-funding-_b_840430.html

The hashtag for following this on twitter is #savethedata

So please, join us in saving these important government transparency efforts!!

-Jim Hendler

p.s. For some irony, Hong Kong’s open data site went live today: http://www.gov.hk/en/theme/psi/welcome/

Here’s some more articles and things for those interested

Federal News Radio, Daniel Shuman, Sunlight Foundation, “Budget cuts may end transparency programs”  http://www.federalnewsradio.com/index.php?nid=17&sid=232614
Federal News Radio, Executive Editor, Jason Miller, “OMB prepares for open gov sites to go dark in May”: http://www.federalnewsradio.com/?nid=35&sid=2327798
Sunlight Foundation, Daniel Shuman, “Budget Technopocalypse Deepens: Transparency Sites will go dark in a few months”: http://sunlightfoundation.com/blog/2011/03/31/budget-technopocalypse-deepens-transparency-sites-will-go-dark-in-a-few-months/
Washington Examiner, Mark Tapscott, “Transparency advocates appeal to Congress to avoid budget cuts”: http://washingtonexaminer.com/blogs/beltway-confidential/2011/03/transparent-advocates-appeal-congress-avoid-budget-cuts
PCWorld, Grant Gross, “Group Protests Proposed Cuts to e-Government Transparency Efforts”: http://www.pcworld.com/businesscenter/article/223618/group_protests_proposed_cuts_in_egovt_transparency_efforts.html
“Data.gov and 7 other sites to shut down after budget cuts”: http://www.readwriteweb.com/archives/datagov_7_other_sites_to_shut_down_after_budgets_c.php

VN:F [1.9.22_1171]
Rating: 9.3/10 (3 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Will no CO2 emission come together with no global warming?

March 30th, 2011

OK, you have been fooled by the title. This post will not talk about environment policies, as I have no courage or knowledge to fight either school about global warming.

As a part of my recent work on “semantic information theory”, I’m reading Compression Without a Common Prior: An Information-theoretic Justification for Ambiguity in Language, by Brendan Juba of Harvard. I had some nice conversations with Brendan on Universal Semantic Communication when he was at MIT . It’s nice to read another paper from him.

In his paper, Brendan uses an example

For an English example, consider the example of sentence, You may step forward when your number is called. The implication is that you may not step forward before your number is called, for if that was not the intention, the sentence You may step forward at any time could have been used

Logically, that means if we know p → q, is ¬p →¬q true?

We know this is not a correct inference (i.e., the Denying the Antecedent fallacy). But why it is so often people fall for fallacies of this kind?

I tried to come up with a reasonable explanation using the semantic information theory (SIT). First introduced by Carnap and Bar-Hillel, SIT studies meanings carried by messages. If a sentence is less likely to be true, then it is more surprising. So “Today is hot, and tomorrow is also hot” means more than “Today is hot”. On the other hand, if we say “Today is hot, or today is not hot”, we give very little information.

In classical information theory, the entropy of a message is determined by the statistical probability of the symbols appearing it. In SIT, the entropy of a statement is determined by its logical probability, i.e., the likelihood of observing a possible world (model) in which this statement is true. To see the difference,  let’s see another example: the message “Rex is not a tyrannosaurus” (M1) is less “surprising” than “Rex is not a dog” (M2), not because the word “tyrannosaurus” is more common than “dog”, but because the individuals represented by “tyrannosaurus” (now considered extinct) are less common than the individuals represented by “dog”. Thus, M1 has less semantic information than M2, even if it may have more Shannon information based on the statistical distribution of English words.

Now back to ( p → q)→(¬p → ¬>q). We have the truth table:

p q p → q ¬P→¬q

T T T T

T F F T

F T T F

F F T T

As we are ignorant about the likelihood of p and q, let’s suppose all 4 situations in the truth table are equally likely.  So the logical probability of ¬P→¬q is

m(¬P→¬q)=3/4

Now we know that p → q is true, so the second row in the table is ruled out. Then, the conditional logical probability

m(¬P→¬q|p → q)=2/3 [less surprising, less information]

Thus, by hearing that “You may step forward when your number is called“, it’s rational to revise downwards one’s belief about that “You may not step forward before your number is called“. The first sentence, while not a logically sufficient condition for the second, carries some semantic mutual information about the other.

Wait, is it the reverse of what we want to justify?

Maybe the real implication of “You may step forward when your number is called” is “No number called, no stepping forward”, i.e., instead of causation (¬P→¬q), we mean correlation (¬P^¬q). If that is true, it will be reasonable to not moving before your number is called:

m(¬P^¬q)=1/4

m(¬P^¬q|p → q)=1/3 [belief increases!]

Now return to the title, assuming P is “CO2 emission” and Q is “global warming”, and also assuming that the causation p → q stands, will ¬P^¬q, i.e., no CO2 emission will happen together with no global warming, make more sense? Well, based on the analysis above, it is. Logicians may disagree, but polar bears will certainly appreciate the argument.

Reference

[1] CARNAP, R., AND BAR-HILLEL, Y. An outline of a theory of semantic information. RLE Technical Reports 247, Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge MA, Oct 1952.

[2] B. Juba, A. Kalai, S. Khanna, and M. Sudan. Compression Without a Common Prior: An Information-theoretic Justification for Ambiguity in Language. In 2nd Symposium on Innovations in Computer Science. Beijing, P.R. China. 2011.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags:

GeoData 2011 – Experiences

March 7th, 2011

GeoData 2011 was a great platform for me to learn from and get familiar with data scientists in academia and various other organizations. The workshop focused on current practices and future directions of data life cycle, integration and citation. It was a very important resource for my research on data citation and was also a perfect follow-up for the data science course which I took last term. The workshop’s highlight was the three breakout sessions on Data life cycle, Integration and Citation. Each of these breakout sessions were preceded by thought provoking talks by experts in the respective areas.

The first day focused on aspects of data life cycle. Prof. Peter Fox gave a talk on the various stages of a data life cycle and presented a couple of data life cycle models. The talk was well received. Some of the slides he presented, especially the “data-knowledge-information” diagram and the “pyramid of data and its audiences” were referenced at various points in the breakout session I attended. While most members reached a consensus on a life cycle model similar to the one suggested by Prof. Fox, some members suggested addition of a “disposal” and a “data definition” stage. There were also suggestions to have two separate value streams for data. Gaps in the life cycle were investigated. Many participants highlighted the need for incentives for better data management.

The Second day started with reports from the data life cycle breakouts. Following the reports, Jim Barret gave a talk on “GeoSpatial Integration”. He suggested a systematic collaborative effort towards data integration and proposed building and publishing a national supply chain plan for data. Rich Signell, USGS, chaired our breakout. To concur with Prof. Fox’s metaphor – “Dead fruit lying on the ground”, our team highlighted successful data integration efforts at OGC and UniData and discussed the importance of communicating those standards to the community. Rich Signell also pointed out the huge demand for people trained in producing quality data.

The Data Citation breakout was my personal favorite. Mark Parsons from NSIDC presented his hypothesis that “~80% of citation scenarios for geospatial data can be addressed with basic citations”. He gave us a homework exercise to come up with citations for 3 use cases. In my breakout group, we split up into small 3 person subgroups. I teamed up with Rich Signell and Ben Lewis. I presented a use case featuring an EPA data set ( How do we cite data from http://www.epa.gov/cgi-bin/htmSQL/mxplorer/query_daily.hsql?poll=42101&msaorcountyName=1&msaorcountyValue=1). Rich Signell showed one of his use cases – a data set in a THREDDS server, which had the same scientific content available in different file formats. Signell raised questions about granularity of citation. Our group also discussed the possibility of using SHA-1 hash values as identifiers to avoid having a central authority having control over identifiers. Personally, I feel a DOI or Handle – like identifier would be the best option, as it would act as both an identifier and a locator with the benefits of persistence. Bruce Barkstrom asked questions about lineage of data citations. As presented in one of the breakout reports, if we use information in a map built from 100 datasets, do we cite the map or the 100 datasets? Many research questions were thrown in the breakout. The breakout session was an excellent opportunity for me to come up with additional use cases and get an idea of the issues around data citation.

Apart from the workshop’s central theme, I also had various semantic web and provenance related discussions with various participants. I met people who were very interested in provenance concepts. Provenance was a hot topic of discussion in both the data life cycle and data citation breakouts. I had discussions about PML and the Inference Web with folks from ORNL and Harvard. They were very excited about it and showed lot of interest. There seems to be a huge interest and demand for tools facilitating Provenance collection and integration. People are showing a lot of interest in tools like csv2rdf4lod, which has built-in provenance support.  There was also a lot of interest in using semantic web technologies for GeoInformatics.

The workshop has given me a lot of insight into data citations, life cycle and integration. I hope to make best use of this experience in my efforts to come up with proper data citation methods. People need incentives to produce quality data and Data Citations could be the answer.

VN:F [1.9.22_1171]
Rating: 9.6/10 (5 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)

GeoData2011 Takeaways

March 6th, 2011

The GeoData2011 workshop was a tremendous experience to hear about the state-of-the-art in the data lifecycle, data integration, and data citation, and to participate in dialogs that will define the path forward in each of these areas.  It was humbling to be surrounded by the combined centuries of experience in geoscience and the data pipeline.  There were members from virtually every community in geoscience and organizations specializing in every stage of the data lifecycle.  That being said, there were some key takeaways that I collected from each of the workshop foci (lifecycle, integration, and citation).

The data lifecycle is a hard thing to define.  Our own Prof. Peter Fox gave the workshop a starting point with a simple, three-level model involving acquisition, curation, and preservation.  Of course, the data lifecycle is not by any means a simple entity, and it’s likely that there is not a one-size-fits-all framework or abstraction for every instantiation.  Some participants thought there needed to be a distinction between the “original intent” data cycle and further cycles.  Others viewed the data lifecycle as an endless spiral of acquisition, curation, and preservation.  One of the breakout sessions divided the simple lifecycle further to include more granular stages such as collection planning, processing, and migration.  Even with the varied viewpoints on how to define a data lifecycle, the breakout sessions all pointed to metadata as the primary target for its improvement. The need to identify points where metadata must be captured, to build better tools and automate the process of capturing metadata within data collection instruments, and to educate scientists on the importance of metadata emerged as critical paths to improving the data lifecycle.  As a Semantic Web group, we can proudly say that we are good at dealing with metadata.  That being said, I still think we can improve in certain areas.  To start, more so than the geoscience community, I think we can nail down a data lifecycle abstraction for acquiring, curating, and preserving Semantic Web data.  We can also do a better job of capturing metadata throughout our data pipeline, and tools like csv2rdf4lod should be celebrated for their efforts in doing this.

For me, the data integration sessions may have been the most interesting part of the workshop.  In our group at TWC, data integration is a task that many of us do on a daily basis: transforming data from different formats to RDF to enable interoperability, applying community vocabularies and constructing vocabulary mappings to enable a consistent view over data.  However, most of the data integration tasks we perform are to achieve a certain goal, or to implement a specific use case; the goals that most of the GeoData participants had in mind were much more ambitious.  There was no use case, or specific domain; rather, the workshop focused on enabling data integration in the broad, multidisciplinary domain of geoscience.  The participants were primed with a talk by Jim Barrett from Enterprise Planning Solutions, where he mentioned the need to move data integration up the value chain, from the use side to the supply side of data.  I think there were mixed feelings on the extent to which data integration be moved up the value chain.  Most recognized that there is generally a tradeoff in ability to integrate data and the ability to capture everything in the original data acquisition.  The breakout session I participated in for this topic had a few interesting suggestions, namely that each role in the data lifecycle (e.g., producer, archiver, user) needed to maximize “integratability,” distinct from the callout to move integration up the pipeline.  It was also mentioned that identifying limitations of data transformations (i.e., what has enabled integration) and constraints on data transformations (i.e., what can enable integration) is important, and there are constructs in the ISO 19115 standard for doing this (MD_Usage and MD_Constraint).  There is tremendous potential to apply semantics in this area, through vocabularies and reasoning capabilities, to notify users of the limitations of the products they are using, and to provide warnings before exceeding constraints.

The last major focus of the workshop was on data citation.  Before attending GeoData2011, I realized the significance of data citation, but only after the workshop did I realize that it was truly within the grasp of the scientific community.  Mark Parsons from the National Snow and Ice Data Center presented some ongoing work in data citation, such as DataCite and the Dataverse Network project, as well as his own theories on data citation.  He hypothesized that 80% of available data can be cited as is, without the need for any special data citation platforms, and set the breakout groups on the task of writing data citations and identifying gaps.  Some particularly tricky datasets were identified that might need alternative approaches, including taxonomic data (which is frequently changing) and hydrographic data (which is often compiled from many individual cruises into a homogeneous database).  What I found most interesting was that Parsons was suggesting that we cite data in exactly the same way as we make other citations in our publications; that we need to treat data citations as if they are equally important to journal and in-proceedings references.  Data citation is critical to the work we are doing at TWC.  We are almost unanimously working with someone else’s data in our lab.  As such, when we publish on what we’ve done, or even when we post visualizations and mashups of Data.gov datasets, we need to include references to the original data that are just like the references we’d put in any of our publications.  In our LOGD site, we should be making appropriate data citations on the pages we create for converted datasets.  Making these simple changes to the way we do science and educating students, scientists, and even publishers is the only way to make progress in data citation.

So those are my takeaways.  The GeoData2011 workshop was an excellent opportunity to learn about the state-of-the-art and the path forward in the data lifecycle, data integration, and data citation.  In short, let’s identify the data lifecycle for Semantic Web data, keep building tools that automatically capture metadata, and add appropriate citations for integrated datasets, visualizations, and mashups that we create.  I look forward to applying the information I absorbed from the many interesting dialogs that occurred.  In fact, I will be looking into the ESIP Discovery Cluster in the coming weeks to see where my work on S2S and Semantic Web services can be applied to improve on their discovery services (especially for their OpenSearch conventions).

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: tetherless world Tags: