Archive for April 1st, 2010

Big Data for the Cloud and the Crowd

April 1st, 2010

Researchers have been long starving for big data to improve the excellence of their research. Nowadays big data is no longer a dream but something real on the Web: increasing amount of data is becoming available for public access from research communities, individuals, government agencies and etc. So what does such big data mean to the web users and how can we best use it? Following are some potential benefits from big data.

“Make sense of what have been known”. Scientific research is growing in a progressive manner, and scientific discoveries are founded on the knowledge we known in the past. In order to avoid reinventing the wheel, we should preserve our knowledge on what we have known as part of big data and make them available to ongoing research. Currently, keyword search, such as Google Scholar, has successfully helped researchers to retrieve previous research work. Moreover, well organized knowledge about the past research is wanted to provide users a systematic and accurate way to access past work. With better knowledge on what has been done, user can better identifying promising research directions and approaching new discoveries.

“Support hypothesis generation and testing”. With big data in hand (or public accessible), not only scientists but the general public users can start thinking more on the hypothesis, including theoretical models and pop-science questions. A humble use of big data would be that users use an interactive application to conveniently aggregate distributed big data and then invent or evaluate their hypotheses on big data. On step forward would be the usage of powerful AI technology (especially statistical methods) on big data to help users identify similar/unique data/hypotheses, prioritize potentially interesting candidate hypotheses and even come up with new hypothesis.

“Support persistence and accountability”. If big data are going to be the foundation for massive scientific research and public use, reliable data availability is needed by all applications that depend on the data. Meanwhile, without effective accountability mechanisms over the distributed and shared big data, conclusions derived from the big data may not be trusted.

In order to realize the benefits, the emerging Web Science seems very promising as it is bringing many interesting opportunities to deal with the big data:

“Linked Data” [1]. Big data is not merely a massive collection of information islands bounded by their physical locations, and the value of big data can be greatly increased if there are effectively linked (or networked). Similar to the hyperlinks on the Web, it is very important to turn implicit inter-data connections into declarative ones and get links available as part of big data: a person’s medical records can be linked across different clinics and hospitals, demographic state statistics (e.g. livestock and gross income tax) can be linked across different government agencies [2], and information about a disease can be linked to entries at GenBank.

“Social Machine” [3]. Big data should also interact with human society. Crowd sourcing, such as Wikipedia and Web rating systems, has been seen adding huge value to the knowledge on the Web. However, that is not yet the ultimate vision. We can indeed combine the power of machine and human to build the social machine: cloud computing, such as Google search and Microsoft recently announced Web n-gram service, are offering great computing power for processing massive data, and crowd sourcing, such as Wikipedia, can distribute the cost for solving hard problems to massive human intelligence on the Web and supply high quality results. The social machine also supports interactive problem solving: there is a feedback loop between the cloud and the crowd, and the consumers can feedback comments and enhancements to the publisher.

“Knowledge Provenance”[4,5]. Big data are often integrated when being used. Declarative knowledge provenance (e.g. audit trace) is the foundation of transparency of distributed data processing. Computations on provenance data are the keys to accountability, e.g. a policy framework to assure proper use of digital information and some trust mechanisms to assure credibility of reused data.


[1] Tim Berners-Lee, Linked Data, 2007

[2] Li Ding, Dominic Difranzo, Alvaro Graves, James Michaelis, Xian Li, Deborah L. McGuinness,Jim Hendler, Data-gov Wiki: Towards Linking Government Data, in Proceedings of the AAAI Spring Symposium on Linked Data Meets Artificial Intelligence, 2010,

[3] J. Hendler, T. Berners-Lee, From the semantic web to social machines: A research challenge for AI on the World Wide Web, Artificial Intelligence (2009),

[4] Deborah L. McGuinness and Li Ding and Paulo Pinheiro da silva and Cynthia Chang. PML 2: A Modular Explanation Interlingua. in Proceedings of the AAAI’07 Workshop on Explanation-Aware Computing, 2007,

[5] Li Ding, Provenance and Search Issues in RDF Data Warehouse, in Proceedings of SemGrail Workshop, 2007,

Li Ding,  April 1, 2010

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Author: Categories: linked data, Web Science Tags: