Provenance

From Semantic Portal Wiki

Jump to: navigation, search
Infobox (Survey)
edit with form

Contents

Overview

The process that led to some data is called the provenance of that data. A provenance architecture is the software architecture for a system that will provide the necessary functionality to record, store and use process documentation to determine the provenance of data items.

"The motivation for understanding the provenance of works of art is also also applicable to data we see on the Web. With the proliferation of data on the Web, questions such as Where did this data come from?, Who else is using this data?, and Why is this piece of data here? are becoming increasingly common" (Tan 2004).

"Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources. It is a moot point on where the boundary between provenance information and generic metadata lies. In some cases, there is little to distinguish the two and provenance is subsumed into the general metadata infrastructure." (Simmhan et al. 2005)



Research Themes

Workflow Provenance

Workflow provenance has emerged as an important consideration in e-science (Lanter 1990; Frew and Bose 1991) and the grid community (Foster et al. 2002; Muniswamy-Reddy et al. 2006; Moreau and Ibbotson 2006). It focuses on the history of dataset derivation at a coarse level of granularity. Workflow provenance is in particular very important in e-science domain and there are quite some requirements emerging (Miles et al. 2007). The increasing interests in provenance metadata from different domains using different technologies have led to several provenance dialects. Interestingly all of the 14 teams in the second provenance challenge used their own (distinct) provenance representations and issues arose during translation. There are some useful surveys (Simmhan et al. 2005; Bose and Frew 2005).


researchers

resources

references

  1. David P. Lanter. Lineage in GIS: The Problem and a Solution , NCGIA (90-6), 1990
  2. James Frew, Rajendra Bose. Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products , SSDBM pp.180-189, 2001
  3. Ian T. Foster, Jens-S. Vockler, Michael Wilde, Yong Zhao. Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation , SSDBM pp.37-46, 2002
  4. Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, Margo I. Seltzer. Provenance-Aware Storage Systems , USENIX Annual Technical Conference, General Track pp.43-56, 2006
  5. Luc Moreau, John Ibbotson. The EU Provenance Project: Enabling and Supporting Provenance in Grids for Complex Problems (Final Report) , The EU Provenance Consortium, 2006
  6. Simon Miles, Paul T. Groth, Miguel Branco, Luc Moreau. The Requirements of Using Provenance in e-Science Experiments , J. Grid Comput. 5 (1) pp.1-25, 2007
  7. Yogesh Simmhan, Beth Plale, Dennis Gannon. A survey of data provenance in e-science , SIGMOD Record 34 (3) pp.31-36, 2005
  8. Rajendra Bose, James Frew. Lineage retrieval for scientific data processing: a survey , ACM Comput. Surv. 37 (1) pp.1-28, 2005
  9. Yolanda Gil, Ewa Deelman, Mark H. Ellisman, Thomas Fahringer, Geoffrey Fox, Dennis Gannon, Carole A. Goble, Miron Livny, Luc Moreau, Jim Myers. Examining the Challenges of Scientific Workflows , IEEE Computer 40 (12) pp.24-32, 2007

Protocol for Bioinformatics

bioinformatics process can be considered as a specific branch of workflow provenance.

references

  1. Lance Feagan, Justin Rohrer, Alexander Garrett, Heather Amthauer, Ed Komp, David Johnson, Adam Hock, Terry Clark, Gerald Lushington, Gary Minden, Victor Frost. Bioinformatics process management: information flow via a computational journal , Source Code for Biology and Medicine 2 (9), 2007
  2. Joan C. Bartlett, Elaine G. Toms. Developing a protocol for bioinformatics analysis: An integrated information behavior and task analysis approach , Journal of the American Society for Information Science 56 (5) pp.469 - 482, 2005
  3. Shawn Hoon, Kiran Kumar Ratnapu, Jer-ming Chia, Balamurugan Kumarasamy, Xiao Juguang, Michele Clamp, Arne Stabenau, Simon Potter, Laura Clarke, Elia Stupka. Biopipe: A Flexible Framework for Protocol-Based Bioinformatics Analysis , Genome Research 13 () pp.1904-1915, 2003

Data Provenance (database)

Data provenance has been pioneered by (Buneman et al, 2001; Cui et al. 2000; Woodruff and Stonebraker 1997) within database community. Data provenance research focuses on issues of importance in database settings and has been inspired by computational methods suitable for and facilitated by databases. For example, (why provenance) find source tuples to explain why a tuple is derived, and (where provenance) find the portion of sources which is copied to a portion of the derived tuple. This kind of provenance can be represented as a specialized workflow step whose action with declarative query and declarative inverse-function. There are some useful surveys (Glavic and Dittrich 2007; Tan 2007). It is notable that some data provenance has been generalized to workflow provenance in e.g. e-science while the narrow "data provenance" remain in database domain.

researchers

resources

References

  1. Peter Buneman, Sanjeev Khanna, Wang Chiew Tan. Why and Where: A Characterization of Data Provenance , ICDT pp.316-330, 2001
  2. Yingwei Cui, Jennifer Widom, Janet L. Wiener. Tracing the lineage of view data in a warehousing environment , ACM Trans. Database Syst. 25 (2) pp.179-227, 2000
  3. Allison Woodruff, Michael Stonebraker. Supporting Fine-grained Data Lineage in a Database Visualization Environment , ICDE pp.91-102, 1997
  4. Wang Chiew Tan. Research Problems in Data Provenance , IEEE Data Eng. Bull. 27 (4) pp.45-52, 2004
  5. Boris Glavic, Klaus R. Dittrich. Data Provenance: A Categorization of Existing Approaches , BTW pp.227-241, 2007
  6. Wang Chiew Tan. Provenance in Databases: Past, Current, and Future , IEEE Data Eng. Bull. 30 (4) pp.3-12, 2007

Knowledge Provenance (AI)

Knowledge provenance (McGuinness and Pinheiro da Silva 2004; Fox and Huang 2003) focuses on issues of importance in knowledge base settings, which typically includes those of importance in database settings but also includes concerns arising from reasoning (potentially hybrid reasoning). For example, applications may need provenance for results of text analytic programs that are integrated into knowledge bases and processed by first order reasoners (Murdock et al. 2006) Provenance in distributed information systems (Weitzner et al. 2006) is an interesting direction in provenance research. Unlike many e-science workflows that simply compose services in to a sequence, the workflow in such systems involves many interactive communication protocols as well.

References

  1. Deborah L. McGuinness, Paulo Pinheiro da Silva. Explaining answers from the Semantic Web: the Inference Web approach , Journal of Web Semantics 1 (4) pp.397-413, 2004
  2. Mark S. Fox, Jingwei Huang. Knowledge Provenance , Canadian Conference on AI pp.517-523, 2004
  3. J. William Murdock, Deborah L. McGuinness, Paulo Pinheiro da Silva, Christopher A. Welty, David A. Ferrucci. Explaining Conclusions from Diverse Knowledge Sources , Proceedings of the 5th International Semantic Web Conference (ISWC2006) pp.861-872, 2006
  4. Daniel J. Weitzner, Harold Abelson, Tim Berners-Lee, Chris Hanson, James A. Hendler, Lalana Kagal, Deborah L. McGuinness, Gerald J. Sussman, K. Krasnow Waterman. Transparent Accountable Data Mining: New Strategies for Privacy Protection , Proceedings of AAAI Spring Symposium on The Semantic Web meets eGovernment, 2006

Research Directions

Provenance Metadata

  • reference information (aka digital object, statements)
  • reference and classify entities involved in information manipulation
  • annotate provenance attributes
  • represent information manipulation process in terms of plan and log


resources


References

  1. Luc Moreau, Juliana Freire, Joe Futrelle, Robert E. McGrath, Jim Myers, Patrick Paulson. The Open Provenance Model , University of Southampton, 2007
  2. Deborah L. McGuinness, Li Ding, Paulo Pinheiro da Silva, Cynthia Chang. PML 2: A Modular Explanation Interlingua , Proceedings of the 2007 Workshop on Explanation-aware Computing (ExaCt-2007), 2007

Provenance Computation

  • classify the computation on provenance metadata
  • list application domain and scenarios for provenance
  • provenance metadata management (storage, access, query)
  • provenance aware user interaction


Provenance Systems

Pvd Representation scheme Application domain Discussed by
CMCS Data-oriented Annotation Chemical Sciences
Chimera Process-oriented Annotation Physics
Astronomy
Simmhan2005survey
Foster2002chimera
Foster2003virtual
ESSW Process-oriented
Data-oriented
Annotation Earth Sciences
LIP Data-oriented Annotation GIS
MyGrid Process-oriented Annotation Biology Simmhan2005survey
Greenwood2003provenance
Zhao2004using
Goble2002position
Zhao2003annotating
Zhao2004semantically
Stevens2003mygrid
PASOA Process-oriented Annotation Biology Simmhan2005survey
Miles2005requirements
Brase2004using
Groth2005recording
Groth2004protocol
Tioga Data-oriented Inversion Atmospheric Science
Trio Data-oriented Inversion Generic


Literature Survey

Semantic Web Community
Tetherless World constellation
maintenance