Jewett Meeting at MBL
From Semantic Portal Wiki
| Jewett Meeting at MBL (Workshop) [ Edit ] | |
| description | Data Provenance and Attribution for Published Datasets |
| location | Jonsson Center, Woods Hole, MA |
| tag | provenance; dataset; library |
- start date: April 9, 2009
- end date: April 10, 2009
- sponsor: MBLWHOI Library,Jewett Foundation
Contents |
Login
location: http://tw.rpi.edu/portal/Jewett_Meeting_at_MBL
Shared wiki login account: Jewett (password: please contact baojie@cs.rpi.edu and dingl@cs.rpi.edu)
Create your own account:
- Please go to http://tw.rpi.edu/proj/portal.wiki/index.php?title=Special:UserLogin&type=signup
- Fill up the form, please select "Your domain" to be "local"
- Go back to the Jewett Meeting at MBL page
For a brief wiki editing tutorial, see [1] (Youtube, 4 mintues)
Attendees
| name | affiliation | |
|---|---|---|
| Alice Orton | aorton@usgs.gov | usgs |
| Andy Maffei | amaffei@whoi.edu | whoi |
| Anthony Goddard | agoddard@mbl.edu | mbl |
| Arcot Rajasekar | rajaseka@email.unc.edu | unc |
| Art Gaylord | agaylord@whoi.edu | whoi |
| Arthur Newhall | anewhall@whoi.edu | whoi |
| Bob Groman | rgroman@whoi.edu | whoi |
| Cathy Norton | cnorton@mbl.edu | mbl |
| Cyndy Chandler | cchandler@whoi.edu | whoi |
| Deborah McGuinness | dlm@cs.rpi.edu | rpi |
| Diane Rielinger | drielinger@mbl.edu | mbl |
| Ed Urban | ed.urban@scor-int.org | scor |
| Gary Miller | gmiller@usgs.gov | usgs |
| Holly Miller | hmiller@mbl.edu | mbl |
| Jennifer Schopf | jms@nsf.gov | whoi/nsf |
| Kerstin Lehnert | lehnert@ldeo.columbia.edu | columbia |
| Li Ding | dingl@cs.rpi.edu | rpi |
| Lisa Raymond | lraymond@whoi.edu | whoi |
| Patrick West | westp@rpi.edu | rpi |
| Peter Fox | pfox@cs.rpi.edu | rpi |
| Peter Wiebe | pwiebe@whoi.edu | whoi |
| Ryan Schenk | rschenk@mbl.edu | mbl |
| Stephen Miller | spmiller@ucsd.edu | ucsd |
| Stephan Zednik | zednis@rpi.edu | rpi |
| Tom Moritz | moritz@archive.org | internet archive |
| Vicki Ferrini | ferrini@ldeo.columbia.edu | columbia |
Agenda
Thursday, April 9
Campfire chat room: https://mblwhoilibrary.campfirenow.com/37e51
Raw Campfire transcripts: April 9th Chat April 10th Chat
2:00 pm Pre Conference : Praciticum Team meets and goes over their experience/Carriage House. (team members only)
3:30 pm Shuttle Service from Inn on the Square to Jonsson Center
3:45 pm Shuttle Service from Inn on the Square to Jonsson Center
Conference Starts
4:00 pm Coffee/Tea, Workshop begins: Carriage House
4:00-5:00 pm Keynote by Deborah McGuinness (RPI)
5:00-5:30 pm Challenges by Cyndy Chandler (WHOI)
5:30-6:00 pm Goals - discussion Andy Maffei (WHOI)/Cathy Norton (MBLWHOI Library)
- focus is only on data behind a published journal article
- examine attribution stream for this data, how is it cited?
- examine where do you store the metadata about this data?
- where do you store the data?
- what metadata is required around the metadata?
6:00 pm Cocktails and Dinner-- Main House
Friday, April 10th
7:30 am Shuttle Service from Inn on the Square to Jonsson Center
7:45 am Shuttle Service from Inn on the Square to Jonsson Center
7:30-8:30 am Breakfast at Jonsson Center / Main House
Jonsson Center / Carriage House
8:30-9:00 am Data Library by Lisa Raymond (MBLWHOI Library)
9:00-9:30 am Persistent Archives: Long Term Sustainability of data based on policy and data virtualization by Arcot Rajasekar (UNC)
9:30-10:00 am NSF Office of CyberInfrastructure : What Are We Thinking About Data by Jennifer Schopf (NSF)
10:00-10:30 am Break
10:30-Noon Practicum - Use Cases
Noon Lunch - Jonsson Center/ Main House
1:00-1:30 pm Data Standards, Better Practices: US and others by Peter Fox (RPI)
1:30-3:00 pm - Use cases continued - followed by breakouts if necessary
3:00-3:30 pm Break
3:30-6:00 pm Consensus on Best Practices.... and work on white paper resulting from discussions.
6:00 pm CLAMBAKE at Jonsson Center / Main House
Meeting Notes and Slides
Data Library by Lisa Raymond (MBLWHOI Library)
Post-meeting Documents
Transcribed Easel Sheets from Best Practices discussion
Draft World Data Center Certification Criteria
Use Cases
Use Cases
- UseCase #1 for Group Discussion: A scientist wants to find all tables and figures in papers published the SW06 dataset that have have sound speed profiles in them.
- UseCase #2 for Practicum Exercise: A scientist wants to publish the data associated with the article he is submitting on Acoustic Properties of Salpa thompson to a journal. What steps does he need to take and what information does he have to collect about this data in order do submit this information to the publisher.
NOTE: This is a real use case. We have an example of the steps Peter Wiebe took to do this and the products will be available for workshop participants - Peter Weibe's article: Acoustic properties of Salpa thompsoni that Neil Sarkar and Holly created Dublin Core metadata for, with separate metadata for the text and each figure and table.
template
Generic Data Pipeline
An example of a general data pipeline
data
A link to backbone data for Table 2
A link to backbone data for Table 2
A link to backbone data for Table 6
A link to backbone data for Figure 3
A link to backbone data for Figure 7
summary
| Category | Description | Download | |
|---|---|---|---|
| File:CTD085.txt | Category:DataFile Category:Thing |
Table 2 backbone data for CTD cast 85 (one of four)used to compute the mean and max/min water properties where salps were collected for experimental work as presented in Table 2. | |
| File:CTD087.txt | Category:DataFile Category:Thing |
Table 2 backbone data for CTD cast 87 (one of four)used to compute the mean and max/min water properties where salps were collected for experimental work as presented in Table 2. | |
| Category:ImageFile Category:Thing |
Figure 3. Backbone data consist of a series of Jpeg images that were used in the analysis. | ||
| JMBL20090410 Example Figure 3 | Category:Dataset Category:Thing |
the dataset represented by Figure 3 | |
| JMBL20090410 Example Step 21 | Category:Step | backbone data used as input, to compute the mean and max/min water properties where salps were collected for experimental work as presented in Table 2 which is the output | |
| JMBL20090410 Example Step 31 | Category:Step | part of of Figure 3 | |
| JMBL20090410 Example Table 2 | Category:Dataset Category:Thing |
the dataset represented by table 2 | |
| File:Salp200-1 selection inner part 2.xls | Category:DataFile Category:Thing |
Figure 7: TS-distributions for targets ascribed to salps. Plot for 200 kHz from data in this file. | |
| File:Salp38 1-selection inner part 2.xls | Category:DataFile Category:Thing |
Table 6: Summary statistics of mean TS, confidence interval for the mean, 25 and 75% quartiles and Q75 – Q25 as a measure of spread derived for 38 kHz upper data portion from this file. |
Supplementary Use Cases
UseCase A
A paper is to be published in DSR II and the author needs to know how to reference the data that are available online. As the data manager, I need to know whether I need to do anything differently in how the source data are documented and served (additional metadata?, persistent identifiers?).
The paper (published 2008 in DSR II): Qian P. Li, Dennis A. Hansell, Nutrient distributions in baroclinic eddies of the oligotrophic North Atlantic and inferred impacts on biology, Deep Sea Research Part II: Topical Studies in Oceanography, Volume 55, Issues 10-13, Mesoscale Physical-Biological-Biogeochemical Linkages in the Open Ocean: Results from the E-FLUX and EDDIES Programs, May-June 2008, Pages 1291-1299, ISSN 0967-0645 DOI: http://dx.doi.org 10.1016/j.dsr2.2008.01.009 URL: http://www.sciencedirect.com/science/article/B6VGC-4SFR7MF-5/2/b08137059737fef3a654b2fd7897d4fb
that references data that are available online from BCO-DMO: http://osprey.bco-dmo.org/project.cfm?flag=viewd&id=13
the likely source data objects for the paper are listed below: http://ocb.whoi.edu/jg/serv/OCB/EDDIES/INVENTORY.html1
Measurement PI_name Data object URL OC404-1 bottle (merged) OCB_DMO http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC404_S1/bottle_OC404_S1.html0 bottle oxygen Bates http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC404_S1/oxygen.html1 nM NO3/PO4 Hansell http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC404_S1/nuts_low.html0 DOC; DON; DOP Hansell http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC404_S1/organic_matter.html0 del15N (PON) Hansell http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC404_S1/del15N-PON.html0
WB0409 Niskin bottle samples Bates http://ocb.whoi.edu/jg/serv/OCB/EDDIES/WB0409/bottle.html0 bottle oxygen Bates http://ocb.whoi.edu/jg/serv/OCB/EDDIES/WB0409/bottle.html0 DOC; DON; DOP Hansell http://ocb.whoi.edu/jg/serv/OCB/EDDIES/WB0409/organic_matter.html0 del15N (PON) Hansell data not contributed
OC404-4 bottle file (base) McGillicuddy http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC404_S2/bottle.html0 bottle oxygen Bates http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC404_S2/oxygen.html1 nM NO3/PO4 Hansell http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC404_S2/nuts_low.html0 DOC; DON; DOP Hansell data not contributed del15N (PON) Hansell http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC404_S2/del15N-PON.html0
WB0413 Niskin bottle samples Bates http://ocb.whoi.edu/jg/serv/OCB/EDDIES/WB0413/bottle.html0 bottle oxygen Bates http://ocb.whoi.edu/jg/serv/OCB/EDDIES/WB0413/bottle.html0 DOC; DON; DOP Hansell data not contributed del15N (PON) Hansell data not contributed
OC415-1 bottle file (base) McGillicuddy http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC415_S1/bottle.html1 nM NO3/PO4 Hansell http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC415_S1/nanoNutrients.html0 DOC; DON; DOP Hansell data not contributed del15N (PON) Hansell data not contributed
WB0506 Niskin bottle samples Bates http://ocb.whoi.edu/jg/serv/OCB/EDDIES/WB0506/bottle.html0 bottle oxygen Bates http://ocb.whoi.edu/jg/serv/OCB/EDDIES/WB0506/bottle.html0 DOC; DON; DOP Hansell data not contributed del15N (PON) Hansell data not contributed
OC415-2 bottle file (base) Ledwell http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC415_T1/bottle.html1
OC415-3 bottle file (base) McGillicuddy http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC415_S2/bottle.html1 nM NO3/PO4 Hansell http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC415_S2/nanoNutrients.html0 DOC; DON; DOP Hansell data not contributed del15N (PON) Hansell data not contributed
WB0508 Niskin bottle samples Bates http://ocb.whoi.edu/jg/serv/OCB/EDDIES/WB0508/bottle.html0 DOC; DON; DOP Hansell data not contributed del15N (PON) Hansell data not contributed
OC415-4 bottle file (base) Ledwell http://ocb.whoi.edu/jg/serv/OCB/EDDIES/OC415_T2/bottle.html1
UseCase B
A scientist has found a sound profile represented as a graph in a paper that he feels justifies a hypothesis he has put forward. He wants to get access to the original sensor data related to that sound profile. How does he do this?
UseCase C
A scientist has 10,000 images on slides sitting on his shelf that represents 10 years of work that he wants to digitize. How to get the metadata for data collected in the past before best practices for metadata was considered. Is it even worth the effort?
UseCase D
A scientist has written a paper with data that s/he would like to publish but access to the data is restricted or the use of the data is restricted, for some period of time. The publisher has requested that all data represented as figures or tables in this journal be "properly cited" with repository access.
UseCase E
A scientist wants to find all the data associated with a specific harmful algal bloom. He is interested both in orginal data and derived data that has been published in articles and deposited. He wants to be able to determine who collected the original data, who analyzed and processed the data. He will then publish a review article that will contain a synthesis of this information. How will he find everything (what metadata, connections, organization will be needed in an 'ideal world')? How will he know who should receive attribution? How will he publish and maintain attribution on his data synthesis product once he publishes it?
Suggested Preparation Materials for the Meeting
- National Science and Technology Council Releases Strategy for Digital Scientific Data. A view down the middle of a boron nitride nanotube.
The National Science and Technology Council (NSTC) released a report describing a strategy to promote preservation and access to digital scientific data. The report, Harnessing the Power of Digital Data for Science and Society, was produced by the NSTC's Committee on Science under the auspices of the Office of Science and Technology Policy (OSTP) in the Executive Office of the President.
- The open and timely publication of digital scientific data called for in the report will ... More at http://www.nsf.gov/news/news_summ.jsp?cntn_id=114448&govDel=USNSF_51
- Survey of data provenance techniques. Technical Report IUB-CS-TR618
http://www.cs.usask.ca/faculty/sal426/Provenance/docs/Literature%20Review/TR618.pdf
- ICSU Ad Hoc Strategic Committee on Information and Data
http://www.icsu.org/Gestion/img/ICSU_DOC_DOWNLOAD/2123_DD_FILE_SCID_Report.pdf
- Sudha Ram, Jun Liu. Understanding the Semantics of Data Provenance to Support Active Conceptual Modeling
http://en.scientificcommons.org/41046974
- Fox, McGuinness, Pinheiro da Silva. Knowledge Provenance in Virtual Observatories: Applications to Image Data Pipelines, 2008.
http://data.semanticweb.org/conference/iswc/2008/paper/poster_demo/70/html
- Pinheiro da Silva, McGuinness, McCool. Knowledge Provenance Infrastructure.
http://en.scientificcommons.org/685801
- Clifford Lynch. The Shape of the Scientific Article in the Developing Cyberinfrastructure,” CTWatch Quarterly (August 2007)
- Skills, Role & Career Structure of Data Scientists & Curators: Assessment of Current Practices & Future Needs. JISC Report 2008.
http://www.jisc.ac.uk/publications/publications/dataskillscareersfinalreport.aspx
- Baker, Barton, Peterson, Fox. Informatics and the 2007-2008 Electronic Geophysical Year. EOS, Transactions, American Geophysical Union 89(48) 2008.
http://www.agu.org/pubs/crossref/2008/2008EO480001.shtml (subscription)
- Gomes, Graybeal and O'Reilly. Data Management Issues in Operational Ocean Observatories.
Sea Technology 48(5) p.17-20, 2007 http://www.highbeam.com/doc/1P3-1284688471.html (subscription)
- Altman and King. A Proposed Standard for the Scholarly Citation of Quantitative Data
http://gking.harvard.edu/files/cite.pdf
- Trustworthy Repositories Audit & Certification: Criteria and Checklist
http://www.crl.edu/PDF/trac.pdf
- SCOR/IODE Workshop on Data Publishing, Oostende, Belgium, 17-19 June 2008. UNESCO, 2008. IOC Workshop Report No. 207.
http://www.iode.org/index.php?option=com_oe&task=viewDocumentRecord&docID=2457
- Standards for DATA
A Proposed Standard for the Scholarly Citation of Quantitative Data http://gking.harvard.edu/files/cite.pdf
- ISO 8000 under development
http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=50801
ISO 8000 - A Standard for Data Quality by Grantner, Emily Solving Data Quality Problems Using Data Standards by de Jager, Salomon
- ISO 19115
http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=26020
- ISO 19115:2003 defines the conceptual model required for describing geographic information and services. It provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data
- 11179
http://metadata-standards.org/11179/
This standard addresses the semantics and representation of data, and the registration of descriptions of that data. The standard has strong international backing and is freely available.
SO/IEC 11179 specifies the kind and quality of metadata necessary to describe data, and it specifies the management and administration of that metadata in a metadata registry (MDR). It applies to the formulation of data representations, concepts, meanings, and relationships between them to be shared among people and machines, independent of the organization that produces the data. It does not apply to the physical representation of data as bits and bytes at the machine level.
In ISO/IEC 11179, metadata refers to descriptions of data. ISO/IEC 11179 does not contain a general treatment of metadata. ISO/IEC 11179-1:2004 provides the means for understanding and associating the individual parts of ISO/IEC 11179 and is the foundation for a conceptual understanding of metadata and metadata registries.
| Description | Data Provenance and Attribution for Published Datasets |
| End date | 10 April 2009 + |
| Location | Jonsson Center, Woods Hole, MA + |
| Name | Jewett Meeting at MBL + |
| Sponsor | MBLWHOI Library +, and Jewett Foundation + |
| Start date | 9 April 2009 + |
| Tag | Provenance +, Dataset +, and Library + |

