Persistent Archives: Long Term Sustainability of data based on policy and data virtualization by Arcot Rajasekar (UNC)
From Semantic Portal Wiki
Slide 1 - Persistent Archives
Vicki F. R2K Data Compliance Planhttp://www.ridge2000.org/science/downloads/R2KStCom.pdf
Slide 2 - Topics
Data grids for preservation and sharing Two examples -DIGARCH -TPAP
Slide 3 - Data Preservation Challenges
Massive data, millions of files Cannot look at data by filename, cruise need metadata person responsible can remember for 2 weeks at the most Long term preservation is more than maintaining the data need migration across media types and application types
Tom M. "Conservators face issues in preserving video" [Getty Research Institute] - Los Angeles Times msl1.mit.edu/furdlog/docs/latimes/2008-04-30_latimes_tech_art_aging.pdf ex: old word documents no longer readable
Slide 4 - What is a data grid?
To get data, you need to contact scientist, need to do a lot to get to it We would like to share data more easily and how do we do it It is a paradigm shift
Tom M. : “A mishmash of non-standardized databases of raw results and unevenly reported study designs is not a strong foundation for clinical research data sharing.” Sim, et al “Keeping Raw Data in Context” (letter to) Science VOL 323 6 FEBRUARY 2009 www.sciencemag.org
Slide 5 - What is a data grid?
Data grid is an important aspect of how to do that a network of data that is presented as a single accessible collection of data you search for data by its properties, not where it is it enables discovery and access Now can do server side processing
iRODS - https://www.irods.org/pubs/DICE_irods-desc.pdf
to make data in a more accessible/usable form
Ex: convert tif to jpg so browser can display it
Policy virtualization - can set policies as a data manager can automatically check for metadata and attach Slide 6 - Why Data Grids?
Data virtualization - shared collection concept Common abstract name spaces - phyical independence global user name spaces, shared resources and uniform access ex: make a virtual collection/view based on geolocation Need to be able to do it in a generic fashion Technology independence Slide 7 - Why data grids?
-platform and vendor independence Common typing conventions for object and actions, use an ontology Is setting a goal of geo-referencing all datasets at WHOI an effective catalyst to establishing citations and other metadata?
Provide technology independence
middleware that hides the different filetypes scalability - should be able to keep on adding to it
Need discovery metadata
providing a system where data and metadata are together is important the data grid attaches data and metadata together then when you access data, already have metadata the grid is not just spatial but also temporal, all data in a given time convert metadata formats automatically Slide 8 - Why Data Grids - Policy
Policy-Virtualization: Automate Operations Policy of integrating metadata and data policy is important Andrew M. Arcot worked on Storage Research Broker (SRB) bottleneck is not just integration putting the rules in the datagrid System-centric Policies -manage retention, chain of custody, integrity
?? How much of this is in IRODS? right now?
Arcot Rajasekar: many of it is in there now Another example is medical data, many steps to transfer data from one doctor to the next assuring the descriptive data up to standards ex: automatically checking metadata file format or structure
audit trails Slide 9 - Why data grids - Policy
-keep track of everything that happens to the data as it is in the systme, reads, changes etc Domain Specific data -ex: extraction of metadata, astronomy data from telescope, data in image a single file that contains data and metadata Ingestion control for provenance attribution processing of data on ingestion post processing of data on access redaction, copy of file not original file ex: show only a set of files for elementary school student with language level they can understand
??Ingestion control, categories
Arcot Rajasekar: MRI data, certain processing needs to be done, ie remove the skull before can be shared Arcot Rajasekar: Mentioned that they have ingested MRI data in the past.
Different type of data processing per discipline or institution
Cathy Norton: Hospital blood test - 297 have access to data but they don't get access to all parts of the 'file' Some people look at some modules, some look at others Ex: chain of custody etc has mostly been done on paper. IT is difficult to convert that to policies/rules
DM: are either of you aware of standards Tom Moritz: FBI had detailed specifications. don't know if they are adequate for our needs ex standards that are required for an image to be considered 'evidence' data is evidence
Arcot Rajasekar: digital signature
DM: we want to follow up on this and make a recommendation for our community for htis Tom Moritz: data as evidence is important thing to remember
Slide Preservation is an integral part of the data life cycle -organize
Slide 10 - Preservation is an Integral Part of the Data Life Cycle
-publish -enable data discovery -presvere reference collection -associate new collections against prior data -define and enforce policies for long term management and curation give different viewpoints of the same information need a LOT of metadata to make this possible Slide 11 - Exemplar Data Grid
iRODS -Integrated Rule Oriented Data System -data grid -> data virtualizaaiton -add subcriptions based on metadata query -policy virtualization
Andrew M. REMEMBER - TM: Followup on "data as evidence" and associated specifications in other disciplines (law, medicine, journalism, ...) -domain centric rules applied automatically Slide 12 - Policy/Rule Examples
examples of policy/rule ex: check file type.. it is an MRI... so do a certain type of processing
Tom M. "Making clinical data widely available: granting pubic access to drug trial results and sharing patient data among researchers..." http://www.sciencemag.org SCIENCE VOL 322 10 OCTOBER2008 pp 217-ff.
CN: All our data files and the instruments that collect them are unique Andrew M. This is unlike DICOM files for MRI (and associated metadata) where things are more standardized.
Arcot Rajasekar: they may be unique but you can make a policy/rule for that type of file Notify owner if metadata is missing from the file after ingestion
Peter Wiebe: how do you know metadata is missing
Arcot Rajasekar: domain specific microservices
Peter Wiebe: so you have a template of what md is needed and then check that
Arcot Rajasekar: there could be option and required metadata
Tom M. Forensic Science Communications http://www.fbi.gov/hq/lab/fsc/backissu/oct2004/research/2004_10_research01.htm [03 04 09] Forensic Science Communications October 2004 – Volume 6 – Number 4 Research and Technology “Information Assurance Applied to Authentication of Digital Evidence”
Peter Wiebe: so for each unique instrument, need to create a template for required md
Cathy Norton + others: USGS uses FC?? but will switch to ISO???? Slide 13 - Overview of iRODS Data System track the data, track the policies, track the metadata
Virtuallization is important, that is where we take out the physical and make the rules
Data grids are distributed, many different sites
Institutional repository that use iRODS Slide 14 - iRODS Applications
ex: Duke Medical Archive
ex; Carolina Digital Repos
National data grids
NARA transcontinental persistent archive protopte
also examples of internation data grids
FGDS is going to change to ISO 19115
http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?csnumber=53798
French National Library
ISO MOIMS repository assessment criteria
examples of rules
150 rules that implement the assessment criteria
write rules into cases and then convert to iROD rules
ex; Verify descriptive metatdata against an ontology
Slide ?? - DIGARCH case study
Slide ?? - Our Proposal
To develop a prototype for preserving digital video collections
reels stored in cupboard
read and digitize
used hps
preservation life-cycle meshing with production
Slide ?? - Exemplar Collection
new video still being created
not just video but also emails, images of book cover, databases, text transcripts, hd video, web quality video etc Slide ?? - (Block Diagram) Different steps for TV production lifecycle
do metadata analysis, schema generation
archival process automatic
ex; gather emails, blog posts as generated and put into archive with metadata Slide ?? - Preservation Process Slide ?? - Utility of Data Grids
enforce metadata requirement polices
Tom M. Moore, R., “Building Preservation Environments with Data Grid Technology”, American Archivist, vol. 69, no. 1, pp. 139-158, July 2006
??: how did you do the automatic archiving of email
Arcot Rajasekar: only looked at one guys collection, based on who it was sent to

