SPCDIS Working Group - COMP Parallel Pipeline

Printer-friendly version

Documented here is the parallel pipeline that TWC created of the HAO processing pipeline of COMP data. What we've done is take their mostly IDL code and add provenance capture statements. These statements generate a log file which can then be parsed and converted into RDF, ingested into a triple store, and displayed to users.

Where do we capture provenance

IDL Component

Pipeline software developed at HAO is driven by a language called IDL (Interactive Data Language). During Summer, 2011, James developed an API for logging information about data products generated by both the CoMP and CHIP pipelines. In its current version, the logging API is designed to support a mixture of: (i) high-level provenance recording, and (ii) annotation of high-level processes with kinds of processing activities occurring. In taking this approach, any maintenance of logging routines by HAO staff would be greatly simplified (versus maintaining logging routines for generating a low-level provenance trace).

I've defined the following components in the current IDL-based logging API:

log_artifact.pro - this logs files generated by CoMP
log_process.pro - this logs pipeline processes executed by comp, at the level of IDL scripts (e.g., an execution of Demod.pro)
log_activity.pro - this logs processing activities carried out by pipeline processes
log_observations.pro - this logs observations of the solar corona, made by CoMP
log_entity.pro - this logs data entities (i.e., the result of a CoMP observation)
log_dataset.pro - a set of data entities, gathered over the course of one processing day
log_qualityassertion.pro - assertions of a quality metric (and corresponding score), applied to a data entity
log_qualityevidence.pro - evidence used by a quality assertion
log_fitsheader.pro - header entries for FITS files

Each logged statement will have varying kinds of information included. However, they all have two things in common: a line number and an entry label (name). Additionally, each of the functions above depends on its own IDL common block, each of which has the following variables:
counter > current ID count (incremented for each statement logged)
NameRegistry > hash map for tracking ID-name artifact mapping
IDRegistry > hash map for tracking name-ID artifact mapping
LUN > Logical Unit Number for output file

Using the hash maps NameRegistry and IDRegistry, both ID and Name lookups can be done across IDL script executions, to help facilitate log entry referencing.

Finally, the following additional scripts are used to support the logging API:
comp_initialize.pro - this is a modified version of comp_initialize.pro, designed to open each of the log LUNs
init_logblocks.pro - this initializes each log block for the logging functions
stop_logger.pro - this closes each of the log LUNs

Format of Logs