Second Provenance Challenge: PASS

Notes

The PASS entry for the Second Provenance Challenge is based on PASSv2, not PASSv1. The real kernel-based PASSv2 is not ready yet. The data we've posted comes from a system-call-trace-based tool instead. (Details are posted with the data.)

This toolset is semantically compatible with PASSv2, but because it's different and works differently (and runs on a different OS) it's likely that there will be minor differences from "real" data generated by the in-kernel system.

Workload Stages

Because PASS is always on, we don't have workloads as such. The most difficult part of preparing the data was separating it into stages in a way that would make each stage complete and self-consistent and also allow stitching the stages back together -- that is, making sure that the internal identifiers for objects are consistent across stages and things like that. The technique we settled on was to load ten databases, each containing a subset of the complete execution: starting from the very beginning and continuing to the end of one of the stages. We then dump all ten databases and -- very carefully -- extract the differences.

Six of the ten stages are the two executions of the challenge workload; two represent compiles (of AIR and our fake slicer) and the remaining two reflect setting up the challenge workload runs. These stages are included to mirror the data we generated for the first challenge.

You should be able to paste either or both compile stages together with the workload executions the same way you paste together the stages of the workload executions themselves. We encourage all of you to take a shot at processing this data in your system.

You will notice that the provenance information for the AIR compile is vastly larger than the rest of the workload. AIR is, however, a very small software package. The provenance for the build of GNU awk, which is itself a small package, is twice the size of that for AIR; the provenance for a build of a large package is... vastly larger. As/if time permits we may post XML dumps for some of the large traces we've been working with, for your amusement; if you're interested in finding out how your system scales you may find them of some use...

Data model

The PASSv2 data model is extended from the PASSv1 model, reflecting our experience working with PASSv1. (We hope to have corrected the most severe shortcomings of PASSv1...) The most notable changes are:

The basic unit of provenance is a provenance record, which is a key-value pair. The value can either be a plain ("flat") value, which for external purposes is always a string, or it can be a cross-reference.

Provenance describes things; we call the things described provenanced objects or just objects. Objects are organized into a series of versions as they are modified over time. (When exactly are new versions created? That turns out to be a subtle and vexing problem; a discussion may be found in our IPAW 2006 paper.)

Many kinds of objects can be recycled -- e.g. a file can be truncated to zero length, a process can call execve, and so forth. We sometimes call the sequence of versions between recycle operations a generation. (In fact, we try to avoid using this word very much, as there are already too many overloaded uses of the term "generation" or "generation number" in file systems; we often refer to "objects" instead and pretend generations don't really exist.)

Provenance that is recorded is thus stored in records that are each attached to some object (or technically, some generation), or to some version of that object. We call the records that attach to the object itself identity information, because they record information about the object itself. We call records that attach to specific versions ancestry information, because they record the flow of data through the system as the objects are modified. That is, the object's identity doesn't change as it's modified, but the history of its contents does. (This is a somewhat oversimplified view; it doesn't, for example, take file renames into full account. We're still working on how best to account for renames.)

Note that identity information is generally flat records and ancestry information is generally crossreferences; however, both opposite combinations are possible and meaningful.

A pnode is a storage-level construct that holds the provenance for a series of versions of a single generation of an object. Every provenanced object for which a representation is required on a particular volume has a pnode with a unique (for that volume) pnode number.

A cross-reference thus contains a pnode number and version number, and therefore points to the (generation of the) object named by the pnode number and specificially the named version number within that (generation of the) object.

The XML dump is organized into <provenance> blocks, each of which contains all the records associated with a particular pnode and version. Versions start with 1. Version 0 is reserved as a slot for holding the records that describe the whole (generation of the) object.

Note that the existence of a version 2 does not imply the existence of a version 1. Also note that if a cross-reference points to an object that was created outside the current workload stage, there may be no records available for that particular version. (However, the version 0 identity records should always be present.)

The attributes (record types) found in the data are slightly different from PASSv1 and are as follows:

Notes: