About PQL

top | about (long) | download | documentation | source

 
   

PQL, or Path Query Language or "pickle", is a database query language designed for working with graph-structured data. Specifically, it is for handling ordinary scalar data contained in a database whose structure is a graph of objects and relationships instead of a system of tabular relations. (Other tools and languages exist for other graph-related database problems, such as storing and retrieving graphs as data or computing transformations on graphs.) The syntax of PQL is similar to SQL, with additional bits drawn from familiar sources such as Unix regular expressions. This makes it familiar and reasonably legible.

Data Model

PQL works on semi-structured data, that is, data with no formal schema. It uses a simple object model: values in the database can be atoms (integers, strings, etc.) or references to structures containing near-arbitrary collections of member names and values. (Member names are, for the time being at least, restricted to integers and strings.) Structure members connect objects to one another, forming a directed graph; each member is an edge in the database graph and each structure or atom is a node. Member names in structures are not necessarily unique: an unordered collection is represented as a structure with multiple members all having the same name. Nor, in the absence of any schema, is there any requirement that multiple members of the same name necessarily all lead to values of the same type. Many of the properties of the language are chosen to allow easy and concise operation in this environment.

Query Model

Queries in PQL are based on the concept of path: a route from a database entry point, crossing some sequence of graph edges (that is, structure member fields), and ultimately arriving at a value. A path expression is a a piece of language syntax that, when evaluated and applied to a database, matches one or more (or zero) physical paths in the database and retrieves one or more (or zero) values. A path expression is like a series of structure member references in a language like C or Java, except that in PQL it has been generalized to allow matching.

Path expressions in PQL allow two kinds of matching: first, edges (that is, structure members) to be followed can be selected with wildcards as well as exact names. Second, and much more importantly, sequences of edges to be followed can be matched with full regular expressions. For example, the query

   select M.name
   from Genealogy.people as P,
        P.mother+ as M
   where P.name = "John Harvard";
retrieves the names of all female-line ancestors of John Harvard from a hypothetical genealogy database. The path expression P.mother+ follows one or more mother edges starting at the object P. While some other languages allow simple transitive closures as in this query, more complex examples that use the full matching power of regular expressions, such as
   select F.name
   from Genealogy.people as P,
        P(.mother|.stepmother)+(.father|.stepfather) as F
   where P.name = "John Harvard";
(which follows either mothers or stepfathers and then takes a final step of father or stepfather) are not to our knowledge supported by any other comparable query language.

Additionally, PQL allows inspecting and constraining intermediate values found while following a path, and for cases where this is not enough, paths themselves can be extracted and handled in arbitrary ways. Paths thus discovered are first-class values that can be compared, inspected, or returned as query results. This allows querying the database structure as well as the database contents. It also allows easy expression of various sorts of queries that might otherwise be awkward, difficult, or even impossible without external programmatic support. This query, for example, accesses a hypothetical PQL interface to IMDB to evaluate Bacon numbers, but customizes the computation by excluding a couple well-known movies from the search.

   select A2.name, min(length(P))/2 + 1
   from
      IMDB.actors as A1,
      A1(.actor-of{M}.actor)*@P as A2
   where
      A1.name = "Kevin Bacon"
      and for all N in M.name: N <> "Ishtar" and N <> "Waterworld"
   group by A2
The syntax {M} binds the variable M to each intermediate result at that point in the path (in this case, a movie between two actors) and the syntax @P binds the variable P to the whole path matched by the regular expression repetition operator *.

Query Syntax

The general syntax (as seen in the examples above) is similar to SQL. The primary difference is that the from-clause contains paths rather than relations. There are some other superficial differences, such as support for proper string constants.

The path syntax is essentially comparable to the structure or record lookup syntax of ordinary programming languages, except that it has been extended with regular expression operators. The syntax .actor follows a single edge/field called "actor"; the syntax .actor* crosses zero or more such edges. As in Unix regular expressions, the + and ? operators match one or more, or zero or one, edge respectively, and the | operator handles alternatives. The syntax .actor|.director crosses either an "actor" or a "director" edge.

For more information, see the PQL Language Reference. For the really gory details of the syntax, see the grammar in the PQLBRL source.

History

PQL is derived from Lorel, the query language developed for the Lore semistructured database project back in the 90s. In the course of developing PASS, we decided we needed a query language for provenance. Because provenance is graph-structured, existing relational or XML-based tools proved inadequate, so we went looking for alternatives. PQL grew out of an attempt at reimplementing the published description of Lorel. PQL extends Lorel in several ways, most notably to full regular expressions including complex repeating sections.

Implementation

PQLBRL, the PQL Base Reference Library ("picklebarrel") is the implementation of the PQL engine. PQLBRL is not a DBMS or even a database engine; it is just a query engine. This allows it to be plugged into arbitrary storage backends and arbitrary query submission frontends. Currently PQLBRL is used as part of the custom provenance querying architecture in the PASS project.

The current state of the implementation: as of January 2010 PQLBRL 0.4.1 has been released, containing mostly bug fixes and some preliminary query optimization over 0.4. It pretty much works and supports the read-only language subset.

0.4 (the first release) was issued in August 2009. This was well behind the original schedule, which reflected some nontrivial semantic problems in the query canonicalization step. These problems have finally been resolved: the hardest part has been punted, and the rest is pretty much settled. (I believe I know how to handle the hard part soundly now, finally, but the implementation would be nontrivial, it's not clear from a broader point of view if it's important or even desirable, and as a result overall it is not worth spending any more time on.)

Support for the update language (based on Lorel's update language) is tentatively planned for the spring semester. A formal semantics is expected soon as well.