RDF and the Time Dimension – Part 2

In part 1 I claimed that you will run into problems when you try to model dimensional data in RDF: In basic RDF there is no way to properly model any form of dimension and with Named Graphs we are only able to model discrete dimensions. I received some feedback saying that the way I described the problem (in particular the distinction between discrete and continuous dimensions) was a bit unfortunate (and I agree). So before I get to describing my solutions, I’d like to rephrase the problem.

RDF and Dimensions – The Problem

What we want to model in OxPoints is the development of the University of Oxford over time. A very simple example would be the name of the Oxford University Computing Services (OUCS). OUCS exists since 1957 and was originally named “Computing Laboratory”. In 1969 OUCS was split in two, one branch becoming “Oxford University Computing Services”. Let’s say we simply want to model, that a resource identified by “OUCS” existed since 1957 and was named “Computing Laboratory” from 1957 to 1969. It then changed its name to “Oxford University Computing Services”. Now, why can’t we encode this in RDF? In part 1 I’ve outlined a proof sketch describing why the basic form of RDF does not support describing this kind of conditional data: The problem is RDFs notion of entailment. However, with the extension of “named graphs”, encoding this data should be an easy exercise: Create one graph named “1957-1969” that describes OUCS as “Computing Laboratory” and one graph named “1969-“ that describes it as “Oxford University Computing Services”. As I was pointed out (and I have to agree with it), not having enough names for all possible values does not stop us from talking and reasoning about a dimension (or call it a set of values). Take the real numbers as an example. There are many more real numbers than we can make up names for them. However, this does not stop us from happily talking and proving concepts about them. So what is the difference with RDF? If I asked you whether 2 is in the set of ]1,2[ \subset \mathbb{R}, then you would probably tell me, that it is not. However, you can only give me what I see as really the correct answer if you have the same understanding of the name ]1,2[ as I have (which is that ]a,b[ describes the open interval: $latex x \in \mathbb{R} | a < x < b}). If we have the common understanding, then we can deduce information from a given name. However, for a computer the name “1957-1969” is only a bunch of characters without any more meaning attached to it. Therefore a computer could not tell me whether or not 1960 is part of “1957-1969”. The name of a named graph is just a name (or more precisely a URI), and nothing that you could (or should) deduce information from. So how are we to encode our data in such a way, that allows us to query for:

  • What is (was) the name of OUCS today (in 1960)
  • What names did OUCS have from 1060-1980
  • When is the following true: OUCS is called “Oxford University Computing Services”

I have come up with two ideas on how to solve this problem. One using reification and relaxing the notion of entailment and one using named graphs.

RDF and Dimensions – a Solution.?

As we have seen the problem with dimensions in RDF is, that there is no built-in way to describe the context in which a given fact is valid. I have created a simple ontology (called: context graph ontology) that I will use for exactly that purpose. The idea in both proposed solutions is more or less the same. We will use the context graph ontology to describe in which contexts a reified statement or the statements in a named graph are valid. Furthermore we want to describe in which contexts resources exist. For both proposed solutions I will provide examples to demonstrate how one could work with such an RDF graph(s) using SPARQL (the official W3C RDF query language). I have started to look into Jena’s inferencing engine and with the support of such inference languages we should be able to overcome some of the problems that we are going to encounter with SPARQL. For this post however, I will concentrate on using SPARQL to query our RDF data. I will use the Turtle syntax for my examples, since I believe it is the most readable RDF syntax and I will make use of one abbreviation form, since again, I think it makes the RDF more readable:

_:stmt1 rdf:type rdf:Statement ;
   rdf:subject <http://oxpoints/data#OUCS> .

is the same as

_:stmt1 rdf:type rdf:Statement .
_:stmt1 rdf:subject <http://oxpoints/data#OUCS> .

To describe temporal information, I will use the owl-time ontology.

RDF and Dimensions – using Reification

A reified statement in RDF does not say anything about its validity. The following statement, for example, would simply describe the statement: OUCS hasTitle “Oxford University Computing Services”, but it does not say whether or not this has to be true in a model for this RDF Graph or not.

_:stmt1 rdf:type rdf:Statement ;
   rdf:subject <http://oxpoints/data#OUCS> ;
   rdf:predicate dc:title ;
   rdf:object "Computing Laboratory"  .

What I propose is to allow for an extra property (rdfcon:hasContext) on reified statements that describe the context in which the statement is to be seen as a fact. This could look something like this:

_:stmt1 rdf:type rdf:Statement ;
   rdf:subject <http://oxpoints/data#OUCS> ;
   rdf:predicate dc:title ;
   rdf:object "Computing Laboratory" ;
   rdfcon:hasContext _:con1 .

The context (_:con1) can now store one or more dimensions, thereby defining when the described statement is to be regarded as valid fact. Let’s define the context as: From 1957 until 1969.

_:con1 a rdfcon:ContextGraph ;
   rdfcon:hasTemporalDimension _:temp1 .

_:temp1 a owl-time:Interval ;
   owl-time:hasBeginning _:temp1Begin ;
   owl-time:hasDurationDescription _:temp1Dur .

_:temp1Begin a owl-time:Instant ;
   owl-time:inDateTimeDescription _:temp1BeginInst .

_:temp1BeginInst a owl-time:inDateTime ;
   owl-time:year "1957"^^<http://.../XMLSchema#integer> .

_:temp1Dur a owl-time:DurationDescription ;
   owl-time:years "12"^^<http://.../XMLSchema#integer> .

Admittedly, this looks a bit horrible at first. We have to specify quite a number of facts to say that the context is from 1957 to 1969, but if we look a bit closer there is really not that much of an overhead. With the first two lines we add a new temporal dimension to our context. We then say, that our dimension is of type owl-time:Interval and that it has a beginning and a duration. Next, we say that the beginning is of type owl-time:Instant and define it as 1957. With the last two lines we describe the intervals duration as 12 years. If we now interpret this as: The triple “OUCS hasTitle Computing Laboratory” is valid in the defined context and relax our notion of RDF entailment to exclude context definitions from it, we are done. We just have encoded that OUCS was named “Computing Laboratory” from 1957 to 1969. But getting the data in, is only half the business. We now have to assess whether we can properly query our data.

Simple Queries

I have created a simple example RDF Graph that I will use to demonstrate how to query data that was encoded using the described technique. The example graph consists of two reified statements. One saying that OUCS was named “Computing Laboratory” from 1957 to 1969 and the other saying that from 1969 on OUCS is called “Oxford University Computing Services”. We will construct the following three queries:

  • What was OUCS’ name in 2008/1960/1940
  • List all names that OUCS had during 1960-1980
  • Define the context in which the following holds: OUCS’ name is “Computing Laboratory”

We are going to use SPARQL to query our RDF Graph. If you have never seen SPARQL you might want to take a look at the SPARQL Primer, since an introduction is far beyond the scope of this post. Anyway, the following idea might help. In SPARQL you make assertions on triples and variables (beginning with ?), thereby filtering out the set of possible answers.

Let’s query our RDF Graph for the name OUCS had in 1960. All our SPARQL queries will more or less look the same: The first thing we have to do is to grab all rdf:Statements that are talking about OUCS’ name (subject: <http://oxpoints/data#OUCS&gt; and predicate: dc:title) and that have a defined context (first 5 lines in WHERE clause). After we have identified the context description (stored in the variable ?context) we have to look for the dimension that we are interested in: In this case, a temporal dimension. I assume that if the dimension that is queried for is not specified, then the triple defined by the statement is generally valid in respect to the  during any specified temporal context, which is why we have to put the following statements in an OPTIONAL clause. If there is a temporal dimension, we are storing the beginning in ?beginDescYear and if a duration is specified we store the duration in ?durationYears (the OPTIONAL clauses: lines 20-43). In the final FILTER clause (lines 45-53) we are defining the context that we are interested in using the variables set in the OPTIONAL clause. In this example we are interested in the year 1960.

We get the following results:

obj
“Computing Laboratory”

To search for the the name in a different year, we simply have to change the two occurrences in lines 50 and 51.

How about the next query? Again, we only have to change the filter expression, to change the search to use a time interval instead of an time instant:

   FILTER
   (
     (! bound( ?dim ) )
     ||
        ( bound( ?beginDescYear ) &&
          ( ?beginDescYear >= 1960 && ?beginDescYear <= 1980 )
          ||
          ( ?beginDescYear <= 1960 && bound(?duration) &&
            ?beginDescYear + ?durationYears >= 1960 )
        )
   )

We get the following results:

obj
“Computing Laboratory”
“Oxford University Computing Services”

It is the same for any other query of the type: “What was [x] in [some time]”. The FILTER helps us to define the context that we want to perform our search in, while the longish OPTIONAL clause always stays the same.

For our last query (define the context in which the following holds: OUCS’ name is “Computing Laboratory”), we do not need a FILTER expression, since we do not want to specify a context. But we have to change the former ?obj variable to reflect our search string “Computing Laboratory”. Have a look at the updated query. You will find, that we really did not have to change much, and that the query actually became simpler.

We get the following results:

beginDescYear durationYears
1957 12

It seems that we are able to query our data, but the SPARQL queries become a bit more complex (and we have only looked at years in this example). However, once we have our dimension set, large parts of the query (the OPTIONAL clause) will stay the same for all queries.

Constructing Snapshots

So far we have directly queried our graph. However, it would be nice if we could transform the graph into a simpler one that only contains the facts that are valid in the specified context and which encodes facts directly and not via reification. Through some shortcomings in SPARQL (it is not possible to specify recursive queries in SPARQL), we will not get a very clean result. However, it should be suitable for our purposes. In the following examples we are going to use SPARQL’s CONSTRUCT expression, which allows us to create new RDF Graphs dynamically.

Let’s say we want to create a snapshot of everything that is valid in 1960. We have already seen, how we can query for a name that was valid in 1960. We therefore have to simply generalize that query. We get the following RDF Graph:

<http://oxpoints/data#OUCS> dc:title "Computing Laboratory" .

What our query has done is to look through at all reified statements with a context description. It filtered out all those, that were valid in 1960 and reformulated the statement as a simple triple.

Next, we have to add all statements that do not have a context description and all triples that are universally valid. This is done by this query. It results in the following RDF Graph:

_:node13oaqngn5x43 a owl-time:Interval ;
   owl-time:hasBeginning _:node13oaqngn5x44 ;
   owl-time:hasDurationDescription _:node13oaqngn5x45 .

_:node13oaqngn5x44 a owl-time:Instant ;
   owl-time:inDateTimeDescription _:node13oaqngn5x46 .

_:node13oaqngn5x45 a owl-time:DurationDescription ;
   owl-time:years "12"^^<http://.../XMLSchema#integer> .

_:node13oaqngn5x46 a owl-time:inDateTime ;
   owl-time:year "1957"^^<http://.../XMLSchema#integer> .

_:node13oaqngn5x49 a owl-time:Interval ;
   owl-time:hasBeginning _:node13oaqngn5x50 .

_:node13oaqngn5x50 a owl-time:Instant ;
   owl-time:inDateTimeDescription _:node13oaqngn5x51 .

_:node13oaqngn5x51 a owl-time:inDateTime ;
   owl-time:year "1969"^^<http://.../XMLSchema#integer> .

In our example we only get a bunch of blank nodes, since we did not have any universally valid triples defined in our original RDF Graph. The blank nodes that we get were originally used to describe the context of our statements and in a clean snapshot these should not be present. However, since SPARQL does not allow for recursive queries it becomes difficult to properly filter them out and having them shouldn’t really be a big problem. If we were constructing the snapshot using an inference engine we should be able to filter them out as well and get a cleaner result.

Anyway, if we now combine our two queries we get the following RDF Graph (here in Turtle):

Resulting RDF Graph

Resulting RDF Graph

Summary

We have constructed an RDF Graph with a temporal dimension using reification. With the use of SPARQL we were able to query our data and even construct snapshots that only contain the valid triples at a certain point in time, not containing any notion of a temporal dimension and thereby relieving us from the complex reification. The additional blank nodes we get are isolated and should not interfere with our queries.

We can do all this, with any tool that talks RDF and SPARQL and we did not have to “program” any extension in order to use a temporal dimension in RDF. However, our interpretation of our RDF Graph is officially not correct, since it violates RDF’s entailment property. But, if we accept that, then this looks like a workable solution to the problem.

RDF and Dimensions – using Named Graphs

Using reification we have been able to write temporal RDF, but this comes at a price. To reify a triple we have to write four triples instead of one, and we are violating RDF’s entailment property. In the following sections I will present a solution using Named Graphs. Named Graphs are not part of the RDF specification but widely supported. With Named Graphs we will be able to write temporal RDF without having to violate its entailment property.

In the problem description I said, that if we have a graph named “1957-1969”, we should not try to deduce any information from its name. But nothing stops us from putting additional properties on our graphs. After all, they are identified by URIs. However, in order for our RDF to be understood by others we should do this in a “standardized” way.

The solution I will propose is rather similar to the one using reified triples. We will use a special graph (called the context description graph) to define all our context graphs (e.g. “1957-1969”). An additional graph (the global knowledge graph) will hold all the triples that do not have any context associated with them.

So, let’s redo our old example using Named Graphs. Since we did not have any “global knowledge” we only need two context graphs and the context description graph:

The two context graphs are rather straight forward RDF Graphs. They simply store the name of OUCS:

<http://oxpoints/data#OUCS> dc:title "Computing Laboratory" .

The interesting bit comes in the context description graph. However, if we analyze it, it should look quite familiar. The first triple stores the information, that this graph is the context description graph:

<http://.../cdg.n3> a rdfcon:ContextDescriptionGraph

The following four lines add a temporal dimension to our two graphs, while the other lines further describe the them:

<http://.../1957-1969.n3> a rdfg:Graph ;
	rdfcon:hasTemporalDimension _:node13oddd9rox1 .

<http://.../1969-.n3> a rdfg:Graph ;
	rdfcon:hasTemporalDimension _:node13oddd9rox5 .

We now have encoded the information that OUCS was named “Computing Laboratory” from 1957 to 1969 and from then on “Oxford University Computing Services” using three named graphs. The next step will be to get that data out again.

Simple Queries

Let’s try to construct the same queries as in the reification solution:

  • What was OUCS’ name in 2008/1960/1940
  • List all names that OUCS had during 1960-1980
  • Define the context in which the following holds: OUCS’ name is “Computing Laboratory”

We will find, that the queries are quite similar to the ones in the reification solution. We have to make us of the SPARQL keyword “GRAPH” which let’s us talk about a specific Named Graph, but other than that, the queries are quite the same. Here is the query to answer the first question: The first four lines after the WHERE clause correspond to the statement selection in the reification solution. Here we are selecting a Named Graph (and store the value in ?graph) that is talking about the name of http://oxpoints/data#OUCS. With the next GRAPH clause, we are identifying our context definition graph. In the final GRAPH clause we are querying our context description graph to see whether those graphs that we have identified earlier are satisfying the given contex (in this case the year 1960). The OPTIONAL and FILTER clauses are completely identical to the ones in our previous example.

If you run the query, these are the results:

title
“Computing Laboratory”

To change our queries context to look for a different year, we again only have to change the two occurrences of the year in the FILTER clause. If we want to query for an interval, we can use the same FILTER clause as we have in the reification example:

   FILTER
   (
     (! bound( ?dim ) )
     ||
        ( bound( ?beginDescYear ) &&
          ( ?beginDescYear >= 1960 && ?beginDescYear <= 1980 )
          ||
          ( ?beginDescYear <= 1960 && bound(?duration) &&
            ?beginDescYear + ?durationYears >= 1960 )
        )
   )

With this FILTER clause we get the following answer:

title
“Computing Laboratory”
“Oxford University Computing Services”

And finally, searching for a context is just as easy. We simply have to drop the FILTER clause and specify the state that we are interested in.

The resulting query yields the following results:

beginDescYear durationYears
1957 12
Constructing Snapshots

Let us again construct a snapshot for the situation in 1960. Our first step is, again, to generalize the query looking for OUCS’ name and using SPARQL’s CONSTRUCT keyword to dynamically construct a new RDF Graph.

This generalized query generates the following RDF Graph:

<http://oxpoints/data#OUCS> dc:title "Computing Laboratory" ;
    a <http://.../ox#Unit> .

The only thing left to do is to add all the triples from the global knowledge graph. Since we did not specify any, this will result in an empty RDF Graph in this example. In any case, here is the query that does the job.

Finally, we have to combine the previous two queries using SPARQL’s UNION operator. With the combined query we are able to generate the snapshot (results in N3):

Snapshot of the data valid in 1960

Snapshot of the data valid in 1960

As we can see directly, our snapshot is much cleaner than the one we generated using the reification approach. This is due to the fact, that with Named Graphs we were able to cleanly separate the different contexts and the context description.

Conclusion

I have introduced two possible ways of encoding dimensional data in RDF. One using reification and working with basic RDF which allows us to use any of the existing tools that are able to process RDF and SPARQL. However, with this solution we are violating RDF’s entailment property. The other proposed solution is building on the RDF extension known as Named Graphs and can therefore naturally only be used with tools that know about Named Graphs. With this solution we are able to construct cleaner snapshots and it feels like it might scale better for large projects.

In any case, I think, that some solution to encode dimensional data should find its way into the RDF specification because only then will we be able to write tools that understand one another’s RDF. As it is now, it feels like everyone that uses RDF has to (to a certain extent) reinvent RDF to suit his or her purposes.

For the new OxPoints system I am currently favouring the second solution building on Named Graphs since it feels like the cleaner solution but I am happy to listen to any of your comments.

5 Responses to RDF and the Time Dimension – Part 2

  1. […] this problem by slightly changing the interpretation of RDF Graphs. I will outline my ideas in Part 2 of this […]

  2. […] RDF and the Time Dimension – Part 2 […]

  3. […] RDF and the Time Dimension Part 2 — in this follow-up post the author proposes reifying statements and adding a new predicate to relate the reified statement to the context. (A hybrid of Approach 3 and Approach 4?). A second and preferred solution using named graphs is also presented. The author also demonstrates how to obtain a snapshot of all triples that held true at a specific time under both approaches (a “snapshot”). […]

  4. […] RDF and the Time Dimension Part 2 — in this follow-up post the author proposes reifying statements and adding a new predicate to relate the reified statement to the context. (A hybrid of Approach 3 and Approach 4?). A second and preferred solution using named graphs is also presented. The author also demonstrates how to obtain a snapshot of all triples that held true at a specific time under both approaches (a “snapshot”). […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: