Sometimes it is useful to link your data with entities from external knowledge graphs. For example, Wikidata contains background information about almost any topic in the world, such as the current population of each country. So in order to avoid duplication and manual data entry, your local repository may want to fetch the latest population count from the external sources. Wikidata also defines unique identifiers for each resource, so any vocabulary that talks about Australia could link to the Wikidata resource for Australia as a way of building links into a central knowledge hub.

TopBraid EDG 6.2 introduces a new capability that supports such links to Wikidata, and any other external knowledge graph that provides a SPARQL endpoint, including endpoints managed by your own organization. This article illustrates these new capabilities by establishing a link from a local People database to matching resources in Wikidata, reusing values such as date of birth, height and images from Wikidata.

We distinguish between local resources (aka assets) and remote resources. Local resources are maintained in TopBraid EDG by dedicated staff, while selected data from remote resources is periodically copied over and thus maintained by the remote source.

Dedicated link properties (such as “wikidata person” in the diagram above) are used to point from local resources to remote resources. On-the-fly inferences using SHACL property value rules are employed to copy (transforming if necessary) selected remote values into properties of the local resources, so that they can be queried just like any other local values:

wikidata-demo-25-arnold-all-inferences

Let’s now walk through the steps to make this happen, with TopBraid EDG 6.4.

Creating a Link Property

In this example we have an Ontology (schema) as an EDG asset collection. It is using a SHACL version of the schema.org namespace as a starting point. The class schema:Person already declares various SHACL property shapes for values such as schema:givenName and schema:height. In order to store links from schema:Person instances to corresponding resources on Wikidata, we introduce a new property called wikidataPerson:

Next we navigate into the SHACL property declaration by selecting it from the class form or from the Property Group panel. We are displaying the form and the source code (Turtle) so that it becomes evident what changes are made once you declare this property to be a link to an external graph.

Then we pick “Make this property a Wikidata link” from the Modify menu of the property shape:

This adds a value for the property dash:detailsEndpoint, linking it to the SPARQL endpoint of the Wikidata server: https://query.wikidata.org/sparql

Technical background: Whenever a property shape carries a value for dash:detailsEndpoint, TopBraid EDG will understand that the property values are URIs and that more RDF statements for these URIs can be queried from the given SPARQL endpoint. If the endpoint happens to be exactly the URL above then additional features for Wikidata get activated.

That’s all. The local schema:Person class now carries a link to Wikidata SPARQL endpoint.

Linking Local Resources to Wikidata Resources

We have created an EDG Data Graph with instances of the People class representing the usual suspects from the Kennedys family:

wikidata-demo-05-instances

Not much detailed data is captured in EDG for these people. We have the names of the people, their gender, family relationships between individuals and schools they went to. Since these are well known people, much more data about them is available on Wikidata.

Names of the people will be sufficient to automatically generate links from our local resources to the corresponding Wikidata resources. Since we have a Wikidata link property, running Problems and Suggestions will suggests suitable Wikidata resources based on (approximate) similarity of the labels. To generate suggested matches, TopBraid EDG will run a sequence of queries to a web service kindly provided by Wikidata. This may take a while but can be interrupted at any time:

wikidata-demo-07-suggest-progress

The resulting page can be used to review the suggestions and accept those that seem plausible:

wikidata-demo-08-suggestions

Alternatively to the batch process for all local resources, you can use “Suggest matching Wikidata entities…” from the Modify menu for each individual local resource. This will bring up a dialog such as the following:

wikidata-demo-09-single-suggestion

If we accept some of suggestions, our local resources will now have outgoing links to remote ones – such as the Q2685 link for Arnie:

wikidata-demo-10-arnold

You can follow the link to explore whatever Wikidata knows about this individual:

wikidata-demo-11-arnold-wikidata

Now that our local instances have references to corresponding Wikidata resources, we can start using the property values of the remote resources.

Defining the Remote Data of Interest

Our schema doesn’t know anything about the remote resources yet. We need to tell TopBraid EDG which properties we are interested in, and what format they have. The W3C Shapes Constraint Language (SHACL) is well suited for that job. To identify remote properties we are interested in, we will define a SHACL node shape that carries property shapes for these properties. This acts like a “view” on the remote data and informs the system what kinds of SPARQL queries it needs to use to fetch the actual values.

Back in our example schema, we define a node shape called “Wikidata Person”. We are using a node shape that is not a class because these are remote resources with data we only hold in TopBraid EDG as “refreshable cache”. Alternatively, we could define a class as well.

Click on the Node Shapes panel and press “Create Node Shape” button:

TopBraid now offers another wizard that greatly simplifies the linkage with Wikidata. From the context menu of the new node shape, select “Add property shapes from Wikidata sample…”:

The resulting dialog asks you for the ID of any example instance that may hold typical values. In our example, we pick Arnold’s wikidata ID Q2685 and click on “Load”:

This dialog is fetching all properties of this sample instance, and allows you to browse the values – by expanding each property you are interested in. You can then select the properties of interest in and (optionally) set cardinality and datatype constraints for them. Above, we have selected the “height” property with a maximum cardinality (sh:maxCount) of 1, and datatype xsd:decimal.

We can repeat this process for other sample instances in case they have values for other properties that we may want. For example to pick “death date” which wasn’t available for Arnold. TopBraid EDG will generate suitable SHACL property shape declarations for all the selected properties, and attach them to our Wikidata Person shape:

For experts, here is the Turtle source code of this node shape:

people_schema:Wikidata_Person
  rdf:type sh:NodeShape ;
  rdfs:label "Wikidata Person" ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path schema:description ;
    sh:name "description" ;
  ] ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path rdfs:label ;
    graphql:name "rdfs_label" ;
    sh:name "label" ;
  ] ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path wdt:P18 ;
    sh:name "image" ;
  ] ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path wdt:P2048 ;
    sh:datatype xsd:decimal ;
    sh:maxCount 1 ;
    sh:name "height" ;
  ] ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path wdt:P569 ;
    sh:datatype xsd:dateTime ;
    sh:maxCount 1 ;
    sh:name "date of birth" ;
  ] ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path wdt:P570 ;
    sh:datatype xsd:dateTime ;
    sh:maxCount 1 ;
    sh:name "date of death" ;
  ] .

Now that we have described the data from the remote resources we are interested in, we tell our link property (wikidata person) about it, using sh:node. (If we used a class to describe Wikidata Person, we could also use sh:class.)

This is enough to instruct TopBraid EDG about the values we want to fetch from the endpoint. However, it does not yet tell EDG into what properties of our local resources to copy these values.

Defining Property Value Rules

Let’s say we want the values of the local property schema:height to hold the same values as the property wdt:P2048 (aka “height”) of the remote resources from Wikidata. SHACL property value rules can be used to instruct the system that certain property values shall be computed on the fly, whenever they are queried. The resulting values are called “inferences” and are not editable in TopBraid EDG. A simple form of property value rule can be employed to walk from the locally stored instance of schema:Person class into the associated wikidata person, and from the remote person to its the height value. More complex rules could also be defined to perform additional transformations, when needed.

You can either enter such rules by hand, or use the new wizard in TopBraid EDG. Start by navigating into the person’s property shape that defines the local height property:

Once there, select “Create property value rule from template…” from the Modify menu:

This wizard offers a growing number of templates, including the one that just copies a value from a related (in our case, linked remote) resource:

Once finished, the property shape of schema:height carries a SHACL property value rule:

wikidata-demo-20-height-form

To confirm that this is all now working, we can visit the local Arnold instance, and use “Refresh details of remote values” to fetch the remote values from the Wikidata SPARQL endpoint:

Once this has completed, we can see that our local Arnold instance has a schema:height property, which is inferred straight out of the Wikidata knowledge graph:

wikidata-demo-22-arnold-inference

We can repeat the same steps for the other properties. In some cases, the property value rules may need to be post-processed to include extra transformations. Here, we have modified the rule for schema:deathDate so that the xsd:dateTime value from Wikidata is automatically turned into an xsd:date literal:

wikidata-demo-23-type-cast

If you are not familiar with the syntax, check the SHACL Advanced Features 1.1 draft. The above roughly means “query the values of wikidataPerson and then query the values of P570 of those, and finally convert those to xsd:date using the SPARQL xsd:date(v) function”. Similarly, we can use the function sparql:iri to convert the image URL strings delivered by Wikidata into IRI resources. (To see the sparql: functions, include the “SPARQL vocabulary for SHACL” into your Ontology).

Refreshing and Querying Remote Values

Now that all shapes have been set up, we can use batch processes to periodically refresh the remote values, e.g. once a night. In TopBraid EDG, this can be automated using scheduled jobs. The batch process can be triggered from the Transform tab:

wikidata-demo-26-refresh-all

Alternatively, individual resources can be refreshed as shown earlier.

We can now see that all local person resources that have links to Wikidata entities carry values for height, birth date, death date and image:

wikidata-demo-27-table

You can also query these values, consistently with locally defined values, using GraphQL:

wikidata-demo-28-graphql

Since TopBraid’s GraphQL support is based on shape declarations, we can even query the values of the remote resources, as follows. Note that this requires the shape to be marked with graphql:protectedShape in the Ontology.

wikidata-demo-29-graphql

Oh, and since we have used SHACL node shapes to declare the structure of the Wikidata entities, we can also perform constraint validation on that data. Combined with TopBraid EDG workflows, this means that data can be pulled from the remote service and then validated before it is accepted into production.