Working with Wikidata and other external Knowledge Graphs using SHACL in TopBraid

Sometimes it is useful to link your data with entities from external knowledge graphs. For example, Wikidata contains background information about almost any topic in the world, such as the current population of each country. So in order to avoid duplication and manual data entry, your local repository may want to fetch the latest population count from the external sources. Wikidata also defines unique identifiers for each entity, so any vocabulary that talks about Australia could link to the Wikidata entity for Australia as a way of building links into a central knowledge hub.

TopBraid EDG 6.2 introduces a new capability that supports such links to Wikidata, and any other external knowledge graph that provides a SPARQL endpoint, including endpoints managed by your own organization. This article illustrates these new capabilities by establishing a link from a local People database to similar entities in Wikidata, reusing values such as date of birth, height and images from Wikidata.

wikidata-demo-00-overview

We distinguish between local entities (aka resources or assets) and remote entities. Local entities are maintained “manually” by dedicated staff, while selected data from remote entities are periodically copied over and thus maintained automatically. Dedicated link properties (such as “Wikidata Person” in the diagram above) are used to point from local entities to remote entities. On-the-fly inferences using SHACL property value rules are employed to transform selected remote values into properties of the local entities, so that they can be queried just like any other local values:

wikidata-demo-25-arnold-all-inferences

Let’s now walk through the steps to make this happen, with TopBraid EDG 6.2.

Creating a Link Property

In this example we have an Ontology (schema) as an EDG asset collection. This is using a SHACL version of the schema.org namespace as a starting point. The class schema:Person already declares various SHACL property shapes for values such as schema:givenName and schema:height. In order to store links from schema:Person instances to corresponding entities on Wikidata, we introduce a new property called wikidataPerson:

wikidata-demo-01-create-link-property

Next we navigate into the SHACL property declaration by clicking on the grey box below:

wikidata-demo-02-navigate-to-property-shape

Then we pick “Make this property a Wikidata link” from the context menu of the property shape:

This adds a value for the property dash:detailsEndpoint, linking it to the SPARQL endpoint of the Wikidata server: https://query.wikidata.org/sparql

wikidata-demo-04-link-property

Technical background: Whenever a property shape carries a value for dash:detailsEndpoint then TopBraid will understand that the values are URIs and that more RDF statements for these URIs can be queried from the given SPARQL endpoint. If the endpoint happens to be exactly the URL above then additional features for Wikidata get activated.

That’s all. The local schema:Person class is now linked to Wikidata.

Linking Local Instances to Wikidata Entities

Assuming we have an EDG Data Graph with people instances, and we have pre-populated it with the usual suspects from the Kennedys family:

wikidata-demo-05-instances

Not much is known locally, except for the names of the people and their gender. However, the names of the people are sufficient to establish crosslinks into Wikidata. Assuming that some Wikidata link properties exist, the Transform tab offers a wizard-like feature that suggests suitable Wikidata entities based on (approximate) similarity of the labels:

wikidata-demo-06-suggest-tab

Clicking on “Suggest Mapping to Wikidata – For all Assets” runs a sequence of queries to a web service kindly provided by Wikidata. This may take a while but can be interrupted at any time.

wikidata-demo-07-suggest-progress

The resulting page can be used to review the suggestions and accept those that seem plausible:

wikidata-demo-08-suggestions

Alternatively to the batch process, you can use “Suggest matching Wikidata entities…” from the context menu for each individual local entity. This would bring up a dialog such as the following:

wikidata-demo-09-single-suggestion

Selected entities now have outgoing links to remote entities such as Q2685 for Arnie:

wikidata-demo-10-arnold

You can follow the link to explore whatever Wikidata knows about this individual:

wikidata-demo-11-arnold-wikidata

Now that our local instances have references to corresponding Wikidata entities, we can start using the property values of the remote entities.

Defining the Shape of Remote Entities

Our schema doesn’t know anything about the remote entities yet. We need to tell the system which properties we are interested in, and what format they have. The W3C Shapes Constraint Language (SHACL) is well suitable for that job. We define a SHACL node shape that carries property shapes for the relevant properties. This acts like a “view” on the remote data and informs the system what kinds of SPARQL queries it needs to use to fetch the actual values.

Back in our example schema, we define a node shape called “Wikidata Person”. (Alternatively, we could make it a class too, yet here a shape is arguably cleaner). To get to the following screen in TopBraid EDG 6.2, make the node shapes visible using the small hollow circle button above the class tree and press “New” in the Instance panel:

wikidata-demo-12-create-wikidata-shape

TopBraid now offers another wizard that greatly simplifies the linkage with Wikidata. From the context menu of the new node shape, select “Add property shapes from Wikidata sample…”:

wikidata-demo-13-menu

The resulting dialog asks you for the ID of any example instance that may hold typical values. In our example, we pick Arnold’s wikidata ID Q2685 and click on “Load”:

wikidata-demo-14-add-wikidata-properties

This dialog is fetching all properties of this sample instance, and allows you to browse the values. You can then select which properties you are interested in and (optionally) set cardinality and datatype constraints too. Above, we have selected the “height” property with a maximum cardinality (sh:maxCount) of 1, and datatype xsd:decimal.

We can repeat this process for other sample instances, for example to pick “death date” which wasn’t available for Arnold. Eventually, the system has generated suitable SHACL property shape declarations for all the selected properties, and attached them with our Wikidata Country shape:

wikidata-demo-15-properties-form

For experts, here is the Turtle source code of this node shape:

people_schema:Wikidata_Person
  rdf:type sh:NodeShape ;
  rdfs:label "Wikidata Person" ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path schema:description ;
    sh:name "description" ;
  ] ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path rdfs:label ;
    graphql:name "rdfs_label" ;
    sh:name "label" ;
  ] ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path wdt:P18 ;
    sh:name "image" ;
  ] ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path wdt:P2048 ;
    sh:datatype xsd:decimal ;
    sh:maxCount 1 ;
    sh:name "height" ;
  ] ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path wdt:P569 ;
    sh:datatype xsd:dateTime ;
    sh:maxCount 1 ;
    sh:name "date of birth" ;
  ] ;
  sh:property [
    rdf:type sh:PropertyShape ;
    sh:path wdt:P570 ;
    sh:datatype xsd:dateTime ;
    sh:maxCount 1 ;
    sh:name "date of death" ;
  ] .

Now that we have described the shape of the remote entity, we tell our link property about it, using sh:node (or, alternatively, sh:class if the node shape is also a class):

wikidata-demo-16-set-node

That is enough to instruct the system about which values we want to fetch from the endpoint. However, it does not yet establish the relationship of these remote values with our local schema.

Defining Property Value Rules

Here we want the values of the local property schema:height to hold the same values as the property wdt:P2048 (aka “height”) of the remote entities from Wikidata. SHACL property value rules can be used to instruct the system that certain property values shall be computed on the fly, whenever they are queried. The resulting values are called “inferences” and are not editable in TopBraid EDG. A simple form of property value rule can be employed to walk from the local schema:Person instance into the associated wikidata person, and from there retrieve the height value. More complex rules can be defined to perform additional transformations, when needed.

You can either enter such rules by hand, or use the new wizard in TopBraid EDG. Start by navigating into the property shape that defines the local height property (click the grey box):

wikidata-demo-17-select-height

Once there, select “Create property value rule from template…” from the context menu:

wikidata-demo-18-height-menu

This wizard offers a growing number of templates, including the one that just copies a value from a linked entity:

wikidata-demo-19-height-wizard

Once finished, the property shape of schema:height carries a SHACL property value rule:

wikidata-demo-20-height-form

To confirm that this is all now working, we can visit the local Arnold instance, and use “Refresh details of remote values” to fetch the remote values from the Wikidata SPARQL endpoint:

wikidata-demo-21-arnold-refresh

Once this has completed, we can see that our local Arnold instance has a schema:height property, which is inferred straight out of the Wikidata knowledge graph:

wikidata-demo-22-arnold-inference

We can repeat the same steps for the other properties. In some cases, the property value rules may need to be post-processed to include extra transformations. Here, we have modified the rule for schema:deathDate so that the xsd:dateTime value from Wikidata is automatically turned into an xsd:date literal:

wikidata-demo-23-type-cast

If you are not familiar with the syntax, check the SHACL Advanced Features 1.1 draft. The above roughly means “query the values of wikidataPerson and then query the values of P570 of those, and finally convert those to xsd:date using the SPARQL xsd:date(v) function”. Similarly, we can use the function sparql:iri to convert the image URL strings delivered by Wikidata into IRI resources. (To see the sparql: functions, include the “SPARQL vocabulary for SHACL” into your Ontology).

Refreshing and Querying Remote Values

Now that all shapes have been set up, we can use batch processes to periodically refresh the remote values, e.g. once a night. In TopBraid EDG, this can be automated using scheduled jobs. The batch process can be triggered from the Transform tab:

wikidata-demo-26-refresh-all

Alternatively, individual resources can be refreshed as shown earlier.

We can now see that all local person entities that have links to Wikidata entities carry values for height, birth date, death date and image:

wikidata-demo-27-table

You can also query these values, consistently with locally defined values, using GraphQL:

wikidata-demo-28-graphql

Since TopBraid’s GraphQL support is based on shape declarations, we can even query the values of the remote entities, as follows. Note that this requires the shape to be marked with graphql:protectedShape in the Ontology.

wikidata-demo-29-graphql

Oh, and since we have used SHACL node shapes to declare the structure of the Wikidata entities, we can also perform constraint validation on that data. Combined with TopBraid EDG workflows, this means that data can be pulled from the remote service and then validated before it is accepted into production.