Comparing SPARQL with SQL » TopQuadrant, Inc

+1 919.300.7945     Blog     Downloads     Purchase     Contact

Comparing SPARQL with SQL

Posted by Irene Polikoff on May 5, 2014

TopBraid platform heavily uses SPARQL – for queries, for rules, for data transformations and templated data services. From time to time, we get asked to describe advantages of SPARQL over SQL (the data query language people are most familiar with).

The question may be phrased as:
“What can be done with SPARQL that can't be done with SQL?”

or, as asked at the recent webinar, Connecting the Dots with Dynamically Linked Data from Diverse Sources where we demonstrated TopBraid Insight:

“Can you show an example of SPARQL query that is impossible or very hard to do with SQL?”

In considering the differences, it is important to keep in mind that SPARQL is designed to query RDF data while SQL is designed to query relational data. As such, both languages and their respective advantages closely reflect the data models they work with.

Some Key Differences

One key difference with SPARQL and RDF data is that relationships in the data are explicitly described and directly accessible. As a result, SPARQL queries are considerably better aligned with users’ mental models of a domain. For example, a SPARQL query for medications that may be prescribed to treat ‘TB of vertebra’ can look as follows:

SELECT ?medication
WHERE {
?diagnosis a example:Diagnosis .
?diagnosis example:name “TB of vertebra” .
?medication example:canTreat ?diagnosis .
}

In SQL, such a query may look as shown below.

SELECT DRUG.medID
FROM DIAGNOSIS, DRUG, DRUG_DIAGNOSIS
WHERE DIAGNOSIS.diagnosisID=DRUG_DIAGNOSIS.diagnosisID
AND DRUG.medID=DRUG_DIAGNOSIS.medID
AND DIAGNOSIS.name=”TB of vertebra”

In relational databases, and correspondingly in SQL, relationships such as the one between a diagnosis and the drugs that treat it are left unstated. They get derived from matching the relevant primary and foreign keys of tables. So, unlike SPARQL, the SQL query reflects the specific structure of a database and how the data is stored in tables within it, not the user’s understanding of the domain.

With RDF, schema level information is stored and queried the same way as data. It means that the conceptual data model could be fully explored through queries. It also makes it easy to create flexible, retrospective, data model-aware and driven queries. For example, let's say a patient could have many different types of phones – work phone, home phone, cell phone, etc., and we need to retrieve all available ones. With RDF, one can group such specific properties under a more general parent property 'phone' by making statements such as: example:workPhone rdfs:subPropertyOf example:phone. The query to get all relevant phones can then combine schema and data criteria to get the result:

SELECT ?patient ?phoneType ?phoneNumber
WHERE {
?phoneType rdfs:subPropertyOf example:phone .
?patient ?phoneType ?phoneNumber .
}

Or, if you prefer to return a human readable label of each respective phone property:

SELECT ?patient ?phoneName ?phoneNumber
WHERE {
?phoneType rdfs:subPropertyOf example:phone .
?phoneType rdfs:label ?phoneName .
?patient ?phoneType ?phoneNumber .
}

When we decide to capture an additional type of phone, we will not need to change any queries. Instead, we simply add rdfs:subPropertyOf triple for the new type of phone. It is possible to achieve something like this with relational databases by using a more generic relationship tables and storing names of the relationships as data. However, doing so is considerably more complex and also more limited – with RDF all the rich model information including everything available with OWL can be queried. Furthermore, with RDF all concepts and their properties have globally unique identifiers (URIs) and are not treated as strings or local IDs.

Queries like the one above represent generic design patterns such as 'retrieve all subproperties of a given property'. It is easy to see how it would be useful to wrap such patterns as generic service templates that accept a parameter instead of, as in our example, statically using 'example:phone'. Or, alternatively, they could become functions. SPIN templates and functions implement these capabilities.

Queries that have to traverse a chain of connections are particularly complex in SQL while very simple in SPARQL. Let’s say a user wants to retrieve all patients diagnosed with infectious or parasitic diseases. Diagnoses are organized in a hierarchy with many levels. For example, bacterial diseases, viral diseases, and tuberculosis are all part of this category and are further divided into many nested hierarchical subgroups. The diagnosis for a particular patient can be described at different levels of granularity.

In SPARQL, the query may look as follows:

SELECT DISTINCT ?patient
WHERE {
?diagnosisGroup example:name “1. INFECTIOUS AND PARASITIC DISEASES (001-139)”.
?diagnosis (example:subCode)* ?diagnosisGroup .
?patient example:hasDagnosis ?diagnosis .
}

“*” symbol directs the query engine to traverse the link to the end, getting all diagnoses that are hierarchically sub-codes of the selected group.

Until fairly recent and underused introduction of recursive common table expressions (CTEs) SQL had no equivalent of a “*” symbol. Even with CTEs, the syntax is harder to understand and write. Furthermore, relational databases don't support well storage of hierarchies without a pre-set depth or number of levels.

Similarities

Both languages are mature and rich with many similar capabilities. For example:

  • SQL and SPARQL have a similar set of aggregate functions used with a GROUP BY operator.
  • To select data where some values don’t exist, SQL uses NULLs such as DIAGNOSIS.name = NULL. To select data with values not matching some specified value, SQL uses NOT EQUAL. SPARQL uses the NOT EXISTS filter for similar purposes.
  • SQL uses the LEFT OUTER JOIN operator to select data across multiple tables even if join constraints are not met. SPARQL uses the OPTIONAL key word to indicate that solutions should not be eliminated even if some patterns are not present in the data.

Additional Unique SPARQL Capabilities

In addition to SELECT, INSERT and DELETE, SPARQL supports ASK queries.

These queries simply return true or false depending on whether conditions in the WHERE clause are met. TopBraid uses ASK queries to implement data quality constraints. For example, in TopBraid Enterprise Vocabulary Net if users enter data that violates a constraint such as—all names must be capitalized—a violation will be flagged.

SPARQL can be extended by creating functions for use with the BIND operator.

These functions can be implemented in SPARQL itself or by using languages such as JavaScript or Java. Another way to extend SPARQL is through the use of computed properties (also known as property functions or magic properties). These are predicates that are used in a SPARQL query pattern to do something different from a simple subgraph matching. One common example use case is to use property function to encapsulate a complex subquery that can be re-used within multiple queries. Another common example is providing a computed property that relates string literal subjects to free-text (e.g., Lucene) search strings. TopBraid provides tooling to make it easy to create and manage necessary SPARQL extensions.

A key advantage of RDF is that it was designed specifically to readily merge disparate sources of data. Correspondingly, SPARQL includes syntax to call two or more data sources within a single query. This is done using a SERVICE key word such as:

SELECT ?medication
WHERE {
SERVICE
{?diagnosis a example:Diagnosis .
?diagnosis example:name “TB of vertebra” .}
SERVICE
{?medication example:canTreat ?diagnosis .}
}

Such a query would, of course, require that identity of data in the medication data source was aligned with the identity of data in the diagnosis data source, which is often not the case with diverse data sources. TopBraid Insight is a query federation server that makes it possible to have distributed queries in the presence of diverse data with differing identities. It also incorporates planning algorithms to ensure efficient performance of distributed queries.

What is your experience with using SPARQL? Where do you see the key differences with SQL? What features of SPARQL do you find particularly useful? Have you discovered any shortcomings?
We would like to hear from you.

Archives

About TopQuadrant | Glossary | Privacy Statement | Legal | Terms of Use
© Copyright 2001-2017 TopQuadrant, Inc. All Rights Reserved.