TopBraid Data Platform

The TopBraid Data Platform is a high availability solution for TopBraid EDG servers. It enables continuous operation of business functions by replicating data across a cluster of EDG servers. Each EDG is kept up to date and server can respond to a client/application request. Together with a load balancer to direct requests, user and application access to the EDG data remains available even if some of the servers are offline.

This availability is achieved by having a cluster of EDG servers communicating with a data coordinator (DC) server that receives changes from any clustered server whenever it has updates. The DC server propagates the changes to all other EDG servers, keeping them all up-to-date. Additionally, whenever a new EDG server joins the cluster, it is first brought up-to-date before it starts servicing user/application requests. Each EDG server keeps a local, persistent cache of application data so that start-up only involves applying changes missed by that server.

TopBraid EDG Data Coordinator Server Block Diagram

TopBraid EDG Data Coordinator Server Block Diagram

Each EDG or EDG Explorer server has its own copy of a database of all RDF graphs managed by the Data Platform. When changes to the data are made, the changes are recorded locally as RDF Patches.

At the end of the operation (the end of the HTTP request), any changes are sent to the Data Coordinator and made safe. When the patches are confirmed to be safe, the EDG server triple store declares the request has committed and responds to the HTTP request. It is the state of the RDF Patch Log that determines the state of the EDG server for Data Platform backed graphs.

When operations arrive at EDG server, the EDG server checks that its local database is up-to-date and, if it is not, fetched patched from the Data Coordinator and applies them to make sure it is at the latest version.

The Data Coordinator can be a single machine, with file-backed persistence or a cluster of servers with storage in a cloud object store. There are tradeoffs in complexity of operation as discussed below.

The TopBraid Data Platform is based on the open source (Apache License) RDF Delta system.

Obtaining the Software

The Data Coordinator (the patch log server) can be downloaded from:

https://download.topquadrant.com/dp/

EDG server and EDG explorer already have the data platform client software included and no extra software or is required for EDG. You must purchase a license to run Data Platform from TopQuadrant. You will then be sent the appropriate number of licenses for your EDG installations.

Note

The version corresponding to the EDG version as given in the table on the download page must be used.

Data Coordinator Requirements

The DC server is a Java webserver process that coordinates changes across a cluster of EDG servers. For production use, it should be run on a machine separated from all of the clustered EDG machines. For development and experimentation, it can be run on a shared machine.

The DC server stores patches on-disk.

The disk storage must be backed-up, which can be done by file backup on a live server.

The robustness of the system is determined by the robustness of the file storage so choosing local disk in the DC-server is limiting.

A minimum of 8GB of RAM are required.

It’s imperative that monitoring for low memory and disk space are implemented.

EDG Failover Configuration

In a high-availability configuration, two or more EDG servers provide the service. Each has a complete copy of the replicated graphs.

Typically, a deployment will put a load balancer in front of the two servers so that the clients (web browsers or other applications) use the same URL to access either server. It also allows a deployment to move EDG servers between physical hardware for maintenance and system upgrade.

There are two choices for Patch Log Server: either run with a file-backed single server or run with replicated servers for continuous operation.

The file-based patch log server can use any OS-supported filesystem – local storage or remote disk-array and the reliability and durability of files written are determined by the choice of filesystem. If the single server style is used, then operation is interrupted if the patch log server is unavailable. The EDG servers keep running but will be unable to update data until the patch log server restarts. Patch log server startup is very fast. There is no need to restart EDG servers.

The advantage is the simplicity of operation so there is a tradeoff for small deployments of simplicity and continuous operation.

For continuous operation, three or more patch log servers run, on separate hardware; the system uses Apache Zookeeper to manage a system wide database of the patch state of the deployment. Storage of patches is in an object store, Amazon S3. There are publicly available adapters to provide the Amazon S3 API over other storage choices such as Apache Cassandra.

Example RDF Patch

H id <uuid:6e0288f5-ed79-4ab7-b4cc-8386ee32fadf> .
H prev <uuid:3e346531-777e-4169-abcd-284a7a5b22ac> .
TX .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/shacl#NodeShape> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://www.w3.org/2002/07/owl#Thing> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/2000/01/rdf-schema#label> "SubClass" .
TC .

Example Data Platform connector file.

# baseURI: urn:x-evn-master:ontconfig
# imports: http://topbraid.org/sparqlmotionlib
@prefix dp: <http://org.topbraid/dataplatform#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sm: <http://topbraid.org/sparqlmotion#> .
@prefix sml: <http://topbraid.org/sparqlmotionlib#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<urn:x-evn-master:ont> a dp:Graph ;
    dp:coordinator "http://localhost:1066/" ;
    dp:datasource "ont" ;
    dp:localData "ont/data" ;
    dp:localDataLocation "/home/afs/TQ/EDG/TBS/build/workspaces/TBCME/Zone/ont/data" ;
    sml:baseURI "urn:x-evn-master:ont" ;
    sml:graphName "urn:x-evn-master:ont" ;
    sml:imports () ;
    .

<urn:x-evn-master:ontconfig> a owl:Ontology ;
    owl:imports <http://topbraid.org/sparqlmotionlib> ;
    .

Patches in EDG

Once a collection is created in EDG, a set of patch files will be created in the Data Coordinator.

Changes to the collection will be recorded in the patch log. Patches are handled by the system and not displayed to the user but here is an example patch creating a new subclass in an ontology:

H id <uuid:6e0288f5-ed79-4ab7-b4cc-8386ee32fadf> .
H prev <uuid:3e346531-777e-4169-abcd-284a7a5b22ac> .
TX .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/shacl#NodeShape> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://www.w3.org/2002/07/owl#Thing> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/2000/01/rdf-schema#label> "SubClass" .
TC .

Verify Log Operation

To illustrate that the EDG server is indeed running with the Data Platform, stop TopBraid Composer, restart with a clean workspace, and restart EDG “personal edition”. If running EDG in Tomcat, stop Tomcat, delete the following directories and files, or completely delete the EDG workspace as if reinstalled:

<Workspace>/Zone/ont/
<Workspace>/Zone/ont.tch/
<Workspace>Repositories/ont.dpc
<Workspace>Repositories/ont.tch.dpc

When the EDG: server starts up, the log will show ontology (called “ont” below) restored from the patch log server.

[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: New graph: [id:969bb7, ont.tch, <urn:x-evn-master:ont.tch>]
[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: Create DP Graph [id:969bb7] : datasource=ont.tch, baseURI=urn:x-evn-master:ont.tch, coordinator=http://localhost:1066/
[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: Attach graph ont.tch in 0.166s
[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: New graph: [id:84d7ff, ont, <urn:x-evn-master:ont>]
[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: Create DP Graph [id:84d7ff] : datasource=ont, baseURI=urn:x-evn-master:ont, coordinator=http://localhost:1066/
[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: Attach graph ont in 0.261s

The graph has been created and rebuilt from the patch log.

Data Platform Setup

Data Platform can be used as the default data store for all of the collections in the workspace or as the storage for nominated asset collections, with an option selected when the asset collection is created.

Any projects uploaded to EDG that were created in TBCME or given to you by TopQuadrant professional services for customizations or sample data will not be synced between servers. These are uploaded individually to each server. EDG configuration files will also not be stored via Data Platform. Changes to configurations will need to be replicated on each EDG node.

This section gives a short walkthrough for setting up a EDG system with the Data Platform. It covers the use of the TopBraid Data Platform with file-backed storage of patches. The Data Coordinator is the RDF Delta patch log server.

A new workspace with Data Platform as the active data storage must be used. You cannot convert an existing workspace by changing the database storage option.

File-based persistent patch storage

The simplest configuration of a patch log server uses the servers file system for patch storage. This is an OS-supported filesystem – local storage or remote disk-array and the reliability and durability of files written are determined by the choice of filesystem.

Data Platform default data store steps

Be sure to follow the steps in the following order. Details will be below. If you would like a sample docker image, please contact TopQuadrant support.

  1. Download and run the patch log server.

  2. Enable logging for the patch log server.

  3. Install the first instance of EDG using the below database options. We recommend the file method of installation so the file will be copied to new nodes. This can be done using interactive setup or setup.properties file for EDG installation EDG Server Installation.

    1. Database: Data Platform

    2. DP server URL: http://localhost:1066/. DNS name or IP address of the patch log server machine. This should be a separate machine to the server running the TopBraid server.

    3. Is primary node: true

    4. The “Zone” is the place within the server workspace where the Data Platform software keeps the local state information. This does not usually needed to be changed.

  4. You can verify that your setup was successful by creating any collection. If this succeeds, you can check the Base URI Management page in the Server Administration area. Your Repositories folder collections should have an extension of .dpc.

  5. Now you can clone this EDG instance to set up multiple EDG nodes. Before doing so, you may want to change any other configuration settings that you will want on all the EDG servers: EDG Administration and Configuration. Otherwise you will have to repeat the configurations. EDG configuration files are not synced via Data Platform. Be sure the EDG workspaces are different locations. You cannot use the same workspace on multiple EDG nodes. You will need to clone the entire directory containing the workspace, vault, setup file, and replace the license file. TopQuadrant support will provide a unique license for each EDG node.

  6. Update the additional nodes to have Is primary node: false.

  7. Last and most important step, set up a backup mechanism and server monitoring for memory and disk space at minimum. You will want to save regular back ups of the workspace for EDG as well as the Data Coordinator. Optionally, set up EDG Failover Configuration (instructions above for Zookeeper).

Run the patch log server

In production, the patch log server should be run on a separate machine to the server running the TopBraid server, not just a separate virtual machine.

After obtaining the software, run the patch log server with:

java -jar rdf-delta-server-VERSION.jar –file –base DIRECTORY

where VERSION is the version number of the RDF Delta software corresponding to the version of the EDG server in use and DIRECTORY is an initially empty, existing directory where the patch log server stores the patch logs in files.

Example patch server output:

[2019-05-08 10:46:59] Delta INFO Provider: file
[2019-05-08 10:46:59] Config INFO Delta Server port=1066
[2019-05-08 10:46:59] Delta INFO RDF Delta 0.7.0 2019-03-12T14:54:40+0000
[2019-05-08 10:46:59] Delta INFO No data sources

Be sure to start the Data Platform server with the storage option intended to be used or in the case of upgrade, used previously. Eg. –file

By not specifying an option, Data Platform will default to RocksDB aka ‘Local’.

- Delta INFO Provider: Local

vs

- Delta     INFO Provider: file

It is vital to ensure you don’t specify the wrong storage type.

Data Coordinator Outages

If the Data Coordinator is not running then:

  • shared collections can be read (viewed, queried)

  • updates to shared collections are not possible

  • new, shared collections cannot be created.

Restarting the Data Coordinator re-enables the update and sharing features. Restarting EDG is not necessary.

If the default storage is set to be “data platform” then the Data Coordinator server must be running and contactable when an EDG server is initially configured.

Backup & Restore

Overview

There are 2 areas to consider when backing up a Data Platform enabled EDG environment

  • Data Coordinator

    • Filesystem snapshot of this server – while the Data Coordinator is shutdown

  • EDG

    • The workspace, any files modified in Tomcat such as the setup.properties file, the Vault, context.xml, Tomcat users, logging configuration.

Backup

Option 1: Use the Backup and Restore utility in EDG: EDG Backup and Restore.

Option 2: Manual Backup

The first step to perform a backup of Data Platform environment is to shutdown the Data Coordinator. Once shutdown, the EDG instance(s) will become read-only mode, preventing any changes to the system.

Backup of the Data Coordinator can be done by snapshot of the system, or rsync of the files serving the patch store. These should only be performed while the Data Coordinator is shut down.

Backup of the EDG server can also be done the same way, while Data Coordinator is shut down and EDG is in readonly mode, snapshot of the system or rsync of the workspace, web.xml and secure storage files.

Manual Restore

If restoring a Data Platform enabled EDG environment, first start by restoring the Data Coordinator files and bring the server online.

Next, restore the backed up files of EDG and start that container. If EDG can communicate with the Data Coordinator server, the system should resume operation.