The Data Fabric, Containers, Kubernetes, Knowledge-Graphs, and more

Read all the parts here: Part 1, Part-1b, Part 2.

In the last article we talked about the building blocks of a knowledge-graph, now we will go a step further and learn the basic concepts, technologies and languages we need to understand to actually build it.

Table of Contents

Introduction

https://towardsdatascience.com/the-data-fabric-for-machine-learning-part-1-2c558b7035d7

In the last article on the series I talked about the concept of the knowledge-graph as and its relation to the data fabric:

The fabric in the data fabric is built from a knowledge-graph, to create a knowledge-graph you need semantics and ontologies to find an useful way of linking your data that uniquely identifies and connects data with common business terms.

Also I talked about the concept of triples: subject, object, and predicate (or entity-attribute-value) and the Resource Description Framework (RDF). Now after all of that and the discussion we should be ready to create our data fabric right? Well no.

In this article I’ll detail the last pieces we need to understand to actually build a knowledge-graph and deploy a data fabric for our company. I’ll be using the the stand-alone AnzoGraph graph database by Cambridge Semantics, and I will deploy it with Docker and Kubernetes.

As you can imagine, we have to know more about containers, docker, kubernetes, etc., but there’s something else. We need to have a way to talk to the data, normally we do that with SQL, but when you have graph data, one of the best languages out there to do that is called SPARQL, so we will be learning that too.

This article is a tutorial with theoretical information on how to deploy AnzoGraph on Kubernetes, based on the tutorial by Cambridge Semantics:

Disclaimer for the tutorial: I’ll be testing everything in MacOS but you can find more information about other platforms in the above link.

Objective

At the end of the article you’ll be able to understand the basics of containers, Kubernetes, how to build a platform on top of them, and also the language for graph databases we’ll be using SPARQL all related with the stand-alone AnzoGraph graph database.

Introduction to containers and docker

This won’t be a complete overview of containers, just enough information so you can know why we need them. Check more in the resources below.

What is a container?

In docker’s webpage, there’s a great one liner for what is a container is:

A standardized unit of software

The formal definition of a container is:

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

In simple words, a container is an isolated environment in which you can set up the dependencies that you need in order to perform a task. They’re similar to virtual machines, but function differently because containers virtualize the operating system instead of hardware.

We can visualize a container like this:

https://www.docker.com/resources/what-container

Each container gets its own isolated user space to allow multiple containers to run on a single host machine.

Why do we need containers?

https://xkcd.com/1629/

Containers make our life easier. With them we can test programming environments, software an tools without harming our main OS, and also they have amazing applications in the distribution section of our products.

Great people like Ben Weber, Will Koehrsen, Mark Nagelberg and Sachin Abeywardana wrote amazing pieces on the needs of containers in the data science space, you can find them in the resources below.

And the biggest use cases are:

  • Reproducibility
  • Improved collaboration
  • Saving time in DevOps and DataOps
  • Easier distribution

There are of course more, but in the space of data these are the most important ones. We will be using containers to create a whole isolated system to run AnzoGraph graph database and deploy it over there.

Docker

Docker is the best tool for creating containers. Docker is an open-source project based on Linux containers.

A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.

Docker images are more like blueprints. Images are the immutable master template that is used to pump out containers that are all exactly alike.

This is the basic workflow of using docker:

https://towardsdatascience.com/digging-into-data-science-tools-docker-bbb9f3579c87

We will be using docker and kubernetes (see below) to deploy our AnzoGraph instance, so the first thing we need to do is install docker. Before doing that, you need to know that there are some requirements for running AnzoGraph with docker:

  • Docker version: Docker Community Edition version 18 or later is required.
  • Operating Systems: MacOS, Linux, Windows 10 Professional or Enterprise edition. Docker uses a hypervisor with a VM, and the host server must support virtualization. Since older Windows versions and Windows 10 Home edition do not support Hyper-V, Windows 10 Professional or Enterprise is required for Docker on Windows. In addition, when using Docker CE on Windows, configure Docker to use Linux containers. Using Microsoft Windows Containers is not supported as it provides Windows API support to Windows container service instances.
  • Available RAM: Minimum: 12 GB; Recommended: 32 GB. AnzoGraph needs enough RAM to store data, intermediate query results, and run the server processes. Cambridge Semantics recommends that you allocate 3 to 4 times as much RAM as the planned data size.
  • Available disk space: AnzoGraph requires 10 GB for internal requirements. The amount of additional disk space required for load file staging, persistence, or backups depends on the size of the data to be loaded. For persistence, Cambridge Semantics recommends that you have twice as much disk space available as RAM on the server.
  • CPU count: Minimum: 2; Recommended 4+.

Let’s get started by installing docker. For Mac go here:

And click on Get Docker:

And follow the instructions in the Docker documentation to install and run Docker.

The AnzoGraph image requires at least 10 GB of available disk space and 7 GiB of available RAM to start the database.

To adjust the settings, right-click the Docker icon and select Settings. Click Disk to view the available disk image size. For example:

Adjust the disk size as needed and then click Apply to save the change. Click Advanced to view the CPU and memory settings. For example:

Adjust the values as needed and then click Apply and Restart to apply the changes and restart Docker.

Note: For the record I used 80GB of Disk and 16GB or RAM.

Kubernetes

https://svitla.com/blog/kubernetes-vs-docker

First of all. How do we pronounce Kubernetes? According to this GitHub thread the correct way of saying it is:

koo-ber-nay’-tace

This is me saying it:

And it means “sailing master”. Pretty cool name isn’t it?

So, if this is the sailor, where’s the ship? and, what’s the ship?

According to the official documentation, Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. So the “ships” are our containers, and our apps inside of them. Kubernetes is all about abstracting away complexity.

For more information go here:

and the resource section below.

Deploying AnzoGraph in Docker and Kubernetes

Here, we are going to focus on Kinematic, because that’s how we are going to use Kubernetes and Docker with AnzoGraph. Kitematic is a simple application for managing Docker containers on Mac, Linux and Windows.

So, the first thing we have to do is start Docker Kitematic by clicking the Docker icon in the Apple Menu and selecting Kitematic.

That will lead you to download the software:

Then when everything is installed, just open Kinematic and type anzograph in the search field and find the AnzoGraph repository:

Then click Create to start the AnzoGraph deployment, and you’ll se this:

Wait for some seconds and when you see the message Press [CTRL+C] to stop, press CTRL+C to close the boot log. AnzoGraph is now running in your Docker instance.

Now you can click on the web preview to see where AnzoGraph is running:

In my case it was running on http://localhost:32871/ but it can change for you.

On the log in screen, type admin as the user name and Passw0rd1 as the password, and then click Sign In.

And you are in!!! You should be seeing this:

So, the query section looks like SQL but is not. And what was AnzoGraph?? I hope you remember. If not, let me give you an idea.

Basics of Anzo and AnzoGraph

https://towardsdatascience.com/deep-learning-for-the-masses-and-the-semantic-layer-f1db5e3ab94b

You can build something called “The Enterprise Knowledge Graph” with Anzo.

AnzoGraph is a stand-alone native, massively parallel processing (MPP) graph OLAP database, built to deliver very fast advanced analytics at big data scale and a version of AnzoGraph comes integrated with Anzo.

Graph OLAP databases are becoming very important as Machine Learning and AI grows since a number of Machine Learning algorithms are inherently graph algorithms and are more efficient to run on a graph OLAP database vs. running them on a RDBMS.

The nodes and edges of the graph flexibly capture a high-resolution twin of every data source — structured or unstructured. The graph can help users answer any question quickly and interactively, allowing users to converse with the data to uncover insights.

The goal of all of this is building a data fabric:

The Data Fabric is the platform that supports all the data in the company. How it’s managed, described, combined and universally accessed. This platform is formed from an Enterprise Knowledge Graph to create an uniform and unified data environment.

If you want to know more, check my articles on the series:

Basics of SPARQL

SPARQL, pronounced “sparkle”, is the query language for the Resource Description Framework (RDF). If you want to know more about RDF check this:

But basically RDF is a directed, labeled graph data format for representing information in the Web. RDF is often used to represent, among other things, personal information, social networks, metadata about digital artifacts, as well as to provide a means of integration over disparate sources of information.

We will understand the very basics of SPARQL here, and the goal is to at least understand this query:

SELECT ?g (COUNT(*) as ?count)
WHERE {
graph ?g{
?s ?p ?o
}
}
GROUP BY ?g
ORDER BY DESC(?count)

If you want to know more about SPARQL check the official docs:

Let’s start with some sample data to run some queries. For that on the Query console, replace the default query with the following statement:

LOAD WITH 'global' <s3://csi-sdl-data-tickit/tickit.ttl.gz> INTO GRAPH <tickit>

This statement loads the sample Tickit data from the tickit.ttl.gz directory on the csi-sdl-data-tickit S3 bucket. By the way, the Tickit data set captures sales activity for the fictional Tickit website where people buy and sell tickets for sporting events, shows, and concerts.

The data consists of person, venue, category, date, event, listing, and sales information. By identifying ticket movement over time, success rates for sellers, the best-selling events and venues, and the most profitable times of the year, analysts can use this data to determine what incentives to offer, how to attract new people, and how to drive advertising and promotions.

The following diagram shows the model or ontology for the tickit graph. Circles represent subjects or classes of data and rectangles represent properties:

https://docs.cambridgesemantics.com/anzograph/userdoc/tickit.htm

Then click Run and after a while you should see the message “Update Successful”

Just to check everything is correct, run this query:

SELECT (count(*) as ?number_of_triples)
FROM <tickit>
WHERE { ?s ?p ?o }

To counts the number of triples in the Tickit data set. You should get:

SPARQL conventions:

  • CAPS: Though SPARQL is case-insensitive, SPARQL keywords in this section are written in uppercase for readability.
  • Italics: Terms in italics are placeholder values that you replace in the query.
  • [ ]: Indicates an optional clause.
  • |: Means OR. Indicates that you can use one or more of the specified options.
  • ^: Means exclusive OR (XOR). Indicates that you can only choose one of the specified options.

The basic part when learning a database language is to select and retrieve some data. For that SPARQL like SQL, provides a SELECT query form for selecting or finding data.

The following simple SELECT statement queries the sample Tickit data set to return all of the predicates and objects for event100:

SELECT ?predicate ?object
FROM <tickit>
WHERE {
<event100> ?predicate ?object
}
ORDER BY ?predicate

You should see:

You may be wondering what’s the ?predicate and ?object part are, because the other things, the FROM, WHERE and ORDER BY, seems to be the same as in SQL.

In a past article, I explained it with an example:

To have a triple we need a subject and object, and a predicate linking the two.

The concepts of predicate and object comes from the concept of triples. So as you can see we have the subject <Geoffrey Hinton> is related to the object <Researcher> by the predicate <is a>. This may sound easy for us humans, but it needs a very comprehensive framework to do this with machines.

You will find these concepts as entity-attribute-value in other places.

And that’s it, they are just common names and concepts we have in graph databases that form a knowledge-graph.

Let’s do another example. We will list each distinct event name in the Tickit data set:

SELECT DISTINCT ?name
FROM <tickit>
WHERE {
?event <eventname> ?name.
}

And you should get:

Another important part of SPARQL is the concept of CONSTRUCT.

We use the CONSTRUCT query form to create new data from your existing data. CONSTRUCT queries take each solution and substitute it for the variables in the graph or triple template.

The following example query specifies a triple template that constructs a new age predicate and approximate age value for the person triples in the sample Tickit data set:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
CONSTRUCT { ?person <age> ?age . }
WHERE { GRAPH <tickit> {
{ SELECT ?person ((YEAR(?date))-(YEAR(xsd:dateTime(?birthdate))) AS ?age)
WHERE {
?person <birthday> ?birthdate .
BIND(xsd:dateTime(NOW()) AS ?date)
}
}
}
}
ORDER BY ?person
LIMIT 50

You should get:

Here the PREFIX clause declares any abbreviations for URIs that you want to reference in a query. BIND assigns the results of an expression to a new variable. The GRAPH part in the WHERE clause is for only matching against <tickit>. The other things I think are very similar to SQL and you can understand them.

We can ask questions to our graph with clause ASK. We use the ASK query form to determine whether a particular triple pattern exists in the specified data set. ASK returns true or false, depending on whether the solution or match exists.

For example:

ASK FROM <tickit> { ?s <eventname> "Wicked" . }

For this one you’ll get true 🙂

The last clause we will use is DESCRIBE. We use the DESCRIBE query form to return all triples that are associated with a specified resource, not just the triples that are bound to any variables that you specify.

For example the following simple DESCRIBE example queries the sample Tickit dataset to describe all of the resources that are associated with person2:

DESCRIBE <person2>
FROM <tickit>

And you’ll get:

s | p | o
--------------+-------------------------------------------------+------------------
person6048 | friend | person2
person9251 | friend | person2
listing30988 | sellerid | person2
sales28393 | sellerid | person2
sales28394 | sellerid | person2
person2 | lastname | Humphrey
person2 | like | musicals
person2 | birthday | 1995-01-03
person2 | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | person
person2 | card | 9955429152637722
person2 | city | Murfreesboro
person2 | friend | person48892
person2 | friend | person15323
...
99 rows

There’s much more to learn about SPARQL, and another great source of information is the SPARQL reference from Cambridge Semantics itself:

Zeppelin in AnzoGraph

Pinterest

The last piece we are going to cover is how to use Zeppelin and connect it with AnzoGraph, the stand-alone graph analytics database. Zeppelin is a web-based notebook that enables similar to Jupyter and we will use its deployment to have an integrated SPARQL interpreter that enables us to make a secure, authenticated connection to AnzoGraph using gRPC protocol.

The first step is accessing docker CLI and choose a folder to store Zeppelin. This is what I did:

  1. Open Kinematic and click on Docker CLI

2. Choose a folder and then run:

[ -d $PWD/logs ] || mkdir -p $PWD/logs
[ -d $PWD/notebook ] || mkdir -p $PWD/notebook

Finally run the following command to download and run the Zeppelin image on port 8080:

docker run -p 8080:8080 --name=zeppelin -v $PWD/logs:/logs -v $PWD/notebook:/notebook \
-e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
-e ZEPPELIN_LOG_DIR='/logs' \
-e ZEPPELIN_WEBSOCKET_MAX_TEXT_MESSAGE_SIZE=10240000 \
-d cambridgesemantics/contrib-zeppelin:latest \
/zeppelin/bin/zeppelin.sh

When the deployment is complete, open Zeppelin by going to the following URL in your browser:

http://localhost:8080/#/ # this may be different

You should be seeing this screen at the moment:

To connect AnzoGraph click the anonymous drop-down list and select Interpreter:

Now type “sparql” and find the SPARQL interpreter:

Click the edit button and modify the interpreter to enter your AnzoGraph deployment details and make a secure connection to the database:

  • anzo.graph.host: The IP address of the AnzoGraph host.
  • anzo.graph.password: The password for the user in the anzo.graph.user field.
  • anzo.graph.port: The gRPC port for AnzoGraph. The default value is 5700. Do not change this value.
  • anzo.graph.trust.all: Instructs Zeppelin to trust the AnzoGraph SSL certificates. Accept the default value of true.
  • anzo.graph.user: The username to use to log in to AnzoGraph.

When you finish adding the connection details, click Save at the bottom of the screen. Zeppelin displays a dialog box that asks if you want to restart the interpreter with the new settings. Click OK to configure the connection.

When the interpreter restart is complete, click the Zeppelin logo at the top of the screen to return to the Welcome screen.

Now to use the platform Download the AnzoGraph Tutorial Zeppelin Notebook, and extract the downloaded notebook ZIP file on your computer. The ZIP file contains AnzoGraph-Tutorial-Notebook.json.

On the Zeppelin Welcome screen, click Import note:

On the import screen, click Select JSON File, and then select the AnzoGraph-Tutorial-Notebook.json file to import. And now to run a query in the file, click the run button for the paragraph. Take a look to the previous section where we discussed SPARQL to understand the queries 🙂

You should be seeing this:

Conclusion

This is just the beginning of our process to create a data fabric. So far we’ve been setting the theoretical framework and now the first practical steps to launch AnzoGraph locally.

We learned about containers, Docker, Kubernetes, SPARQL, RDF and Zeppelin, and how to query graph data. If you already know SQL the change is not crazy, but you need to understand a new language.

In the future, we will transform our “common datasets” into graphs to then start building our knowledge-graph and data fabric, run queries on it, and finally do machine learning to predict the future.

Thanks for reading and if you want to know more please follow me here, and on LinkedIn and Twitter:

read original article at https://towardsdatascience.com/https-towardsdatascience-com-the-data-fabric-containers-kubernetes-309674527d16?source=rss——artificial_intelligence-5