Graph Database Basics: Graphs

For anyone who is interested in learning about Dgraph technology, we have a Dgraph paper available for download. The paper describes in fair detail the internals of Dgraph. However, I was surprised to find that we didn’t have any blogs or articles around the basics of Dgraph, providing much-needed context to the dense academic language. Today, that ends! This is the start of a new series of blog posts that will use the Dgraph paper as a guide, providing background, context, and basic principles for the topics in each section of the paper.

Let’s dive into the fundamentals of the technology behind Dgraph.

What is a Graph?

A graph is made up of vertices and edges. To make things a bit more concrete, you may think of a vertex (or node) as a dot drawn on paper. When you have multiple nodes on paper, you may connect any two nodes with a line. That line is called an edge. Sometimes there can be multiple edges from one node to another.

Edges may have arrows (directed edges) or have no arrows (undirected edges). A directed edge represents a one-way relationship. An undirected edge simply represents the existence of a relationship. No, you did not read that wrong. An undirected edge DOES NOT represent a two-way relationship.

A graph is an abstract notion. A graph can represent many things.

For example, the same graph above can represent a social network. Each node is a person, and each edge is a friendship. Here I use an undirected edge representing the existence of a friendship.

The same graph can also represent molecular bonds (first person to figure out the chemical wins admiration from me).

As you can see, graphs allow for very flexible representations of things. Many things and the relationship between them can be represented as graphs.

How are Graphs Represented

In the section above, a graph is represented as a drawing. Unfortunately, computers do not reason well with drawings. So we have to represent them in ways that a computer could understand them. There are many ways graphs may be represented. In this post, I'll start with an abstract representation that is closest to the way Dgraph represents them.

First, we'll have to label the nodes. The most convenient labeling scheme is to use numbers. Edges are directed edges, so there is a "from" and a "to".

Now let’s consider how the nodes are connected, and we’ll list them out:

Node From	Node To
1	2
1	16
2	3
2	7
3	4
3	17
4	5
4	18
5	6
5	19
6	1
6	20
7	8
7	22
7	21
8	9
8	10
9	11
9	12
9	13
10	14
10	15

Now we have a list of pairs of nodes. This is commonly known as an edge list. As you can see, it’s simply a list of edges. If we have multiple edges, then we will need a third column, indicating the kind of edge. For now, the edges are unlabelled.

What exactly then are the nodes? The nodes are simply numbers¹.

Data and Graphs

You may notice that in the graphs described in the sections above, neither the node nor edges carry any data. All we had was a labeling of the nodes with a number. Sometimes, this is all the information that is required. For example, if we are building a map application like Google Maps or Apple Maps, all we need are the labeled nodes representing areas of interests and the edges representing traversable paths (e.g. roads if by car, sidewalk if by walking).

However, Dgraph is a graph database. We want to store data in graphs. What does it mean to store data in graphs? From now on, I will use a simpler version of the graph above, as we will be drawing a lot more things.

Storing Data in the Node

One way of storing data is to store data in the node itself. For example, let's say a node represents a person. A person has a name and age. So we may store the data in the node. So instead of a number, we turn our node into a data structure - a list of key:value pairs that hold the name and edge.

The downside to this sort of representation is that it is very rigid. If you want to add more fields - for example, we also want to note what language each person speaks - we’d have to regenerate the whole graph.

Storing Data Elsewhere

The alternate way is to reuse the numbers as an identifier, and then we can look up the data elsewhere. Let’s take a look at the edgelist (I’ve truncated it for readability):

Node From	Node To
1	2
2	3
2	7
3	4

Now let’s give these nodes names and ages and put them in a lookup table:

Node	Name	Age
1	Charles	20
2	Chloe	20
3	Chris	20
4	Claire	20
7	Calvin	20

The downside to this sort of representation is that when you want to query the data, you would need to find the data elsewhere. This extra step slows things down. However, this is still a viable approach. Some popular graph databases store their data this way².

Storing Data in Edges

If it’s not a good idea to store data in nodes, we might then look at storing data in edges. However, upon inspection of the edgelist, it would seem very silly.

Looking at the table once more:

Node From	Node To	Data
1	2	???
2	3	???
2	7	???
3	4	???

We see that it can be quite confusing as to what kind of data could be stored there. One may presumably put a pair of key:value data structures (think hash maps in Java, dictionaries in Python, objects in JavaScript). The first would correspond to the data of the “from” node. The second of which would correspond to the data of “to” node. The table will look something like this:

Node From	Node To	Data
1	2	`{Name: Charles, Age: 20}`, `{Name: Chloe, Age: 20}`
2	3	`{Name: Chloe, Age: 20}`, `{Name: Chris, Age:20}`
2	7	`{Name: Chloe, Age: 20}`, `{Name: Calvin, Age: 20}`
3	4	`{Name: Chris, Age: 20}`, `{Name: Claire, Age:20}`

Observe that in the data, there are many repeats. There are three copies of “Chloe” and two copies of “Chris”.

The downside to this sort of representation is that you would need a lot of memory to store very little data. Furthermore, updating the data would require updating in many places.

What Dgraph chose to do is to store data as edges. But in order to fully explain this, I’d first need to explain predicates.

Predicates: The Type of Edge

I had alluded to this earlier - a graph may have multiple edges between two nodes. Thus we would have to give names to each edge. Take the social network example. Previously we had determined the edges to be a friendship. Now, let’s change what an edge represents to “knows” - so if there is an edge between Alice and Bob, then we can say “Alice knows Bob”. We’ll also change the edge to be a directed edge.

Now human relationships are subtle and many-layered. There are many types of relationships³. Imagine the social network example above represents a social network of authors⁴. Authors influence each others’ works all the time. So another type of relationship could be “influences”. In fact let’s use that:

In the graph above, black edges represent “knows” while “influences” edges represents “influences”. So Chris knows Claire. And Claire influences Chris.

Most importantly, this graph has two TYPES of edges:

“knows”
“influences”

In Dgraph, we call the type of edges a Predicate⁵.

Storing Data as Edges

Having introduced the notion of a predicate, we can now press on to see how Dgraph stores data.

Recall that we may store data in the table that defines a graph. But storing data in an edge is wasteful. However, if we make the properties of a node into an edge, then things would change quite dramatically.

Still using the example of the social network, now let's turn the properties "Name" and "Age" into edges. The tables are gone because we're no longer storing the nodes with their properties. So we replace them with dots:

Green edges represent the "Name" predicate. Purple edges represent the "Age" predicate. Black edges represent the "knows" predicate. Red edges represent the "influences" predicate.

This looks VERY complicated. However, it’s simpler when viewed as a table (I’ve truncated the table to ensure that it’s readable):

Node From	Node To	Data
1	2
1	23	20
1	24	Charles
2	23	20
2	25	Chloe
2	3
3	23	20
3	26	Chris
…	…	…

Here we see there are some edges that carry data and some don’t. Those edges that carry data only carry the data of the “To” node.

Congratulations. You now have a rough understanding of the primary data structure that powers Dgraph: the Posting List. An in-depth analysis of the Posting List will be done later in this blog series. The next blog post will talk about types and why we need them.

Footnotes

Here’s a fun fact: the footnotes form a graph. See if you can figure it out!

The numbers are for all intents and purposes, for labeling only. They are not data. How is this distinction made? If I change the numberings of the nodes, the graph structure remains the same. The edge list will have to change, of course, but the structure remains the same. You may think of it this way: if Peter and Jane are friends with each other, and one day Peter changes his name to Spiderman, the two people are still friends⁶! ↩︎
The look-up tables are usually called record files. Each column in a record file is called a property record. Property records are linked together as linked lists. Each node would simply link to the first property record, forming a “row”. ↩︎
The different types of relationships themselves form a graph. For example, “friendship” subsumes “acquaintances”; and “acquaintances” subsumes “knowledge” - that is, you can’t be friends without also becoming acquaintances; you can’t be acquaintances without first knowing the other person⁷. Build up a list of dependencies and you’ll find that human relationship types also form a graph. ↩︎
Obviously, they’re very young writers. Perhaps this is a authors’ social network taken from the databases of NaNoWriMo whi5ch skews young. ↩︎
Strictly speaking, a predicate is a grammatical component. The job of a predicate is to make a claim. So in the sentence “Chris knows Claire”, the word “knows” performs the job of making the claim that Chris knows Claire. This is also linked to the notion of a predicate in logic - where a predicate refers to a sentence that can be judged to be true or false⁸. ↩︎
I will acknowledge that changing one’s identity often causes strifes in friendships in real life. Changing identities in computer systems also cause some issues. I did mention that I don’t want to get into philosophical conundrums didn’t I⁹? Look what you made me do! ↩︎
No, not “knowing” someone in the Biblical way. Get your head out of the guttter! ↩︎
A sentence like Chomsky’s famous “Colourless green ideas sleep furiously” is not a logical predicate, while “the cat is black” is a logical predicate. I also used the word “judged” instead of “evaluated”, because strictly speaking, only expressions may be evaluated. Statements are judged. ↩︎
I use “label” instead of “references”, “names”, or “identity” because those words have specific connotation in philosophy. I don’t want to get into philosophical conundrums about names and references, which will take more than a book to disentangle. Labels are not data. ↩︎

Thu, Mar 11, 2021