For anyone who is interested in learning about Dgraph technology, we have a Dgraph paper available for download. The paper describes in fair detail the internals of Dgraph. However, I was surprised to find that we didn’t have any blogs or articles around the basics of Dgraph, providing much-needed context to the dense academic language. Today, that ends! This is the start of a new series of blog posts that will use the Dgraph paper as a guide, providing background, context, and basic principles for the topics in each section of the paper.
Let’s dive into the fundamentals of the technology behind Dgraph.
Edges may have arrows (directed edges) or have no arrows (undirected edges). A directed edge represents a one-way relationship. An undirected edge simply represents the existence of a relationship. No, you did not read that wrong. An undirected edge DOES NOT represent a two-way relationship.
A graph is an abstract notion. A graph can represent many things.
For example, the same graph above can represent a social network. Each node is a person, and each edge is a friendship. Here I use an undirected edge representing the existence of a friendship.
The same graph can also represent molecular bonds (first person to figure out the chemical wins admiration from me).
As you can see, graphs allow for very flexible representations of things. Many things and the relationship between them can be represented as graphs.
In the section above, a graph is represented as a drawing. Unfortunately, computers do not reason well with drawings. So we have to represent them in ways that a computer could understand them. There are many ways graphs may be represented. In this post, I'll start with an abstract representation that is closest to the way Dgraph represents them.
First, we'll have to label the nodes. The most convenient labeling scheme is to use numbers. Edges are directed edges, so there is a "from" and a "to".
Now let’s consider how the nodes are connected, and we’ll list them out:
|Node From||Node To|
Now we have a list of pairs of nodes. This is commonly known as an edge list. As you can see, it’s simply a list of edges. If we have multiple edges, then we will need a third column, indicating the kind of edge. For now, the edges are unlabelled.
What exactly then are the nodes? The nodes are simply numbers1.
You may notice that in the graphs described in the sections above, neither the node nor edges carry any data. All we had was a labeling of the nodes with a number. Sometimes, this is all the information that is required. For example, if we are building a map application like Google Maps or Apple Maps, all we need are the labeled nodes representing areas of interests and the edges representing traversable paths (e.g. roads if by car, sidewalk if by walking).
However, Dgraph is a graph database. We want to store data in graphs. What does it mean to store data in graphs? From now on, I will use a simpler version of the graph above, as we will be drawing a lot more things.
The downside to this sort of representation is that it is very rigid. If you want to add more fields - for example, we also want to note what language each person speaks - we’d have to regenerate the whole graph.
The alternate way is to reuse the numbers as an identifier, and then we can look up the data elsewhere. Let’s take a look at the edgelist (I’ve truncated it for readability):
|Node From||Node To|
Now let’s give these nodes names and ages and put them in a lookup table:
The downside to this sort of representation is that when you want to query the data, you would need to find the data elsewhere. This extra step slows things down. However, this is still a viable approach. Some popular graph databases store their data this way2.
If it’s not a good idea to store data in nodes, we might then look at storing data in edges. However, upon inspection of the edgelist, it would seem very silly.
Looking at the table once more:
|Node From||Node To||Data|
|Node From||Node To||Data|
Observe that in the data, there are many repeats. There are three copies of “Chloe” and two copies of “Chris”.
The downside to this sort of representation is that you would need a lot of memory to store very little data. Furthermore, updating the data would require updating in many places.
What Dgraph chose to do is to store data as edges. But in order to fully explain this, I’d first need to explain predicates.
I had alluded to this earlier - a graph may have multiple edges between two nodes. Thus we would have to give names to each edge. Take the social network example. Previously we had determined the edges to be a friendship. Now, let’s change what an edge represents to “knows” - so if there is an edge between Alice and Bob, then we can say “Alice knows Bob”. We’ll also change the edge to be a directed edge.
Now human relationships are subtle and many-layered. There are many types of relationships3. Imagine the social network example above represents a social network of authors4. Authors influence each others’ works all the time. So another type of relationship could be “influences”. In fact let’s use that:
In the graph above, black edges represent “knows” while “influences” edges represents “influences”. So Chris knows Claire. And Claire influences Chris.
Most importantly, this graph has two TYPES of edges:
In Dgraph, we call the type of edges a Predicate5.
Having introduced the notion of a predicate, we can now press on to see how Dgraph stores data.
Recall that we may store data in the table that defines a graph. But storing data in an edge is wasteful. However, if we make the properties of a node into an edge, then things would change quite dramatically.
Still using the example of the social network, now let's turn the properties "Name" and "Age" into edges. The tables are gone because we're no longer storing the nodes with their properties. So we replace them with dots:
Green edges represent the "Name" predicate. Purple edges represent the "Age" predicate. Black edges represent the "knows" predicate. Red edges represent the "influences" predicate.
This looks VERY complicated. However, it’s simpler when viewed as a table (I’ve truncated the table to ensure that it’s readable):
|Node From||Node To||Data|
Here we see there are some edges that carry data and some don’t. Those edges that carry data only carry the data of the “To” node.
Congratulations. You now have a rough understanding of the primary data structure that powers Dgraph: the Posting List. An in-depth analysis of the Posting List will be done later in this blog series. The next blog post will talk about types and why we need them.
Here’s a fun fact: the footnotes form a graph. See if you can figure it out!
The numbers are for all intents and purposes, for labeling only. They are not data. How is this distinction made? If I change the numberings of the nodes, the graph structure remains the same. The edge list will have to change, of course, but the structure remains the same. You may think of it this way: if Peter and Jane are friends with each other, and one day Peter changes his name to Spiderman, the two people are still friends6! ↩︎
The look-up tables are usually called record files. Each column in a record file is called a property record. Property records are linked together as linked lists. Each node would simply link to the first property record, forming a “row”. ↩︎
The different types of relationships themselves form a graph. For example, “friendship” subsumes “acquaintances”; and “acquaintances” subsumes “knowledge” - that is, you can’t be friends without also becoming acquaintances; you can’t be acquaintances without first knowing the other person7. Build up a list of dependencies and you’ll find that human relationship types also form a graph. ↩︎
Obviously, they’re very young writers. Perhaps this is a authors’ social network taken from the databases of NaNoWriMo whi5ch skews young. ↩︎
Strictly speaking, a predicate is a grammatical component. The job of a predicate is to make a claim. So in the sentence “Chris knows Claire”, the word “knows” performs the job of making the claim that Chris knows Claire. This is also linked to the notion of a predicate in logic - where a predicate refers to a sentence that can be judged to be true or false8. ↩︎
I will acknowledge that changing one’s identity often causes strifes in friendships in real life. Changing identities in computer systems also cause some issues. I did mention that I don’t want to get into philosophical conundrums didn’t I9? Look what you made me do! ↩︎
No, not “knowing” someone in the Biblical way. Get your head out of the guttter! ↩︎
A sentence like Chomsky’s famous “Colourless green ideas sleep furiously” is not a logical predicate, while “the cat is black” is a logical predicate. I also used the word “judged” instead of “evaluated”, because strictly speaking, only expressions may be evaluated. Statements are judged. ↩︎
I use “label” instead of “references”, “names”, or “identity” because those words have specific connotation in philosophy. I don’t want to get into philosophical conundrums about names and references, which will take more than a book to disentangle. Labels are not data. ↩︎