I’m very excited to use this first post to talk about Dgraph , what it is and why it was created.
Before I explain what’s Dgraph , let’s start with a basic understanding of graphs. A graph is a mathematical structure used to model a pairwise relationship between entities. A graph is thus composed of nodes connected by edges. Each node represents an entity (a person, place, thing, etc.), and each edge represents the relationship between two nodes. Some popular graphs that we all know about are the Facebook Social Graph and the Google Knowledge Graph.
A graph database is a database that uses graph structures with nodes and edges to represent, store and serve data.
But who really uses graph databases? More teams and companies than you’d think. Google, Facebook, Twitter, eBay, LinkedIn, Amazon, Dropbox, Pinterest – pick a company you are familiar with. If they’re doing something smart, chances are they’re probably using a graph database. Even very simple web apps have much to gain from graph databases. In the past, I’ve built a graph based REST framework and using that cut down the code in half.
So now that we understand graphs, let’s talk about Dgraph .
Dgraph is an open source, low-latency, high throughput, native and distributed graph database.
To understand why it was created, let’s rewind a few years back to 2011.
I’d been at Google about 4+ years with the Web Search Infrastructure Group. Google had just then acquired Metaweb a year earlier in 2010. I’d been wrapping my head around the newly acquired Knowledge Graph, trying to find ways to integrate Knowledge Graph with Google Search. This is when I found a problem.
At Google, we had multiple knowledge bearing feeds called One Boxes. You know, the boxed snippets that sometimes show up at the top of the search results, for instance when you search for Tesla stock, Weather in Sydney, or Events in San Francisco.
There were multiple custom built backends, each serving a One Box. A search query hitting www.google.com would be sent iteratively through each of these One Box backends to check if any of them has a response. When one of the backends responds, the One Box data is retrieved and rendered on the top of the search results page. This is how that well-formatted box with just the right information shows up below the search bar, thus saving you a few clicks.
As good as it sounded, One Boxes had several inefficiencies and missed opportunities.
For starters, each One Box was custom built by a separate team that was responsible for running and maintaining it. As a result, there was no particular sharing of the framework used to build the One Box.
This also meant that there was no single standard for the data format used by the One Boxes. Each One Box kept its data in its very own data structure, and no common querying language. Thus, there didn’t exist an opportunity to share data amongst the boxes, to respond to more interesting queries that required an intersection of diverse data feeds.
A good example of this would be the ability to recommend events based on the weather to a tourist exploring NYC – that couldn’t easily be done with the existing system.
Courtesy: Several Seconds
This motivated me to start a project to standardize the data structures and eventually serve them all using a single backend. Using the vast expertise of Metaweb team, we chose a data normalization structure that was also used by Knowledge Graph, the RDF Triples. By reconciling all the various entities from the different data feeds, we could start to reuse the data.
But, there was a second and more challenging part to the problem.
It was to build a system that could serve structured queries with data updating in real time. The system had to run behind Web Search, which meant that if it doesn’t respond within allocated milliseconds, Search would time out and move on. Also, this system had to tackle a major chunk of query load to Web Search, which amounts to thousands of queries per second.
We basically had to build a low latency, high throughput system to serve graph queries.
It was certainly an exciting project and held much promise. But, the harsh realities of the business environment and the attendant politics resulted in the cancellation of the project. Shortly thereafter I left Google in 2013 and didn’t give much thought to the project.
Fast forward two years, I was hanging out on the Go language’s Slack channel and Stack Overflow. I saw quite a few people complaining about a popular graph database’s performance and stability.
That’s when I realized that graph databases were starting to be used more frequently than it would appear from the surface. But a bit more digging around revealed a deeper problem.
Existing native graph databases weren’t designed to be performant or distributed.
The ones that sharded the data and distributed it across a cluster weren’t actually native graph databases. They were largely serving as a graph layer over another database. This meant having many network calls should the intermediate number of results be large, which leads to performance degradation.
For example, say you wanted to find [People living in SF who eat Sushi]. Assuming you have this data (hey Facebook!) and keeping things simple, this requires 2 steps.
Courtesy: Yannig Van de Wouwer
First, you find all the people living in SF, and then secondly, intersect that list with all the people who eat Sushi.
As you can imagine, the intermediate step here has a large fan-out, i.e. there’re over a million results. If you were to shard the data by entities (people), you’d end up broadcasting to all the servers in the cluster. Thus, this query would be affected by even a single slow machine in the cluster.
Do that for every query, and it would spike the 95%-ile latency numbers up dramatically, higher latency being worse.
This allows us to shard and relocate the data better, to minimize the number of network calls required per query. In fact, the above query would run in 2 network calls, irrespective of the cluster size.
The number of network calls being directly proportional to the complexity of the query, not the number of intermediate or final results.
Dgraph is designed to easily scale from meeting the needs of a small startup to that of Dropbox, or even Facebook. This means being able to run on a laptop as well as on a big cluster of hundreds of machines serving thousands of queries per second.
Additionally, it would also have to survive machine failures and partial data center collapses. The data stored would have to be automatically replicated with no single point of failure, and be able to move around the cluster to better distribute traffic.
Apart from use with diverse social and knowledge graphs, Dgraph can also be used to:
We’ll be reporting some performance numbers for Dgraph in our next few posts, to give you an idea of what you can expect from the system.