In this post, we’ll look at how Dgraph performs on varying the number of nodes in the cluster, specs of the machine and load on the server to answer the ultimate question: Can it really scale?
Freebase is an online collection of structured data which includes contributions from many sources including individual and user-generated contributions. Currently, it has 1.9 Billion RDF N-Triples worth 250GB of uncompressed data. On top of that, this dataset is over 95% accurate with a complex and rich real world schema. It is an ideal data set to test the performance of Dgraph . We decided not to use the entire data set as it wasn’t necessary for our goal here.
Given our love for movies, we narrowed it down to the film data. We ran some scripts and filtered in the movie data only. All the data and scripts are present in our benchmarks repository. There are two million nodes, which represent directors, actors, films and all the other objects in the database. Moreover, 21 million edges (including 4M edges for names) are representing the relationships between actors, films, directors and all the other nodes in the database.
Some interesting information about this data:
# film.film --{film.film.starring}--> [mediator] --{film.performance.actor}--> film.actor
# Film --> Mediator
$ zgrep "<film.film.starring>" rdf-films.gz | wc -l
1397647
# Mediator --> Actor
$ zgrep "<film.performance.actor>" rdf-films.gz | wc -l
1396420
# Film --> Director
$ zgrep "<film.film.directed_by>" rdf-films.gz | wc -l
242212
# Director --> Film
$ zgrep "<film.director.film>" rdf-films.gz | wc -l
245274
# Film --> Initial Release Date
$ zgrep "<film.film.initial_release_date>" rdf-films.gz | wc -l
240858
# Film --> Genre
$ zgrep "<film.film.genre>" rdf-films.gz | wc -l
548152
# Genre --> Film
$ zgrep "<film.film_genre.films_in_this_genre>" rdf-films.gz | wc -l
546698
# Generated language names from names freebase rdf data.
$ zcat langnames.gz | awk '{print $1}' | uniq | sort | uniq | wc -l
55
# Total number of countries.
$ zgrep "<film.film.country>" rdf-films.gz | awk '{print $3}' | uniq | sort | uniq | wc -l
304
This data set contains information about ~480K actors, ~100K directors and ~240K films. Some example of entries in the dataset are :
<m.0102j2vq> <film.actor.film> <m.011kyqsq> .
<m.0102xz6t> <film.performance.film> <m.0kv00q> .
<m.050llt> <type.object.name> “Aishwarya Rai Bachchan”@hr .
<m.0bxtg> <type.object.name> “Tom Hanks”@es .
All the testing was done on GCE instances. Each machine had 30GB of SSD and at least 7.5 GB of RAM. The number of cores varied depending on the experiments performed.
The tests were run for 1-minute intervals during which all the parallel connections made requests to the database. This was repeated ten times and throughput, mean latency, 95th percentile latency, 50th percentile latency were measured. Note that for user-facing systems, measuring percentile latency is better than mean latency as the average can be skewed by outliers.
In a multi-node cluster set up, the queries were distributed among each node in a round-robin fashion. Note that no single machine contains all the data to answer these queries, in a multi-node cluster. They still have to communicate with each other to respond to the queries.
The parameters that were varied were:
This gave us an idea of what to expect from the system and would help in predicting the configuration required to handle a given load.
We ran broadly 2 categories of queries.
{
me ( _xid_ : XID ) {
type.object.name.en
film.actor.film {
film.performance.film {
type.object.name.en
}
}
}
}
{
me ( _xid_ : XID ) {
type.object.name.en
film.director.film {
film.film.genre {
type.object.name.en
}
}
}
}
During each iteration, either an actor or a director category was chosen randomly.
Furthermore, for that category, an actor or director was chosen randomly; their XID
filled in in the query template.
Let us look at some graphs obtained by varying the machine specs and the number of nodes in the cluster under different loads.
From the above experiments, we can see a relationship between the throughput, latency and the overall computational power of the cluster. The graphs show that the throughput increases as the computational power increases. Which can be achieved either by increasing the number of cores on each server or the number of nodes in the cluster.
The latency increases as the amount of load on the database increases. However, the rate of the increase differs based on how much computational power we have available.
This experiment also shows that there is a limit on how much computational power a single node can have, and once we reach that limit, scaling horizontally is the right option. Not only that, but it also proves that scaling horizontally improves the performance. Hence, having more replicas, distributing the dataset optimally across machines are some factors which help in improving the throughput and reducing the latency that the users face.
Based on this experiment, our recommendation for running Dgraph would be:
These might seem pretty obvious recommendations for a distributed system, but this experiment proves that the underlying design of Dgraph is scalable.
Hope this helps you get a sense of what sort of performance you could expect out of Dgraph!
This post is derived from my report for B.tech Project on “A Distributed Implementation of the Graph Database System, Dgraph”. The full report is available for download here.