Ever wondered how a knowledge graph with ten billion nodes is able to achieve a millisecond query? What is the best way to select the right graph database among multiple options that is suitable for the production environment requirement? How is KE Holdings’s 48 billion ordered triple datasets stored in the database? This article will explore and discover answers to these questions.
We will use this case study to explore the following five topics:
Let’s look at the first question: KE Holdings largest graph database is at 48 billion ordered triple datasets as of the moment. How should such a large amount of map data be stored and retrieved with concurrent queries within milliseconds? How would this bring growth and scalability for the company? Keep these questions in mind as we get started.
KE Holdings’s graph database contains a lot of information: property listings, clients, brokers, developers, communities, subways, hospitals, schools, supermarkets, cinemas, and so on.
Here’s some context for our example. Let’s say that the developer is XXX; the greening rate of the community should be greater than 30%, within 200 meters periphery should have a large supermarket, within 500 meters there is an access to subway, within a kilometer there is a top hospital, within two kilometers there is an senior high educational institution of which passing rate is more than 60%. The price point of the unit should be within eight million and is also the most frequently visited unit by the brokers.
This may be an ideal unit that clients want, but can anyone think of a software product that would be able to support the business’s needs? Let’s say, we are to use the traditional relational databases, such as MySQL or Oracle, would it be feasible? Does that go to say we need to join multiple tables at the same time, such as the unit listing table, client info table, broker table, developer table, etc. in order to get the desired result? This is obviously an unrealistic solution. How about using ES (Elasticsearch)? ES is a very popular choice in the search domain. Would it be able to satisfy our business’s needs? It would not be able to, because for ES to be able to search such a listing, it would need to have an extremely big listing dataset. How would it be able to simultaneously search for the constraint if the unit listing has a supermarket within 200 meters? Does this mean that we must create a variable called “supermarket relative distancing”? It is a clear indication that this option is not practical. Likewise, this is also true for HBase. Therefore, we can only use a Graph Database, like Neo4j, to be able to support our business applications.
Allow me to introduce Graph Database. What is a Graph Database?
The applications for Graph Database are wide. It is not only limited to Company Matrix Graph, Knowledge Graph, but it also includes the following:
This is the popularity trend chart of various types of databases on DB-Engines. It can be clearly seen that graph databases have become more and more popular in the past two years and have attracted a lot of attention from many users.
Among the graph database management systems, Neo4j ranks first. There are so many graph database types to choose from, such as open source, closed source, stand alone, distributed, and more.
The applications of graph databases are wide, from search, recommendation, relational graphs, knowledge graphs, etc. KE Holdings uses graph databases in multiple different instances, but the platform selection process varies from department to department. Some use JanusGraph, while others use Neo4j, and each department that needs it has to build a set from scratch. So, we are wondering, is there a universal graph database that can support all scenarios that need to use graph databases? For the data analysts, this means an application that would allow them to focus on the high-level strategic planning and decision policies, rather than focusing on the programming per se. The answer to this is, we need a universal graph database. With multiple graph databases out there, how do we decide which one to use? This brings us to our second topic: graph database selection process.
As we proceed with the selection of a graph database, which indicators should we take note of when making the decision? We mainly focus on the following aspects: open source, platform maturity, scalability, library abundance, stability, performance, maintenance cost, and ease of use. The maintenance cost component is frequently overlooked; it is one of prime considerations we should keep in mind to know if it would be a worthwhile investment. We should not forget to take this as one of the prime considerations to know is if it will be worth it for the investment.
While there are many graph databases to choose from, such as Neo4j, OrientDB and ArangoDB, JanusGraph and Dgraph are the only open source ones. In respect to Neo4j, OrientDB and ArangoDB are mature graph databases that were developed in 2010 and 2012, respectively. JanusGraph and Dgraph are the more recent graph databases that launched in 2017 and 2016, respectively.
What are the main differences between these graph databases? Let’s dive into a brief comparison and analysis of each.
Neo4j has a long history and has been a leading graph database, so why should we not consider it? The reason is pretty simple; its open source community edition only supports stand-alone databases and not distributed databases.
OrientDB and ArangoDB started relatively early. From their earlier stages, likewise, they were only able to support stand-alone databases. However, as the amount of user data continued to increase, the later versions added distributed model databases. Maybe because the distributed model database was an added function to an existing program, the support and synergy are not ideal.
Because JanusGraph and Dgraph were developed relatively later and they considered distributed databases and scalability from the beginning of the design, their support for distributed databases is very good. They are also both designed specifically for graph databases. Each however has a great distinction. The storage for JanusGraph relies heavily on other systems, which leads to the aforementioned maintenance cost problem, while Dgraph has a self-serving storage system. For example, JanusGraph mostly uses HBase as the underlying storage system, and HBase relies on Zookeeper and HDFS. Simultaneously, the indices for JanusGraph also rely heavily on ES. Therefore, if you want to build a complete JanusGraph, you need to build and maintain several systems at the same time, which has a high implied maintenance cost. Since Dgraph is supported natively, the cost is relatively much more affordable.
Let’s compare the two graph databases’ architecture in more detail.
As previously mentioned, the storage system of JanusGraph relies heavily on external storage systems like Cassandra, HBase, BerkeleyDB and so on, while its search engine heavily relies on Elasticsearch, Slr, Lucene, etc. Because of these reasons, JanusGraph is very well integrated with the big data ecosystem and can be well combined with Spark to do large graph computations. The only trade-off is its high maintenance cost, due to its dependency on many external systems. When building a JanusGraph system, several sets of dependent systems need to be built simultaneously; another point of consideration is its stability. According to past experience, the more dependence a system has, the greater the risk of lower controllability.
The Dgraph Database architecture is straightforward. All of its functions are supported natively, so Dgraph does not depend on any third-party systems. (View diagram bottom-up)
Dgraph has only one executable file. By specifying the parameters on different local machines, it can automatically form clusters. This is an advantage of Dgraph, as it does not depend on third-party systems. Can we just choose Dgraph due to it being easy to use and maintain? The answer is no, because we need to do performance comparisons as well. For example, if the performance of JanusGraph is way better than Dgraph, then its higher maintenance cost is more reasonable. So for this purpose, we did a detailed performance comparison test between these two graph databases.
The performance comparison test for a 30GB dataset was carried out under the environment of three machines with 48 cores, 128GB memory, and SATA hard drive, with 48 million data nodes, 63 million data edges, and 450 million triple datasets.
Given this query performance test, we can see Dgraph’s superiority over JanusGraph.
Summary of the differences between JanusGraph and Dgraph:
Based on the comparison done above, we chose Dgraph to build our graph database.
The selection of a graph database marks the start of the graph database modeling.
As mentioned earlier, building a Dgraph cluster is actually quite simple. We use Docker and Kubernetes to implement a unified containerized deployment and management system for Dgraph. As shown in the figure above, three servers are used, and four containers are started on each server. Three of these containers are Alpha nodes, which are used for storing data, indexing, and querying; the Zero nodes is used as the controller. One thing to note is, the three Alphas of each group are used to store three copies of the same data; different Alphas of the same group cannot be on the same machine. Based on the number of copies, Zero determines which Alpha goes to which group. For example, when Dgraph Zero has replicas set to 3, then the system will then initialize the Alpha group accordingly across three servers. Alpha 1 gets allocated to Server 1; Alpha 2 gets allocated to Server 2, and Alpha 3 naturally gets allocated to Server 3. The first initialization (Alpha 1, 2, and 3) forms the first group. Followed by the natural order, Alpha 4, 5, 6, forms the second group, and Alpha 7, 8, 9 forms the third group. Note that Alpha 1, 2, and 3 should not be in the same group as such a group would not be guaranteed highly available, because if local machine server 1 were to fail, then all three replicas of the data would be lost.
The left side of the panel is the control center of Dgraph. Upon the activation of Zero, the replica parameter needs to be provided. When Alpha is activated as well, specify the address of Zero. Now that you have a whole Dgraph cluster, you can see that the investment cost is indeed low.
After the implementation of the clusters, the next focus is the data stream. When creating a graph database, there are multiple data writing modes to consider, such as real-time, batch, and initialized data stream.
Real-Time Data Stream: There is a Data-Accepter piece, which users may use to update the data in real-time. Then, Kafka is used to employ an asynchronous peak cut and write into its queue. Lastly, Graph-Import is used to transfer data from Kafka into the Dgraph cluster.
Batch Data Stream: For example, if there is a massive batch data update, most of the data for the KE Holdings graph database that are either stored in Hive or HDFS would have a Hive2Kafka spark task. The task is to export the data from Hive or HDFS to collect all the data. In the same manner of Kafka data importing, the new data would be imported through Graph-Import of Kafka, then written to the Dgraph cluster.
Initialized Data Stream: Another difference of Dgraph is that its support of the initial data import is extremely quick. For a 48 billion dataset to be imported for the first time, written line by line, would take a long time. Hence, use Dgraph’s initialized data stream and Bulk Loader, and pre-generate its data and index files through MapReduce. Then, Dgraph’s Alpha node is activated to load those files quickly, which we will go into more detail later. We stand behind this type of data stream because of its efficiency to load the data in just a single click. The Kubernetes are then called to activate a Dgraph cluster, followed by loading the initialized data and finally returning an API to the Dgraph user.
After the data finishes loading to the server, the next step is to query the data. The image above shows Dgraph’s Ratel visualization interface. After entering a query in the console on the left, the corresponding visualization appears on the right. In this example, we are querying the name “Xiu Yuan” for all kindergartens with a greening rate of at least 30% and are within a kilometer range. As you can see in the figure above, it took Dgraph only 24 milliseconds to return the query, and the total round-trip time was 91 milliseconds, which is extremely fast.
As you can see above, the Dgraph query statement is not that simple; it requires at least a fundamental understanding on the subject matter. As previously mentioned, it is important to consider ease of use and simplifying the learning curve for the user as much as possible. Thus, we need to consider if there is simpler query syntax.
Let’s take a look at how this query is written in Gremlin. As shown in the figure, multiple has and select attributes are used to query the same result, but it is also not simple.
Considering SQL is the language most programmers are the most familiar with, even non-programmers and most data analysts know how to use SQL, can SQL be used to query graph databases? We then designed a language that uses SQL to query graph databases, called Graph SQL. As shown in the first line of the Graph SQL query statement, it selects the name of the community, greening rate, and the name of the kindergarten from the mapping relationship between the community and kindergarten, where the name of the community is “Xiu Yuan”, the greening rate is greater than 30%, and the distance to the kindergarten is less than 1000 meters. The difference in the SQL from statement is that the relationship between the community and kindergarten is not from a table, but rather it is a mapping relationship.
The figure above shows a complete Graph SQL query statement, including some noteworthy graph database keywords. Here are a couple of key terms: shortest path – searching for the shortest distance from a to b and degree – searching on the first degree, second degree, third degree, and so on. Graph SQL also supports node query, link query, attribute query, etc. The last part of the query statement includes GROUP BY, HAVING, ORDER BY, LIMIT, etc. To be more specific, LIMIT supports node LIMIT and link LIMIT, which only supports the more basic commands as of the moment. The more complex queries are to be improved down the line. The bottom line is, Graph SQL would help effectively lower the learning curve and operating cost for the implementation of graph databases.
Lastly, the figure above is the result of the previous query statement that was run. As you can see, sending a simple HTTP request with an SQL query statement returns the graph database result quickly.
Summarizing the previously created Dgraph cluster, the overall graph database architecture consists of data writing, data query, and unification of the search query infrastructure.
There is a unified data flow and query module within the gateway:
The completion of building the graph database infrastructure is just the first step to having production-ready infrastructure. Steps 1 to N then requires the assurance of infrastructure stability and increased usability, performance and user experience. This second part requires one to do further learning and have an in-depth understanding of the implementation principles, Dgraph optimization, as well as its advantages and disadvantages.
The storage engine, Badger, is developed by Dgraph in Golang. The first storage engine Dgraph employs is RocksDB, but through some adjustments made in Go, data overflow problems occurred. The Dgraph team then decided to use Golang to implement an efficient and persistent LSM-based key-value database, which is 3.5 times faster than RocksDB.
(Predicate, Subject) --> [sorted list of ValueId], Key is a combination of a predicate and subject, and Value is a list of the dataID.
For an example:
(friend, me) --> [person1, person2, person3, person4, person5]
The keys from this example are friend and me. The mapping relationship is friend, while me is the subject, which forms a key. Value on the other hand, is an ordered list. All the “friend” of “me” form an ordered array of IDs that go from person1, person2, until person5.
Based on the design of the infrastructure, all data stored under the same predicate in Dgraph is stored in the same data node or even on the same data block. This is done so that when data is queried from a predicate, it will only run the RPC once to get all predicate data. This is a great performance upgrade for first degree, second degree, and multiple degree mapping search queries – the core advantage of this infrastructure design.
According to predicate sharding, the data in the same predicate stored in the same node decreases RPC and improves query performance. Different predicates may be stored on different nodes.
Regular data balance (rebalance_interval), the zero node will periodically check if the data nodes are balanced. If certain data nodes are too large or too small, the search query performance would drop. Hence, the role of zero node is to ensure data balance for each node.
The Raft group is determined according to the startup sequence of replicas and Alpha. Because Dgraph's replica consistency relies on the Raft protocol, it is necessary for there to be at least three nodes to ensure strong data consistency.
Each group has at least three Alphas, which are replicas of each other. Raft ensures data consistency, so if one of the alpha nodes fails, data can be restored through the other two replicas.
Write-ahead logs are a WAL mechanism in distributed servers used to improve write performance. It must first be written in the cache before flushing to disk. Because it is not being written directly to the disk, its efficiency is extremely low. If the data is written first to the memory before the disk, it would impose a potential problem; if the local machine hangs before the saving is complete, the data would not be saved to the disk, and a portion of the data would be lost. Therefore, most distributed systems like HBase, Elasticsearch, and Dgraph write the write-ahead logs in advance before saving the data to the memory. The logs will be flushed to the disk in real-time, and then the data will be written to the memory. So if any instances of data loss happen, it can use the write-ahead logs to retrieve the data, which ensures the performance and availability is high.
Our research on Dgraph is just a span of a few months, so we have only made some small optimizations so far: How can the 48 billion ordered triple datasets graph database be uploaded as efficiently as possible to the cluster?
Java was used first to write, and it was found to be extremely slow; it would take a full one week to write into the database.
Then we used Dgraph’s Bulk Loader to load data, first generating the index data, then loading it through the alpha node, and finally starting the cluster to provide services. This only requires 48 hours to finish loading. While the speed has improved, it still took quite some time. Can this be optimized even further to improve speed?
Hence, we researched on the Bulk Loader source code. We found that when the Map Reduce process is done on a single machine, we just need to assign a unique UID for the local machine to ensure that the UID would satisfy two things – UID uniqueness and ordered list upon execution. Use single machine multithreading to start multiple Map and Reduce threads. Next, generate shard files for each thread, followed by loading data through Dgraph’s Alpha. Based on our understanding of the source code, we may be able to optimize it even further. Dgraph was designed as a distributed system, and all kinds of mutations can be expanded linearly, but it cannot be said that the initial batch import can be a stand-alone module.
So, we then optimized the source code to an extent where we changed the single-machine multi-threading model to a multi-machine multi-threading model. For the partitioning process, a unique UID is assigned to each data row, which is still executed on a single machine. The data is divided into a data block, which is then distributed to different machines, where each machine activates a map-reduce process to generate data files that are loaded by alphas as each alpha is started, until the entire cluster successfully comes online. With this kind of setup, the 48 billion ordered triple dataset import time decreased from 48 hours to 15 hours, improving the lead time by three times.
After importing the 48 billion ordered triple datasets into the server, we can now address the initial question we posted: Can it truly support millisecond-level queries of tens of billions of graph data? Therefore, we conducted a performance stress test. As shown in the graph above, we can truly see that Dgraph’s performance is indeed good. The x-axis is the concurrent threads; the left y-axis is client and server response time, and the right y-axis is QPS output. When we performed a stress test on a thousand concurrent queries, Dgraph was still able to maintain a response time of 50 ms, while simultaneously achieving 15000/s QPS; the performance is indeed good.
If the performance, usability, and maintenance cost of Dgraph are that good, can we then say that Dgraph is a perfect graph database? Can we use Dgraph to support all kinds of applications? Obviously not, as there is no perfect system, but rather a system that is most suitable for a business’s needs. In the same token, there is no perfect human being, just the most suitable one for you.
Does not support multigraph
For any pair of vertices, only one edge with the same label type is allowed. For JanusGraph, multiple edges are allowed after two vertices are determined. For example, in Dgraph, if you and I are classmates, there can only be one edge called classmate relationship. But in JanusGraph, you and I can be grade school classmates, high school classmates, and college classmates, as there are three classmate relationship edges.
One cluster only supports one graph
A Dgraph cluster may only support one graph. Supporting multiple graphs is currently under development and will be supported in the future. The absence of this feature does not have a huge impact on KE Holdings. For example, the graph of 48 billion ordered triple datasets itself requires a separate cluster and will not be shared with other graphs. This is not too big of a problem at present. Naturally, we hope that there can be official support for this feature as soon as possible.
Big Data Incompatibility
JanusGraph is compatible with the Big Data ecosystem because JanusGraph uses HBase storage. Dgraph on the other hand is not compatible because it is developed in Golang, and when Spark is used to write to it concurrently, there will be an overload state.
Dgraph started in 2016, so it is not a very mature platform and has many minor issues. However, the developers are quick to release updates, fixing a lot of issues quickly.
To summarize, there is no perfect system, there is only the most suitable system. When we make a technical selection, we mainly see whether its advantages are what we need and whether its defects are acceptable to us. Ultimately, we chose Dgraph as our premier graph database, rather than building our own graph database platform based on it, that can be used universally across all departments to meet their respective needs.
Finally, we will talk briefly about future plans. I am mainly responsible for the implementation of the overall search engine platform for KE Holdings. The Dgraph implementation is just a portion of the whole search engine system. At present, we already have a text retrieval engine based on Elasticsearch and a graph data retrieval engine based on Dgraph, and there will be a vector retrieval engine based on Faiss in the future.
The Cloud Search Platform is a business access platform that integrates all the underlying layers: effects platform, algorithm platform, the three retrieval engines (ES, Dgraph, Faiss), and the container platform. At the same time, it integrates unified service management capabilities to form a search platform. In the future, the business side does not need to care about the underlying database storage, writing and querying any further. The search platform will unify the relevant capabilities, and then provide a unified entrance and exit, while ensuring the overall performance and stability, so as to quickly empower the business. This will allow the business side to focus more on the high-level strategic planning and business logic.
This is the overall action plan for the graph database:
That is all I have to share for today, thank you everyone.