Powering 48 Billion Triples In Production with Dgraph at KE Holdings
This is the English translation of the original KE Holdings Case Study. KE Holdings Inc. is China’s leading integrated online and offline platform for housing transactions and services.
With ten billion nodes, KE Holding needed to select the right graph database that could achieve millisecond query response. This is how they did it.
This case study answers three key questions.
- How do you choose the best graph database (one that is ideal for the unique production environment’s needs) from multiple options?
- How can a knowledge graph with ten billion nodes achieve a millisecond query?
- How are KE Holdings’s 48 billion ordered triple datasets stored in the database?
To answer these questions, the case study is divided as follows:
- Why KE Holdings Needed a Graph Database
- Choosing the Right Graph Database
- How to Deploy the Graph Database
- Principles, Optimizations, and Trade-offs
- Future Plans
1 KE Holdings’ journey with graph databases: how to store 48 billion ordered triple datasets?
KE Holdings had a challenge: its largest property knowledge graph was at 48 billion ordered triple datasets at the time of writing. How should such a large amount of map data be stored and retrieved with concurrent queries within milliseconds, to support the exponential growth of the business? How would this enable growth and scalability for the company?
Why does KE Holdings need a graph database?
KE Holdings’s knowledge graph contains a lot of information: property listings, clients, brokers, developers, communities, subways, hospitals, schools, supermarkets, cinemas, and so on.
Even a standard search query can be complex. For example, someone might search for a property with the following features:
- The developer is XXX
- The greening rate of the community should be greater than 30%
- Should have a large supermarket within 200 meters
- Access to a subway within 500 meters
- A top hospital within 1 kilometer
- A top high school with a passing rate higher than 60% within 2 kilometers
- The price point of the unit should be within 8 million and is also the most frequently visited unit by the brokers.
This may be an ideal unit that clients want, but can anyone think of a software product that would be able to support such a search query? Using traditional relational databases, such as MySQL or Oracle, wouldn’t be feasible – the KE Holdings team would need to join multiple tables at the same time, such as the unit listing table, client info table, broker table, developer table, etc. in order to get the desired result. This is clearly an unrealistic solution.
How about using Elasticsearch (ES)? ES is a very popular choice in the search domain. But it would not satisfy the business’ needs because for ES to be able to search such a listing, it would need to have an extremely big listing dataset. How would it be able to simultaneously search for the constraint if the unit listing has a supermarket within 200 meters? Does this mean that we must create a variable called “supermarket relative distance”? It is a clear indication that this option is not practical. Likewise, this is also true for HBase. Therefore, graph databases are necessary for KE Holdings.
What is a Graph Database?
- It is not used to store images (Editor’s note: in Chinese the word for graph and photograph is the same.)
- Rather, it is used to store nodes and mapping relationships
The applications for graph databases are vast. It is not limited to only real-estate industry knowledge graphs or other knowledge graphs. Other industries that can benefit from graph databases include:
- Social Media, Computing Network, Road Network, Telecoms Network
- Related Queries, Search Recommendations
- Risk Predictions, Risk Control Management
- Business Process, Logistics Supply Chain
- Organizational or Market Hierarchical Structure
- Event-Driven Causal Clusters and Relationships
Among the graph database management systems, Neo4j ranks first. There are so many graph database types to choose from, such as open source, closed source, stand alone, distributed, and more.
The Graph Database Selection Process
The applications of graph databases are vast, from search, recommendation, relational graphs, knowledge graphs, etc. KE Holdings uses graph databases in multiple different instances, but the platform selection process varies from department to department. Some use JanusGraph, while others use Neo4j, and each department that needs it has to build the dataset from scratch.
KE Holdings needed a universal graph database that could support all of these scenarios. This would enable departments that use knowledge graphs and relationship graphs to focus on the high-level strategic planning and decision policies, rather than focusing on the underlying technical problems, such as storage, distribution, high performance, high availability, etc. But which graph database?
How to select the right graph database?
When it came to the selection process, there were multiple criteria and elements to consider. The KE Holdings team mainly focused on: open source, platform maturity, scalability, library abundance, stability, performance, maintenance cost, and ease of use. The maintenance cost aspect is frequently overlooked during the selection process. But it was one of the prime considerations.
While there are many graph databases to choose from, such as Neo4j, OrientDB and ArangoDB, JanusGraph and Dgraph are the only open source ones. In respect to Neo4j, OrientDB and ArangoDB are mature graph databases that were developed in 2010 and 2012, respectively. JanusGraph and Dgraph are the more recent graph databases that launched in 2017 and 2016, respectively.
Comparison of the Main Graph Databases
Neo4j has a long history and has been a leading graph database. So why didn’t KE Holdings choose it? Neo4j’s open source edition only supports stand-alone databases but it’s not distributed.
OrientDB and ArangoDB have also been around for years. From their earlier stages, likewise, they were only able to support stand-alone databases. However, as their users continued to increase, the later versions added the distributed model. Maybe because the distributed model database was an added function to the existing program, the support and integration were not ideal.
The choice was then between JanusGraph and Dgraph. Both were started relatively recently and both considered distributed databases and scalability from the beginning of the design. Their support for distributed databases is very good. They are also both designed as native graph databases.
However, there is an important distinction between JanusGraph and Dgraph. The storage for JanusGraph relies heavily on other systems, which leads to the aforementioned maintenance cost problem, while Dgraph has an embedded storage system, BadgerDB.
For example, JanusGraph mostly uses HBase as the underlying storage system, and HBase relies on Zookeeper and HDFS. Simultaneously, the indices for JanusGraph also rely heavily on ES. Therefore, if you want to build a complete JanusGraph, you need to build and maintain several systems at the same time, which has a high implied maintenance cost. But Dgraph has a native storage system, the maintenance cost is much more affordable since you only need to support one system.
JanusGraph Database Architecture
As previously mentioned, the storage system of JanusGraph relies heavily on external storage systems like Cassandra, HBase, BerkeleyDB and so on, while its search engine heavily relies on ES, Slr, Lucene, etc. As a result, JanusGraph is very well integrated with the big data ecosystem and can be well combined with Spark to do large graph computations.
The only trade-off is its high maintenance cost, due to its dependency on many external systems. When building a JanusGraph system, several sets of dependent systems need to be built simultaneously; another point of consideration is its stability. According to past experience, the more dependence a system has, the greater the risk of lower controllability.
Dgraph Database Architecture
The Dgraph Database architecture is straightforward. All of its functions are supported natively, so Dgraph does not depend on any third-party systems.
- Zero: It controls the Dgraph cluster, assigns servers to a raft group, then rebalances the data between server groups. Lastly, it will elect a leader using Raft, similar to NameNode in Hadoop or the master node in ElasticSearch.
- Alpha: It stores data, processes queries, host predicates and indexes; essentially this is the data node.
- Group: Multiple alphas form a Raft group, and the data is stored in different Raft groups in shards. The data stored in each Raft group is guaranteed to be strongly consistent through the Raft protocol.
- Ratel: A user interface which users can run queries and update or modify schemas
- At the same time, Dgraph supports gRPC and HTTP to connect to alpha for mutations and queries.
Dgraph has only one executable file. By specifying the parameters on different local machines, it can automatically form clusters. This is an advantage of Dgraph, as it does not depend on third-party systems.
But KE Holdings couldn’t choose Dgraph purely because it’s easy to use and maintain. The team needed to do performance comparisons as well. For example, if the performance of JanusGraph was way better than Dgraph, then its higher maintenance cost would be more reasonable.
Performance Comparison Between JanusGraph and Dgraph
The performance comparison test for a 30GB dataset was carried out under the environment of three machines with 48 cores, 128GB memory, and SATA hard drive, with 48 million data nodes, 63 million data edges, and 450 million triple datasets.
- The write performance test was divided into real-time and pre-defined writes. For the nodal write performance, JanusGraph reached 15000/s, while Dgraph reached 35000/s. As for the edge write performance, JanusGraph reached 9000/s, while Dgraph reached 10000/s.
- The difference in performance was much more apparent in the query performance test. We tested the following frequent applications of graph databases: node attributes, first-degree, second-degree, and third-degree mapping relationships, including shortest path, etc. As you can see from the table, the performance difference in simple queries such as node, first-degree, and second-degree attributes are insignificant. However, as queries become more complex, JanusGraph returns queries much slower. It takes JanusGraph 700 milliseconds to query third-degree node vertices and attributes, whereas it only takes a few milliseconds for Dgraph to do the same.
Given this query performance test, Dgraph’s superiority over JanusGraph is clear.
JanusGraph VS Dgraph
Summary of the differences between JanusGraph and Dgraph:
- Infrastructure: Dgraph is distributed, while JanusGraph is built on other distributed databases.
- Data Storage: Dgraph is strong and persistent, while JanusGraph relies on underlying storage DB.
- Data Rebalancing: Dgraph supports automatic rebalancing, while JanusGraph once again relies on the underlying storage DB.
- Language: JanusGraph uses the more commonly used Gremlin, while Dgraph uses DQL (the improved version of GraphQL).
- Full-text search, regular expression search, geographical location search: Dgraph is supported natively, while JanusGraph relies on external systems.
- Visualization: Dgraph has its own built-in dashboard, whereas JanusGraph relies on external systems.
- Maintenance Cost: Since Dgraph does not rely on other systems, its maintenance cost is much lower than that of JanusGraph.
- Write Performance: Dgraph is higher by a margin.
- Query Performance: After the intensive query test, Dgraph is way better than JanusGraph for complex queries.
Based on the comparison done above, we chose Dgraph to build our graph database.
Graph Database Architectural Implementation
With a graph database selected, the database modeling can begin.
Cluster Modeling Set-up
Building a Dgraph cluster is quite simple. The team uses Docker and Kubernetes to implement a unified containerized deployment and management system for Dgraph. As shown in the figure above, three servers are used, and four containers are started on each server. Three of these containers are Alpha nodes, which are used for storing data, indexing, and querying; the Zero nodes is used as the controller. One thing to note is, the three Alphas of each group are used to store three copies of the same data; different Alphas of the same group cannot be on the same machine.
Based on the number of copies, Zero determines which Alpha goes to which group. For example, when Dgraph Zero has replicas set to 3, then the system will then initialize the Alpha group accordingly across three servers. Alpha 1 gets allocated to Server 1; Alpha 2 gets allocated to Server 2, and Alpha 3 naturally gets allocated to Server 3. The first initialization (Alpha 1, 2, and 3) forms the first group. Followed by the natural order, Alpha 4, 5, 6, forms the second group, and Alpha 7, 8, 9 forms the third group. Note that Alpha 1, 2, and 3 should not be in the same group as such a group would not be guaranteed highly available, because if local machine server 1 were to fail, then all three replicas of the data would be lost.
The left side of the panel is the control center of Dgraph. Upon the activation of Zero, the replica parameter needs to be provided. When Alpha is activated as well, specify the address of Zero. Now that you have a whole Dgraph cluster, you can see that the investment cost is indeed low.
After the implementation of the clusters, the next focus was the data stream. When creating a graph database, there are multiple data writing modes to consider, such as real-time, batch, and initialized data stream.
Real-Time Data Stream: There is a Data-Accepter piece, which users may use to update the data in real-time. Then, Kafka is used to employ an asynchronous peak cut and write into its queue. Lastly, Graph-Import is used to transfer data from Kafka into the Dgraph cluster.
Batch Data Stream: For example, if there is a massive batch data update, most of the data for the KE Holdings graph database that are either stored in Hive or HDFS would have a Hive2Kafka spark task. The task is to export the data from Hive or HDFS to collect all the data. In the same manner of Kafka data importing, the new data would be imported through Graph-Import of Kafka, then written to the Dgraph cluster.
Initialized Data Stream: Another difference of Dgraph is that its support of the initial data import is extremely quick. For a 48 billion dataset to be imported for the first time, written line by line, would take a long time. Hence, use Dgraph’s initialized data stream and Bulk Loader, and pre-generate its data and index files through MapReduce. Then, Dgraph’s Alpha node is activated to load those files quickly, which we will go into more detail later. We stand behind this type of data stream because of its efficiency to load the data in just a single click. The Kubernetes are then called to activate a Dgraph cluster, followed by loading the initialized data and finally returning an API to the Dgraph user.
After the data finishes loading to the server, the next step is to query the data. The image above shows Dgraph’s Ratel visualization interface. After entering a query in the console on the left, the corresponding visualization appears on the right.
As you can see above, the Dgraph query statement is not that simple; it requires at least a fundamental understanding of the subject matter. That’s why it was important to consider ease of use and simplify the learning curve for the user as much as possible. Thus, KE Holdings needed to consider if there is a simpler query syntax.
Look at how this query is written in Gremlin. As shown in the figure, multiple has and select attributes are used to query the same result, but it is also not simple.
Considering SQL is the language most programmers are the most familiar with – even non-programmers and most data analysts know how to use SQL – can SQL be used to query graph databases? The team designed a language that uses SQL to query graph databases, called Graph SQL.
As shown in the first line of the Graph SQL query statement, it selects the name of the community, greening rate, and the name of the kindergarten from the mapping relationship between the community and kindergarten, where the name of the community is “Xiu Yuan”, the greening rate is greater than 30%, and the distance to the kindergarten is less than 1000 meters. The difference in the SQL from statement is that the relationship between the community and kindergarten is not from a table, but rather it is a mapping relationship.
The figure above shows a complete Graph SQL query statement, including some noteworthy graph database keywords. Here are a couple of key terms: shortest path – searching for the shortest distance from a to b and degree – searching on the first degree, second degree, third degree, and so on.
Graph SQL also supports node query, link query, attribute query, etc. The last part of the query statement includes GROUP BY, HAVING, ORDER BY, LIMIT, etc. To be more specific, LIMIT supports node LIMIT and link LIMIT, which only supports the more basic commands as of the moment. The more complex queries are to be improved down the line.
The bottom line is, Graph SQL would help effectively lower the learning curve and operating cost for the implementation of graph databases.
Finally, the figure above is the result of the previous query statement that was run. Sending a simple HTTP request with an SQL query statement returns the graph database result quickly.
- The data flow module includes data source, data reception, incremental, full volume, Kafka, and data import.
- The query module supports Graph SQL; if there are any queries that cannot be run by Graph SQL, then DQL would be the alternative to run more complex queries. Afterwards, the underlying Dgraph cluster is connected through Graph-Client to query the results.
- The entire Dgraph Cluster is deployed from local physical machines through Docker and Kubernetes containerization technology.
- The right panel in the figure above shows the reuse of the overall governance service capabilities of KE Holdings’s query platform. All microservices are scheduled, managed, and monitored through the registry, centralized config server, load balancer, message bus, fuse degradation, link tracer, and monitoring alarms.
Principles, Optimizations, and Trade-offs
Building the graph database infrastructure is just the first step to having a production-ready infrastructure. Steps 1 to N then require the assurance of infrastructure stability and increased usability, performance, and user experience. This second part requires one to do further learning and have an in-depth understanding of the implementation principles, Dgraph optimization, as well as its advantages and disadvantages.
Brief introduction of the Dgraph principles:
1. Storage Engine
The storage engine, Badger, is developed by Dgraph in Golang. The first storage engine Dgraph employs is RocksDB, but through some adjustments made in Go, data overflow problems occurred. The Dgraph team then decided to use Golang to implement an efficient and persistent LSM-based key-value database, which is 3.5 times faster than RocksDB.
2. Storage Infrastructure (As storage engine is KV, storage system is also KV)
(Predicate, Subject) –> [sorted list of ValueId], Key is a combination of a predicate and subject, and Value is a list of the dataID.For an example:
(friend, me) –> [person1, person2, person3, person4, person5]
The keys from this example are friend and me. The mapping relationship is friend, while me is the subject, which forms a key. Value on the other hand, is an ordered list. All the “friend” of “me” form an ordered array of IDs that go from person1, person2, until person5.Based on the design of the infrastructure, all data stored under the same predicate in Dgraph is stored in the same data node or even on the same data block. This is done so that when data is queried from a predicate, it will only run the RPC once to get all predicate data. This is a great performance upgrade for first degree, second degree, and multiple degree mapping search queries – the core advantage of this infrastructure design.
3. Data Sharding (For a distributed system to scale, it needs to support data sharding)
According to predicate sharding, the data in the same predicate stored in the same node decreases RPC and improves query performance. Different predicates may be stored on different nodes.Regular data balance (rebalance_interval), the zero node will periodically check if the data nodes are balanced. If certain data nodes are too large or too small, the search query performance would drop. Hence, the role of zero node is to ensure data balance for each node.The Raft group is determined according to the startup sequence of replicas and Alpha. Because Dgraph’s replica consistency relies on the Raft protocol, it is necessary for there to be at least three nodes to ensure strong data consistency.
4. High Availability
Each group has at least three Alphas, which are replicas of each other. Raft ensures data consistency, so if one of the alpha nodes fails, data can be restored through the other two replicas.Write-ahead logs are a WAL mechanism in distributed servers used to improve write performance. It must first be written in the cache before flushing to disk. Because it is not being written directly to the disk, its efficiency is extremely low. If the data is written first to the memory before the disk, it would impose a potential problem; if the local machine hangs before the saving is complete, the data would not be saved to the disk, and a portion of the data would be lost.Therefore, most distributed systems like HBase, ES, and Dgraph write the write-ahead logs in advance before saving the data to the memory. The logs will be flushed to the disk in real-time, and then the data will be written to the memory. So if any instances of data loss happen, it can use the write-ahead logs to retrieve the data, which ensures the performance and availability is high.
Bulk Loader Optimization
The KE Holdings team has only been researching Dgraph for a few months, so the team has only made some incremental optimizations thus far. The main focus has been how can the 48 billion ordered triple datasets graph database be uploaded as efficiently as possible to the cluster?
Java was used first to write, and it was found to be extremely slow; it would take a full week to write into the database.
Then the team used Dgraph’s Bulk Loader to load data, first generating the index data, then loading it through the alpha node, and finally starting the cluster to provide services. This only required 48 hours to finish loading. While the speed improved, it still took quite some time. Could this be optimized even further to improve speed?
That’s when the team researched the Bulk Loader source code. When the Map Reduce process was done on a single machine, the team just needed to assign a unique UID for the local machine to ensure that the UID would satisfy two things – UID uniqueness and ordered list upon execution. They used single machine multithreading to start multiple Map and Reduce threads. Next, they generated shard files for each thread, followed by loading data through Dgraph’s Alpha.
KE Holdings may be able to optimize it even further. Dgraph was designed as a distributed system, and all kinds of mutations can be expanded linearly, but it cannot be said that the initial batch import can be a stand-alone module.
The team then optimized the source code to an extent where they changed the single-machine multi-threading model to a multi-machine multi-threading model. For the partitioning process, a unique UID was assigned to each data row, which was still executed on a single machine. The data was divided into a data block, which was then distributed to different machines, where each machine activated a map-reduce process to generate data files that were loaded by alphas as each alpha was started, until the entire cluster successfully came online. With this kind of setup, the 48 billion ordered triple dataset import time decreased from 48 hours to 15 hours, cutting the lead time to a third.
Performance Stress Test
After importing the 48 billion ordered triple datasets into the server, KE Holdings could now address the crucial question: could it truly support millisecond-level queries of tens of billions of graph data?
The team conducted a performance stress test. As shown in the graph above, you can see that Dgraph’s performance is indeed good. The x-axis is the concurrent threads; the left y-axis is client and server response time, and the right y-axis is QPS output.
Limitations of Dgraph
If the performance, usability, and maintenance cost of Dgraph are that good, is Dgraph a perfect graph database? Does Dgraph support all applications? Obviously not, as there is no perfect system, but rather a system that is most suitable for a business’ needs. By the same token, there is no perfect human being, just the most suitable one for you.
Dgraph also has some limitations and inefficiencies:
1. Does not support multigraph
For any pair of vertices, only one edge with the same label type is allowed. For JanusGraph, multiple edges are allowed after two vertices are determined. For example, in Dgraph, if you and I are classmates, there can only be one edge called< classmate relationship. But in JanusGraph, you and I can be grade school classmates, high school classmates, and college classmates, as there are three classmate relationship edges.
2. One cluster only supports one graph
A Dgraph cluster may only support one graph. Supporting multiple graphs is currently under development and will be supported in the future. The absence of this feature does not have a huge impact on KE Holdings. For example, the graph of 48 billion ordered triple datasets itself requires a separate cluster and will not be shared with other graphs. This is not too big of a problem at present. Naturally, we hope that there can be official support for this feature as soon as possible.
3. Big Data Incompatibility
JanusGraph is compatible with the Big Data ecosystem because JanusGraph uses HBase storage. Dgraph on the other hand is not compatible because it is developed in Golang, and when Spark is used to write to it concurrently, there will be an overload state.
4. Infrastructure Immaturity
Dgraph started in 2016, so it is not a mature platform and has some minor issues. However, the developers are continually releasing updates, fixing a lot of issues quickly.To summarize, there is no perfect system, there is only the most suitable system. When the KE Holdings team made a technical selection, the goal was to find the required benefits without any deal-breaking defects. Ultimately, the team chose Dgraph as its premier graph database, rather than building their own graph database platform based on it. Dgraph can be used universally across all departments to meet their respective needs.
Future Action Plans
Pan Gao is primarily responsible for the implementation of the overall search engine platform for KE Holdings. The Dgraph implementation was just a portion of the whole search engine system. At present, they already have a text retrieval engine based on ES and a graph data retrieval engine based on Dgraph, and there will be a vector retrieval engine based on Faiss in the future.
The Cloud Search Platform is a business access platform that integrates all the underlying layers: effects platform, algorithm platform, the three retrieval engines (ES, Dgraph, Faiss), and the container platform. At the same time, it integrates unified service management capabilities to form a search platform. In the future, the business side does not need to care about the underlying database storage, writing and querying any further. The search platform will unify the relevant capabilities, and then provide a unified entrance and exit, while ensuring the overall performance and stability, so as to quickly empower the business. This will allow the business side to focus more on the high-level strategic planning and business logic.
This is the overall action plan for the graph database:
- Analysis of performance, stability optimization, and source code improvements
- A basic search engine that supports various graph retrieval requirements
- Access to the cloud search platform, optimize user interface, simplify operation and maintenance
- Strengthen search engine capability, improve search results
- Support industry graph, relationship graph, knowledge graph, risk management, etc.