Although graph databases seem complicated, they are much simpler to understand than relational databases. At their core, a graph database is built on the concept of a graph. Graphs are composed of two elements: nodes and edges.
Each node represents an entity, and each edge represents the relationship between two nodes. For example, in the social graph below, people are represented by nodes, and the relationships between people are represented by edges.
One thing to note is that most graph databases use a directed graph, which means that edges have a direction. As such, not all relationships are two-way. For example, in the graph above, Jack can have a “Friend” edge to Annie, but Annie may or may not have a “Friend” edge back to Jack.
You would think they would, but that’s not the case. Let’s take a look at the most common type of database, relational databases.
The relational database model, used by databases such as PostgreSQL, MySQL, or SQL Server, uses a table format to store data (as seen above). Each column represents a table, with the arrows between representing relationships.
Relational databases use a table format, consisting of rows and columns to represent data. Each table (usually) represents a specific entity. For example, the Employees table represents employees, with columns holding data such as ID, name, department, etc. So how would you represent relationships between entities (tables)?
There are two ways to create relationships between tables. The first way is for one entity to directly refer to another via its primary key. For example, Alice has a nice office on the second floor. The room’s ID is 812, and we can store that ID in Alice’s record.
For many-to-many relationships, the above method won’t cut it. We need to use a special table commonly referred to as a “JOIN table” or a Lookup table. We would create a new table that only holds IDs, and use that table to store relationships between entities.
As you can see from the example above, graph databases allow us to model relationships in a much more natural way. In addition, JOIN operations in relational databases are very costly. This is usually addressed by denormalizing data and breaking data integrity.
In a graph system, relationships are stored with data, which means graph databases are much more performant when querying highly connected datasets. Considering that highly connected data is increasing across industries at a rapid rate, it makes sense why you’re reading about graph databases right now. In short, if relationships between data are more important than the data itself, graph databases are the undisputed winners.
So, if graph databases are more intuitive and more performant, why are they not the go-to databases? To understand that, we need to learn a bit about the history of databases.
Back in the 1960s, the first general use database systems started appearing. Charles W. Bachman created the “Integrated Database System”, widely known as the first DBMS (Database Management System). Around the same time, IBM also created their own DBMS known as IMS. By the mid-60s, many other general use database systems started appearing.
The lack of standardization led many customers of these systems to demand a standard, which led to the formation of the “Database Task Group” (very creative name ;) ). In 1971, the Database Task group created the “CODASYL” approach, which wasn’t great and left many people unhappy.
One such person was Edgar Codd, who wrote A Relational Model of Data for Large Shared Data Banks , which described a method for storing data in a “table with fixed-length records”. Sadly, IBM, where Codd was working, was heavily invested in IMS and wasn’t interested in it.
Luckily, two individuals at UC Berkeley decided to research relational database systems and created INGRES, which proved that a relational model could be efficient and practical. This pressured IBM to improve on QUEL (INGRES’ query language), and in 1974, IBM launched SQL, which soon overtook QUEL as a more functional query language. By the 1990s, SQL had become the de facto language of data and the SQL database reigned supreme. What if you didn’t want to use SQL? Too bad, you don’t have a choice.
Relational databases up to this point were architected on the assumption that they would be run on a single machine. However, as the internet exploded, some workloads grew so heavy that no single computer could handle the load. Enter the NoSQL database.
NoSQL, at least in the beginning, was developed as a solution to this scaling problem, as well as the need for unstructured data. One of the pioneers of NoSQL was Google, who, as one could imagine, ran into this scaling problem rather quickly. In 2004, Google released a paper describing a distributed file system with a storage system built on top of it. This storage system, known as BigTable, was able to be run distributed over many physical nodes (servers). Other companies such as Facebook and Apache later followed suit and created their own distributed NoSQL systems.
NoSQL provided a great alternative to companies that needed to store massive amounts of data. Great. Now, let’s say you want to look through that data, process it, and provide an answer to your team or your customers? Suddenly NoSQL solutions are less appealing.
If you want to do anything interesting with your data, like recommend purchases based on what friends or friends-of-friends buy, you will be processing many records - think terabytes of data. With a graph database? Let me get back to you in a millisecond (or less).
If graph databases are so great, why haven’t they caught on?
Neo Technologies, creators of the Neo4J database and the first graph database company, was founded in 2000. Graph databases themselves only started gaining commercial acceptance in the early 2010s.
Even innovative companies like Google have fallen into this trap. Funny enough, Dgraph was founded because of Google’s resistance to migrating to a graph database. Read more about why Google needed a graph database and the founding of Dgraph.
Almost all relational databases use some variant of SQL for querying. Graph databases, on the other hand, don’t have a standard language.
Yeah, it has graph in the name. But there’s more to it than that. A knowledge graph organizes the relationships between entities to get a more human-centered data understanding. By capturing real-world context in your data’s knowledge graph, you’ll connect internal data silos.
For example, purchase data may be useless on its own, but once you connect it to browsing data and a social network, you can now start to see what users are buying, what their friends are buying, and most importantly, what products they are likely to buy in the future.
Interested in an example? Take a look at how KE Holdings used Dgraph to create a searchable knowledge graph from 48 billion tuples.
Social networks are inherently graphs. Whether declared or implied, you (an entity) are friends with entities, coworkers with other entities, possibly married to an entity, etc. You are interested in specific entities (interests), live in an entity (city), post entities (images posts, statuses), which are liked, commented, and shared by other entities (friends). You go out to eat in entities (restaurants) with other entities (people), you enjoy eating specific entities (food).
Fraudsters have become more sophisticated, and once state-of-the-art traditional fraud measures focusing on discrete data points are much more easily fooled. Cutting-edge fraud detection systems now look beyond individual data points, looking at patterns of relationships that are difficult to notice without using a graph database.
Real-time recommendations are at the heart of any online store. Making relevant real-time recommendations means taking into account product, customer, inventory, delivery, and sentiment data - and that’s without including any new interests in the customer’s current session. Using a graph database you can provide real-time content, service, and product recommendations by uniting all user data for a truly personalized and engaging experience that increases revenue.
Read how you could build an Amazon-like recommendation engine for yourself.
Many companies are now dabbling in AI and machine learning to achieve their goals. Like the use cases above, machine learning depends heavily on uncovering patterns in data. You can make better predictions about data by using relationships in the data than you can from just the data alone. For example, the most powerful predictor of whether someone will start smoking is whether they have friends that smoke. And what kind of database helps unlock relational data? You guessed it, a graph database.
That’s a tough question. If you answer yes to one or more of the following questions, it might be time to start using a graph database for at least a portion of your data.
At a smaller scale, relational databases can handle your needs, but as with everything, migrating to a graph database earlier rather than later will save you headaches when your current database begins to struggle.
Do you have many many-to-many relationships? Are relationships as or more important than the data itself? If so, heavily consider using a graph database. As your dataset grows larger, relationships will become more and more cumbersome to maintain, and much harder to query and understand.
If you think that querying and analyzing data fast is much more important to you than optimizing writing and storing data, then a graph database is a good fit for you. As your dataset and relationships between data grow, a graph database will become far more performant for complex queries and data analytics.
Just use Dgraph. It’s that easy!
You’re still here? OK, let’s get serious. Two main points differentiate some graph databases from others, performance and scalability. Other features are nice to have, but a lack of performance and stability are most often the dealbreakers.
Every graph database claims that is “blazing-fast”. Due to a relative lack of consistent and unbiased benchmarks in the space, it’s tough to crown a specific database the winner in the performance category. Additionally, every database is built differently and excels in slightly different areas, which makes things more confusing.
That said, KE Holdings ran its own benchmarks and assessments for performance and scalability. The team found that Dgraph was able to load and query a dataset of 48 billion tuplhttps://www.techrepublic.com/article/chinas-zillow-alternative-goes-open-source-to-scale-a-10-billion-node-graph-database/es, and still maintain 15000 Q/S (queries per second)!
It’s also worth looking at how Dgraph has previously run benchmarks again Neo4j. We encourage others to follow similar steps when benchmarking competitors against each other.
While unbiased benchmarking is scarce, there are still third-party sources of information that can reduce bias and help in your assessment.
Forrester evaluates the “Graph Data Platform” category. While the methodology may not be applicable for all companies in the category, it can be a helpful starting point.
G2 collects honest reviews from verified product users - people have to share screenshots of the product to prove that they’re actual users and not just bots.
Jepsen is a framework designed for testing whether distributed systems live up to their consistency guarantees. Dgraph is the first and only graph database to have been Jepsen tested.
Dgraph is the most popular graph database on Github, with over 15k stars and more than 1000 forks. Dgraph also has over five million docker pulls!