The ultimate guide to graph databases

The ultimate guide to graph databases

Unite your data and get a highly scalable, performant, native GraphQL graph database in the cloud that delivers blazingly fast query speeds.

Supercharge your app development.

Don’t get stuck in the same old pitfalls by developing apps in legacy ways. Instead, use Dgraph to supercharge your app development by building apps the modern way.

We built Dgraph with the idea of giving developers a way to build better apps, quickly. As part of this, we saw a more modern approach which includes:

Graph databases vs relational databases:
Why do graph databases exist?

You would think they would, but that’s not the case. Let’s take a look at the most common type of database, relational databases.

Associative Entity, JOIN table or Lookup table

The relational database model, used by databases such as PostgreSQL, MySQL, or SQL Server, uses a table format to store data (as seen above). Each column represents a table, with the arrows between representing relationships.

Relational databases use a table format, consisting of rows and columns to represent data. Each table (usually) represents a specific entity. For example, the Employees table represents employees, with columns holding data such as ID, name, department, etc. So how would you represent relationships between entities (tables)?

There are two ways to create relationships between tables. The first way is for one entity to directly refer to another via its primary key. For example, Alice has a nice office on the second floor. The room’s ID is 812, and we can store that ID in Alice’s record.

For many-to-many relationships, the above method won’t cut it. We need to use a special table commonly referred to as a “JOIN table” or a Lookup table. We would create a new table that only holds IDs, and use that table to store relationships between entities.

As you can see from the example above, graph databases allow us to model relationships in a much more natural way. In addition, JOIN operations in relational databases are very costly. This is usually addressed by denormalizing data and breaking data integrity.

In a graph system, relationships are stored with data, which means graph databases are much more performant when querying highly connected datasets. Considering that highly connected data is increasing across industries at a rapid rate, it makes sense why you’re reading about graph databases right now. In short, if relationships between data are more important than the data itself, graph databases are the undisputed winners.

So, if graph databases are more intuitive and more performant, why are they not the go-to databases? To understand that, we need to learn a bit about the history of databases.

A very brief history of databases

The Beginning

Back in the 1960s, the first general use database systems started appearing. Charles W. Bachman created the “Integrated Database System”, widely known as the first DBMS (Database Management System). Around the same time, IBM also created their own DBMS known as IMS. By the mid-60s, many other general use database systems started appearing.

The lack of standardization led many customers of these systems to demand a standard, which led to the formation of the “Database Task Group” (very creative name😉 ). In 1971, the Database Task group created the “CODASYL” approach, which wasn’t great and left many people unhappy.

SQLing with delight

One such person was Edgar Codd, who wrote A Relational Model of Data for Large Shared Data Banks, which described a method for storing data in a “table with fixed-length records”. Sadly, IBM, where Codd was working, was heavily invested in IMS and wasn’t interested in it.

Luckily, two individuals at UC Berkeley decided to research relational database systems and created INGRES, which proved that a relational model could be efficient and practical. This pressured IBM to improve on QUEL (INGRES’ query language), and in 1974, IBM launched SQL, which soon overtook QUEL as a more functional query language. By the 1990s, SQL had become the de facto language of data and the SQL database reigned supreme. What if you didn’t want to use SQL? Too bad, you don’t have a choice.

Say NO to SQL

Relational databases up to this point were architected on the assumption that they would be run on a single machine. However, as the internet exploded, some workloads grew so heavy that no single computer could handle the load. Enter the NoSQL database.

NoSQL, at least in the beginning, was developed as a solution to this scaling problem, as well as the need for unstructured data. One of the pioneers of NoSQL was Google, who, as one could imagine, ran into this scaling problem rather quickly. In 2004, Google released a paper describing a distributed file system with a storage system built on top of it. This storage system, known as BigTable, was able to be run distributed over many physical nodes (servers). Other companies such as Facebook and Apache later followed suit and created their own distributed NoSQL systems.

The “NoSQL” term might mislead you into thinking that NoSQL databases are all similar in design, but that is not the case. Today, there are hundreds, if not thousands of NoSQL databases, which can be split into one (or more) of the following categories:

  • Key/Value Stores: Simplest NoSQL model. They store pairs of keys and values and retrieve a value for a given key. Very high performance compared to SQL. Some store these values entirely in memory (e.g. Redis, Dgraph Badger)
  • Document Stores: Handle semi-structured data (XML, JSON, etc). They store key/document pairs, but internal document data can be processed and indexed as well. Can contain nested structures such as arrays. Very flexible. The most popular example is MongoDB.
  • Graph Databases: Data represented as a graph (graph theory link), composed of nodes and edges. Optimized for complex data models. Some are distributed like Dgraph.
  • Wide Column Stores: Inspired by BigTable. Cells can be individually accessed. Lots of optimization techniques for splitting data across files and the network. Very high performance, very scalable.

Graphs to the Rescue?

NoSQL provided a great alternative to companies that needed to store massive amounts of data. Great. Now, let’s say you want to look through that data, process it, and provide an answer to your team or your customers? Suddenly NoSQL solutions are less appealing.

If you want to do anything interesting with your data, like recommend purchases based on what friends or friends-of-friends buy, you will be processing many records – think terabytes of data. With a graph database? Let me get back to you in a millisecond (or less).

Graph Technology Drawbacks

If graph databases are so great, why haven’t they caught on?

Graph databases are relatively new technology.

Neo4j, Inc, creators of the Neo4J database and the first graph database company, was founded in 2000. Graph databases themselves only started gaining commercial acceptance in the early 2010s.

Learning a new query language sounds like a hassle.

Even innovative companies like Google have fallen into this trap. Funny enough, Dgraph was founded because of Google’s resistance to migrating to a graph database. Read more about why Google needed a graph database and the founding of Dgraph.

New query languages and a different paradigm.

Almost all relational databases use some variant of SQL for querying. Graph databases, on the other hand, don’t have a standard language

But the main thing to notice is that, in fact, they are catching on! This is a graph of “Popularity Changes per Database Categories” from DB-Engines. Since 2013, the popularity of graph databases has risen by almost 1200%!

Who needs a graph database?
What are some graph database use cases?

Knowledge Graphs

Yeah, it has graph in the name. But there’s more to it than that. A knowledge graph organizes the relationships between entities to get a more human-centered data understanding. By capturing real-world context in your data’s knowledge graph, you’ll connect internal data silos.

For example, purchase data may be useless on its own, but once you connect it to browsing data and a social network, you can now start to see what users are buying, what their friends are buying, and most importantly, what products they are likely to buy in the future.
Interested in an example? Take a look at how KE Holdings used Dgraph to create a searchable knowledge graph from 48 billion tuples.

Social Networks

Social networks are inherently graphs. Whether declared or implied, you (an entity) are friends with entities, coworkers with other entities, possibly married to an entity, etc. You are interested in specific entities (interests), live in an entity (city), post entities (images posts, statuses), which are liked, commented, and shared by other entities (friends). You go out to eat in entities (restaurants) with other entities (people), you enjoy eating specific entities (food).

Facebook created its own graph system and graph storage named TAO.

Fraud Detection

Fraudsters have become more sophisticated, and once state-of-the-art traditional fraud measures focusing on discrete data points are much more easily fooled. Cutting-edge fraud detection systems now look beyond individual data points, looking at patterns of relationships that are difficult to notice without using a graph database.

See how Feedzai, which scores trillions of dollars a year for fraud, built their graph database to improve their fraud detection efforts.

Recommendation Engines

Real-time recommendations are at the heart of any online store. Making relevant real-time recommendations means taking into account product, customer, inventory, delivery, and sentiment data – and that’s without including any new interests in the customer’s current session. Using a graph database you can provide real-time content, service, and product recommendations by uniting all user data for a truly personalized and engaging experience that increases revenue.

Read how you could build an Amazon-like recommendation engine for yourself.

Machine Learning/AI

Many companies are now dabbling in AI and machine learning to achieve their goals. Like the use cases above, machine learning depends heavily on uncovering patterns in data. You can make better predictions about data by using relationships in the data than you can from just the data alone. For example, the most powerful predictor of whether someone will start smoking is whether they have friends that smoke. And what kind of database helps unlock relational data? You guessed it, a graph database.

When do I need to switch to a graph database?

That’s a tough question. If you answer yes to one or more of the following questions, it might be time to start using a graph database for at least a portion of your data.

That’s a tough question. If you answer yes to one or more of the following questions, it might be time to start using a graph database for at least a portion of your data.

Is your data heavily connected?

Do you have many many-to-many relationships? Are relationships as or more important than the data itself? If so, heavily consider using a graph database. As your dataset grows larger, relationships will become more and more cumbersome to maintain, and much harder to query and understand.

Is query speed more important than write speed?

If you think that querying and analyzing data fast is much more important to you than optimizing writing and storing data, then a graph database is a good fit for you. As your dataset and relationships between data grow, a graph database will become far more performant for complex queries and data analytics.

How do I choose a graph database?
Which graph database is best?

Just use Dgraph. It’s that easy!

You’re still here? OK, let’s get serious. Two main points differentiate some graph databases from others, performance and scalability. Other features are nice to have, but a lack of performance and stability are most often the dealbreakers.

Every graph database claims that is “blazing-fast”. Due to a relative lack of consistent and unbiased benchmarks in the space, it’s tough to crown a specific database the winner in the performance category. Additionally, every database is built differently and excels in slightly different areas, which makes things more confusing.

That said, KE Holdings ran its own benchmarks and assessments for performance and scalability. The team found that Dgraph was able to load and query a dataset of 48 billion tuples, and still maintain 15000 Q/S (queries per second)!

It’s also worth looking at how Dgraph has previously run benchmarks again Neo4j. We encourage others to follow similar steps when benchmarking competitors against each other.

Third-Party Benchmarks

While unbiased benchmarking is scarce, there are still third-party sources of information that can reduce bias and help in your assessment.

Forrester evaluates the “Graph Data Platform” category. While the methodology may not be applicable for all companies in the category, it can be a helpful starting point.

Jepsen Test

Jepsen is a framework designed for testing whether distributed systems live up to their consistency guarantees. Dgraph is the first and only graph database to have been Jepsen tested.

G2

G2 collects honest reviews from verified product users – people have to share screenshots of the product to prove that they’re actual users and not just bots.

GitHub + Docker

Dgraph is the most popular graph database on Github, with over 15k stars and more than 1000 forks. Dgraph also has over five million docker pulls!