Dgraph is rapidly gaining reputation as an easy to use database to build apps upon. Many new users of Dgraph have existing relational databases that they want to migrate from. In particular, we get asked a lot about how to migrate data from MySQL to Dgraph. In this article, we present a tool that makes this migration really easy: all a user needs to do is write a small 3 lines configuration file and type in 2 commands. In essence, this tool bridges one of the best technologies of the 20th century with one of the best ones of the 21st (if you ask us).
Before we dive into the details, let’s quickly glance over the data models of SQL databases and Dgraph. Real life objects have certain attributes: a person has a name, age, and date of birth; a movie has a name, release date, and genre. SQL databases represent all the attributes of an object by storing them into a single row in a table. Dgraph, on the other hand, represents these attributes as edges around a node.
Both systems work equally well if we stop here and consider attributes of objects as the only type of relationships in the universe. However, the real universe is more complex than that. The 2nd type of relationship is between objects.
As a concrete example, an actor is a person that has starred in a number of movies; a movie has one or more directors, one or more writers, and a generally larger number of actors. In order to logically link a concept (e.g. a movie) with another (e.g. an actor), a SQL database would normally have separate tables to represent the relationships. Thus, finding the names of the people that starred in a given movie requires:
movie_id
by searching for its name in the movie
table,person_id
from a second acting
table, and finallypeople
table with the people_id
s above and extracting the values in the name
column.If that seems already cumbersome, you might now want to consider how much more difficult it can get
for recursive relationships. Imagine there is a people
table, and a children
table mapping each
person_id
to the person_id
s of their children. The logic to find all the descendants of a person
with SQL is not trivial and requires a bunch of back and forth interactions between the application and the
database.
Graph Databases come to rescue in situations where SQL would have needed many nested join
operations: in Dgraph, inter-object relationships are represented by edges and edge traversal from
one node to another is supported as a first-class operation. This seemingly simple difference
makes an entirely new class of operations possible. Finding all people who are related to a movie
regardless of how they are related (using the expand(_all_)
operation), finding all of a person’s
descendants regardless of how many generations they are apart (using the @recurse
directive), finding the
shortest connection path between two people in a social network (using the shortest
function),
etc. The possibilities are limitless.
If you feel Dgraph is the right fit for your use case, let’s dive a little deeper and see how the migration tool works by studying an example. Below, you can see a simplified version of the database schema supporting StackOverflow:
create table users (
Id int,
DisplayName varchar(200),
PRIMARY KEY (Id)
);
create table posts (
Id int,
AcceptedAnswerId int NULL,
Body text,
OwnerUserId int NULL,
LastEditorUserId int NULL,
Title varchar(1000),
PRIMARY KEY (Id),
FOREIGN KEY (OwnerUserId) REFERENCES users (Id),
FOREIGN KEY (LastEditorUserId) REFERENCES users (Id),
FOREIGN KEY (AcceptedAnswerId) REFERENCES posts (Id)
);
create table comments (
Id int,
PostId int,
Text varchar(1000),
UserId int,
PRIMARY KEY (Id),
FOREIGN KEY (PostId) REFERENCES posts (Id),
FOREIGN KEY (UserId) REFERENCES users (Id)
);
create table votes(
Id int,
PostId int,
CreationDate varchar(100),
PRIMARY KEY (Id),
FOREIGN KEY (PostId) references posts (Id)
);
Each of the four tables stores objects that have their own attribute(s):
users
has a DisplayName
;posts
has a Title
and a Body
;comments
has some Text
;votes
has its CreationDate
.Besides the basic attributes, the objects are also interconnected:
posts
has an owner and hence a column OwnerUserId
referencing a row in the users
table.comments
table has a PostId
column referencing
a row in the posts
table.To illustrate the migration process, we’ve selected some sample rows from these tables. We will walk through each table, and explain how they can be converted into N-Quad entries. The RDF N-Quad format is one of the formats that can be used for importing data into Dgraph (the other accepted format is JSON.)
# users
| Id | DisplayName |
|-------|-------------------|
| -1 | NULL |
| 14 | Jimmy Hoffa |
| 15 | Mooseman |
| 2089 | MrPhooky |
| 12370 | Paul Wesselkamper |
# posts
| Id | AcceptedAnswerId | Body | OwnerUserId | LastEditorUserId | Title |
|-------|------------------|-----------|-------------|------------------|-----------|
| 9 | 11075 | <p>I try | 14 | 2089 | How can.. |
| 11075 | NULL | <p>When I | 12370 | -1 | |
# comments
| Id | PostId | Text | UserId |
|----|--------|---------------------------------------|--------|
| 4 | 9 | Of what fabric are the blankets made? | 15 |
# votes
| Id | PostId | CreationDate |
|----|--------|-------------------------|
| 10 | 9 | 2014-12-09T00:00:00.000 |
N-Quad entries have the format <subject> <predicate> <object> .
, so the single row in the comments
table is converted into the following N-Quad entries:
_:comments.4 <comments.Id> "4" .
_:comments.4 <comments.Text> "Of what fabric are the blankets made?" .
JSON equivalent (for reference):
{
"uid": "_:comments.4",
"comments.Id": 4,
"comments.Text": "Of what fabric are the blankets made?",
}
Take the first entry _:comments.4 <comments.Id> "4" .
as an example:
_:comments.4
is the subject,comments.Id
is the predicate,"4"
is the object, and finally.
is the terminal symbol of each N-Quad statement.In Dgraph, a subject that begins with _:
is called a blank node label, and it represents an object that is still to be created.
Thus the two N-Quad entries tell Dgraph that
comments.Id
and comments.Text
, whose values are "4"
, and "Of what..."
respectively.Notice that the blank node label for a given object must be unique across all the generated N-Quad
entries. To ensure uniqueness, we leverage the fact that the values in the primary key columns of a table
are unique so we construct the blank node labels by appending the primary key values after the table
name. The row in the comments
table whose primary key is 4 has the blank node label
_:comments_4
. However some SQL tables do not have any primary keys. To solve the problem, we would
maintain a counter to generate unique labels for each row in such tables.
Another interesting design decision is that we dropped the values in the PostId
and UserId
columns, because they should be represented
using edges in Dgraph. We will explain this in more details a little later.
Going back to the four N-Quad entries, it demonstrates the first and second rule for the conversion:
(1) To generate unique blank node labels, the values in the primary key columns are appended to table names if the table has a primary key; otherwise an increasing counter will be used.
(2) Each SQL table cell except those in foreign key columns is stored under a predicate name, which is constructed by concatenating the table name and the corresponding column name.
Similarly the 3 other tables “posts”, “users”, and “votes” can be converted by following the two rules:
_:posts.9 <posts.Body> "<p>I try to use..." .
_:posts.9 <posts.Id> "9" .
_:posts.9 <posts.Title> "How can I keep 2 blankets together on a bed?" .
_:posts.11075 <posts.Body> "<p>When I was in..." .
_:posts.11075 <posts.Id> "11075" .
_:posts.11075 <posts.Title> "" .
_:users.-1 <users.DisplayName> "" .
_:users.-1 <users.Id> "-1" .
_:users.14 <users.DisplayName> "Jimmy Hoffa" .
_:users.14 <users.Id> "14" .
_:users.15 <users.DisplayName> "Mooseman" .
_:users.15 <users.Id> "15" .
_:users.2089 <users.DisplayName> "MrPhooky" .
_:users.2089 <users.Id> "2089" .
_:users.12370 <users.DisplayName> "Paul Wesselkamper" .
_:users.12370 <users.Id> "12370" .
_:votes.10 <votes.CreationDate> "2014-12-09T00:00:00.000" .
_:votes.10 <votes.Id> "10" .
Note: the column formatting was added for your reading, the migration tool would not add the extra whitespaces.
Besides the basic attributes, we also need N-Quad entries to represent the inter-object relationships.
Consider how we converted the PostId
and UserId
foreign keys in the comments
table:
_:comments.4 <comments.PostId> _:posts.9 .
_:comments.4 <comments.UserId> _:users.15 .
Taking the first N-Quad entry _:comments.4 <comments.PostId> _:posts.9 .
as an example. The
subject is a blank node _:comments.4
. The predicate comments.PostId
is again the table name concatenated with the
column name. The object _:posts.9
is different from a plain attribute value – it is the
blank node label of a row in the posts
table, and derived by looking up the value 9
in the posts
table.
From these entries come our 3rd rule of conversion:
Following this rule, we can establish relationships between posts and users, between posts and posts, and between votes and posts:
_:posts.9 <posts.AcceptedAnswerId> _:posts.11075 .
_:posts.9 <posts.OwnerUserId> _:users.14 .
_:posts.9 <posts.LastEditorUserId> _:users.2089 .
_:votes.10 <votes.PostId> _:posts.9 .
Besides converting the data, the migration tool also needs to derive the schema of each predicate. We have two simple rules for converting the schema:
Body
column in the posts
table is of type text
,
and hence, the predicate posts.Id
is of type string
:
posts.Body: string .
posts.OwnerUserId.
, simply have the type
[uid]
, meaning following the predicate leads us to a set of other objects.These two rules give us the following schema when converting the SQL tables above:
comments.Id: int .
comments.Text: string .
comments.PostId: [uid] .
comments.UserId: [uid] .
posts.Body: string .
posts.Id: int .
posts.Title: string .
posts.AcceptedAnswerId: [uid] .
posts.OwnerUserId: [uid] .
posts.LastEditorUserId: [uid] .
users.DisplayName: string .
users.Id: int .
votes.CreationDate: string .
votes.Id: int .
votes.PostId: [uid] .
Once the migration tool finishes, two files will have been created:
The two files can then be imported into Dgraph via the Dgraph Live Loader
or Bulk Loader. It’s worth pointing out that sometimes you
may want to customize the schema, e.g. adding an index to a predicate, or by
changing an inter-object predicate from unidirectional to bidirectional links by adding the
@reverse
directive. If you would like such customizations, you should do it by editing
the schema file generated by the migration tool before feeding the files to the Live Loader or Bulk Loader.
It’s time to sit back, relax, grab something to drink and enjoy a demo with the migration tool in action:
[demo]: https://github.com/dgraph-io/dgraph/tree/master/dgraph/cmd/migrateThe step-by-step instructions for running the demo can be found [here][demo].
Besides the demo above, we’ve tested the tool using a larger data set from the lifehacks site on StackExchange, which showed us that for a dataset with around 10,000 rows the migration takes around 12 seconds (might vary on your machine).
To recap, we presented the design of a migration tool that exports SQL tables into NQuad entries. To make this blog easier to understand, we have simplified the problem, and left out some tricky details such as multi-column foreign key constraints. As a result, the gist of the design is straightforward and can be summarized into 3 rules:
From there, the generated NQuad entries can be imported into Dgraph with the Live Loader or Bulk Loader.
If you find this tool useful or have any suggestions for improvements, please drop us a note at our discussion forum.