I recently started working at Dgraph Labs in Bengaluru as a Software Engineer. One of the first issues that I worked on was related to the Dgraph’s datetime datatype. This article covers how I discovered and resolved the issue. The article assumes a basic understanding of Dgraph, but you can learn everything you need in docs.dgraph.io.
While working on Flock, I observed that timestamp comparisons were not working correctly in Dgraph when the data or query had timezone information in it. To see the issue, let’s first quickly set up Dgraph using the commands below.
The easiest way is to use Docker Compose.
Create a docker-compose.yml
file with the contents below.
version: "3.2"
services:
zero:
image: dgraph/dgraph:v1.0.16
restart: on-failure
command: dgraph zero --my=zero:5080
server:
image: dgraph/dgraph:v1.0.16
ports:
- 8080:8080
restart: on-failure
command: dgraph alpha --my=server:7080 --lru_mb=2048 --zero=zero:5080
ratel:
image: dgraph/dgraph:v1.0.16
ports:
- 8000:8000
command: dgraph-ratel
Next, from the folder containing the newly created file, run:
docker-compose up
We will use Ratel to showcase the issue. Ratel allows you to run queries, mutations and manage the Dgraph cluster. To open Ratel, you can go to the URL http://localhost:8000/?latest in your browser after running the commands above.
Now, let’s set up the schema. To set up the schema, you have to go to the Schema
tab on the left navigation bar in Ratel, click on Bulk Edit
and paste the following schema:
created_at : datetime @index(hour) .
Then, click on Apply Schema
. This will create a new predicate created_at
of type datetime
.
Now you can perform the following mutation to add data to Dgraph in the Console
tab
after choosing Mutate
option. This will store two tweets with a creation time and author name.
{
"set": [
{
"uid" :"_:user1",
"created_at": "2019-03-28T14:00:00-06:00",
"author_name": "Rahul"
},
{
"uid" :"_:user2",
"created_at": "2019-03-28T18:00:00+01:00",
"author_name": "Ashish"
}
]
}
Finally, run the following query to get all the tweets that were created after
UTC March 28th, 3 PM
. You can run queries in the Console
after selecting
Query
option:
{
tweets(func: gt(created_at, "2019-03-28T15:00:00+00:00")) {
created_at
uid
}
}
This is the response we get:
{
"extensions": { ... },
"data": {
"tweets": [
{
"created_at": "2019-03-28T18:00:00+01:00",
"uid": "0x2"
}
]
}
}
The created_at
timestamp for the both the users is as follows.
For brevity, we will remove the date (2019-03-28
) from timestamps
in the rest of the article.
Created At | In UTC | Greater than query ts (15h +00:00) | |
---|---|---|---|
user1 | 14h -06:00 | 20h +00:00 | True |
user2 | 18h +01:00 | 17h +00:00 | True |
Even when the created_at
timestamps for both the users are larger than the timestamp
in the query, only user2
shows up in the response of the query. Before I explain
how we found and fixed the root cause of the issue, let me first explain how indexes
work in Dgraph.
Along with JSON, Dgraph takes RDF data format as input which looks like
<Subject> <Predicate> <Object>
(SPO Triples).
The Subject
is generally a UID
value representing a node in the graph whereas
the Predicate
represents a relationship of the node with the Object
value.
Indexes are created on a given Predicate
to quickly find nodes in the graph for
the provided Object
value.
Dgraph indexes are stored in BadgerDB, an embedded
key-value store written purely in Go. For each object value, one or more tokens are
generated based on the provided tokenizer. For example, in the case of a hash
tokenizer,
the tokenizer computes the hash of the Object
value and returns it as a token.
Once the tokens are generated, data is stored in badgerDB in the following format.
Each key in badgerDB, is a combination of predicate and generated token, and the value
is the list of UIDs that match the predicate and the token (the key). An example is shown
below for created_at
predicate:
key | value |
---|---|
< created_at, Tokenizer-Tok1 > | List(uid1, uid3) |
< created_at, Tokenizer-Tok2 > | List(uid2) |
We prepend a unique Tokenizer
identifier in the key (whether it’s a hash tokenizer,
int tokenizer, hour tokenizer etc.) to the token generated, as shown above, to be able
to traverse through all the tokens belonging to only that particular tokenizer (index).
For a datetime index, Dgraph provides tokenizers for hour
, day
, month
and year
granularity. Each key generated by the tokenizer can be treated as a time bucket of
specified granularity, storing the UIDs with a date falling within that bucket.
The granularity of a datetime index should be carefully chosen to keep the size of
each bucket reasonable. Choosing a lower granularity time index would wipe out the
performance benefits of UID consolidation into time buckets (think one entry per bucket).
On the other hand, choosing very high granularity would end up putting too many UIDs
within the same bucket, which could also affect performance (millions of entries per bucket).
In order to generate tokens for datetime datatype, we use the Hour()
, Day()
,
Month()
and Year()
methods of the time
package as you can see in the
(source code).
These methods return the information accounting for the timezone information in the
timestamp. For example, for timestamp 14h +06:00
, the Hour()
method
returns the value 14
and for timestamp 14h +00:00
, the Hour()
method still returns the same value 14
. Due to this behavior, the index is not
correctly built.
In our example above, the data has two timestamps 1) 14h -06:00
and
2) 18h +01:00
and we have created an hour
index on the created_at
predicate. The hour
index incorrectly stores roughly the following information:
<created_at, HourTokenizer-2019-03-28T14> -> List(0x01)
<created_at, HourTokenizer-2019-03-28T18> -> List(0x02)
The query is trying to find out all users having created_at
timestamp bigger than
15h +00:00
. While processing the query, Dgraph computes the same
hour
tokenizer for this timestamp to be 2019-03-28T15
. Then, it scans through
all the users in the index having the token bigger than 2019-03-28T15
and only
returns the user with uid 0x02
.
Once the issue was clear, the fix was pretty straight forward. Instead of building tokens
directly on the given timestamp, we first convert the timestamp into UTC()
timezone and
then compute various tokens. This normalizes all timestamps into single timezone and the
Hour()
, Day()
, Month()
and Year()
information can be correctly compared. The PR
for this fix is available here.
Once this change is included, the index looks like this instead:
<created_at, HourTokenizer-2019-03-28T17> -> List(0x01)
<created_at, HourTokenizer-2019-03-28T20> -> List(0x02)
The timestamp in the query now generates the token as 2019-03-28T15
and we see
both the users while scanning through the index. The correct response, therefore,
looks like as follows:
{
"extensions": { ... },
"data": {
"tweets": [
{
"created_at": "2019-03-28T14:00:00-06:00",
"uid": "0x1"
},
{
"created_at": "2019-03-28T18:00:00+01:00",
"uid": "0x2"
}
]
}
}
You can see that both the users are included in this response.
In this article, we discussed how indexes are implemented in Dgraph. I also showed how I discovered and resolved the issue with datetime indexes. While backporting it to v1.0 series would require change in data format on disk, the fix will be available in v1.1 which we plan to release end of this month. Stay tuned!