The recent release of Dgraph is packed with new features and improvements. Many of them are related to strings - full text search (with support for 15 languages!) and regular expression matching have been added, and handling of string values in multiple languages was greatly improved. All of these changes make Dgraph an excellent tool for working with multilingual applications.
We’re working hard to keep the query language easy to use and clean. Dgraph, in v0.7.5, adopted and extended the language tag syntax from the RDF N-Quads standard. It is intuitive, well-known, and was partially supported in previous versions (during data loading).
Let’s start from the beginning - the data.
Dgraph uses RDF N-Quads for data loading and backup.
String literals in N-Quads may be followed by the @
sign and language tag, e.g. "badger"@en
or "Dachs"@de
.
Multiple such literals may be used as a value for a single entity/attribute pair.
When querying for a predicate with multiple values, the user is able to use the @lang
notation known from RDF N-Quads.
Many languages can be specified in a list of preference, e.g. @en:de
denotes that preferred language is English, but if such a value is not present, a value in German should be returned.
Language can also be specified in functions, which is important especially for full text search.
The dataset used in all examples is the Freebase film data.
As this post is string-oriented, queries are focused on movie titles in multiple languages, and no other information is retrieved.
As we don’t have information about type of name
, we use filtering to select only the movie titles, and to limit the number of results a bit - @filter(gt(count(genre), 1))
.
The schema for name
field is very simple - it defines 3 types of indexes:
curl localhost:8080/query -XPOST -d $'
mutation {
schema {
name: string @index(term, fulltext, exact) .
}
}' | python -m json.tool | less
term
index is used for term matching with the allofterms
and anyofterms
functions.
Note that it was the only string index available in previous releases of Dgraph.
fulltext
index uses matching with language specific stemming and stopwords.
One thing worth noting is, that values indexed with fulltext
are processed according to their’s language (if they are tagged). If values are untagged, English is used as a default language.
exact
index is used for regular expression matching.
By definition (from Wikipedia):
In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user).
This may sound trivial, but it’s not. Searching for exact form of the word is not always satisfying for the user. For example, nouns can be singular or plural, verbs have grammatical tenses, etc., and the user may be interested in all values related to the word in any inflected or derived form.
The simple but powerful idea is to find a method, that can transform all the forms of a word to some common base. This process is called stemming. For many natural languages (including English) stemmers may be implemented using a set of well known grammatical rules. There are also languages (like Polish) where a dictionary based approach is required (i.e. inflected form -> stem mapping).
Only for languages with well known grammatical rules are stemmers are widely available.
Another problem with search are the words that are common, like the
, is
, or at
.
In most cases, searching for them gives an enormous amount of results which are useless.
Those words are called stop words.
Again, stop words are language specific.
The common method of handling those words is just to remove them from the search.
The following steps are applied to both data (while indexing), and the query pattern:
Since stop words contain inflected forms, they are removed before stemming.
There are two new functions that provide basic support for full text search:
alloftext
- searches for values that contain all the specified words (using NLP).anyoftext
- searches for values that contain one or more of the specified tokens (using NLP).Let’s query for white or maybe black
, using the term matching function allofterms
.
{
movie(func:allofterms(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
name@en
name@de
name@it
}
}
curl localhost:8080/query -XPOST -d '
{
movie(func:allofterms(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
name@en
name@de
name@it
}
}
' | python -m json.tool | less
import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;
public class DgraphMain {
public static void main(final String[] args) {
final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
final DgraphResult result = dgraphClient.query("{
movie(func:allofterms(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
name@en
name@de
name@it
}
}
");
System.out.println(result.toJsonObject().toString());
}
}
alloftext
function:
{
movie(func:alloftext(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
name@en
name@de
name@it
}
}
curl localhost:8080/query -XPOST -d '
{
movie(func:alloftext(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
name@en
name@de
name@it
}
}
' | python -m json.tool | less
import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;
public class DgraphMain {
public static void main(final String[] args) {
final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
final DgraphResult result = dgraphClient.query("{
movie(func:alloftext(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
name@en
name@de
name@it
}
}
");
System.out.println(result.toJsonObject().toString());
}
}
Query returns 59 results. This example shows that removing a stop word may help in some cases.
In context of NLP, English is quite easy - there are no diacritics, and inflection is rather simple. So let’s try similar query in German:
{
movie(func:allofterms(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
name@de
name@en
name@it
}
}
curl localhost:8080/query -XPOST -d '
{
movie(func:allofterms(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
name@de
name@en
name@it
}
}
' | python -m json.tool | less
import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;
public class DgraphMain {
public static void main(final String[] args) {
final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
final DgraphResult result = dgraphClient.query("{
movie(func:allofterms(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
name@de
name@en
name@it
}
}
");
System.out.println(result.toJsonObject().toString());
}
}
Again, the query doesn’t return any results.
Now let’s try the NLP-enabled version of this query:
{
movie(func:alloftext(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
name@de
name@en
name@it
}
}
curl localhost:8080/query -XPOST -d '
{
movie(func:alloftext(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
name@de
name@en
name@it
}
}
' | python -m json.tool | less
import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;
public class DgraphMain {
public static void main(final String[] args) {
final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
final DgraphResult result = dgraphClient.query("{
movie(func:alloftext(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
name@de
name@en
name@it
}
}
");
System.out.println(result.toJsonObject().toString());
}
}
This returns 4 results.
It’s worth noting the inflected forms of schwarz
- schwarzes
and Schwartze
.
Also the Wei\u00dfer
is interesting - \u00df
is the escaped Unicode value of grapheme ß
.
weiss
matched Weißer
- the form is inflected, and grapheme equivalency is preserved.
Like in the English example, the stop word (oder
) is ignored.
In some cases, natural language processing can lead to surprising results.
Let’s search for the answer to the famous question: To be, or not to be?
:
{
movie(func:alloftext(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
name@en
}
}
curl localhost:8080/query -XPOST -d '
{
movie(func:alloftext(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
name@en
}
}
' | python -m json.tool | less
import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;
public class DgraphMain {
public static void main(final String[] args) {
final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
final DgraphResult result = dgraphClient.query("{
movie(func:alloftext(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
name@en
}
}
");
System.out.println(result.toJsonObject().toString());
}
}
The query gives no results, while the term matching query:
{
movie(func:allofterms(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
name@en
}
}
curl localhost:8080/query -XPOST -d '
{
movie(func:allofterms(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
name@en
}
}
' | python -m json.tool | less
import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;
public class DgraphMain {
public static void main(final String[] args) {
final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
final DgraphResult result = dgraphClient.query("{
movie(func:allofterms(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
name@en
}
}
");
System.out.println(result.toJsonObject().toString());
}
}
gives two results:
What happened? To be, or not to be?
consists of stop words only.
After FTS/NLP processing, there are no movies that match the query.
Regular expressions are extremely useful for creating sophisticated matchers.
For example, all titles starting with a word containing night
but not knight
may be matched using following query:
{
movie(func:regexp(name@en, /^[a-zA-z]*[^Kk ]?[Nn]ight/)) @filter(gt(count(genre), 1)) {
name@en
name@de
name@it
}
}
curl localhost:8080/query -XPOST -d '
{
movie(func:regexp(name@en, /^[a-zA-z]*[^Kk ]?[Nn]ight/)) @filter(gt(count(genre), 1)) {
name@en
name@de
name@it
}
}
' | python -m json.tool | less
import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;
public class DgraphMain {
public static void main(final String[] args) {
final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
final DgraphResult result = dgraphClient.query("{
movie(func:regexp(name@en, /^[a-zA-z]*[^Kk ]?[Nn]ight/)) @filter(gt(count(genre), 1)) {
name@en
name@de
name@it
}
}
");
System.out.println(result.toJsonObject().toString());
}
}
There are 502 results in the test dataset.
Dgraph supports extensive, and useful methods of string matching.
Natural language processing, employed for full text search, may be the best choice for lookup based on users input. If more strict matching is required, term matching should give good results. And to get the most precise results of complicated text searches, regular expressions can be used.