String matching in Dgraph v0.7.5

The recent release of Dgraph is packed with new features and improvements. Many of them are related to strings - full text search (with support for 15 languages!) and regular expression matching have been added, and handling of string values in multiple languages was greatly improved. All of these changes make Dgraph an excellent tool for working with multilingual applications.

Values in Many Languages

We’re working hard to keep the query language easy to use and clean. Dgraph, in v0.7.5, adopted and extended the language tag syntax from the RDF N-Quads standard. It is intuitive, well-known, and was partially supported in previous versions (during data loading).

Let’s start from the beginning - the data. Dgraph uses RDF N-Quads for data loading and backup. String literals in N-Quads may be followed by the @ sign and language tag, e.g. "badger"@en or "Dachs"@de. Multiple such literals may be used as a value for a single entity/attribute pair.

When querying for a predicate with multiple values, the user is able to use the @lang notation known from RDF N-Quads. Many languages can be specified in a list of preference, e.g. @en:de denotes that preferred language is English, but if such a value is not present, a value in German should be returned.

Language can also be specified in functions, which is important especially for full text search.

Example Data

The dataset used in all examples is the Freebase film data. As this post is string-oriented, queries are focused on movie titles in multiple languages, and no other information is retrieved. As we don’t have information about type of name, we use filtering to select only the movie titles, and to limit the number of results a bit - @filter(gt(count(genre), 1)).

The schema for name field is very simple - it defines 3 types of indexes:

curl localhost:8080/query -XPOST -d $'
mutation {
  schema {
    name: string @index(term, fulltext, exact) .
  }
}' | python -m json.tool | less

term index is used for term matching with the allofterms and anyofterms functions. Note that it was the only string index available in previous releases of Dgraph.

fulltext index uses matching with language specific stemming and stopwords. One thing worth noting is, that values indexed with fulltext are processed according to their’s language (if they are tagged). If values are untagged, English is used as a default language.

exact index is used for regular expression matching.

Full Text Search (FTS)

Very Short Introduction to Natural Language Processing (NLP)

By definition (from Wikipedia):

In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user).

This may sound trivial, but it’s not. Searching for exact form of the word is not always satisfying for the user. For example, nouns can be singular or plural, verbs have grammatical tenses, etc., and the user may be interested in all values related to the word in any inflected or derived form.

The simple but powerful idea is to find a method, that can transform all the forms of a word to some common base. This process is called stemming. For many natural languages (including English) stemmers may be implemented using a set of well known grammatical rules. There are also languages (like Polish) where a dictionary based approach is required (i.e. inflected form -> stem mapping).

Only for languages with well known grammatical rules are stemmers are widely available.

Another problem with search are the words that are common, like the, is, or at. In most cases, searching for them gives an enormous amount of results which are useless. Those words are called stop words. Again, stop words are language specific. The common method of handling those words is just to remove them from the search.

Dgraph FTS/NLP Processing

The following steps are applied to both data (while indexing), and the query pattern:

Tokenization - text is divided into words.
Normalization - all letters are transformed to lowercase. Unicode Normalization is applied.
Stop words are removed.
Stemming is applied.

Since stop words contain inflected forms, they are removed before stemming.

Full Text Search Functions

There are two new functions that provide basic support for full text search:

alloftext - searches for values that contain all the specified words (using NLP).
anyoftext - searches for values that contain one or more of the specified tokens (using NLP).

Examples

Let’s query for white or maybe black, using the term matching function allofterms.

{
  movie(func:allofterms(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
    name@en
    name@de
    name@it
  }
}

curl localhost:8080/query -XPOST -d '
{
  movie(func:allofterms(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
    name@en
    name@de
    name@it
  }
}
' | python -m json.tool | less

import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;

public class DgraphMain {
  public static void main(final String[] args) {
    final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
    final DgraphResult result = dgraphClient.query("{
  movie(func:allofterms(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
    name@en
    name@de
    name@it
  }
}
");
    System.out.println(result.toJsonObject().toString());
  }
}

The query gives no results. It may be worth trying less strict match with alloftext function:

{
  movie(func:alloftext(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
    name@en
    name@de
    name@it
  }
}

curl localhost:8080/query -XPOST -d '
{
  movie(func:alloftext(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
    name@en
    name@de
    name@it
  }
}
' | python -m json.tool | less

import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;

public class DgraphMain {
  public static void main(final String[] args) {
    final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
    final DgraphResult result = dgraphClient.query("{
  movie(func:alloftext(name@en, "white or maybe black")) @filter(gt(count(genre), 1)) {
    name@en
    name@de
    name@it
  }
}
");
    System.out.println(result.toJsonObject().toString());
  }
}

Query returns 59 results. This example shows that removing a stop word may help in some cases.

In context of NLP, English is quite easy - there are no diacritics, and inflection is rather simple. So let’s try similar query in German:

{
  movie(func:allofterms(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
	name@de
	name@en
	name@it
  }
}

curl localhost:8080/query -XPOST -d '
{
  movie(func:allofterms(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
	name@de
	name@en
	name@it
  }
}
' | python -m json.tool | less

import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;

public class DgraphMain {
  public static void main(final String[] args) {
    final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
    final DgraphResult result = dgraphClient.query("{
  movie(func:allofterms(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
	name@de
	name@en
	name@it
  }
}
");
    System.out.println(result.toJsonObject().toString());
  }
}

Again, the query doesn’t return any results.

Now let’s try the NLP-enabled version of this query:

{
  movie(func:alloftext(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
	name@de
	name@en
	name@it
  }
}

curl localhost:8080/query -XPOST -d '
{
  movie(func:alloftext(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
	name@de
	name@en
	name@it
  }
}
' | python -m json.tool | less

import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;

public class DgraphMain {
  public static void main(final String[] args) {
    final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
    final DgraphResult result = dgraphClient.query("{
  movie(func:alloftext(name@de, "weiss oder vielleicht schwarz")) @filter(gt(count(genre), 1)) {
	name@de
	name@en
	name@it
  }
}
");
    System.out.println(result.toJsonObject().toString());
  }
}

This returns 4 results.

It’s worth noting the inflected forms of schwarz - schwarzes and Schwartze. Also the Wei\u00dfer is interesting - \u00df is the escaped Unicode value of grapheme ß. weiss matched Weißer - the form is inflected, and grapheme equivalency is preserved. Like in the English example, the stop word (oder) is ignored.

Gotchas

In some cases, natural language processing can lead to surprising results. Let’s search for the answer to the famous question: To be, or not to be?:

{
  movie(func:alloftext(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
	name@en
  }
}

curl localhost:8080/query -XPOST -d '
{
  movie(func:alloftext(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
	name@en
  }
}
' | python -m json.tool | less

import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;

public class DgraphMain {
  public static void main(final String[] args) {
    final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
    final DgraphResult result = dgraphClient.query("{
  movie(func:alloftext(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
	name@en
  }
}
");
    System.out.println(result.toJsonObject().toString());
  }
}

The query gives no results, while the term matching query:

{
  movie(func:allofterms(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
	name@en
  }
}

curl localhost:8080/query -XPOST -d '
{
  movie(func:allofterms(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
	name@en
  }
}
' | python -m json.tool | less

import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;

public class DgraphMain {
  public static void main(final String[] args) {
    final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
    final DgraphResult result = dgraphClient.query("{
  movie(func:allofterms(name@en, "To be, or not to be?")) @filter(gt(count(genre), 1)) {
	name@en
  }
}
");
    System.out.println(result.toJsonObject().toString());
  }
}

gives two results:

What happened? To be, or not to be? consists of stop words only. After FTS/NLP processing, there are no movies that match the query.

Regular Expressions (regexp)

Regular expressions are extremely useful for creating sophisticated matchers.

For example, all titles starting with a word containing night but not knight may be matched using following query:

{
  movie(func:regexp(name@en, /^[a-zA-z]*[^Kk ]?[Nn]ight/)) @filter(gt(count(genre), 1)) {
	name@en
	name@de
	name@it
  }
}

curl localhost:8080/query -XPOST -d '
{
  movie(func:regexp(name@en, /^[a-zA-z]*[^Kk ]?[Nn]ight/)) @filter(gt(count(genre), 1)) {
	name@en
	name@de
	name@it
  }
}
' | python -m json.tool | less

import io.dgraph.client.DgraphClient;
import io.dgraph.client.GrpcDgraphClient;
import io.dgraph.client.DgraphResult;

public class DgraphMain {
  public static void main(final String[] args) {
    final DgraphClient dgraphClient = GrpcDgraphClient.newInstance("localhost", 8080);
    final DgraphResult result = dgraphClient.query("{
  movie(func:regexp(name@en, /^[a-zA-z]*[^Kk ]?[Nn]ight/)) @filter(gt(count(genre), 1)) {
	name@en
	name@de
	name@it
  }
}
");
    System.out.println(result.toJsonObject().toString());
  }
}

There are 502 results in the test dataset.

Summary

Dgraph supports extensive, and useful methods of string matching.

Natural language processing, employed for full text search, may be the best choice for lookup based on users input. If more strict matching is required, term matching should give good results. And to get the most precise results of complicated text searches, regular expressions can be used.

Mon, Apr 10, 2017