Using Vector similarity search in DQL

Similarity search in DQL

Dgraph v24 introduces vector data type and similarity search to DQL query language.

This post shows a simple example of using vector embeddings and similarity search.

This example is using Ratel for the schema update, mutations and queries, but you can use any approach.

Setup and install dgraph and ratel

Get a Dgraph docker container for the v24 alpha version

docker pull dgraph/standalone:v24.0.0-alpha2 

Run a docker container, storing data on your local machine

mkdir ~/dgraph

docker run -d --name dgraph-v24alpha2 -p "8080:8080" -p "9080:9080"  -v ~/dgraph:/dgraph dgraph/standalone:v24.0.0-alpha2

Then get and start the ratel tool

docker pull dgraph/ratel

docker run -d --name ratel -p "8000:8000"  dgraph/ratel:latest

Ratel will now be running on localhost:8000

Add a schema, data and test queries

Define a DQL Schema. You can set this via the Ratel schema tab using the bulk edit option.

<Issue.description>: string .

<Issue.vector_embedding>: float32vector @index(hnsw(metric:"euclidean")) .
type <Issue> {
      Issue.description
      Issue.vector_embedding
}

Notice that the new float32vector type is used, with a new index type of hnsw. The hnsw index can use a different distance metrics: cosine, euclidean, or dotproduct. Here we use euclidean distance for the index.

At this point, the database will accept and index float vectors for the predicate Issue.vector_emebedding

Insert some data containing short, test-only embeddings using this DQL Mutation

You can paste this into Ratel as a mutation, or use curl, pydgraph or similar:

{
   "set": 
    [
      {
         "dgraph.type": "Issue",
         "Issue.vector_embedding": "[0.25, 0.47, 0.8,  0.27]",
         "Issue.description":"Intermittent timeouts. Logs show no such host error."
      },
      {  "dgraph.type": "Issue",
         "Issue.vector_embedding": "[0.57, 0.23, 0.68, 0.41]",
         "Issue.description":"Bug when user adds record with blank surName. Field is required so should be checked in web page."
      },
      {
         "dgraph.type": "Issue",
         "Issue.vector_embedding": "[0.26, 0.12, 0.77, 0.57]",
         "Issue.description":"Delays on responses every 30 minutes with high network latency in backplane"
      },
      {
         "dgraph.type": "Issue",
         "Issue.vector_embedding": "[0.45, 0.49, 0.72, 0.2]",
         "Issue.description":"vSlow queries intermittently. The host is not found according to logs."
      },
      {  "dgraph.type": "Issue",
         "Issue.vector_embedding": "[0.52, 0.05, 0.22, 0.82]",
         "Issue.description":"Some timeouts. It seems to be a DNS host lookup issue. Seeing No Such Host message."
      },
      {
         "dgraph.type": "Issue",
         "Issue.vector_embedding": "[0.33, 0.64, 0.16, 0.68]",
         "Issue.description":"Host and DNS issues are causing timeouts in the User Details web page"
      }
    ]

A simple query that finds similar questions

You are ready to do similarity queries, to find Issues based on semantic similarity to a new Issue description! For simplicity, we are not computing large vectors from an LLM. The embeddings above simply represent four concepts which are in the four vector dimensions: which are, respectively:

  • Slowness or delays

  • Logging or messages

  • Networks

  • GUIs or web pages**


Use case and query

Let’s say a new issue comes in, and you want to use the text description to find other, similar issues you have seen in the past. Use the similarity query below.

If the new issue description is “Slow response and delay in my network!”, we represent this new issue as the vector [0.28 0.75 0.35 0.48]. Note that the first parameter to similar_to is the DQL field name, the second parameter is the number of results to return, and the third parameter is the vector to look for.

query slownessWithLogs() {
    simVec(func: similar_to(Issue.vector_embedding, 3, "[0.28 0.75 0.35 0.48]")) 
    {     
        uid
        Issue.description
    }
  }

If you want to send in data using parameters, rewrite this as

query test($vec: float32vector) {
    simVec(func: similar_to(Issue.vector_embedding, 3, $vec)) 
    {
        uid
        Issue.description
    }
}

And make a request (again using Ratel) with variable named “vec” set to a JSON float[] value:

vec: [0.28 0.75 0.35 0.48]

Computing vector distances and similarity scores

The following query finds 3 most similar results based on the input $vect. It is using the hnsw index that has been declared in the schema with distance metric (in our case, the euclidean distance).

In some cases you want to obtain this distance or compute a similarity score. Keep in mind that, for a distance, the lower the more similar, for a similarity score, the higher the more similar.

Dgraph v24 introduces a new DQL math function dot to compute the dot product of vectors. Using dot function you can compute the similarity score of your choice in your query.

Given two vectors $$A=[a_1,a_2,…,a_n] \space\space and \space\space B=[b_1,b_2,…,b_n]$$

The euclidean distance is the L2 norm of A - B

$$D = \sqrt{(a_1 - b_1)^2+…+(a_n - b_n)^2}$$ It is easily express as a dot product

$$D = \sqrt{(A-B).(A-B)}$$

​An other option to measure how close the 2 vectors are, is to use the cosine similarity. Cosine similarity is a measure of the angle between two vectors. $$ cosine(A,B) = {A . B \over ||A||.||B||}$$

Cosine will be between -1 and 1 (identical vector). So we usually turn it into cosine distance measure:
$$cosine\_distance(A,B) = 1 - cosine(A,B)$$

When the vectors are normalized, ( ||A|| = 1 and ||B|| = 1 ), which is usually the case with vector embeddings produced by ML models. The cosine computation can be simplified using only a dot product.

$$ dotproduct\_distance = 1 - A.B $$

A common use case is to compute a simialrity score or confidence. When using normalized vector we can use $$ similarity = {1 + A.B \over 2} $$

This metric has the nice property to be between 0 and 1, with 1 being as similar as possible, providing a simple score to apply thresholds.

Compute distances in DQL using dot function

Here is an example to compute euclidean, cosine and dot product distance from our previous query.

We just need to add variable in the query vemb as Issue.vector_embedding to get the vector embedding of each similar Issues and use it in Math functions.

The query also illustrates the use of an intermediary variable to compute cosine_distance.

query slownessWithLogs($vec: float32vector) {
    simVec(func: similar_to(Issue.vector_embedding, 3, $vec)) 
    {     
        uid
        Issue.description
        vemb as Issue.vector_embedding

        euclidean_distance: Math (sqrt( ($vec - vemb) dot ($vec - vemb))  )

        dotproduct_distance: Math (1.0 - (($vec) dot vemb))
        
        cosine as  Math( ( ($vec) dot vemb) / sqrt((($vec) dot ($vec)) *(vemb dot vemb)  ))  
        cosine_distance: Math(1.0 - cosine)

        similarity_score: Math ((1.0 + (($vec) dot vemb)) / 2.0)
    }
  }

You usually just compute the same distance as defined in the index or the similarity score.

The next query shows how to compute the similarity score in a variable and use it to get the 3 closest nodes ordered by similarity:

query slownessWithLogs($vec: float32vector) {
    var(func: similar_to(Issue.vector_embedding, 3, $vec)) 
    {     
        vemb as Issue.vector_embedding
        score as  Math ((1.0 + (($vec) dot vemb)) / 2.0)
    }
    # score is now a map of uid -> similarity_score
    
    simVec(func:uid(score), orderdesc:val(score)){     
        uid
        Issue.description
        score:val(score)
    }
  }

Summing it up

This end-to-end example shows how you can insert data with vector embeddings, corresponding to a schema that specifies a vector index, and do a semantic search via the new similar\_to() function in Dgraph. It also shows how to compute various metrics using the new dot function.