Dgraph v24 introduces vector data type and similarity search to DQL query language.
This post shows a simple example of using vector embeddings and similarity search.
This example is using Ratel for the schema update, mutations and queries, but you can use any approach.
docker pull dgraph/standalone:v24.0.0-alpha2
Run a docker container, storing data on your local machine
mkdir ~/dgraph
docker run -d --name dgraph-v24alpha2 -p "8080:8080" -p "9080:9080" -v ~/dgraph:/dgraph dgraph/standalone:v24.0.0-alpha2
Then get and start the ratel tool
docker pull dgraph/ratel
docker run -d --name ratel -p "8000:8000" dgraph/ratel:latest
Ratel will now be running on localhost:8000
Define a DQL Schema. You can set this via the Ratel schema tab using the bulk edit option.
<Issue.description>: string .
<Issue.vector_embedding>: float32vector @index(hnsw(metric:"euclidean")) .
type <Issue> {
Issue.description
Issue.vector_embedding
}
Notice that the new float32vector type is used, with a new index type of hnsw
. The hnsw
index can use a different distance metrics: cosine
, euclidean
, or dotproduct
. Here we use euclidean
distance for the index.
At this point, the database will accept and index float vectors for the predicate Issue.vector_emebedding
Insert some data containing short, test-only embeddings using this DQL Mutation
You can paste this into Ratel as a mutation, or use curl, pydgraph or similar:
{
"set":
[
{
"dgraph.type": "Issue",
"Issue.vector_embedding": "[0.25, 0.47, 0.8, 0.27]",
"Issue.description":"Intermittent timeouts. Logs show no such host error."
},
{ "dgraph.type": "Issue",
"Issue.vector_embedding": "[0.57, 0.23, 0.68, 0.41]",
"Issue.description":"Bug when user adds record with blank surName. Field is required so should be checked in web page."
},
{
"dgraph.type": "Issue",
"Issue.vector_embedding": "[0.26, 0.12, 0.77, 0.57]",
"Issue.description":"Delays on responses every 30 minutes with high network latency in backplane"
},
{
"dgraph.type": "Issue",
"Issue.vector_embedding": "[0.45, 0.49, 0.72, 0.2]",
"Issue.description":"vSlow queries intermittently. The host is not found according to logs."
},
{ "dgraph.type": "Issue",
"Issue.vector_embedding": "[0.52, 0.05, 0.22, 0.82]",
"Issue.description":"Some timeouts. It seems to be a DNS host lookup issue. Seeing No Such Host message."
},
{
"dgraph.type": "Issue",
"Issue.vector_embedding": "[0.33, 0.64, 0.16, 0.68]",
"Issue.description":"Host and DNS issues are causing timeouts in the User Details web page"
}
]
}
A simple query that finds similar questions
You are ready to do similarity queries, to find Issues based on semantic similarity to a new Issue description! For simplicity, we are not computing large vectors from an LLM. The embeddings above simply represent four concepts which are in the four vector dimensions: which are, respectively:
Slowness or delays
Logging or messages
Networks
GUIs or web pages**
Let’s say a new issue comes in, and you want to use the text description to find other, similar issues you have seen in the past. Use the similarity query below.
If the new issue description is “Slow response and delay in my network!”, we represent this new issue as the vector [0.28 0.75 0.35 0.48]. Note that the first parameter to similar_to is the DQL field name, the second parameter is the number of results to return, and the third parameter is the vector to look for.
query slownessWithLogs() {
simVec(func: similar_to(Issue.vector_embedding, 3, "[0.28 0.75 0.35 0.48]"))
{
uid
Issue.description
}
}
If you want to send in data using parameters, rewrite this as
query test($vec: float32vector) {
simVec(func: similar_to(Issue.vector_embedding, 3, $vec))
{
uid
Issue.description
}
}
And make a request (again using Ratel) with variable named “vec” set to a JSON float[] value:
vec: [0.28 0.75 0.35 0.48]
The following query finds 3 most similar results based on the input $vect
. It is using the hnsw
index that has been declared in the schema with distance metric (in our case, the euclidean
distance).
In some cases you want to obtain this distance or compute a similarity score. Keep in mind that, for a distance, the lower the more similar, for a similarity score, the higher the more similar.
Dgraph v24 introduces a new DQL math function dot
to compute the dot product of vectors. Using dot
function you can compute the similarity score of your choice in your query.
Given two vectors $$A=[a_1,a_2,…,a_n] \space\space and \space\space B=[b_1,b_2,…,b_n]$$
The euclidean distance is the L2 norm of A - B
$$D = \sqrt{(a_1 - b_1)^2+…+(a_n - b_n)^2}$$ It is easily express as a dot product
$$D = \sqrt{(A-B).(A-B)}$$
An other option to measure how close the 2 vectors are, is to use the cosine similarity. Cosine similarity is a measure of the angle between two vectors. $$ cosine(A,B) = {A . B \over ||A||.||B||}$$
Cosine will be between -1 and 1 (identical vector). So we usually turn it into cosine distance measure:
$$cosine\_distance(A,B) = 1 - cosine(A,B)$$
When the vectors are normalized, ( ||A|| = 1 and ||B|| = 1 ), which is usually the case with vector embeddings produced by ML models. The cosine computation can be simplified using only a dot product.
$$ dotproduct\_distance = 1 - A.B $$
A common use case is to compute a simialrity score or confidence. When using normalized vector we can use $$ similarity = {1 + A.B \over 2} $$
This metric has the nice property to be between 0 and 1, with 1 being as similar as possible, providing a simple score to apply thresholds.
Here is an example to compute euclidean, cosine and dot product distance from our previous query.
We just need to add variable in the query vemb as Issue.vector_embedding
to get the vector embedding of each similar Issues and use it in Math
functions.
The query also illustrates the use of an intermediary variable to compute cosine_distance.
query slownessWithLogs($vec: float32vector) {
simVec(func: similar_to(Issue.vector_embedding, 3, $vec))
{
uid
Issue.description
vemb as Issue.vector_embedding
euclidean_distance: Math (sqrt( ($vec - vemb) dot ($vec - vemb)) )
dotproduct_distance: Math (1.0 - (($vec) dot vemb))
cosine as Math( ( ($vec) dot vemb) / sqrt((($vec) dot ($vec)) *(vemb dot vemb) ))
cosine_distance: Math(1.0 - cosine)
similarity_score: Math ((1.0 + (($vec) dot vemb)) / 2.0)
}
}
You usually just compute the same distance as defined in the index or the similarity score.
The next query shows how to compute the similarity score in a variable and use it to get the 3 closest nodes ordered by similarity:
query slownessWithLogs($vec: float32vector) {
var(func: similar_to(Issue.vector_embedding, 3, $vec))
{
vemb as Issue.vector_embedding
score as Math ((1.0 + (($vec) dot vemb)) / 2.0)
}
# score is now a map of uid -> similarity_score
simVec(func:uid(score), orderdesc:val(score)){
uid
Issue.description
score:val(score)
}
}
This end-to-end example shows how you can insert data with vector embeddings, corresponding to a schema that specifies a vector index, and do a semantic search via the new similar\_to()
function in Dgraph. It also shows how to compute various metrics using the new dot
function.