Dgraph v24 introduces the much-expected merge of vector and graph database: vector support, HNSW index, and similarity search both in DQL and GraphQL. It is a major step to support GenAI RAG, classification, entity resolution, semantic search and many other AI and Graph use cases.
Vector databases are vector first: they let you find similar vectors, but what you need is real data. The data is either stored as a payload associated with the vector or as a reference ID. If you are using payload then creating multiple vectors for the same data introduces duplicates and synchronization issues. If you are using reference ID, then you need some extra queries to get the data you need.
Dgraph is entity first: you can add many vector predicates to the same entity type. For example, a Product
may have one vector embedding built from the text description and another vector embedding created from the product image. When searching for similarity, you get similar entities i.e. the data and not only the vector. You don’t need extra queries to get the information you need. The entities found are part of the graph so you can also query any relationships in the same graph request.
Dgraph is a database and does not have the limitations of in-memory solutions: the vectors are treated as any other predicates, stored, and indexed in the core DBs.
This Blog shows how to get started with vector embeddings in Dgraph using OpenAI, Mistral, or Huggingface embedding models. It provides details about how Product
embedding have been added to an existing Dgraph instance in the video example:
In our example, the following minimal GraphQL Schema is deployed in Dgraph and the Database is populated with existing Products
type Product {
id: String! @id
description: String @search(by: [term])
title: String @search(by: [term])
imageUrl: String
}
With v24 we can declare a new vector predicate and specify an index using the @search
directive. Vector predicates support hnsw
index with euclidean
cosine
or dotproduct
metric. A vector is a predicate of type [Float!]
with the directive @embedding
.
Here is the updated GraphQL Schema:
type Product {
id: String! @id
description: String @search(by: [term])
title: String @search(by: [term])
imageUrl: String
characteristics_embedding: [Float!] @embedding @search(by: ["hnsw(metric: euclidean, exponent: 4)"])
}
Notes
For our test with a local instance of Dgraph, we simply deploy the schema using
curl -X POST http://localhost:8080/admin/schema --silent --data-binary '@./schema.graphql'
Dgraph uses the deployed GraphQL schema to expose a GraphQL API with queries, mutations, and subscriptions for the declared types. For each entity with at least one vector predicate, Dgraph v24 generates 2 new queries
querySimilar<Entity>ByEmbedding
querySimilar<Entity>ById
querySimilar<Entity>ByEmbedding
returns the topK closest entities to a given vector. The typical use case is semantic or natural language search: the vector is computed in the client application from a sentence i.e. a request expressed in natural language and using the same model used for the entities’ embeddings.
querySimilar<Entity>ById
returns the topK closest entities to a given entity. The typical use case is recommendation systems using similarity search.
Before experimenting with those new queries in the GraphQL API, we need to populate our graph with embeddings.
We are using a python script from the examples folder of the pydgraph repository.
The script is provided as-is, as an example. Adapt the logic to your needs.
The logic of the shared Python script is as follows:
mustache
notation used by pybars
).For our Product
we defined the following embedding configuration:
{
"embeddings" : [
{
"entityType":"Product",
"attribute":"characteristics_embedding",
"index":"hnsw(metric: \"euclidean\")",
"provider": "huggingface",
"model":"sentence-transformers/all-MiniLM-L6-v2",
"config" : {
"dqlQuery" : "{ title:Product.title }",
"template": "{{title}}"
}
}
]
}
Note that the script is using a DQL query on data generated from a GraphQL Schema. You can learn more about this topic in the doc section GraphQL - DQL interoperability
In a terminal window, declare the Dgraph GRPC endpoint. For example, for a local instance:
export DGRAPH_GRPC=localhost:9080
If needed, for cloud instances, declare an admin client key.
export DGRAPH_ADMIN_KEY=<Dgraph cloud admin key>
and simply run the script
python ./computeEmbeddings.py
We are using
python 3.11
with
openai 1.27.0
mistralai 0.1.8
pybars3 0.9.7
sentence-transformers 2.2.2
Having vector predicates populated with your embeddings is all you need to perform similarity queries using the auto-generated queries in the GraphQL API.
In our example we have identified one of the Product with id 059446790X
and performed a similarity search:
query QuerySimilarProductById {
querySimilarProductById(id: "059446790X", by: characteristics_embedding, topK: 10) {
id
title
vector_distance
}
}
Note that you specify the predicate name (here characteristics_embedding
) to be used for the similarity search in the query. As previously mentioned, you may have more than one vector attached
to the Product entity and you can perform different similarity queries (similar description, similar image, etc…).
vector_distance
is a generated predicate providing the distance between the given vector and each entity vector. It can be used to compute similarity score or to apply thresholds.
Dgraph added vector support as a first-class citizen with fast HNSW index support.
Using vector predicates to store embeddings, computed by ML models such as OpenAI, Mistral, Hugging Face, or others, is a surprisingly powerful approach to many AI or NLP use cases.
In this Blog, we showed how to quickly add embeddings to existing entities stored in Dgraph. Let us know what you are building by combining the power of Dgraph and ML models.
Photo by Tuur Tisseghem