In this blog post, we will use Dgraph database to store retail products information (Amazon products data) and Language Models to reply to users asking for a product recommendation.
The blog post provides Python code snippets to explain the main steps. The complete Jupyter Notebook and associated data folder are available for you to play with.
The language models are used in 3 different ways
This is a case of Retrieval Augmented Generation (RAG), and NLP (Natural Language Processing) leveraging graph structures.
Dgraph is particularly suited for knowledge graph and AI applications due to several key features and capabilities:
Graph Database Structure: Dgraph is designed as a native graph database, which means it stores data in a graph structure consisting of nodes, edges, and properties. This is inherently aligned with the way knowledge graphs represent relationships and entities, making it easier to model complex interconnections.
Native vector support: Any node may have any number of vector predicates that are indexed using the HNSW algorithm for fast similarity retrieval.
Scalability: Dgraph is built to scale horizontally, handling large volumes of data and high query loads efficiently. This is crucial for AI applications that often require processing vast amounts of interconnected data.
High Performance: Dgraph provides fast query execution and low latency, which are essential for real-time AI applications. Its performance optimizations, such as parallel query execution and efficient data storage, make it capable of handling demanding workloads.
Flexible Schema: Dgraph supports flexible schema definitions, allowing for dynamic data models that can evolve. This is beneficial for AI applications where the data schema might need to adapt to new requirements or insights.
Rich Querying Capabilities: Dgraph’s query language, DQL (Dgraph Query Language) is declarative, which means that queries return a response in a similar shape to the query. DQL allows for complex graph traversals and pattern matching, which are essential for extracting insights and relationships in knowledge graphs. It also supports advanced features like recursive queries and aggregations and most importantly vector similarity search.
Create a knowledge graph consisting of Products information, categories, brands, age_groups, colors, measurements, materials ad characteristics. On user request
intent
representing which part of the graph should be used to reply to the request.intent
in a Dgraph DQL query and execute the query. Use similarity search for best filtering.Create a file .env
in the folder containing this Python notebook with one line for your OpenAI API key
OPENAI_API_KEY=sk-....
We just need some python packages for Dgraph, Openai, Hugging Face, and some tools we are using.
# Optional script to install all the required packages
!pip3 install pydgraph
!pip3 install openai
!pip3 install sentence_transformers
!pip3 install pybars3
!pip3 install python-dotenv
import os
import json
import pydgraph
from pybars import Compiler
# Activate the provider you want to use for embeddings and LLM
# from openai import OpenAI
# from mistralai.client import MistralClient
from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv
load_dotenv()
assert os.getenv("OPENAI_API_KEY") is not None, "Set OPENAI_API_KEY in your .env file"
Dgraph supports JSON and RDF format. In this Notebook we are using RDF. RDF is a powerful notation for knowledge graph. It describes information in triples of the form Subject - Predicate - Object (S-P-O).
The original dataset is in JSON format and is 2.7Mb. We have generated an RDF file with the same information. The RDF file is only 361 Kb!
See generateRDF notebook for details.
See Learning Environment to setup a docker image with dgraph:standalone/latest
, or use your on-prem or cloud instance.
dgraph_grpc = "localhost:9080"
client_stub = pydgraph.DgraphClientStub(dgraph_grpc)
client = pydgraph.DgraphClient(client_stub)
print(f"Connected to DGraph at {dgraph_grpc}")
The notebook provides a complete setup covering on-prem and cloud instances.
First we clean the DB. You may want to skip this step.
# Drop all data including schema from the Dgraph instance.
# This is a useful for small examples such as this since it puts Dgraph into a clean state.
confirm = input("drop schema and all data (y/n)?")
if confirm == "y":
op = pydgraph.Operation(drop_all=True)
client.alter(op)
print("schema and data deleted")
In the Dgraph schema, we tell the system the different indexes we want to use on the different predicate, declare node types and specify relationships cardinality.
# add predicates to Dgraph type schema
with open('data/products.schema', 'r') as file:
dqlschema = file.read()
op = pydgraph.Operation(schema=dqlschema)
client.alter(op)
print("schema updated:")
print(dqlschema)
As the dataset is small we can load all the data in one mutation:
def mutate_rdf(nquads, client):
ret = {}
body = "\n".join(nquads)
if len(nquads) > 0:
txn = client.txn()
try:
res = txn.mutate(set_nquads=body)
txn.commit()
ret["nquads"] = len(nquads),
ret["total_ns"]= res.latency.total_ns
except pydgraph.errors.AbortedError as err:
print("AbortedError %s" % i)
except Exception as inst:
print(inst)
finally:
txn.discard()
return ret
with open('data/products.rdf') as f:
data = f.readlines()
mutate_rdf(data, client)
For large dataset, refer to the Import data options.
As our data is now in a graph database, we can traverse the graph, search for nodes, count relationships, etc. To verify that we have data in the DB, let’s execute a simple query to find the top 5 categories and their number of products,
query = '''
{
var(func:type(category)) {
np as count(~Product.category)
}
productsPerCategory(func:uid(np), orderdesc:val(np), first:3){
category:category.Value
number_of_products:val(np)
}
}
'''
res = client.txn(read_only=True).query(query)
res = json.loads(res.json)
print("Top 3 categories with the most products:")
print (json.dumps(res, indent=4))
The expected result is
Top 3 categories with the most products:
{
"productsPerCategory": [
{
"category": "home decoration",
"number_of_products": 20
},
{
"category": "books",
"number_of_products": 17
}
]
}
We don’t want to constrain the question to only use terms present in the database. For example, the user may want “some clothes of dark color”. We need to search our graph by similarity and not only by terms. We will use the power of Dgraph vectors and language model vector embeddings.
Dgraph is a Graph database with native vector support, HNSW index, and similarity search. For this use case, we will be using a Python script shared in the blog post Add OpenAI, Mistral or open-source embeddings to your knowledge graph. to compute and add vector embeddings to all our entities.
For example, with an embedding on the color
entities, we will be able to search for colors similar_to
“dark color”.
Refer to the notebook to get the embedding logic details.
The embeddings are then computed using a simple configuration file:
embedding_config = [
{
"entityType":"Product",
"attribute":"embedding",
"index":"hnsw",
"provider": "huggingface",
"model":"sentence-transformers/all-MiniLM-L6-v2",
"config" : {
"dqlQuery" : "{ title: Product.Title}",
"template": "{{title}} "
}
},
{
"entityType":"age_group",
"attribute":"embedding",
"index":"hnsw",
"provider": "huggingface",
"model":"sentence-transformers/all-MiniLM-L6-v2",
"config" : {
"dqlQuery" : "{ value: age_group.Value}",
"template": "{{value}} "
}
},
{
"entityType":"brand",
"attribute":"embedding",
"index":"hnsw",
"provider": "huggingface",
"model":"sentence-transformers/all-MiniLM-L6-v2",
"config" : {
"dqlQuery" : "{ value: brand.Value}",
"template": "{{value}} "
}
},
...
]
for embedding_def in embedding_config:
buildEmbeddings(
embedding_def,
only_missing = True
)
print(f"Embeddings done for {embedding_def['entityType']}.{embedding_def['attribute']}")
In our example we used Hugging Face Sentence Transformer model all-MiniLM-L6-v2
for all our embeddings.
The template
is a handlebars template that generates the text to be embedded from the dqlQuery
result. By using a DQL query, we can build complex embeddings (or graph embeddings) for any node type: the embedded text can include text from connected nodes at any level.
In our use case, the embeddings are kept simple.
When the embeddings have been added to each node using mutations, we can use similar_to
function in DQL query.
For example:
sentence = "looking for something to make my home pretty"
# Get the sentence embedding with the same model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentence_embedding = model.encode(sentence).tolist()
# Use Dgraph similar_to function to find similar categories and use Graph relations to get the products for this category
txn = client.txn(read_only=True)
query = f'''
{{
result(func: similar_to(category.embedding,3,"{sentence_embedding}")) {{
category:category.Value
products:~Product.category (first:3) {{
name:Product.Name
}}
}}
}}'''
try:
res = txn.query(query)
data = json.loads(res.json)
print(json.dumps(data,indent=4))
finally:
txn.discard()
The query is looking for 3 category
nodes closed to the provided prompt and get 3 products maximum for each category.
The response looks like the following:
{
"result": [
{
"category": "wedding decor",
"products": [
{
"name": "Romantic LED Light Valentine's Day Sign"
}
]
},
{
"category": "home decor",
"products": [
{
"name": "Fall Pillow Covers"
}
]
},
{
"category": "home garden balcony decor",
"products": [
{
"name": "Flower Pot Stand"
}
]
}
]
}
In the previous query, we assumed that the question was about products
found by category
, so we could write a DQL query.
We can go further and use an LLM model to analyze the user prompt and determine the correct criteria to use before querying the graph structure. In this example our dataset is small but the approach must work for large graph. Loading all the data to the LLM context may not be practical and get over the LLM token window. The whole idea is to extract a subset of the data to reply to the user question.
We will use OpenAI and a prompt build with our knowledge of the graph structure, i.e the description of the entities and predicates that can be found in the graph (aka ontology).
We define the ontology and a way to represent it as text:
entities = [
{
"entity_name": "Product",
"description": "Item detailed type",
"predicates": {
"category" : {"description": "Item category, for example 'home decoration', 'women clothing', 'office supply'"},
"color" : {"description": "color of the item"},
"brand": {"description": "if present, brand of the item"},
"characteristic": {"description": "if present, item characteristics, for example 'waterproof', 'adhesive', 'easy to use'"},
"measurement": {"description": "if present, dimensions of the item"},
"age_group": {"description": "target age group for the product, one of 'babies', 'children', 'teenagers', 'adults'."}
}
}
]
def ontologyPrompt(ontology):
# Create a textual description of the ontology to help prompting LLM
# The graph database has the following entities and predicates:
entities = [ f'\'{e["entity_name"]}\'' for e in ontology]
list_entities = ", ".join(entities)
s = f"Identify if the user question is about one of the entities {list_entities}."
s += "\nIdentify criteria about predicates depending on the entity."
for e in ontology:
pred = [ f'\'{p}\'' for p in e["predicates"]]
pred_list = ", ".join(pred)
s+= f'\nFor \'{e["entity_name"]}\' look for:'
for p in e["predicates"]:
s+= f'\n- \'{p}\': {e["predicates"][p]["description"]}'
return(s)
Using meta-data in an ontology structure is an elegant and generic way to provide information to both the LLM (textual part) and the query builder (structured knowledge) about the domain we are dealing with.
We use a prompt in including the ontology information to ask openAI to identify an intent
from the user prompt:
system_prompt = f'''
You are analyzing user prompt to fetch information from a knowledge graph.
{ontologyPrompt(entities)}
Return a json object following the example:
{{
"entity": "product",
"intent": "one of 'list', 'count'",
criteria: [
{{ "predicate": "category", "value": "clothing"}},
{{ "predicate": "color", "value": "blue"}},
{{ "predicate": "age_group", "value": "adults"}}
]
}}
If there are no relevant entities in the user prompt, return an empty json object.
'''
from openai import OpenAI
llm = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
# Define the entities to look for
def text_to_intent(prompt, model="gpt-4o-mini"):
completion = llm.chat.completions.create(
model=model,
temperature=0,
response_format= {
"type": "json_object"
},
messages=[
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": prompt
}
]
)
intent = json.loads(completion.choices[0].message.content)
intent['prompt'] = prompt
return intent
For example the analysis of the request
"do you have clothes for teenagers in dark colors?",
results into the intent
{
"entity": "product",
"intent": "list",
"criteria": [
{
"predicate": "category",
"value": "clothing"
},
{
"predicate": "color",
"value": "dark"
},
{
"predicate": "age_group",
"value": "teenagers"
}
],
"prompt": "do you have clothes for teenagers in dark colors?"
}
The intent structure is easily translated into a graph traversal with constraints and filters.
One key idea is to use a similarity search instead of a keyword or term search. In the above example dark
is not a color value. A keyword search would not find any result. A similarity search should find “black” and “dark blue” as good match.
For example the query generated from the previous intent looks like:
query test($categoryvect: float32vector,$colorvect: float32vector,$age_groupvect: float32vector){
category as var(func:similar_to(category.embedding,1,$categoryvect))
color as var(func:similar_to(color.embedding,1,$colorvect))
age_group as var(func:similar_to(age_group.embedding,1,$age_groupvect))
products(func:type(Product)) @filter(
uid_in(Product.category, uid(category))
AND uid_in(Product.color, uid(color))
AND uid_in(Product.age_group, uid(age_group)) ) {
name:Product.Name
title:Product.Title
age_group:Product.age_group {
value:age_group.Value
}
brand:Product.brand {
value:brand.Value
}
color:Product.color {
value:color.Value
}
category:Product.category {
value:category.Value
}
characteristic:Product.characteristic {
value:characteristic.Value
}
material:Product.material {
value:material.Value
}
measurement:Product.measurement {
value:measurement.Value
}
}
}
The DQL query is created from 4 parts
The query parts are inferred from the intent and the ontology. In the example we have harcoded the fact that we are dealing with Product
type, but this can easily be generated from the intent “entity” information.
Here is the code used to build the DQL query
# use same embedding model for user input and for the searched entities
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
def create_embedding(text):
# print (f"create embedding for {text}")
return huggingfaceEmbeddings(model,[text])[0]
# for each criteria, compute an embedding of the criteria value.
# build a sequence of var block to find the most similar node (category, characteristic, brand etc ...).
# build a filter to keep only the Product with the corresponding category, characteristic, brand etc ...
def intent_to_dql(intent):
vect = []
vars = []
filters = []
variables = {}
for criteria in intent['criteria']:
variables[f"${criteria['predicate']}vect"] = f"{create_embedding(criteria['value'])}"
vect.append(f"${criteria['predicate']}vect: float32vector")
vars.append(f"{criteria['predicate']} as var(func:similar_to({criteria['predicate']}.embedding,1,${criteria['predicate']}vect))")
filters.append(f"uid_in(Product.{criteria['predicate']}, uid({criteria['predicate']}))")
all_filters = "\n AND ".join(filters)
all_vars = "\n".join(vars)
query = f"""
query test({','.join(vect)}){{
{all_vars}
products(func:type(Product)) @filter( {all_filters} ) {{
name:Product.Name
title:Product.Title
age_group:Product.age_group {{
value:age_group.Value
}}
brand:Product.brand {{
value:brand.Value
}}
color:Product.color {{
value:color.Value
}}
category:Product.category {{
value:category.Value
}}
characteristic:Product.characteristic {{
value:characteristic.Value
}}
material:Product.material {{
value:material.Value
}}
measurement:Product.measurement {{
value:measurement.Value
}}
}}
}}
"""
return {"query":query,"variables":variables}
We simply instruct an LLM to reply to the user request using the data retrieved from the graph.
This allows us to create a graph query that is good enough
and may retrieve more data than needed. We then let the LLM use what is relevant for the request.
def rag(prompt, payload):
model="gpt-4o-mini"
rag_prompt = f'''
You are suggesting products based on user input and available items.
Reply to the user with suggestions from the following data that match the criteria
{payload}
If possible explain why the items are suggested.
If there are no relevant items reply that we don't have any items that match the criteria.
'''
completion = llm.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": rag_prompt
},
{
"role": "user",
"content": prompt
}
]
)
return completion.choices[0].message.content
The last function is a good summary of the approach in few lines of code:
def reply(sentence):
intent = text_to_intent(sentence)
dql = intent_to_dql(intent)
res = client.txn(read_only=True).query(dql["query"], variables=dql["variables"])
payload = json.loads(res.json)
return rag(sentence, payload)
Let’s test our RAG solution:
example_queries = [
"Which pink items are suitable for children?",
"Do you have a helmet with anti allergic padding?",
]
for q in example_queries:
print()
print(f"> {q}")
print()
r = reply(q)
print(r)
Here are the results we got
> Which pink items are suitable for children?
I have two great suggestions for pink items that are suitable for children:
Suitcase Music Box
Unicorn Curtains
Both items are not only visually appealing with their pink color but also serve functional purposes for children’s enjoyment and room decor.
The second question :
> Do you have a helmet with anti allergic padding?
Yes, we have a helmet that features anti-allergic interior padding. I recommend the Steelbird Hi-Gn SBH-11 HUNK Helmet.
Here are some details about it:
This helmet not only provides anti-allergic features but also has a variety of other comfort and safety attributes, making it a great choice for your motorcycle gear.
In this blog post, we demonstrated the integration of Dgraph database and Language Models to create an intelligent product recommendation system. By leveraging Dgraph’s graph database structure and native vector support, along with the powerful capabilities of language models, we achieved efficient storage, retrieval, and response generation for retail product information.
We explored the following key aspects:
The approach exemplifies the use of Retrieval Augmented Generation (RAG) and Natural Language Processing (NLP) within graph structures and provides a general working flow of RAG over Graph use case. It can be improved in various points, including
Happy coding with Dgraph, embeddings, and language models!
Photo by SHVETS production from Pexels