Business applications nowadays have both prescriptive and predictive components, with the former being used to recommend products or services to customers and the latter helping to anticipate events in line with company goals. Adding recommendation capabilities to a business requires handling large datasets, analyzing often disconnected data, and being in real-time. Traditional approaches involve extracting meaningful information and computing similarity functions using math, which is challenging to apply in real-time.
However, by using Dgraph as storage, businesses can compute similarity functions using graph traversal and computation along the path, eliminating the need for complex matrix extraction and computation. The computed similarity can be stored back in the graph as additional knowledge, providing an efficient real-time recommendation engine for every business.
In this example, we walk you through how Dgraph approaches content-based filtering and collaborating filtering and the capabilities required to build a recommendation engine.
Graph Databases for Recommendation Engine
Graph databases are a great solution when the relationships between your data are just as important as the data itself — and a recommendation engine is a perfect example. Dgraph is nicely suited to address recommendations engine challenges;
- Can handle massive datasets thanks to its distributed nature (horizontal scalability)
- 'Connect the dots' in your data
- Has unparalleled performance to analyze connected data
- Do math on the graph traversal
- Let you add any valuable information, knowledge, or metadata in the graph itself.
‘Recommendation Systems, as described in the Stanford University MOOC Mining Massive Datasets, deals with users and items. The known information about the degree to which a user 'likes' or 'is connected to' an item can be represented as a utility matrix.Most entries are unknown, and the goal of recommending items to users is to predict the values of the unknown entries based on the values of the known entries.
The two standard approaches, content-based filtering, and collaborative filtering, boil down to finding 'distance' functions that serve as measures of similarity.
Building a recommendation engine requires the capability of
- Being able to easily extract subsets (matrices) based on criteria and for different dimensions
- Counting relations, aggregating ratings, doing basic math functions to compute distance functions
- Storing proximity information as a relationship back in data to speedup later predictions
- Computing predicted values
Dgraph offers all those capabilities in a single scalable solution as a distributed graph database with powerful traversal and query language.
The association of graph traversal and computation is a very effective way to compute the similarity function for given nodes on a large graph without having to build gigantic matrices as usually required in traditional approaches. Saving you the cost of building the matrix from the data and saving you computational cost too. Moreover, the graph is an elegant support to save intermediate results (such as the proximity between two users or two items) so the computation of the recommendation is even faster.
Collaborative filtering is about spotting similarities and differences in the behavior of users. If similar users (lots of similarities) enjoyed a particular item that you are not related to, or have not seen or bought (the small difference), that might be a good one to recommend to you. Collaborative filtering is the approach used wherever you have a notion of 'ranking' by peers.
In a "Stackoverflow-like" scenario, say we want to recommend what a user should read based on upvotes. This is a case of collaborative filtering. The recommendation logic is as follows:
- find similar users who have upvoted the same content.
- find the content similar users have upvoted but the given user has not interacted with.
- Use the count of similar users who have upvoted the content to order the result and take the top 10
The graph representation of the dataset is the following
We can easily implement this recommendation using graph traversal queries and some computation.In Dgraph you can follow the relationships backward using the ~ notation.The query to do collaborative filtering on this dataset is as follows:
Let’s look at what the numbered lines in the query represent:
- Start with an initial user found by user name
- Traverse the ~Author relationship which leads to the votes he has created
- Traverse the ~Upvote relationship which leads us to the posts he has upvoted
- Traverse the Upvote relationship to get all the upvotes on these posts
- Traverse the Author relationship to get the users who created these upvotes and store the number of paths between the starting user and this user in a variable sc
- Pick the top 50 users by the score we calculated
- Get the upvotes created by these users
- Get the posts that are of type 'question' not seen by the initial user and assign them a score which is the number of upvotes by these top (similar) users. This score is store in variable fsc
- Use the score we calculated in fsc to get the top 10 questions
The computation takes advantage of the graph traversal capabilities, the filtering functions, and some math using what we call "variable propagation".
The query is using a variable c with a value of 1. Deeper in the graph traversal it is using a variable "sc as math(c)"In Dgraph two important things happen: "sc" is variable means that it is in fact a map associating value to each node of this level. Second, the value of "c" has changed, at each nested level, the value of c is the sum of the "c" value for each parent. Let's see that in schema:
You see that, when reaching User A, B, and C, the value of "c" is different and reflects the number of paths reaching each of them, and in our case, it is the number of votes each one has in common with our initial user.We use the same capability with the variable "b" to count the number of votes from the selected set of users for each Post that the initial user has not seen yet.
To make this recommendation scalable, we simply store the "similar" relationship in the graph itself: every time a user makes an upvote, we recompute the similar users (steps 1 to 6) and update the graph with a "similar" relationship for this user.
Then, when a user logs in, we use these "pre-computed" relationships, to compute the recommendation output in a very efficient way.
Content-based filtering is about finding similarities among items based on the properties they exhibit. For example, if a user likes movies of a certain genre, with certain actors or directors, we can recommend other movies with the same properties. More than often 'properties' are in fact relationships between different data. The "Jaccard distance" and the "cosine distance" are commonly used.
Let's see how to use the cosine distance in a graph storing our movie database.
We are using data generated from ML-100k dataset of 100.000 ratings from users on movies. We converted the dataset to RDF (the dataset and script can be found here) and loaded it into Dgraph.
The model is super simple:
The rating is a number from 1 to 5 and is stored as a property of the "rated" edge.
Let's take the genres of each movie as our 'dimension'. Mathematically we can represent the genre information of a movie as a vector X of all the possible genres and with 1 for genre present and 0 for not present.
The cosine similarity between two movies A and B i.e two 'genre' vectors is computed by using a dot product and magnitude:
As Ai and Bi are either 0 or 1, the dot product is in fact the number of common genres i.e | M1.genres M2.genres |References also suggest to include the magnitude of the ratings in the distance. So our final distance is
This distance function factors in both the intersecting genres as well as the similarity of the average ratings.
It can be computed in a simple Dgraph query!
FIrst let's compute the average rating of each movie and add it as a new information in our graph :
The average rating of a movie is only changing when we have a new rating for this movie so that's a good idea to separate this computation from the recommendation and to store the value in the graph.
Now we can evaluate similar movies to a given movie when needed, using a simple Dgraph query:
The query works as follows.
- We find the graph node for movie M1, given its name.
- Starting at M1 node, we use a variable norm 1 that will aggregate and count paths.
- Use a variable M1_AVG_RATING for the average rating.
- Use a variable M1_COUNT_GENRES for the count of genres.
- Follow genre paths to explore M1 genres and ~genres to find all Movies which have at least one genre in common with M1.
- As M1_AVG_RATING and M1_COUNT_GENRES variables have been propagated along the path M1->genre->M2 we need to divide their value by the number of paths to get the initial value. It's a trick to counter the effect of variable propagation.
- We have all the variables to compute the similarity between M1 and M2 in a variable M2_SCORE (associated with M2 unique id).
- Return the name and score of the 5 films with the highest score
Running this query with the name ‘‘Toy Story (1995)’’ we get these results:
As expected the selected movie is found similar to itself with a score of 1 and we could identify similar other movies.
Have a look at those blogs, they also show how to compute a Jaccard distance and the Slope One Predictor.https://dgraph.io/blog/post/recommendation/#preface/