Thoughts on The Cagle Report: Dgraph Analysis

The Cagle Report: Dgraph is a Hybrid RDF + JSON Store

Industry analyst and all-around data guru Kurt Cagle published an analysis of Dgraph recently and it’s a great read.

The Cagle Report analysis is insightful and interesting, and provides a fresh perspective on how Dgraph fits into the Graph Database and Knowledge Graph worlds. I’m not going to summarize all points in his article, but I’d like to dig into and expand on the parts that surprised or particularly informed me.

Dgraph as a hybrid RDF + LPG system

I’ve worked with a lot of databases over the years (soon to be decades), including Resource Description Framework (RDF) stores and Labeled Property Graphs (LPGs), as well as working with inference systems and ontology modeling at Network Inference way back when OWL was hot. The idea of Dgraph as a hybrid JSON + RDF store was surprising to me, but makes a lot of sense the more I think about it.

Cagle’s claim is provocative enough that I’d like to dig into it, and clarify why I think it is accurate. It’s clearly true that Dgraph uses a “triple” storage model at core, supports RDF as a wire format, and also supports JSON input and output. So “hybrid” is accurate just based on that.

But hybrid how - and why? Dgraph’s underlying persistence model is triple-based (edge-based in LPG terminology and “predicate-based” in RDF terminology) to allow efficient indexing and lightweight updates. Yet Dgraph application interfaces are mostly JSON-based to make it easier for app developers and data engineers to use Dgraph. This is no accident: Dgraph differentiates itself by prioritizing the application developer and data consumer experiences, and support for JSON and GraphQL are part of that approach.

Dgraph does, however, also accept RDF inputs, Dgraph clients produce RDF outputs, and the internal storage model is based on RDF triple ideas - but Dgraph does not embrace what Cagle terms the full “RDF stack.” Dgraph avoids much of that technology in favor of simpler, more mainstream (JSON and GraphQL) tools.

Overall, then, data experts can think of Dgraph as something of an RDF store. But app developers and data consumers can and should think of Dgraph as a simpler, JSON-based data store where the magic of triples and related graph concepts may be hidden.

Dgraph Example

Consider two examples: schemas to define entities and edges in a graph data model, and queries to update and access the data in that graph.

Schemas and Data Models

Compared to borderline-cryptographic SHACL constraints or OWL rules to define the data model,industry-standard GraphQL is simpler to use.

SHACL Dgraph GraphQL Schema

@prefix : <http://example.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
:PersonShape
    a sh:NodeShape ;
    sh:targetClass :Person ;
    sh:property [
        sh:path :dob ;
        sh:datatype xsd:date ;
    ] .
    

Person {
  dateOfBirth: Date
}
  

Queries

Now consider how to query for a Person in a richer model with addresses and employers. Older RDF-based systems tend to use SPARQL, which is rooted in OWL-based inference requirements that few people use anymore.

GraphQL is simpler, and is more clearly declarative as the GraphQL query is a minimalist skeleton of the JSON response desired:

SPARQL query Dgraph GraphQL query

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX ex: <http://example.org/>
SELECT ?name ?dateOfBirth ?street ?city ?state ?postalCode 
       ?startDate ?endDate ?companyName 
       ?employmentStartDate ?employmentEndDate
WHERE {
    ?personId a foaf:Person ;
            foaf:name ?name ;
            ex:dateOfBirth ?dateOfBirth .
    OPTIONAL {
        ?personId ex:hasAddress ?address .
        ?address ex:street ?street ;
                ex:city ?city ;
                ex:state ?state ;
                ex:postalCode ?postalCode ;
                ex:startDate ?startDate ;
                ex:endDate ?endDate .
    }
    OPTIONAL {
        ?personId ex:hasEmployment ?employment .
        ?employment ex:companyName ?companyName ;
                    ex:startDate ?employmentStartDate ;
                    ex:endDate ?employmentEndDate .
    }
}
    

{
    person(id: "personId") {
        name
        dateOfBirth
        addressHistory {
            street
            city
            state
            postalCode
            startDate
            endDate
        }
        employers {
            companyName
            startDate
            endDate
        }
    }
}
  

The GraphQL approach is notably simpler.

Other graph query languages exist besides SPARQL, so let’s also look at the comparable cypher syntax for this query, which is also rather complex, and less declarative in the sense that the query is not very similar to any desired output form:

cypher query

MATCH (p:Person {id: "personId"})
OPTIONAL MATCH (p)-[:HAS_ADDRESS_HISTORY]->(a:Address)
OPTIONAL MATCH (p)-[:HAS_EMPLOYED_BY]->(e:Employer)
RETURN p.name AS name, p.dateOfBirth AS dateOfBirth, 
       collect(DISTINCT {
            street: a.street, city: a.city, state: a.state, postalCode: a.postalCode, 
            startDate: a.startDate, endDate: a.endDate}) 
            AS addressHistory,
       collect(DISTINCT {companyName: e.companyName, 
            startDate: e.startDate, endDate: e.endDate}) 
            AS employers

Will LLMs hasten the demise of OWL and RDF?

It does seem that way. As shown above, RDF can be verbose and difficult to work with. In years past, this may have been justified by RDF’s close association with OWL and other inferencing technology. But OWL is a symbolic AI approach (also kown as Good Old Fashioned AI or GOFAI) where sets of specific rules are used to power AI, and that approach is being rapidly elclipsed by LLMs and other statistical AI approaches. At this point, the raison d’etre of RDF is difficult to identify and it’s continued use therfore harder to justify.

This table summarizes some of the changes we see in the technology landscape:

Old Approach Emerging New Approach Reasons for Change
Rule-Based Systems Machine Learning & Generative AI Machine-trained and robust rather than painstakingly curated and fragile
RDF Stores LPGs Ease of use, simpler schemas
Bespoke Graph Query Languages GraphQL Developer use and adoption
Academic graph orientation DevX orientation Developer velocity vs theoretical concerns

At Dgraph we are working to be the right technology to adopt during this transition from RDF toward LPGs, and from rules-based systems to machine learning and generative AI systems. Dgraph will continue to support RDF as a wire format to make migration from RDF-based systems easy, but our focus is simplicity: an LPG and JSON-based developer experience using GraphQL query to empower applications developers and data consumers,rather than symbolic AI practitioners.

Ideal Use cases for Dgraph

The Cagle Report gets down to brass tacks by identifying a number of core use cases that Dgraph is particularly well suited for. Here are a few, with Cagle’s key benefits, and my clarifications or additions in [square brackets].

  • Semantic Publishing. Dgraph can represent taxonomies [and other SKOS-style or RDF concept relations] well. Also, Dgraph can [and often does] represent roles, permissions and other access metadata easily. [Dgraph’s speed is key here, as security and permission checks for authorization use cases can be complex, yet also need to be evaluated in real time].

  • Customer 360°, Equipment 360°, Product 360°, Patient 360°. All the 360’s. Cagle cites “diverse views of complex domains without specialized development” which I interpret to mean that [JSON data, LPG modeling, and GraphQL query are all quite simple, and the LPG data model is simplified vs RDF by allowing facets on edges. Together with graph modeling’s flexibility and agility, this makes Dgraph ideal to knit together data from many systems into a single, 360° view of anything.]

  • AI/LLM and vector Integration [Cagle received some preview of Dgraph’s AI direction prior to his analysis (which I won’t spill too much of here) showing how Dgraph is integrating fuzzy, AI-based similarity links directly into our graphs.]

Limitations

For completeness, The Cagle Report points out when not to use Dgraph. And I agree. In short, if a team is deep into “the RDF stack” including SPARQL, OWL, and “deep inference” then another technology is best. Dgraph can do light inference via query, but if the key criterion is supporting the RDF tool set rather than use cases and developer adoption, Dgraph will not be best.

I’ll add that Dgraph supports a productive division of responsibilities among the tech team vs using RDF + SPARQL. In this approach, graph and data gurus can prep, model, store and expose the data in Dgraph using traditional graph concepts, and data consumers of the data just see friendly JSON and GraphQL interfaces. No need for application developers to learn SPARQL or another recondite, non-standard graph query language such as gremlin or cypher. This makes work easier for everyone, and clarifies each team’s responsibilities for efficient collaboration.

Best of both worlds

Dgraph covers many RDF use cases and motivations, particularly for the data gurus, but presents a standards-based, JSON-oriented interface for suppliers and consumers of data. “Hybrid” is a fitting term, as The Cagle Report illustrates in their analysis.