FactSet, a global provider of integrated financial information, analytical applications, and industry-leading services, refactored their Core Symbology database — consisting of all the financial symbols the company monitors — to a single distributed Dgraph cluster with a dataset that has 160 million nodes and 2 billion edges. Benefits included:
FactSet delivers superior analytics, service, content, and technology to help more than 126,000 users worldwide see and seize opportunities sooner. The company helps the financial services community address its most pressing challenges, such as optimizing technology costs, building integrated cross-organizational workflows, mitigating risk due to increased regulatory demand, and creating efficient data governance throughout an organization. The company has assembled one of the largest collections of financial data and their databases hold billions of financial objects spread across dozens of databases including SQL and NoSQL Key-Value Stores.
Because the financial world is continually evolving, FactSet is constantly seeking to add new information and categories to its databases. This creates some technical challenges:
FactSet was also implementing a broad internal shift away from an older proprietary query language as a means of interacting with data to a deployment of modern APIs that use more standardized queries. The shift would simplify application development with easier discovery and querying across multiple APIs. A primary candidate for reinvention was FactSet’s primary database for symbols — internally referred to as the “Core Symbology” database.
In the finance world, everything is associated with a symbol. Company stocks have ticker symbols, e.g., “FB” for Facebook. Stock exchanges, bonds, commodities, countries, and derivatives all have their symbols. There are also symbols for specific transactions and clearing processes such as CUSIPs and ISINs.
They all connect together with different relationships and they all have different symbols that map onto them. — Mark Boxall, Principal Software Engineer at FactSet
The single record of truth for all symbols at FactSet is the Core Symbology database. As Boxall explains,
This is a low-level, fundamental piece of infrastructure. — Mark Boxall, Principal Software Engineer at FactSet
All types of financial content — pricing, estimates, benchmarks, countries, ticker symbols — are tied together in a graph of relationships; the relationships have rules. For example, a company can only have one stock ticker on the NASDAQ stock exchange. The traits, values, and relationships of entities can change over time. A bond will mature and sunset. A company will merge with another company. A country may form as a result of political changes, like the Baltic States breaking off from the former Soviet Union. Moreover, numbers can be retroactively revised, from corporate earnings to unemployment, or inflation numbers to the number of shares outstanding.
The Symbology database acts as the primary repository for anything with a symbol and its associated information. It connects a web of proprietary SQL and NoSQL FactSet databases that carry more extensive datasets on specific topics. In other words, the Core Symbology functions as the glue holding other higher-level queries together.
If you are generating a report and pulling in all these different types of data and you want to show it in one view, Symbology is right in the middle of that. We line up the content from all these different sources and map them together as the report is being generated. — Mark Boxall, Principal Software Engineer at FactSet
The FactSet Core Symbology database and structure that Boxall has upgraded includes multiple SQL databases for different types of symbols and a wide array of tables describing different aspects of each symbol. This complex structure requires dizzying joins that span hundreds of tables. FactSet wanted to make it easier to query the Core Symbology.
People have always complained that it was hard to use Symbology APIs and that there were too many of them. — Mark Boxall, Principal Software Engineer at FactSet
Boxall felt the nature of the Core Symbology data made this collection of databases a good candidate for refactoring as a graph and using a graph query language to simplify queries.
We had this very highly connected dataset with multiple facets to each node. Some of the data is sparse so it's not an easy fit into an SQL database. For us, this seemed like a good fit for a graph database with nodes and edges rather than a traditional table structure. — Mark Boxall, Principal Software Engineer at FactSet
Because the Core Symbology database represented a critical piece of infrastructure for FactSet, Boxall had a set of specific and exacting design constraints and requirements.
FactSet wanted to democratize the ability to query the Core Symbology database to allow more people inside the company to build applications that create reports and pull information out of the financial symbol graph.
This was a key consideration for us and one of the main reasons we looked at graphs. Writing queries against graphs is far more natural than writing SQL queries. You don’t need to have a deep understanding of data schema or write across hundreds of joins. — Mark Boxall, Principal Software Engineer at FactSet
This was crucial to allow FactSet to build and modify data schemas easily, such as when new types of nodes were added or new rules were applied to nodes.
For example, we didn’t want to have a company table and a debt table and so on. This dispersed architecture makes changes difficult. If you want to introduce even one new attribute, it’s a schema change, and that’s a nightmare to deploy. With a graph, that’s not a problem and we can modify the schema when we need to in the future. — Mark Boxall, Principal Software Engineer at FactSet
In particular, FactSet wanted the capability for two distinct types of timeseries to be added to each node. One type of time was true business time — what happened in the world at any moment. Over time, FactSet continually corrects and improves its data. Boxall explains,
There is a versioning element to that and many users actually wanted access to different versions of our data. — Mark Boxall, Principal Software Engineer at FactSet
This required a second time series that functioned as a versioning engine. With Dgraph, FactSet could apply both types of time series easily and at scale across tens of millions of nodes in the Core Symbology Database.
FactSet has a demanding customer base who often incorporate its data into time-sensitive financial analysis, modeling, and trading applications. Some of these applications run lengthy, detailed queries requiring numerous joins (or in graph-speak, traversals), which are ill-suited to relational databases due the need for time-consuming query performance tuning with uncertain outcomes. For Core Symbology, queries needed to be fulfilled in less than 100 milliseconds for almost any complexity or size. Ideally, queries should round-trip in less than 20 milliseconds. This requires a database written with an efficient and distributed architecture that leverages an asynchronous programming language like Golang.
Because FactSet intended to create both a resilient and globally co-located database structure, Boxall wanted a truly distributed database. This meant having the ability to horizontally shard on the fly (rather than create full replica sets) and to create global consistency across multiple datacenters in Amazon Web Service’s compute and storage cloud. Many graph databases are not able to shard and carry heavy infrastructure footprints. FactSet wanted to avoid future problems on either front. While FactSet has not yet implemented sharding, it is confident that as the use of Symbology grows and requirements change, sharding can be deployed quickly and easily with Dgraph.
System reliability and support for high customer SLAs were paramount. In finance, system downtime could cost billions of dollars and puts funds at risk. Since the Core Symbology database is an essential engine powering FactSet reports, any refactored database must meet high reliability and uptime requirements. In addition, the database must be able to redeploy quickly in the event of an outage to minimize downtime.
Boxall and his team compared a wide variety of graph databases against FactSet’s design constraints. Some could not deliver on performance, requiring seconds to traverse a large graph and return results on complex queries. Others could not shard data and required full replicas of the database. Still, others could not deliver ACID consistency and reliability. Crucially, Dgraph’s DQL query structure removed the need to add an additional layer for distributing and executing queries. Lastly, Dgraph included the capability for each node to have not only edges but also facets — key value pairs on edges. This unique functionality would allow FactSet to add two dimensions of time series (timestamp and versioning) to meet the product requirement.
The easy combination of a graph database and the native DQL interface was attractive because it made it easy for FactSet users to write queries with minimal technical training. In the end, Boxall found that Dgraph was the only graph database that met all the design constraints.
The community and the company were also super friendly and helpful. That was a key part of our decision. — Mark Boxall, Principal Software Engineer at FactSet
As part of the roll out, Boxall and his team interviewed domain experts on all aspects of the different financial instruments and symbols they would be adding to their Core Symbology datastore. From these conversations, Boxall and his team constructed rules for each data type that matched expert-defined rules around real-world relationships between different financial symbols. The goal was to create a single, up-to-date comprehensive schema that captures all the graph relationship data and uses it as a guide to building the graph Core Symbology database. For the dual-time series functionality, FactSet leveraged Dgraph facets and edges functionality.
We were able to use facets to associate a start and end time, and a start and end version, with all of those edges. Whether it is an actual relationship like ‘is controlled by’ or time count like ‘what is the duration that it was controlled by’, all edges had a start and end time and a start and end version. — Mark Boxall, Principal Software Engineer at FactSet
When they completed their build, the Core Symbology database had 160 million nodes with two billion edges and many billions of facets. Boxall found that Dgraph traverses FactSet’s large graph quickly with most queries requiring less than 20 milliseconds. To accelerate query speed, Dgraph breaks each query into subqueries and runs them concurrently. Whereas the legacy Symbology databases ran in multiple dedicated servers, the Dgraph-backed Core Symbology database required only three eight-core Amazon Web Services cloud instances while providing faster performance and lower latency.
It’s been really good. We liked the concurrency. Dgraph does a good job of breaking down the queries into parallel units of work. And if we want our queries to get quicker, we can add more cores to our instance. — Mark Boxall, Principal Software Engineer at FactSet
Boxall plans to scale up the infrastructure as FactSet pushes the new database into production. During the entire trial and testing period of the POC phase, the FactSet database never went down. Restart tests of the database allowed Boxall to bring the entire system back up in a few minutes from a cold start.
Most importantly, Dgraph’s DQL query structure enabled even non-technical product developers to write queries against the Core Symbology API.
DQL is a lot more natural to use than SQL. With our older databases, non-technical users would quickly lose sight of the data model as their queries spanned hundreds of joins. With Dgraph’s DQL, it’s easier to understand how to write queries and what comes back. — Mark Boxall, Principal Software Engineer at FactSet
For example, the following concise query locates the Bond identified by CUSIP X65692AC3 and finds its issuer by following the isIssuerOf edge in reverse. It then recursively follows the isControlledBy edge to find the ultimate parent of the Bond as of April 15th, 2019.
Boxall’s team wrote a tutorial and created an additional UX layer to make it easier to write and visualize queries. Impressively, new users were able to write queries and start using the new Core Symbology DQL API in real applications right away.
That would never have happened with our old APIs. — Mark Boxall, Principal Software Engineer at FactSet
says Boxall, who plans to add the API with documentation to FactSet’s internal API developer portal. This will make it even easier to discover and use the new Core Symbology database either in beta or when it officially enters production. Boxall explains,
Our experience with Dgraph has been great, and the switch to DQL really opens up a lot of new possibilities to help us build products more quickly and efficiently. — Mark Boxall, Principal Software Engineer at FactSet
If the download is not working, please click here for the PDF file.