FactSet, a global provider of integrated financial information, analytical applications, and industry-leading services, refactored their Core Symbology database — consisting of all the financial symbols the company monitors — to a single distributed Dgraph cluster with a dataset that has 160 million nodes and 2 billion edges.
- Empowering non-technical product developers to query the database
- Reducing query processing time to under 20 milliseconds for most queries
- Reducing infrastructure footprint and time required to manage databases
- Improving database management by converting multiple SQL and NoSQL databases into one Graph database
- A flexible schema that is easy to change and will not disrupt existing front-end architectures
FactSet delivers superior analytics, service, content, and technology to help more than 126,000 users worldwide see and seize opportunities sooner. The company helps the financial services community address its most pressing challenges, such as optimizing technology costs, building integrated cross-organizational workflows, mitigating risk due to increased regulatory demand, and creating efficient data governance throughout an organization. The company has assembled one of the largest collections of financial data and their databases hold billions of financial objects spread across dozens of databases including SQL and NoSQL Key-Value Stores.
Because the financial world is continually evolving, FactSet is constantly seeking to add new information and categories to its databases. This creates some technical challenges:
- Schema flexibility: Legacy SQL and most NoSQL databases have relatively inflexible data schemas. Global and even partial changes to schemas can be difficult to roll out.
- Query latency: As schemas grow more complicated and more and more joins are required across tables and diverse databases, query latency becomes an issue. This is particularly challenging for the finance industry, which demands high performance and low latencies.
- Usability for application development: For internal application development, FactSet’s own product developers were slowed down by their inability to write the complex SQL queries required for even relatively basic requests. This was driven by the number of tables and databases and the requirements to join elements from many different databases with different schemas.
FactSet was also implementing a broad internal shift away from an older proprietary query language as a means of interacting with data to a deployment of modern APIs that use more standardized queries. The shift would simplify application development with easier discovery and querying across multiple APIs. A primary candidate for reinvention was FactSet’s primary database for symbols — internally referred to as the “Core Symbology” database.
Building a Better Financial Symbols Graph
In the finance world, everything is associated with a symbol. Company stocks have ticker symbols, e.g., “FB” for Facebook. Stock exchanges, bonds, commodities, countries, and derivatives all have their symbols. There are also symbols for specific transactions and clearing processes such as CUSIPs and ISINs.
The single record of truth for all symbols at FactSet is the Core Symbology database. As Boxall explains,
All types of financial content — pricing, estimates, benchmarks, countries, ticker symbols — are tied together in a graph of relationships; the relationships have rules. For example, a company can only have one stock ticker on the NASDAQ stock exchange. The traits, values, and relationships of entities can change over time. A bond will mature and sunset. A company will merge with another company. A country may form as a result of political changes, like the Baltic States breaking off from the former Soviet Union. Moreover, numbers can be retroactively revised, from corporate earnings to unemployment, or inflation numbers to the number of shares outstanding.
The Symbology database acts as the primary repository for anything with a symbol and its associated information. It connects a web of proprietary SQL and NoSQL FactSet databases that carry more extensive datasets on specific topics. In other words, the Core Symbology functions as the glue holding other higher-level queries together.
The FactSet Core Symbology database and structure that Boxall has upgraded includes multiple SQL databases for different types of symbols and a wide array of tables describing different aspects of each symbol. This complex structure requires dizzying joins that span hundreds of tables. FactSet wanted to make it easier to query the Core Symbology.
Boxall felt the nature of the Core Symbology data made this collection of databases a good candidate for refactoring as a graph and using a graph query language to simplify queries.
Design Constraints and Requirements for Refactoring Core Symbology
Because the Core Symbology database represented a critical piece of infrastructure for FactSet, Boxall had a set of specific and exacting design constraints and requirements.
Easy to Query / DQL
FactSet wanted to democratize the ability to query the Core Symbology database to allow more people inside the company to build applications that create reports and pull information out of the financial symbol graph.
Super-Flexible Data Schema
This was crucial to allow FactSet to build and modify data schemas easily, such as when new types of nodes were added or new rules were applied to nodes.
For example, we didn’t want to have a company table and a debt table and so on. This dispersed architecture makes changes difficult. If you want to introduce even one new attribute, it’s a schema change, and that’s a nightmare to deploy. With a graph, that’s not a problem and we can modify the schema when we need to in the future. — Mark Boxall, Principal Software Engineer at FactSet
In particular, FactSet wanted the capability for two distinct types of timeseries to be added to each node. One type of time was true business time — what happened in the world at any moment. Over time, FactSet continually corrects and improves its data. Boxall explains,
This required a second time series that functioned as a versioning engine. With Dgraph, FactSet could apply both types of time series easily and at scale across tens of millions of nodes in the Core Symbology Database.
FactSet has a demanding customer base who often incorporate its data into time-sensitive financial analysis, modeling, and trading applications. Some of these applications run lengthy, detailed queries requiring numerous joins (or in graph-speak, traversals), which are ill-suited to relational databases due the need for time-consuming query performance tuning with uncertain outcomes. For Core Symbology, queries needed to be fulfilled in less than 100 milliseconds for almost any complexity or size. Ideally, queries should round-trip in less than 20 milliseconds. This requires a database written with an efficient and distributed architecture that leverages an asynchronous programming language like Golang.
Fully Distributed (Really) But Economical
Because FactSet intended to create both a resilient and globally co-located database structure, Boxall wanted a truly distributed database. This meant having the ability to horizontally shard on the fly (rather than create full replica sets) and to create global consistency across multiple datacenters in Amazon Web Service’s compute and storage cloud. Many graph databases are not able to shard and carry heavy infrastructure footprints. FactSet wanted to avoid future problems on either front. While FactSet has not yet implemented sharding, it is confident that as the use of Symbology grows and requirements change, sharding can be deployed quickly and easily with Dgraph.
System reliability and support for high customer SLAs were paramount. In finance, system downtime could cost billions of dollars and puts funds at risk. Since the Core Symbology database is an essential engine powering FactSet reports, any refactored database must meet high reliability and uptime requirements. In addition, the database must be able to redeploy quickly in the event of an outage to minimize downtime.
Making the Decision on Dgraph and DQL
Boxall and his team compared a wide variety of graph databases against FactSet’s design constraints. Some could not deliver on performance, requiring seconds to traverse a large graph and return results on complex queries. Others could not shard data and required full replicas of the database. Still, others could not deliver ACID consistency and reliability. Crucially, Dgraph’s DQL query structure removed the need to add an additional layer for distributing and executing queries. Lastly, Dgraph included the capability for each node to have not only edges but also facets — key value pairs on edges. This unique functionality would allow FactSet to add two dimensions of time series (timestamp and versioning) to meet the product requirement.
The easy combination of a graph database and the native DQL interface was attractive because it made it easy for FactSet users to write queries with minimal technical training. In the end, Boxall found that Dgraph was the only graph database that met all the design constraints.
Schema Design and Roll Out
As part of the roll out, Boxall and his team interviewed domain experts on all aspects of the different financial instruments and symbols they would be adding to their Core Symbology datastore. From these conversations, Boxall and his team constructed rules for each data type that matched expert-defined rules around real-world relationships between different financial symbols. The goal was to create a single, up-to-date comprehensive schema that captures all the graph relationship data and uses it as a guide to building the graph Core Symbology database. For the dual-time series functionality, FactSet leveraged Dgraph facets and edges functionality.
DQL and Dgraph Make Life Easier
When they completed their build, the Core Symbology database had 160 million nodes with two billion edges and many billions of facets. Boxall found that Dgraph traverses FactSet’s large graph quickly with most queries requiring less than 20 milliseconds. To accelerate query speed, Dgraph breaks each query into subqueries and runs them concurrently. Whereas the legacy Symbology databases ran in multiple dedicated servers, the Dgraph-backed Core Symbology database required only three eight-core Amazon Web Services cloud instances while providing faster performance and lower latency.
Boxall plans to scale up the infrastructure as FactSet pushes the new database into production. During the entire trial and testing period of the POC phase, the FactSet database never went down. Restart tests of the database allowed Boxall to bring the entire system back up in a few minutes from a cold start.
Most importantly, Dgraph’s DQL query structure enabled even non-technical product developers to write queries against the Core Symbology API.
For example, the following concise query locates the Bond identified by CUSIP X65692AC3 and finds its issuer by following the isIssuerOf edge in reverse. It then recursively follows the isControlledBy edge to find the ultimate parent of the Bond as of April 15th, 2019.
Boxall’s team wrote a tutorial and created an additional UX layer to make it easier to write and visualize queries. Impressively, new users were able to write queries and start using the new Core Symbology DQL API in real applications right away.
says Boxall, who plans to add the API with documentation to FactSet’s internal API developer portal. This will make it even easier to discover and use the new Core Symbology database either in beta or when it officially enters production. Boxall explains,