In this article we explain our transition to GitHub Actions for our CI/CD needs at Dgraph Labs Inc. As a part of this effort we have built (in-house) & implemented a new architecture for “Dynamic AutoScaling of GitHub Runners” to power this setup.
In the past, our CI/CD was powered by a self-hosted on-prem
TeamCity setup - this turned out to be a little difficult to operate & manage in
a startup setting like ours. Transitioning to GitHub Actions & implementing our
new in-house built “Dynamic AutoScaling of GitHub Runners” - has helped us reduce our
Maintenance Efforts &
Configuration Time across our repositories for our
CI/CD efforts (with improved security).
CI/CD DevOps Infinite Loop
image source credits
CI/CD is a two-step process that dramatically streamlines code development and delivery using the power of automation. CI (Continuous Integration) makes developer tasks around source code integration, testing and version control more efficient - so the software can get built with higher quality. CD (Continuous Deployment) automates software testing, release & deployment. CI/CD is often referred to as the DevOps Infinity Loop (as illustrated in the image above).
At Dgraph Labs Inc, we use CI/CD to facilitate our SDLC (Software Development Life Cycle) for our Dgraph Database and our Dgraph DBaaS (Cloud Offering) components. Like any tech company we want to minimize our bugs & deliver high quality products. The testing standards become strict for a Database company like ours, as a database becomes the most critical component of a software stack. To facilitate this, we follow the Practical Test Pyramid model along with other kinds of measurements instrumented into our CI/CD. To summarize, CI/CD helps us with:
CI/CD at Dgraph
Our CI/CD use-cases mostly revolve in the following areas: Building for Multi-Architectures (amd64/arm64), Testing (Unit/Integration & Load), Deployments, Security Audits, Code Linting, Benchmarking & CodeCoverage. We will cover some of these topics in our future blog posts.
image source credits
This setup was quite difficult to manage, monitor & have a high-uptime for a small team like ours today. The work
infrastructure setup, ensuring right security posture and instrumenting observability on these systems. There were 3
issues here for us, and they were
Maintenance Efforts &
Compute Costs was additive because we not only needed a Server & Agent Compute Machines, but we also needed
Observability Stack (& Instrumentation) for these critical systems. Secondly, the
Maintenance Efforts on the issues we
encountered (like Security Patching, Upgrades, Disk-Issues, Inconsistent Test Results Reporting etc.) was taxing the
team and was taking time away from our development cycles. Lastly, the
Configuration Time was also a problem for us
because the job configurations were outside our codebase (and in the Server), VCS configurations for new repo’s needed
instrumentation & we had to write our custom install/cleanups for basic setup tasks in the job definitions.
As a result,
Compute Costs ⬆⬆,
Configuration Time ⬆⬆ were all high. This led us
to re-think how we can transition to a new system that solved these issues and offered Public & Private repositories
NOTE: TeamCity is a great product. As explained above
were driving factors that made us transition out.
Our research led us to GitHub Actions. Given that we were already on GitHub for our VCS, this made us explore this further. We were quite content with what it had to offer, as it came with immediate benefits. Notable architecture differences were around how there was a fully managed Server (unlike the previous setup) & how GitHub semi-managed the Runners (a.k.a. Agents). Below we show an example CD run for multi-architecture release for Dgraph Labs Inc on GitHub Actions.
GitHub Actions CD steps for Dgraph
(everything well integrated to GitHub eco-system)
Compute Costs were lower because we only manage the
Self-Hosted Runners. We made
use of the free
wherever possible. For jobs that required higher resource specs, we had the option to run them on higher resources using
Self-Hosted Runners. Secondly,
Maintenance Costs reduced because we had fewer components to manage. Although, GitHub had done a great job
in simplifying the Runner setup steps - it still needed manual management - this was still a concern. Lastly,
Configuration Time reduced drastically because of the Action Marketplace, which
provided pre-templated tasks to perform pre-setup and post-cleanup on the Runners. And with this transition, the code &
job definitions lived together (unlike our old setup).
As a result, our
Compute Costs ⬆ reduced,
⬆ reduced &
Configuration Time ⬇ was a
great win. There was still some room to improve here, because we were manually attaching & detaching Runners on a need
basis. There was another problem here - the “Idle Runners” - it was leading to wasted resource spends.
As described in the previous step, our problems were limited to
Compute Costs (idle runners) &
(manual Runner attach/detach). We started exploring potential solutions for “Dynamic AutoScaling of GitHub
Runners”. There were 2 solutions we found, which were ARC &
ARC was specific to container eco-system and did not apply to
us. The Philips-Scalable-Runner had
too many components
to facilitate this. So this led us to building our in-house solution.
Our design needed to solve these:
Dynamic AutoScaling of GitHub Runners
There are 3 logical pieces in this architecture, and they are:
VM Images, We bake custom AMI’s with a specialized startup script in them. The startup script that we bake into the image, has a logic to connect to the SSM Parameter Store & read its configuration at startup. When the AMI comes up as an EC2 instance, it will read its config from SSM Parameter Store & self-configure itself to GitHub to service our Jobs.
SSM, we use SSM Parameter Store as a way to store Runner configurations. This is essentially a KV store for configs. We store the Runner configurations as Values with the EC2 Instance as the Key deliminator.
Orchestrator, This is essentially our Controller (written in Python). The
Orchestratormonitors GitHub events. It has logic to Scale Up or Down based on the Job Queue & available Runner count dynamically. This has hooks to GitHub & AWS (SSM Parameter Store & EC2) to facilitate this process. In the Scale Up phase, the
Orchestratorwill create an SSM Parameter Store entry and follow it up by creating an EC2 using the AMI (through the Launch-Templates). In the Scale Down phase, the
Orchestratorwill delete the SSM Parameter Store entry and follow it up by a deletion operation on the created EC2 instance.
Note: We are considering to Open Source this project. For that reason we have only given an overview and skipped the full implementation details. If you are interested to discuss further, do hit us up. We would love to partner.
CI/CD Cost Graph
We enabled “Dynamic AutoScaling of GitHub Runners” on
2023-Jan-06. And we have seen drastic reduction in our spends
since then. Not only has it saved us money, it has also saved us Engineering time by eliminating
The “Idle Runner” problem was real, and it would have affected us in different ways had we not addressed.
~$0(as it’s a break day)
~$1.2/PRfor dgraph-io/dgraph repository
NOTE: We will continue to see more savings over time (as the savings compound here).
To conclude, the table shows our OKRs and how we went about solving for these. The last column is where we are today in our journey.
On behalf of the project team, I would like to express our gratitude to all the internal contributors who have played a crucial role in bringing this project to fruition. Their dedication and hard work have been instrumental in making this idea a reality.
A special thanks goes to Kevin for his efforts in implementing and testing the initial concept. His efforts ensured that our vision could be transformed into a tangible execution.