Dynamic AutoScaling of GitHub Runners

In this article we explain our transition to GitHub Actions for our CI/CD needs at Dgraph Labs Inc. As a part of this effort we have built (in-house) & implemented a new architecture for “Dynamic AutoScaling of GitHub Runners” to power this setup.

In the past, our CI/CD was powered by a self-hosted on-prem TeamCity setup - this turned out to be a little difficult to operate & manage in a startup setting like ours. Transitioning to GitHub Actions & implementing our new in-house built “Dynamic AutoScaling of GitHub Runners” - has helped us reduce our Compute Costs, Maintenance Efforts & Configuration Time across our repositories for our CI/CD efforts (with improved security).

Background

Before we begin we would like to give you an overview of CI/CD & explain our needs for it at Dgraph Labs Inc.

What is CI/CD?

CICD_DevOps_Infinite_Loop

CI/CD DevOps Infinite Loop
image source credits

CI/CD is a two-step process that dramatically streamlines code development and delivery using the power of automation. CI (Continuous Integration) makes developer tasks around source code integration, testing and version control more efficient - so the software can get built with higher quality. CD (Continuous Deployment) automates software testing, release & deployment. CI/CD is often referred to as the DevOps Infinity Loop (as illustrated in the image above).

Why is CI/CD important to us?

At Dgraph Labs Inc, we use CI/CD to facilitate our SDLC (Software Development Life Cycle) for our Dgraph Database and our Dgraph DBaaS (Cloud Offering) components. Like any tech company we want to minimize our bugs & deliver high quality products. The testing standards become strict for a Database company like ours, as a database becomes the most critical component of a software stack. To facilitate this, we follow the Practical Test Pyramid model along with other kinds of measurements instrumented into our CI/CD. To summarize, CI/CD helps us with:

  • Ensuring higher code quality
  • Obtaining continuous feedback
  • Shipping efficiently (with higher confidence)

How we use CI/CD?

CICD_At_Dgraph

CI/CD at Dgraph

Our CI/CD use-cases mostly revolve in the following areas: Building for Multi-Architectures (amd64/arm64), Testing (Unit/Integration & Load), Deployments, Security Audits, Code Linting, Benchmarking & CodeCoverage. We will cover some of these topics in our future blog posts.

Old Setup (TeamCity)

As described above, Dgraph Labs Inc ran a self-managed on-prem TeamCity setup for CI/CD in the past. The setup looked similar to the image below.

TeamCity Architecture

TeamCity Architecture
image source credits

This setup was quite difficult to manage, monitor & have a high-uptime for a small team like ours today. The work infrastructure setup, ensuring right security posture and instrumenting observability on these systems. There were 3 issues here for us, and they were Compute Costs, Maintenance Efforts & Configuration Time.

Firstly, the Compute Costs was additive because we not only needed a Server & Agent Compute Machines, but we also needed Observability Stack (& Instrumentation) for these critical systems. Secondly, the Maintenance Efforts on the issues we encountered (like Security Patching, Upgrades, Disk-Issues, Inconsistent Test Results Reporting etc.) was taxing the team and was taking time away from our development cycles. Lastly, the Configuration Time was also a problem for us because the job configurations were outside our codebase (and in the Server), VCS configurations for new repo’s needed instrumentation & we had to write our custom install/cleanups for basic setup tasks in the job definitions.

As a result, Compute Costs ⬆⬆, Maintenance Efforts ⬆⬆ & Configuration Time ⬆⬆ were all high. This led us to re-think how we can transition to a new system that solved these issues and offered Public & Private repositories support.

NOTE: TeamCity is a great product. As explained above time & easy-of-use were driving factors that made us transition out.

New Setup (GitHub Actions)

Our research led us to GitHub Actions. Given that we were already on GitHub for our VCS, this made us explore this further. We were quite content with what it had to offer, as it came with immediate benefits. Notable architecture differences were around how there was a fully managed Server (unlike the previous setup) & how GitHub semi-managed the Runners (a.k.a. Agents). Below we show an example CD run for multi-architecture release for Dgraph Labs Inc on GitHub Actions.

cicd_example

GitHub Actions CD steps for Dgraph
(everything well integrated to GitHub eco-system)

Firstly, the Compute Costs were lower because we only manage the Self-Hosted Runners. We made use of the free GitHub-hosted runners wherever possible. For jobs that required higher resource specs, we had the option to run them on higher resources using Self-Hosted Runners. Secondly, the Maintenance Costs reduced because we had fewer components to manage. Although, GitHub had done a great job in simplifying the Runner setup steps - it still needed manual management - this was still a concern. Lastly, the Configuration Time reduced drastically because of the Action Marketplace, which provided pre-templated tasks to perform pre-setup and post-cleanup on the Runners. And with this transition, the code & job definitions lived together (unlike our old setup).

As a result, our Compute Costs reduced, Maintenance Efforts reduced & Configuration Time was a great win. There was still some room to improve here, because we were manually attaching & detaching Runners on a need basis. There was another problem here - the “Idle Runners” - it was leading to wasted resource spends.

Dynamic AutoScaling Of GitHub Runners

As described in the previous step, our problems were limited to Compute Costs (idle runners) & Maintenance Efforts (manual Runner attach/detach). We started exploring potential solutions for “Dynamic AutoScaling of GitHub Runners”. There were 2 solutions we found, which were ARC & Philips-Scalable-Runner. The ARC was specific to container eco-system and did not apply to us. The Philips-Scalable-Runner had too many components to facilitate this. So this led us to building our in-house solution.

Our design needed to solve these:

  • minimal AWS service use
  • support different labels (i.e. Runner types like arm64 / amd64)
  • support different repositories This led us to the below architecture.

CICD_Dynamic_AutoScaling

Dynamic AutoScaling of GitHub Runners

There are 3 logical pieces in this architecture, and they are:

  • VM Images, We bake custom AMI’s with a specialized startup script in them. The startup script that we bake into the image, has a logic to connect to the SSM Parameter Store & read its configuration at startup. When the AMI comes up as an EC2 instance, it will read its config from SSM Parameter Store & self-configure itself to GitHub to service our Jobs.
  • SSM, we use SSM Parameter Store as a way to store Runner configurations. This is essentially a KV store for configs. We store the Runner configurations as Values with the EC2 Instance as the Key deliminator.
  • Orchestrator, This is essentially our Controller (written in Python). The Orchestrator monitors GitHub events. It has logic to Scale Up or Down based on the Job Queue & available Runner count dynamically. This has hooks to GitHub & AWS (SSM Parameter Store & EC2) to facilitate this process. In the Scale Up phase, the Orchestrator will create an SSM Parameter Store entry and follow it up by creating an EC2 using the AMI (through the Launch-Templates). In the Scale Down phase, the Orchestrator will delete the SSM Parameter Store entry and follow it up by a deletion operation on the created EC2 instance.

Note: We are considering to Open Source this project. For that reason we have only given an overview and skipped the full implementation details. If you are interested to discuss further, do hit us up. We would love to partner.

Financial Analysis

CICD_Financials

CI/CD Cost Graph

We enabled “Dynamic AutoScaling of GitHub Runners” on 2023-Jan-06. And we have seen drastic reduction in our spends since then. Not only has it saved us money, it has also saved us Engineering time by eliminating Maintenance Efforts. The “Idle Runner” problem was real, and it would have affected us in different ways had we not addressed.

High Level Analysis

  • Before AutoScaling
    • our costs increased as we got closer to our release cycles because we attached more Runners
    • increase in costs was primarily because of beefy Idle Runners
  • After enabling AutoScaling
    • costs shrunk drastically
    • our weekend costs touched ~$0 (as it’s a break day)
    • we serviced more runs (almost triple) to what we did in our previous release
    • no manual attach/detach of Runners required by Engineers

Average Daily Runner Cost (dropped by ~87%)

  • Before AutoScaling ~$63.36/day (or $1,900.8/month) ⬆⬆
  • After enabling AutoScaling ~$8.12/day (or ~$243.6/month) ⬇⬇

Average Per PR cost (after AutoScaling)

NOTE: We will continue to see more savings over time (as the savings compound here).

Conclusion

To conclude, the table shows our OKRs and how we went about solving for these. The last column is where we are today in our journey.

OKRs Old Setup
TeamCity
New Setup
Github Actions
New Setup
GitHub Actions
w/ AutoScaling
Compute Costs ⬆⬆
Maintenance Costs ⬆⬆
Configuration Time Costs ⬆⬆

Acknowledgements

On behalf of the project team, I would like to express our gratitude to all the internal contributors who have played a crucial role in bringing this project to fruition. Their dedication and hard work have been instrumental in making this idea a reality.

A special thanks goes to Kevin for his efforts in implementing and testing the initial concept. His efforts ensured that our vision could be transformed into a tangible execution.

We would also like to acknowledge the contributions of Aditya (co-author), Anurag, Dilip & Joshua. Their valuable insights, expertise, and collaboration have greatly enriched the project.