Open Sourcing our Drone CI/CD CloudWatch Auto Scaler

Open Sourcing our Drone CI/CD CloudWatch Auto Scaler

At AssemblyAI, we use Drone as our primary CI/CD tool. It's dead simple to set up and operate  which frees us up to build out our product.

Being an AI company, we have a fair amount of workloads that require a GPU. GPU instances are expensive, so in the past our Drone workflow for GPU-based tests looked like this:

aws ec2 start-instances --instance-ids some-hardcoded-instance-id-with-an-elastic-ip

ssh user@elastic-ip

run test suite

aws ec2 stop-instances --instance-ids hardcoded-instance-id

Some of our services, like our API, don't need a GPU. For those services, we run a small fleet of T class spot instances to handle the CI/CD workload for those services. We call these instances the "standard" workers.

Recently we decided to figure out how to build a cost-effective, easily-scalable Drone worker fleet for our GPU instances. Our "standard" worker fleet is an EC2 autoscaling group that, prior to drone-queue-cloudwatch, didn't actually scale. The Launch Template's userdata for our "standard" workers tells new instances to retrieve a base64-encoded docker-compose file from Parameter Store, save it to the filesystem, then docker-compose -f /path/to/file -d up . This has worked really well for us so far.

First Steps

In order to move away from our SSH-based workflow, we needed to utilize Docker Runners.

We first looked at the Drone autoscaler. It's a very solid piece of software but didn't quite fit our needs. We didn't want to run another container if we could avoid it. In our case, we'd have to run an autoscaler container per worker "type" (GPU, non-GPU, etc). The Drone autoscaler also SSHs into instances it launches in order to configure them, which creates another layer of complexity.

Drone autoscaler runs in the background and polls the build API endpoint to see if it needs to add new instances. We decided that we could do something similar and tailor it to our use case.


When deciding what metrics to scale on, we realized that scaling on CPU and/or memory wouldn't be effective. What if we had 20 builds queued up behind a few long-running, non-intense scripts? The only effective metric is build queue depth. We also decided to publish metrics for running builds in order to control scaling. We didn't want running builds to be interrupted by a scale in event.

One of our goals is to reduce our operational overhead. We already run our Drone workers in autoscaling groups so we decided to send Drone build stats to Cloudwatch and let Cloudwatch handle scaling.

As far as the polling process itself, Lambda was an easy choice. No managing instances or containers, no worrying about our instance becoming unhealthy, no OS patching, etc. All we had to do was set up an EventBridge cron to invoke our function every minute.

On the Drone side, we use Node Routing to define which worker group handles a given build. When drone-queue-cloudwatch polls the Drone API, it turns each build's node labels into Cloudwatch metric dimensions. From there, it's pretty simple to set up scaling policies.

For the sake of reducing cost, we decided against publishing a metric for queues with 0 running or pending builds.

Using the Tool

You can find the code open sourced here:

Once you have cloned the repo, you can run bash build/ to build the binary and have it zipped up for you. Alternatively, you can access published artifacts in the aai-oss S3 bucket in us-west-2.

  • To use a specific version of the code, use the drone-queue-cloudwatch/<commit sha>.zip object key
  • To use the latest version, use the drone-queue-cloudwatch/ object key

There is example Terraform code in the terraform/ directory that you can reference for your own implementation.


GPUs and docker-compose

In order for Docker Compose to access GPUs, your Docker daemon config must specify the default-runtime as nvidia. The following config works for us:

  "default-runtime": "nvidia",
  "runtimes": {
      "nvidia": {
          "path": "nvidia-container-runtime",
          "runtimeArgs": []

GitHub Actions

We like GitHub Actions for self-contained test suites. In fact, GitHub Actions performs the CI for drone-queue-cloudwatch. For tests that require authenticated access to resources, we prefer Drone. For example, any pipeline that must access AWS resources runs through Drone - this way, we can leverage IAM roles.


Thank you to my friend Pablo Clemente Perez for evangelizing Drone and getting me excited to work with it!

Subscribe to our blog!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

You may also like

Checkout some of our recent research and product updates

How to transcribe an audio file with Python and AssemblyAI
How to transcribe an audio file with Python and AssemblyAI

In this tutorial, we show you how to easily transcribe an audio or video file using Python and the AssemblyAI API in just a few lines of code!

2 New Endpoints to Return Transcripts as Paragraphs and Sentences
2 New Endpoints to Return Transcripts as Paragraphs and Sentences

We now have two new endpoints that allow you to pull a completed transcript broken into paragraphs, or should you desire to be more specific, sentences!

Getting started with HttpClientFactory in C# and .NET 5
Getting started with HttpClientFactory in C# and .NET 5

HttpClientFactory has been around the .NET ecosystem for a few years now. In this post we will look at 3 basic implementations of HttpClientFactory; basic, named, and typed.


Unlock your media with our advanced features like PII Redaction,
Keyword Boosts, Automatic Transcript Highlights, and more