June 23, 2024
Michael Bachand
The Airbnb Tech Blog

How Airbnb leverages AWS, Packer, and Terraform to replace macOS on tons of of CI machines in hours as an alternative of days

A person leans over the edge of a balcony. In the background are trees.

By: Michael Bachand, Xianwen Chen

At Airbnb, we run a complete suite of steady integration (CI) jobs earlier than every iOS code change is merged. These jobs be certain that the primary department stays steady by executing important developer workflows like constructing the iOS software and operating exams. We additionally schedule jobs that carry out periodic duties like reporting metrics and importing artifacts.

Lots of our iOS CI jobs execute on Macs, which allows operating developer instruments offered by Apple. CI jobs for all different platforms at Airbnb execute in containers on Amazon EC2 Linux situations. To satisfy the macOS requirement of iOS CI jobs we have now traditionally maintained alternate CI infrastructure exterior of AWS particularly for iOS improvement. The introduction of Macs to AWS offered a chance for us to rethink our method to iOS CI.

We designed the following iteration of our iOS CI system in late 2021, completed the migration to the brand new system in mid 2022, and polished the system by means of the tip of 2022. CI for iOS and all different platforms at Airbnb already leveraged Buildkite for dispatching jobs. Now, we deploy iOS CI infrastructure to AWS utilizing Terraform, which helps align CI for iOS with CI for different platforms at Airbnb.

On this article, we’re excited to share with you particulars of the versatile and easy-to-maintain iOS CI system that we’ve carried out with Amazon EC2 Mac situations.

Traditionally we ran Airbnb iOS CI on bodily Macs. We loved the velocity of operating CI with out virtualization however we paid a considerable upkeep value to run CI jobs immediately on bodily {hardware}. An iOS infrastructure engineer individually logged into over 300 machines to carry out administrative duties like enrolling the Mac in our MDM (Cellular Machine Administration) software and upgrading macOS. Handbook upkeep necessities restricted the scalability of the fleet and consumed engineer time that may very well be higher spent on higher-value initiatives.

A screenshot of a macOS desktop with many open VNC sessions to remote Mac machines.
An engineer remotely updates a number of bodily Macs to macOS Large Sur. EC2 macOS AMIs have eradicated this handbook work.

Our previous CI machines had been not often restarted and too usually drifted into a foul state. When this occurred, the best-case situation was that an engineer may log into the machine, diagnose what configuration drift was inflicting points, and manually carry the machine again to a very good state. Extra generally, we shut down the corrupted machine in order that it may now not settle for new CI jobs. Periodically, we requested the seller who managed our bodily Macs to revive the corrupted machines to a clear set up of macOS. When the machines ultimately got here again on-line, we manually re-enrolled every machine in MDM to carry our fleet again to its full capability.

Updating to a brand new model of Xcode was fairly error-prone as effectively. We attempt to roll out new Xcode variations usually since many iOS engineers at Airbnb comply with Swift and Xcode releases intently and are desperate to undertake new language options and IDE enhancements. Nevertheless, the mounted capability of our Mac fleet made it tough for us to confirm iOS CI jobs completely towards new variations; any machine allotted to testing a brand new model of Xcode may now not settle for CI jobs from the earlier Xcode model. The chance of tackling every Xcode replace was elevated by the truth that rolling again to a earlier model of Xcode throughout our fleet was not sensible.

When evaluating AWS, we had been excited by the potential of launching situations from Amazon Machine Photos (AMIs). An AMI is a snapshot of an occasion’s state, together with its file system contents and different metadata. Amazon offers base AMIs for every macOS model and permits prospects to create their very own AMIs from operating situations.

AMIs enable us so as to add new situations to our fleet with out human intervention. An EC2 Mac bare-metal occasion launched from a correctly configured AMI is instantly prepared to simply accept new work after initialization. When updating macOS, we now not must log into each machine in our fleet. As an alternative, we log right into a single occasion launched from the Amazon base AMI for the brand new macOS model. After performing a handful of handbook configuration steps, like enabling automatic login, we create an Airbnb base AMI from that occasion.

Initially, we powered our EC2 Mac fleet with manually created AMIs. An engineer would configure a single occasion and create an AMI from that occasion’s state. Then we may launch any variety of further situations from that AMI. This was a significant enchancment over managing bodily machines since we may spin up a whole fleet of similar situations after configuring solely a single occasion efficiently.

Now, we build AMIs using Packer. Packer programmatically launches and configures an EC2 occasion utilizing a template outlined within the HashiCorp configuration language (HCL). Packer then creates an AMI from the configured EC2 occasion. A Ruby wrapper script invokes Packer persistently and performs useful validations like checking that the consumer has assumed the right AWS position. We examine the HCL template code into supply management and all modifications to our Packer template and companion scripts are made through GitHub pull requests.

Timing statistics for creating a brand new Arm AMI with Packer. This command ran on an EC2 mac2.metallic occasion.

We initially ran Packer from developer laptops, however the laptop computer wanted to be awake and on-line in the course of the Packer construct. Finally, we created a devoted pipeline to construct AMIs within the cloud. A developer can set off a brand new construct on this pipeline with a few clicks. A profitable construct will produce freshly baked and verified AMIs for each the x86 and Arm (Apple Silicon) CPU architectures inside a couple of hours.

Our new CI system leveraging these AMIs consists of many environments, every of which could be managed independently. The central AWS part of every CI setting is an Auto Scaling group, which is liable for launching the EC2 Mac situations. The variety of situations within the Auto Scaling group is decided by the desired capacity property on the group and is bounded by min and max dimension properties.

An Auto Scaling group creates new situations utilizing a launch template. The launch template specifies the configuration of every occasion, together with the AMI, and permits a “consumer information” script to run when the occasion is launched. Launch templates could be versioned, and every Auto Scaling group is configured to launch situations from a particular model of its launch template.

Though the introduction of environments has made our CI topology extra advanced, we discover that complexity manageable when our infrastructure is outlined in code. All of our AWS infrastructure for iOS CI is laid out in Terraform code that we examine into supply management. Every time we merge a pull request associated to iOS CI, Terraform Enterprise will routinely apply our modifications to our AWS account. We now have outlined a Terraform module that we will name each time we need to instantiate a brand new CI setting.

Calling a Terraform module to create a CI setting of Arm Mac Minis with Xcode 14.2 put in.

An inner scaling service manages the specified capability of every setting’s Auto Scaling group. This service, a modified fork of buildkite-agent-scaler, will increase the specified capability of an setting’s Auto Scaling group as CI job quantity for that setting will increase. We specify a most variety of situations for every CI setting partially as a result of On-Demand EC2 Mac Devoted Hosts presently have a minimal host allocation and billing period of 24 hours.

A diagram showing the relationship between CI environments, the scaling service, and Buildkite.
A sketch of Airbnb’s new iOS CI system.

Every CI setting has a novel Buildkite queue title. Particular person CI jobs can goal situations in a particular setting by specifying the corresponding queue title. Jobs will fall again to the default CI setting when no queue title is explicitly specified.

CI Environments Are Extremely Versatile

With this new Terraform setup we’re in a position to assist an arbitrary variety of CI environments with minimal overhead. We create a brand new CI setting per CPU structure and model of Xcode. We are able to even duplicate these environments throughout a number of variations of macOS when performing an working system replace throughout our fleet. We use devoted staging environments to check CI jobs on situations launched from a brand new AMI earlier than we roll out that AMI broadly.

Once we are now not usually utilizing a CI setting, we will specify a minimal capability of zero when calling the Terraform module, which can set the identical worth on the underlying Auto Scaling group. Then the Auto Scaling group will solely launch situations when its desired capability is elevated by the scaling service. In follow, we are inclined to delete older environments from our Terraform code. Nevertheless, even as soon as an setting has been wound down, reinstating that setting is so simple as reverting a few commits in Git and redeploying the scaling service.

Rotation of Situations Will increase CI Consistency

To reduce the chance for EC2 situations to float, we terminate all situations every evening and change them day by day. This fashion, we could be assured that our CI fleet is in a identified good state at the beginning of every day.

When an occasion is terminated, the underlying Devoted Host is scrubbed earlier than a brand new occasion could be launched on that host. We terminate situations at a time when CI demand is low to permit for the EC2 Mac scrubbing course of to finish earlier than we have to launch contemporary situations on the identical hosts. When an occasion terminates itself in a single day, it would decrement the specified capability of the Auto Scaling group to which it belongs. As engineers begin pushing commits the following day, the scaling service will increment the specified capability on the suitable Auto Scaling teams, inflicting new situations to be launched.

A chart showing CI capacity relative to job volume over more than one week.
Situations terminate themselves in a single day. We scale back our most capability over weekends. The spikes in job quantity that elevated capability on the 2nd, sixth, and seventh have been hidden by smoothing within the chart.

When an occasion does expertise configuration drift, we will disconnect that occasion from Buildkite with one click on. The occasion will stay operating however will now not settle for new CI jobs. An engineer can log into the occasion to analyze its state till the occasion is ultimately terminated on the finish of the day. To maintain total CI capability steady, we will manually add an extra occasion to our fleet, or a alternative will probably be launched routinely if we terminate the occasion early.

We Ship Xcode Variations Extra Rapidly

We respect the brand new capabilities of our upgraded CI system. We are able to lease further Devoted Hosts from Amazon on demand to climate sudden spikes in CI utilization and to check software program updates completely. We roll out new AMIs steadily and may roll again painlessly if we encounter sudden points.

A chart showing CI capacity relative to job volume for two simultaneous versions of Xcode.
CI jobs shift from Xcode 14.1 to 14.2. On the twenty fourth, we briefly elevated 14.2 capability to accommodate a spike in jobs.

Collectively, these capabilities get Airbnb iOS builders entry to Swift language options and Xcode IDE enhancements extra shortly. In reality, with the tailwind of our new CI system, we have now seen the tempo at which we replace Xcode enhance by over 20%. As of the time of writing, we have now internally rolled out all accessible main and minor variations of Xcode 14 (14.0–14.3) as they’ve been launched.

Our new CI system ran over 10 million minutes of CI jobs within the final three months of 2022. After upgrading to EC2, we spend meaningfully fewer hours on upkeep regardless of a rising codebase and persistently excessive job quantity. Our newfound capability to scale CI to satisfy the evolving wants of the Airbnb iOS group justifies the elevated complexity of the rebuilt system.

After the migration to AWS, iOS CI advantages extra from shared infrastructure that’s already getting used efficiently inside Airbnb. For instance, the brand new iOS CI structure enabled us to keep away from implementing an iOS-specific resolution for routinely scaling capability. As an alternative, we leverage the aforementioned fork of buildkite-agent-scaler that Airbnb engineers had already transformed to an inner Airbnb service full with a devoted deployment pipeline. Moreover, we used current Terraform modules which are maintained by different groups to combine with IAM and SSM.

We now have discovered that EC2 Mac situations launched from customized AMIs present lots of the advantages of virtualization with out the efficiency penalty of executing inside a digital machine. We contemplate AWS, Packer, and Terraform to be important applied sciences for constructing a versatile CI system for large-scale iOS improvement in 2023.