As of a few weeks ago, Chris AtLee and his half of Mozilla Release engineering report to me.
We are working on drastically improving efficiency of our continuous integration infrastructure.
For 2014 our focus is on cost-efficiency, developer efficiency is also a priority. Mozilla has a few thousand (around 3000, including various ARM boards) real machines dedicated to building & testing our products. There are an additional ~500-1000 virtual machines on Amazon EC2. We have been spending around ~$115K per month for Amazon AWS infra for the past 4 months (e.g. Linux/B2G/Android builds & unit tests. Linux perf tests run on real hw). It’s harder to get numbers for our internal costs, but a conservative estimate would be to assume that Amazon AWS is 1/3rd of our spend given that we test Windows, Mac, Android, etc in-house.
Number of c-i hours goes up with every month, so it is crucial to get a grip on our infrastructure efficiency in terms of cost per job.
My cloud target is to get our current Amazon EC2 workload to under $30K/month by the end of the year. This will give us room to move Mac builds, and Windows unit-tests + compiles into the cloud.
Where to save money? One can break down the c-i process into 3+1 distinct optimization areas:
- Building & Testing Workloads
- Scheduling Logic
- Provisioning Bare Metal or Virtual Machines
Items 1-3 are ordered in terms of dependencies where in theory #1 does not care how it’s scheduled or what machines it’s run on(#3). #3 is what we get billed for and it is largely a result of #1 and #2.
#4 is mainly a human cost in that we don’t have a lot of insight into how efficiently our jobs run. Making our systems more transparent should make optimization easier.
In terms of work:
- We made great progress with reducing build times in 2013. Compilation is the biggest contributor to our EC2 bill. We have more work to do there. We’ll focus on optimizing tests to run more efficiently once build costs are on-par in cost terms.
- The way we use Buildbot results in a lot of idle-time on machines. There is a lot of room for improvements via containerization, running more jobs in parallel. We are looking at potentially replacing Buildbot with a simpler and more cloud-friendly framework (TaskCluster).
- We saved $30K in January by tweaking what kind of EC2 nodes we use. We switched to more cost-efficient ondemand nodes (m3.xlarge -> c3.xlarge, m1.medium -> m3.medium) and started running try jobs and tests on EC2 spot nodes.
Moving to Public Clouds
We will be moving all of our Mac/Win/Linux (other than perf-testing) infrastructure into the cloud. Windows, Linux and most of Mac builds and tests should run on hardware outside of Mozilla data-centers by the end of 2014. Cloud infra like EC2 is expensive for inefficient workloads. We will be optimizing workloads locally until they can run on a public cloud in a cost-efficient manner (eg VMs with little idle capacity or redundant tasks). I expect cloud infra to save Mozilla developer time and money.
Private Cloud Functionality
For obvious reasons, we are stuck doing performance tests and many Mac workloads on our bare metal. Unfortunately suitable bare metal public clouds do not exist yet. We would like to approach EC2-like engineering efficiencies by reconfiguring machines in our data center as a private bare metal cloud based on OpenStack. We will be replacing bugzilla+human intervention with APIs for reimaging, slave loans, etc. Ideally all of our pc, mac, arm onsite,offsite hardware would be controlled by the same API. I hope that in addition to making our hardware ops happier, opening up machine configuration via self-serve access control will greatly increase our ability to iterate on improving machine configurations, bringing new configurations online, etc.
Amazon’s spot infra (http://aws.amazon.com/ec2/purchasing-options/spot-instances/) is a great idea for minimizing idle time on machines. I would like to have spot workloads in public and private clouds. Try and fuzzing deal well with being interrupted and soaking up spare capacity.
Developers with keys
The releng team is small. Unless you have a critical request, expect us to hear a “no, we are too busy” followed by “but here are the keys to our test public/private clouds, go do what you want and we’ll help you deploy it”. I hope that Mozilla developers will appreciate not having to block on busy intermediaries and help themselves.
My strategy is largely inspired by this Forrester analysis. Infrastructure devop teams should strive to enable other teams to self-serve. Our role is to help developers do things right, not to be gatekeepers or do work that developers find unsatisfying.
How to Get Involved
In 2014 I’d like to move our c-i from a 90s-style architecture with a few modern cloud components to a modern cloud-oriented architecture. The following areas might be fun to help with:
- We need to be more data-driven. Need to process, visualize and correlate various logs to direct optimization decisions. We have build logs, buildbot logs, AWS usage logs, AWS Spot price logs, etc.
- Bare metal virtualization with OpenStack for Mac, Windows, ARM boards is an open problem
- Containerizing builds, tests via Docker/LXC.