I started getting familiar with Mozilla continuous integration infrastructure in September. I’ve been particularly focused on the cloud part of that.
Thinking was that Amazon can take care of buying, managing hw so we can focus on higher level workloads. In theory this allows us to move faster and possibly lets us do more with less people. We have a lot of improvements to make to our C-I process, API-driven nature of Amazon EC2 was the obvious way to iterate quickly.
Our September Amazon bill was $136K. About $84K was that was spent on 150K hours of m3.xlarge virtual machines used for compilation. This did not feel cheap. We paid ondemand rates due to a combination of difficulty of forecasting the perfect mix of Amazon reserved instances and pain of getting over 1 million dollars for upfront reservations approved. Then the following happened:
September: $0.450/hour m3.xlarge was the most cost-effective instance Amazon had to offer.
October: I’m pushing for us to start using spot instances, as they seem to be 5x cheaper. Downside is that spot nodes get terminated by Amazon if there is insufficient capacity vs our bid price. Investigations into spot & some work to switch starts. There is a lot of uncertainty about how often our jobs will get killed.
December: Amazon starts offering $0.300 c3.xlarge instances, but these require upgrading our ami images. Work to switch to c3.xlarge starts.
February: We start switching to c3.xlarge, pocketing a 33% ondemand reduction vs m3.xlarge. We also start running m3.xlarge on spot. We end up paying around $0.15 for the mix of c3/m3 nodes we are running on spot. Turns out Amazon spot bidding API does not try to pick the cheapest AvailabilityZone for us. We also see about 3-4% of our spot jobs get killed.
March: We switch the majority of our workload to spot, come up with a trivial spot bidding algo. c3.xlarge now costs $0.068, but we can also bid on m3.xlarge, m3.2xlarge, c3.2xlarge depending on market prices. Our kill rates drop closer to 0.3%
April: Amazon spot prices drop by 50%, c3.xlarge/m3.xlarge now cost around $0.035. Amazon seems to have increased capacity, spot instances are almost never killed. There are still inefficiencies to deal with (EBS costs us as much as EC2), but cost is no longer our primary concern.
In September, we paid for ~150,000 hours of m3.xlarge. In April we paid for 190,000(even though our builds got about 2x faster) hours of a mix of compilation-friendly instance types(4-8Xeon cores, 8-16GB of RAM). 190000hours/730.5hours-per-month=260machines. We have around to 100-400 Amazon builders operating at any given time depending on how many Mozilla developers are awake.
This is interesting because $0.035 x 730.5hours = $25.57. According to the Dell website that’s identical to a 36month lease of a $900 quadcore Xeon server. At $0.035/hour we could still buy some types of hardware for less than we are paying to Amazon, but we’d have no money left to provision & power the machines on.
Releng engineers did a great job retrofitting our infra to save Mozilla money, it was nice of Amazon to help out by repeatedly droppping prices during the same period. I’m now confident that using Amazon spot instances is the most operationally efficient way to do C-I. I doubt Amazon will let us cut our price by 15x over 8month period again, but I hope they keep up the regular 30-50% price cuts :)
Amazon cost explorer provides for a dramatic visualization: