All About Performance

and other stuff by Taras Glek

More & Faster C-I for Less on AWS

Amazon Pricing - Expensive or Cheap?

Amazon ondemand nodes are fantastic for rapid iteration, but using them in production is expensive naivety. It is expensive for Amazon to maintain spare capacity to allow customers to launch any of the wide variety of nodes they offer ondemand. Forecasing demand at Amazon scale can’t be easy. As a result, Amazon recommends that customers buy reserves with an upfront payment then pay a discounted rate after. This is brilliant as it shifts the capacity-planning burden to each customer. This would net us a 60% discount if we we could forecast our AWS usage perfectly.

Fortunately Amazon also has a spot-pricing model. Spot prices can be 70-90% lower than ondemand (we’ve also seen them 50% higher). The downside is that Amazon can kill these nodes at any point and node availability is limited compared to ondemand. Given that Amazon competition can’t match spot prices, Amazon might be selling their unused ondemand capacity at cost. I doubt that anyone smaller than Amazon can maintain their own hardware with salaried ops for less than Amazon’s spot prices.

Spot Everything

We spent 2014 retrofitting our c-i architecture to cope with failure so we can run more of our workload on spot nodes.

On our January AWS bill we were 30% more cost-efficient. This was accomplished late in the month, we managed to have the bill not go up to cope with a higher-than-ever load. For February we were aiming to drop the bill to under $80K. The following is a summary of where we are.

Provisioning

  • We now run the majority of our workload on Amazon spot nodes. Ondemand:spot ratio is between 2:1 and 7:1. Note we still pay more for ondemand portion of our bill because ondemand is a lot more expensive
  • At $74,389.03, our Feb bill is 36% lower than Jan.
  • Our current AWS spending per job is approximately half of what we paid in December
  • We now bid on a range of AWS node types to maximize node availability and minimize price. This results in >=50% lower spot bill. We now run a portion of our workload on 2x-faster VMs when cheaper spot machine types are not available.

Scheduling

  • Our AWS scheduler ramps up slower now to avoid temporary overprovisioning. Note the improvement on the right side of the graph (tall & narrow spikes are bad)

catlee's graph

Monitoring

  • We are evaluating hostedgraphite.com for monitoring our efficiency. It’s nice to have someone offer a well-supported open-source-compatible solution that can cope with 30K+ of metrics our 1000s of VMs generate.

Workload Improvements

Mozilla Data Center plans for March

Amazon S3 is cheap, fast and robust. EC2 is incredibly flexible. Both are great for quickly iterating on cool ideas. Unfortunately most of our infrastructure runs on physical machines. We need to improve our non-elastic inhouse capacity with what we learned in the cloud:

  • Use a shared object cache for Windows/Mac builds. This should more than double Windows build speed. The plan is to use Ceph for S3-compatible shared object storage.
  • Get OpenStack bare metal virtualization working so we could move as fast there as we do in EC2

Cloud Plans for March

  • Eliminate EBS usage for faster builds, 10% lower EC2 bill. Amazon EBS is the antithesis of cost-effectiveness.
  • Deploy more jacuzzis for faster builds, less EC2 instances
  • Run more things on spot, switch to cheaper ondemand nodes, maybe buy some reserves
  • Bid on an even wider variety of spot nodes
  • Probably wont hit another 30% reduction, focusing on technical debt, better metrics, etc
  • Containerization of Linux builds

Conclusion

Cloud APIs make cost-oriented architectures fun. Batch nature of c-i is a great match for spot.

In general, spot is a brilliant design pattern, I intend to implement spot workloads on our own infra. It’s too bad other cloud vendors do not offer anything comparable.

Comments