All About Performance

and other stuff by Taras Glek

How Mozilla Amazon EC2 Usage Got 15X Cheaper in 8 Months

I started getting familiar with Mozilla continuous integration infrastructure in September. I’ve been particularly focused on the cloud part of that.

Thinking was that Amazon can take care of buying, managing hw so we can focus on higher level workloads. In theory this allows us to move faster and possibly lets us do more with less people. We have a lot of improvements to make to our C-I process, API-driven nature of Amazon EC2 was the obvious way to iterate quickly.

Our September Amazon bill was $136K. About $84K was that was spent on 150K hours of m3.xlarge virtual machines used for compilation. This did not feel cheap. We paid ondemand rates due to a combination of difficulty of forecasting the perfect mix of Amazon reserved instances and pain of getting over 1 million dollars for upfront reservations approved. Then the following happened:

September: $0.450/hour m3.xlarge was the most cost-effective instance Amazon had to offer.

October: I’m pushing for us to start using spot instances, as they seem to be 5x cheaper. Downside is that spot nodes get terminated by Amazon if there is insufficient capacity vs our bid price. Investigations into spot & some work to switch starts. There is a lot of uncertainty about how often our jobs will get killed.

December: Amazon starts offering $0.300 c3.xlarge instances, but these require upgrading our ami images. Work to switch to c3.xlarge starts.

February: We start switching to c3.xlarge, pocketing a 33% ondemand reduction vs m3.xlarge. We also start running m3.xlarge on spot. We end up paying around $0.15 for the mix of c3/m3 nodes we are running on spot. Turns out Amazon spot bidding API does not try to pick the cheapest AvailabilityZone for us. We also see about 3-4% of our spot jobs get killed.

March: We switch the majority of our workload to spot, come up with a trivial spot bidding algo. c3.xlarge now costs $0.068, but we can also bid on m3.xlarge, m3.2xlarge, c3.2xlarge depending on market prices. Our kill rates drop closer to 0.3%

April: Amazon spot prices drop by 50%, c3.xlarge/m3.xlarge now cost around $0.035. Amazon seems to have increased capacity, spot instances are almost never killed. There are still inefficiencies to deal with (EBS costs us as much as EC2), but cost is no longer our primary concern.

In September, we paid for ~150,000 hours of m3.xlarge. In April we paid for 190,000(even though our builds got about 2x faster) hours of a mix of compilation-friendly instance types(4-8Xeon cores, 8-16GB of RAM). 190000hours/730.5hours-per-month=260machines. We have around to 100-400 Amazon builders operating at any given time depending on how many Mozilla developers are awake.

This is interesting because $0.035 x 730.5hours = $25.57. According to the Dell website that’s identical to a 36month lease of a $900 quadcore Xeon server. At $0.035/hour we could still buy some types of hardware for less than we are paying to Amazon, but we’d have no money left to provision & power the machines on.

Releng engineers did a great job retrofitting our infra to save Mozilla money, it was nice of Amazon to help out by repeatedly droppping prices during the same period. I’m now confident that using Amazon spot instances is the most operationally efficient way to do C-I. I doubt Amazon will let us cut our price by 15x over 8month period again, but I hope they keep up the regular 30-50% price cuts :)

Amazon cost explorer provides for a dramatic visualization: Our AWS bill

Demise of Special Snowflake Infrastructure

TLDR

I’ve been running one of our Releng(C-I) teams since January. My specialty is to optimize $stuff at Mozilla. My team is full of like-minded people.

Here is a short summary of various improvements done this year:

  • Get 4x better value for each dollar spent on Amazon AWS (so we can use it for more stuff)
  • Sped up build times: try, inbound (values in seconds)
  • wrote a new pull-request review tool on top of reviewboard. See mailing list. Bugzilla guys are standing it up for testing
  • Setup http://ask.mozilla.org for knowledge transfer

Long Story: Put up or Shut up

Our infrastructure is an awful artifact of the project roots and people working on it going back to the 90s. Buildbot made sense when running on a dozen computers, it makes less sense on 5000. Building our own issue tracker(sadly bugzilla is still competitive), version control hosting made sense when alternatives were worse. However, the world moved on and we ended up with a dead evolutionary branch (rooted in the 90s!) of a lot of these technologies. Things are often slow, because of either lack of ownership or wrong ownership(eg management fail).

It hurts to see new people join and ask “why are you guys such special snowflakes?”, it also hurts to see our elders reply “because Mozilla is different”. Everyone working with me is sick of our infrastructure. We have our best people working days (and often weekends to get us out of the mess we are in).

Instead of lamenting about the situation we dug ourselves into, we are working hard on getting out of the hole, presenting PoCs, etc.

Here are some things being worked on atm:

  • Faster builds: the goal with builds is to get them to as fast as possible. We’ll halve our build times a few times this year. 2 people are on this fulltime with part-time help from others. Most of these wins will happen on our infrastructure, however our local builds will also be best in class.
  • Efficient, cloud-oriented, modern flexible job C-I system: taskcluster.net, blog. This will cut latency during scheduling, test execution and be mostly self-serve.
  • Moving everything we can move into the public cloud, running rest as a private OpenStack bare metal cloud. The plan is to offload private parts to public clouds as soon as viable(mac, bare metal, etc) options appear on market.
  • Excellent review tools: in addition to setting up reviewboard, we’ll support a proper github pull req flow without CVS-style-everybody-can-commit-to-TRUNK artifacts
  • Everything is being designed with self-serve operation in mind. The idea is that if someone is sick of something they have all the APIs they need to do something about it, or they can shut up. Historically obtuseness of a lot of our infra prevented this.
  • sccache(eg multimachine ccache). This now works as a ccache replacement on Windows in addition to being a big speedup on our infra.

Note I spend a lot of time evaluating alternatives (ninja, phabricator, github, etc) before embarking on projects. Sometimes it’s easier to fix existing infra, other times it’s better to switch tools. We’ll try to use 3rd-party infrastructure where possible because domain experts are usually better at what they do. For example, it’s unbeliably hard to stand up something on our infrastructure vs calling APIs on Amazon (eg hours of dev vs months of meetings).

It’s a lot of work to drag our infrastructure out of the late 1990s. Our mission is to eradicate/outsource/rewrite crappy infrastructure. We know what needs to happen and how to get there by the end of the year.

PS. I expect to see dev blogposts about their work.

More & Faster C-I for Less on AWS

Amazon Pricing - Expensive or Cheap?

Amazon ondemand nodes are fantastic for rapid iteration, but using them in production is expensive naivety. It is expensive for Amazon to maintain spare capacity to allow customers to launch any of the wide variety of nodes they offer ondemand. Forecasing demand at Amazon scale can’t be easy. As a result, Amazon recommends that customers buy reserves with an upfront payment then pay a discounted rate after. This is brilliant as it shifts the capacity-planning burden to each customer. This would net us a 60% discount if we we could forecast our AWS usage perfectly.

Fortunately Amazon also has a spot-pricing model. Spot prices can be 70-90% lower than ondemand (we’ve also seen them 50% higher). The downside is that Amazon can kill these nodes at any point and node availability is limited compared to ondemand. Given that Amazon competition can’t match spot prices, Amazon might be selling their unused ondemand capacity at cost. I doubt that anyone smaller than Amazon can maintain their own hardware with salaried ops for less than Amazon’s spot prices.

Spot Everything

We spent 2014 retrofitting our c-i architecture to cope with failure so we can run more of our workload on spot nodes.

On our January AWS bill we were 30% more cost-efficient. This was accomplished late in the month, we managed to have the bill not go up to cope with a higher-than-ever load. For February we were aiming to drop the bill to under $80K. The following is a summary of where we are.

Provisioning

  • We now run the majority of our workload on Amazon spot nodes. Ondemand:spot ratio is between 2:1 and 7:1. Note we still pay more for ondemand portion of our bill because ondemand is a lot more expensive
  • At $74,389.03, our Feb bill is 36% lower than Jan.
  • Our current AWS spending per job is approximately half of what we paid in December
  • We now bid on a range of AWS node types to maximize node availability and minimize price. This results in >=50% lower spot bill. We now run a portion of our workload on 2x-faster VMs when cheaper spot machine types are not available.

Scheduling

  • Our AWS scheduler ramps up slower now to avoid temporary overprovisioning. Note the improvement on the right side of the graph (tall & narrow spikes are bad)

catlee's graph

Monitoring

  • We are evaluating hostedgraphite.com for monitoring our efficiency. It’s nice to have someone offer a well-supported open-source-compatible solution that can cope with 30K+ of metrics our 1000s of VMs generate.

Workload Improvements

Mozilla Data Center plans for March

Amazon S3 is cheap, fast and robust. EC2 is incredibly flexible. Both are great for quickly iterating on cool ideas. Unfortunately most of our infrastructure runs on physical machines. We need to improve our non-elastic inhouse capacity with what we learned in the cloud:

  • Use a shared object cache for Windows/Mac builds. This should more than double Windows build speed. The plan is to use Ceph for S3-compatible shared object storage.
  • Get OpenStack bare metal virtualization working so we could move as fast there as we do in EC2

Cloud Plans for March

  • Eliminate EBS usage for faster builds, 10% lower EC2 bill. Amazon EBS is the antithesis of cost-effectiveness.
  • Deploy more jacuzzis for faster builds, less EC2 instances
  • Run more things on spot, switch to cheaper ondemand nodes, maybe buy some reserves
  • Bid on an even wider variety of spot nodes
  • Probably wont hit another 30% reduction, focusing on technical debt, better metrics, etc
  • Containerization of Linux builds

Conclusion

Cloud APIs make cost-oriented architectures fun. Batch nature of c-i is a great match for spot.

In general, spot is a brilliant design pattern, I intend to implement spot workloads on our own infra. It’s too bad other cloud vendors do not offer anything comparable.

Cost-Efficient Continuous Integration

As of a few weeks ago, Chris AtLee and his half of Mozilla Release engineering report to me.

We are working on drastically improving efficiency of our continuous integration infrastructure.

For 2014 our focus is on cost-efficiency, developer efficiency is also a priority. Mozilla has a few thousand (around 3000, including various ARM boards) real machines dedicated to building & testing our products. There are an additional ~500-1000 virtual machines on Amazon EC2. We have been spending around ~$115K per month for Amazon AWS infra for the past 4 months (e.g. Linux/B2G/Android builds & unit tests. Linux perf tests run on real hw). It’s harder to get numbers for our internal costs, but a conservative estimate would be to assume that Amazon AWS is 1/3rd of our spend given that we test Windows, Mac, Android, etc in-house.

Number of c-i hours goes up with every month, so it is crucial to get a grip on our infrastructure efficiency in terms of cost per job.

My cloud target is to get our current Amazon EC2 workload to under $30K/month by the end of the year. This will give us room to move Mac builds, and Windows unit-tests + compiles into the cloud.

Where to save money? One can break down the c-i process into 3+1 distinct optimization areas:

  1. Building & Testing Workloads
  2. Scheduling Logic
  3. Provisioning Bare Metal or Virtual Machines
  4. Reporting

Items 1-3 are ordered in terms of dependencies where in theory #1 does not care how it’s scheduled or what machines it’s run on(#3). #3 is what we get billed for and it is largely a result of #1 and #2.

#4 is mainly a human cost in that we don’t have a lot of insight into how efficiently our jobs run. Making our systems more transparent should make optimization easier.

In terms of work:

  1. We made great progress with reducing build times in 2013. Compilation is the biggest contributor to our EC2 bill. We have more work to do there. We’ll focus on optimizing tests to run more efficiently once build costs are on-par in cost terms.
  2. The way we use Buildbot results in a lot of idle-time on machines. There is a lot of room for improvements via containerization, running more jobs in parallel. We are looking at potentially replacing Buildbot with a simpler and more cloud-friendly framework (TaskCluster).
  3. We saved $30K in January by tweaking what kind of EC2 nodes we use. We switched to more cost-efficient ondemand nodes (m3.xlarge -> c3.xlarge, m1.medium -> m3.medium) and started running try jobs and tests on EC2 spot nodes.

Moving to Public Clouds

We will be moving all of our Mac/Win/Linux (other than perf-testing) infrastructure into the cloud. Windows, Linux and most of Mac builds and tests should run on hardware outside of Mozilla data-centers by the end of 2014. Cloud infra like EC2 is expensive for inefficient workloads. We will be optimizing workloads locally until they can run on a public cloud in a cost-efficient manner (eg VMs with little idle capacity or redundant tasks). I expect cloud infra to save Mozilla developer time and money.

Private Cloud Functionality

For obvious reasons, we are stuck doing performance tests and many Mac workloads on our bare metal. Unfortunately suitable bare metal public clouds do not exist yet. We would like to approach EC2-like engineering efficiencies by reconfiguring machines in our data center as a private bare metal cloud based on OpenStack. We will be replacing bugzilla+human intervention with APIs for reimaging, slave loans, etc. Ideally all of our pc, mac, arm onsite,offsite hardware would be controlled by the same API. I hope that in addition to making our hardware ops happier, opening up machine configuration via self-serve access control will greatly increase our ability to iterate on improving machine configurations, bringing new configurations online, etc.

Spot workloads

Amazon’s spot infra (http://aws.amazon.com/ec2/purchasing-options/spot-instances/) is a great idea for minimizing idle time on machines. I would like to have spot workloads in public and private clouds. Try and fuzzing deal well with being interrupted and soaking up spare capacity.

Developers with keys

The releng team is small. Unless you have a critical request, expect us to hear a “no, we are too busy” followed by “but here are the keys to our test public/private clouds, go do what you want and we’ll help you deploy it”. I hope that Mozilla developers will appreciate not having to block on busy intermediaries and help themselves.

My strategy is largely inspired by this Forrester analysis. Infrastructure devop teams should strive to enable other teams to self-serve. Our role is to help developers do things right, not to be gatekeepers or do work that developers find unsatisfying.

How to Get Involved

In 2014 I’d like to move our c-i from a 90s-style architecture with a few modern cloud components to a modern cloud-oriented architecture. The following areas might be fun to help with:

  • We need to be more data-driven. Need to process, visualize and correlate various logs to direct optimization decisions. We have build logs, buildbot logs, AWS usage logs, AWS Spot price logs, etc.
  • Bare metal virtualization with OpenStack for Mac, Windows, ARM boards is an open problem
  • Containerizing builds, tests via Docker/LXC.

New Role: Developer Productivity

New Role

The performance part of my team is now lead by Vladan. They are doing usual performancy things with a focus on improving our ‘internal’ benchmarks such as talos, TART + some upcoming power testing.

I got a new task recently: developer productivity. I’m prioritizing tasks based on developer unproductivity. I have the following new stuff on my radar:

  • slow build system for local builds
  • long build/test times on our release automation
  • bugzilla improvements (eg tracking review times, lack of pull reqs, etc)

Expect to hear less Gecko and more web dev stuff out of me for the foreseable future. Note, this is a collaborative project. My team would not be able to accomplish any of the new goals as we are already above capacity. I’m mainly working on improving coordination and collaboration on existing projects. Let me know if you have ideas related to this dev productivity stuff.

Cost-Oriented Architecture

My main project for the remainder of this year is to maximize throughput per dollar spent on our release automation infrastructure. I’m focusing on computation that happens on Amazon EC2(because it’s easier than pricing out our own infrastructure). My goal is to be able to say “making improvement x will pay for itself in y time”. This is mainly management technique to make it easier to justify working on infrastructure.

John posted some initial costing figures. Rail is working on switching us over to Amazon Spot instances in bug 935533. I expect a 2-5x reduction in our AWS costs once Rail is done.

I’d like to have a receipt attached to every job that executes on our build infrastructure. I hope this encourages devs to look into build/test inefficiencies.

Keeping an eye on costs means we’ll be able to better reason about moving more workloads into the cloud. Cloud stuff is attractive because it allows us to give developers keys to AWS. They can deploy(in a low-friction manner) whatever they need to make their job easier while cost-monitoring ensures that things don’t get crazy.

Open Architecture

In theory, open source is about scratching own needs. In practice service owners (usually inadvertently) put up walls to contribution by depending on licensing requirements, poor documentation, vpns, obscure infrastructure, weird version control, etc. This then snowballs into something equivalent to closed source where ‘clients’ have to ask for ‘features’ from ‘owners’. This usually results in unhappy customers and overworked owners. This happened to Telemetry (see previous post) and is happening to a few of our tools. Expect improvements in this area.

Telemetry Reboot

A year ago I wrote 2 blog posts describing telemetry (post1, post2).

What sucks about telemetry?

The following aspects of performance made us unhappy with telemetry as deployed now: dashboard performance, effort required to add new dashboards/visualizations, data-gathering latency and slow ad-hoc queries on hadoop. Telemetry dashboards keep timing out, infrastructure keeps failing, this can’t be the way to drive Firefox performance improvements.

Telemetry Reboot

After a few months of soul-searching we ended up with a small group of people working on a replacement for current telemetry serverside. The goals are to minimize latency in the following dimensions:

  • Serverside should be fast and robust. Slow dashboards are a good way to discourage their use.
  • We should have extensive monitoring and alerting to notify us when data isn’t coming in (eg due to a bug in Firefox client code).
  • Everything should be open source and easy to hack on. Outgoing telemetry infrastructure has an overly complicated software stack and code isn’t open source. This was a mistake.
  • Fast ad-hoc queries
  • Minimal submission->graphing latency. Currently it takes about 3 days between landing a piece of telemetry code and having that data show up in a dashboard. Takes about 7 to gather a useful amount of data. My goal is to eventually do live-histogram-aggregation so the data can be graphed as soon as it hits the server. This means we should be able to gather a useful amount of telemetry data within 3 days of landing a probe. Should we choose to switch to shipping hourly nightlies, telemetry would come back even faster :)

ETA

I expect to shut down old dashboards and cut over to the new server + dashboards on Oct 1. See the wiki for our next set of milestones.

We’ll be deploying the new telemetry backend on Amazon AWS. I’ll write a blog post about switching from a private cluster to AWS once we go live.

Mark is working on our serverside. Mark started implementing the server in Python/Django, but Node.js turned out to be a much more potent platform in terms of amount of work needed to squeeze out good performance. You can find the current node code on github. I suspect we’ll be switching to a C/C++ HTTP receiver in the future to get another 2-3x increase in req/s performance. In the past we’ve run into servers having difficulty coping with increases in telemetry volume (submissions went up 10-fold when we added saved-session telemetry).

At the moment we are looking at squeezing out the maximum possible requests per second out of node. If you have a background of that sort of thing feel free to try improving our code.

I spent a lot of time thinking about how to write the fastest(+robustest) possible telemetry dashboard. I ended up writing a dashboard (code) based on static JSON. This approach is fast and should make dashboards easy to contribute to. Chris is working on making my prototype useful and usable. The dashboard is going to be going through a lot of refactoring in the near future. At the moment we use jydoop to generate the dashboard JSON from hadoop (was easier to write a new query tool for hadoop than do a complex map/reduce in Java).

Summary

We will be cutting over to brand new telemetry infrastructure on Oct 1. There will be pain. If you have any questions or are interested in helping out ping us on IRC in #telemetry.

New Performance People, Telemetry News

reddit/MozillaTech

I like our make-planet-techy subreddit. Seems like articles that get rated as one would expect. I happily switched to reddit as my primary way to consume planet mozilla. Redditors, keep up the good work!

Telemetry Dashboard

Our new telemetry dashboard went live yesterday. It’s missing features, data, UX. However it is public, fast and hackable. It should evolve quickly.

New Blood

Mark Reid is our first server-side dev. His primary task is switching telemetry backend from hadoop to a custom telemetry server. New server infrastructure will enable a live dashboard (3-day delay atm) + ability to run queries in minutes rather than in hours.

Dhaval Giani, “the intern”, is our first kernel hacker. He’s working on helping land volatile memory in the kernel. This feature will let Firefox safely consume more memory when memory is plentiful and use less in limited scenarios. Hopefully he’ll also add read-only file compression to ext4.

Blogs linked above are in the please-add-to-planet queue. Expect them to spend a few weeks there. Please subscribe to their RSS feeds in meantime.

Investigating SQLite Performance via Telemetry

Years ago I tweaked our SQLite clients to use a larger page size. In my testing this seemed to achieve a speedup of 0.2-2x, see bug 416330. Unfortunately, ~30% of our users are still on the old page size (bug 634374). We used a jydoop query to figure out if it’s worth developing a feature to convert these users over using two different (time spent reading & time spent executing a query) ways to measure performance differences. According to both methods there is a 4x reduction in sqlite IO waits with the larger page size, so we’ll be adding code to convert people over more aggressively.

This was a cool investigation because it highlighted how much more confidence we have when making performance decisions now (due to telemetry) vs a few years ago.

Mozilla Tech on Reddit

I posted earlier about planet mozilla being obsolete. To move forward, we need to experiment with some alternatives. Mozilla contributor, @djco, setup a reddit feed so we can up-vote technical content and down-vote the rest.

I know a number of good people who stopped reading and posting on planet because of irrelevant, offensive, etc content. Please give the reddit feed a try. Maybe once we get enough people moderating, we’ll encourage more quality content.

I am not suggesting that people stop posting pictures of their cats, dictionary entries on planet. I only want to filter that out so I can have a satisfying technical feed.

Please try out the reddit feed and let me and Dirkjan know if it works for you. We are open to other suggestions too.

Snappy #55: Snappy Evolution

The Snappy name will be retired in favor of Performance: we will be expanding our meta scope beyond desktop Firefox responsiveness.

I think as a result of Snappy, Firefox is in a much happier performance place now. There are a big wins remaining, but we have tools and ideas on how to get there. We’ve come a long way from “how to make Firefox feel fast?” discussions in late 2011.

Snappy has been a tricky meta-project because work has to happen across teams. While my Performance team has an exclusive commitment to performance, other teams have to context-switch between feature-work, platform-work, performance, etc. There are also team culture differences on internal vs external communication, planning, etc.

Some of the projects are quite hard and require specialized expertise. This meant that we’d make a bunch of progress on a project (eg breakpad profiler backend) only to realize that we are blocked on someone we didn’t loop into the project right away (eg Ted). Ideally we will minimize chances of surprises like this in the future.

Until recently my solution was to deal with this as a manager. I’d sync up with relevant managers about Snappy needs and try to get some Snappy bugs onto relevant team’s todo. This has worked out ok, but there were a few too many instances of unexpected delays due to shifting team priorities, miscommunication, general lag. My conclusion is that at a developer-driven organization like Mozilla heavy manager involvement on Snappy-type projects is a sign that we are doing it wrong. Developers should be adding performance goals to their team’s agenda.

Last week we came up with a more developer-centric way to do performance work. I’ll still be around making sure things are moving forward, but from now one I’d like to see developers drive planning & coordination on a per-project basis. I’ll introduce individual projects + their respective blogs as they start in the next couple of weeks.

Irving Reid (addon-manager startup performance lead) summed up key mechanics of the new approach in an email. Thanks, Irving.

Irving’s Summary

Coming out of the Snappy work week, we decided to try and make performance projects a little more structured than they have been in the past; see Lawrence’s post for a summary.

While not mentioned in Lawrence’s blog, one of the things Taras (if I recall correctly) suggested was to have a project kick-off meeting to get rough agreement on scope, responsibilities, and how we’re going to track progress. If everyone is happy with settling those questions over email, we can; I think we lose a little by not getting a chance to see and hear each other directly, but it is rather painful to get that scheduled in the current circumstances.

In any case I wasn’t planning on delaying work until after the meeting could happen; it’s more about getting things as clear as we can, as early as we can.

The current plan is for Felipe and I to do the patches, with me handling coordination and progress tracking as well. Everybody else is involved to make sure we’re going in the right direction, and help us over roadblocks.

The progress tracking technique the Snappy team settled on last week was a combination of daily one-line updates using the status bot in the #perf channel, and a blog post every two weeks summarizing overall progress. I’ll do the blog posts; depending on how things go I may do them weekly instead of biweekly.

Snappy #54: Snappy Discussion in Paris

Meeting In Paris

Last week Snappy people met in Paris at IRILL. I like small venues because they encourage conversations. IRILL is one of the best small work venues I’ve attended.

Excellent Parisian food did not hurt :)

Workweek Presentations

I was impressed by the number and quality of presentations this time. I suspect having a presentation-friendly venue helped.

Talks/discussions/notes (please leave a comment if I missed a talk):

Team Activity

For a team activity we walked to the Paris Catacombes.