All About Performance

and other stuff by Taras Glek

Demise of Special Snowflake Infrastructure

TLDR

I’ve been running one of our Releng(C-I) teams since January. My specialty is to optimize $stuff at Mozilla. My team is full of like-minded people.

Here is a short summary of various improvements done this year:

  • Get 4x better value for each dollar spent on Amazon AWS (so we can use it for more stuff)
  • Sped up build times: try, inbound (values in seconds)
  • wrote a new pull-request review tool on top of reviewboard. See mailing list. Bugzilla guys are standing it up for testing
  • Setup http://ask.mozilla.org for knowledge transfer

Long Story: Put up or Shut up

Our infrastructure is an awful artifact of the project roots and people working on it going back to the 90s. Buildbot made sense when running on a dozen computers, it makes less sense on 5000. Building our own issue tracker(sadly bugzilla is still competitive), version control hosting made sense when alternatives were worse. However, the world moved on and we ended up with a dead evolutionary branch (rooted in the 90s!) of a lot of these technologies. Things are often slow, because of either lack of ownership or wrong ownership(eg management fail).

It hurts to see new people join and ask “why are you guys such special snowflakes?”, it also hurts to see our elders reply “because Mozilla is different”. Everyone working with me is sick of our infrastructure. We have our best people working days (and often weekends to get us out of the mess we are in).

Instead of lamenting about the situation we dug ourselves into, we are working hard on getting out of the hole, presenting PoCs, etc.

Here are some things being worked on atm:

  • Faster builds: the goal with builds is to get them to as fast as possible. We’ll halve our build times a few times this year. 2 people are on this fulltime with part-time help from others. Most of these wins will happen on our infrastructure, however our local builds will also be best in class.
  • Efficient, cloud-oriented, modern flexible job C-I system: taskcluster.net, blog. This will cut latency during scheduling, test execution and be mostly self-serve.
  • Moving everything we can move into the public cloud, running rest as a private OpenStack bare metal cloud. The plan is to offload private parts to public clouds as soon as viable(mac, bare metal, etc) options appear on market.
  • Excellent review tools: in addition to setting up reviewboard, we’ll support a proper github pull req flow without CVS-style-everybody-can-commit-to-TRUNK artifacts
  • Everything is being designed with self-serve operation in mind. The idea is that if someone is sick of something they have all the APIs they need to do something about it, or they can shut up. Historically obtuseness of a lot of our infra prevented this.
  • sccache(eg multimachine ccache). This now works as a ccache replacement on Windows in addition to being a big speedup on our infra.

Note I spend a lot of time evaluating alternatives (ninja, phabricator, github, etc) before embarking on projects. Sometimes it’s easier to fix existing infra, other times it’s better to switch tools. We’ll try to use 3rd-party infrastructure where possible because domain experts are usually better at what they do. For example, it’s unbeliably hard to stand up something on our infrastructure vs calling APIs on Amazon (eg hours of dev vs months of meetings).

It’s a lot of work to drag our infrastructure out of the late 1990s. Our mission is to eradicate/outsource/rewrite crappy infrastructure. We know what needs to happen and how to get there by the end of the year.

PS. I expect to see dev blogposts about their work.

Comments