What sucks about telemetry?
The following aspects of performance made us unhappy with telemetry as deployed now: dashboard performance, effort required to add new dashboards/visualizations, data-gathering latency and slow ad-hoc queries on hadoop. Telemetry dashboards keep timing out, infrastructure keeps failing, this can’t be the way to drive Firefox performance improvements.
After a few months of soul-searching we ended up with a small group of people working on a replacement for current telemetry serverside. The goals are to minimize latency in the following dimensions:
- Serverside should be fast and robust. Slow dashboards are a good way to discourage their use.
- We should have extensive monitoring and alerting to notify us when data isn’t coming in (eg due to a bug in Firefox client code).
- Everything should be open source and easy to hack on. Outgoing telemetry infrastructure has an overly complicated software stack and code isn’t open source. This was a mistake.
- Fast ad-hoc queries
- Minimal submission->graphing latency. Currently it takes about 3 days between landing a piece of telemetry code and having that data show up in a dashboard. Takes about 7 to gather a useful amount of data. My goal is to eventually do live-histogram-aggregation so the data can be graphed as soon as it hits the server. This means we should be able to gather a useful amount of telemetry data within 3 days of landing a probe. Should we choose to switch to shipping hourly nightlies, telemetry would come back even faster :)
I expect to shut down old dashboards and cut over to the new server + dashboards on Oct 1. See the wiki for our next set of milestones.
We’ll be deploying the new telemetry backend on Amazon AWS. I’ll write a blog post about switching from a private cluster to AWS once we go live.
Mark is working on our serverside. Mark started implementing the server in Python/Django, but Node.js turned out to be a much more potent platform in terms of amount of work needed to squeeze out good performance. You can find the current node code on github. I suspect we’ll be switching to a C/C++ HTTP receiver in the future to get another 2-3x increase in req/s performance. In the past we’ve run into servers having difficulty coping with increases in telemetry volume (submissions went up 10-fold when we added saved-session telemetry).
At the moment we are looking at squeezing out the maximum possible requests per second out of node. If you have a background of that sort of thing feel free to try improving our code.
I spent a lot of time thinking about how to write the fastest(+robustest) possible telemetry dashboard. I ended up writing a dashboard (code) based on static JSON. This approach is fast and should make dashboards easy to contribute to. Chris is working on making my prototype useful and usable. The dashboard is going to be going through a lot of refactoring in the near future. At the moment we use jydoop to generate the dashboard JSON from hadoop (was easier to write a new query tool for hadoop than do a complex map/reduce in Java).
We will be cutting over to brand new telemetry infrastructure on Oct 1. There will be pain. If you have any questions or are interested in helping out ping us on IRC in