Magic of GCC PGO
On Friday I finally got gold to produce a prelinkable static binary(bug). I also got around to trying out GCC profile-guided-optimization with the debloatifying -freorder-blocks-and-partition option. This option breaks up every profiled function into cold and hot “functions”. It then lumps all of the hot functions together.
PGO performance is amazingly close to that of binaries produced by icegrind (within 10% based on page-counts).
Startup Time RSS (KB)
firefox.stock 2515ms 49452
firefox.ordered 1919ms 45344
firefox.static 2321ms 49616
firefox.static.ordered 1577ms 37072
firefox.static.pgo 1619ms 38436 In above table, ordered means application of icegrind, static means a fat firefox-bin. To generate the PGO profile I just started and closed Firefox. So it’s not too surprising that the results are similar to those of an explicitly ordered binary. RSS refers to how much memory is mmap()ed into the process(lower RSS usage means we are paging in less junk). I was not able to control the layout of PGO builds; will need some linker hackery to deal with split functions.
I think the numbers speak for themselves. Isn’t it scary how wasteful binaries are by default? It amazes me that Firefox can shrug off a significant amount of resource bloat without changing a single line of code. I think this is a good demonstration on why application developers should a) expect more out of their toolchain (Linux, GCC, Binutils) b) contribute to their toolchain.
I think my next step is to tackle PGO Firefox builds(bug). From there I would like to teach icegrind to play together with PGO.
Pieces of The Startup Puzzle
It took a long time since I first noticed the suckyness of library io. After much digging, help by smart people on #gcc + LKML discussion, I think I finally have a pretty clear list of the remaining/inprogress things needed for Linux applications to start faster.
- Wu Fengguang is making headway on smarter readahead that makes better use of available RAM + disk bandwidth. Firefox could be read in 512kb chunks instead of 128 (4x page-fault reduction). Distributions should be aggressively testing this patch.
- Better agreement on binary organization between the compile-time linker, run-time linker and the kernel (see LKML discussion). Can shave off a handful of unneeded page-faults per file this way.
- A linker flag to specify how much of a particular library should be read-in via madvise(). For example any xulrunner apps will know ahead of time that they need large parts of libxul.so - might as well let the OS know.
- Transparent read-only per-file ext4 compression (like OSX). Ted Tso indicated that this would be easy to add to ext4, but as far as I know nobody has jumped on this. I think with all of the above combined, we could load apps like Firefox at near-warm (0.5 second) speeds. Most of these are easy. #1 is hard, but it’s already being worked on. I’ll be happy to point someone at how to solve items 2-4 while I work on the Firefox side of things. The end result of all this should be a more instant gratification for all Linux users.