In an earlier post I described my Fiji hack: how to use some nasty instrumentation to spit out ld scripts to speed up cold startup. This week I tried to extract more data out of the binary to lay it out even better. Trouble is that even if one lays out functions perfectly, they load data for things like variable initializers which will cause more IO.
A very clever friend suggested that I can write a valgrind plugin that can detect data accesses and function access and write the linker input files in one step. So with much hand-holding I hacked a sample valgrind plugin to do what I want. Unfortunately, my binaries ended up not being significantly faster(if at all) than the Fiji ones. They also ended up 20% bigger.
Fortunately, the GCC devs were able to point out my linker mistakes and pointed me at a linker patch that does what I want without linker scripts(and has less binary-bloating side-effects). Unfortunately, that just confirmed that the speedup I was looking for wasn’t hiding behind data symbols. So I am going to have to sit down with my io tracing script and study what the heck is going on.
Cool Things I Learned
In the process of helping me, GCC people namedropped some compiler flags that may prove very helpful:
- -freorder-blocks-and-partition: Apparently this breaks up functions into hot/cold parts and gives them different section names so they can be moved around at link time.
- -fno-common, -fno-zero-initialized-in-bss should go well with my favourites: -ffunction-sections -fdata-sections Additionally, it may be possible to benefit from linking with large page support. I have some doubts about that.
I did learn about some cool GNU Gold flags:
- –compress-debug-sections=zlib: Most of the overhead of linking a development libxul.so is writing out a near gig of debug data
- –icf: Identical code folding, I think that matches the deduplication feature found in the ms linker. Saves 5% on my libxul.so