Last week I learned about how Windows handles page faults backed by files (specifically xul.dll). I already knew that Linux was suboptimal in this area, perhaps the clever people at Redmond did better.
Shaver pointed me at xperf, which is sort of like the new Linux perf tools. Xperf rocks in that it can capture the relevant data and it can export it as .csv.
Even with profile-guided-optimization Windows causes 3x as much IO in xul.dll than linux does on libxul.so. That’s especially interesting given that xul.dll is one-third smaller on Windows. Here is the graph. PGO isn’t helping on Windows as much as it can help on Linux because MSVC PGO doesn’t do the equivalent of GCC’s -freorder-blocks-and-partition (unless I missed something in the docs).
With the Windows prefetcher, there were 4x less xul.dll IOs (graph here). Unfortunately, the prefetcher can’t figure out that the whole xul.dll should be paged in and we still end up with an excess of random IO.
Why?
When a page fault occurs, Windows goes to read the page from the file and reads a little extra (just like any other sane OS) assuming that there will be more IO nearby. Unfortunately the gods of virtual memory at Microsoft decided that for every page fault, only 7 extra pages should read. So reads occur in 32K chunks (vs 128K in Linux, which is still too small). To make matters worse, segments mapped in as data only read in 3 extra pages (ie 16K chunks).
This is funny in a sad way. Reading in 32K chunks is supposed to minimize ram usage (which makes no bloody sense when Win7 officially requires 1GB of RAM). As a result of being so thrifty on easy-to-evict file cache, windows ends up doing 4x as much file IO as Linux. The 16K reads are particularly funny as one can see the result of that misoptimization in the string of puny reads on the top right of the graphs.
Surely There is an API like madvise() On Posix systems madvise() can be used to influence the caching behavior of the OS. fadvise() is another such call for IO based on filehandles. For example, Firefox fastload files are now madvise()ed such that they are read in a single 2mb chunk on startup. Unfortunately, it appears that Windows has no such APIs so one is stuck with pathetically small reads.
At first, I thought that, passing FILE_FLAG_SEQUENTIAL_SCAN when opening the file handle will work like a crappy fadvise()-equivalent. Turns out that mmaping files completely bypasses the Windows Cache Manager, so that flag just gets ignored.
So as far as I can tell the only way to convince Windows to not read stuff in stupidly small chunks is to mmap() everything we care about using large pages. Unfortunately that comes with some significant costs.
We are going to try to order the binaries better which should halve the amount of page faults.
Can Windows Do Anything Better?
Yes. The “Unix way” of breaking up everything into a billion libraries and configuration files results in 10x more files being read on startup on Linux vs Windows. Just because Linux can read individual libraries 4x faster, doesn’t mean that IO becomes free.
Presently, in ideal conditions, Firefox starts up 30-50% faster on Windows. The Windows Prefetcher hack sweeps a lot of Windows suck under the carpet, but Microsoft has a lot of room for improvement.
Update:
People seem to prefer to comment on reddit. If you want me to not miss your comment, make sure you comment here.