This post is about minimizing amount of disk IO and CPU overhead when reading Zip files.
I recently saw an article about a new format that was faster than zip.
This is quite surprising as to my mind, zip is one of the most flexible and low-overhead formats I’ve encountered.
Some googling showed me that over past 11 years people have noticed that Firefox uses optimized zip files. This inspired me to document thinking behind the optimized zip format I implemented in Firefox in the pre-pandemic 2010. I had a lot of fun writing this code, was surprised that I failed to blog about it.
The following diagram is borrowed from codeproject article. Wikipedia and ZIP specification are also helpful.
Zip files seem to be designed for cheap appends. For every file inside a zip, there is a
Local Header + optionally compressed
file # contents.
This is followed by a
Central Directory which acts as an index for zip contents.
To extract a file from zip file one must:
- Scan backwards through zip file for
End of Central Directorymarker.
- Read offset to begining of
- Find relevant
Local Headeroffset in
- Read + optionally decompress stored file.
Writing Optimized Zip Files
In order to optimize file IO on Firefox startup I wanted to make use of OS readahead1.
Unfortunately reading files started from ending precludes readahead. It is also suboptimal to read files from zip in random order.
The following creative interpretation of Zip spec results in optimized zip files:
Since we already do PGO2 for Firefox builds, I added a
ZipArchiveLoggerfor logging zip-entries being accessed to the Firefox profiling stage.
Then during the build phase, I added
optimizejars.py3 to move the
optimizejars.pywould lay out zip entries in order specified by
Wrote down length of
Central Directory+ entries in step 3.
Zip file now looks like: | 4 bytes |
Central Directory | Hot Files | Cold Files |
End Of Central Directory Marker |.
Thus we have a sequentual-read-friendly zip file that can still be ready by zip tools that follow the spec.
Reading Optimized Zip Files
- Speculatively check if we can find the
Central Directorysignature 4 bytes in.
- If step 1 succeeded, assume first 4 bytes are our read-ahead length. Issue a platform-specific read-ahead call.
So for all cases where zip file access pattern matches one recorded during profiling phase Firefox can read the relevant resources in a single IO!
Further Zip Trivia
- https://bugzilla.mozilla.org/show_bug.cgi?id=701875 renamed .jar files .ja so Microsoft System restore wouldn’t corrupt Firefox.
- At the time optimized jar change broke antivirus scanners, which further sped up Firefox startup :)
- Zip reader in Firefox uses mmap thus zip files can be nested with 0 overhead (without compression). This also broke OS/2 support as OS/2 did not have a concept of memory mapping.
- This work would not have worked-out without massive amounts of help and follow-on work from Mike Hommey, Michael Wu and Robert Kaiser.