Firefox's Optimized Zip Format: Reading Zip Files Really Quickly

2021-11-22 524 words 3 minutes

Contents

This post is about minimizing amount of disk IO and CPU overhead when reading Zip files.

I recently saw an article about a new format that was faster than zip.

This is quite surprising as to my mind, zip is one of the most flexible and low-overhead formats I’ve encountered.

Some googling showed me that over past 11 years people have noticed that Firefox uses optimized zip files. This inspired me to document thinking behind the optimized zip format I implemented in Firefox in the pre-pandemic 2010. I had a lot of fun writing this code, was surprised that I failed to blog about it.

Zip format

The following diagram is borrowed from codeproject article . Wikipedia and ZIP specification are also helpful.

https://www.codeproject.com/KB/cs/remotezip/diagram1.png — hello

Zip files seem to be designed for cheap appends. For every file inside a zip, there is a Local Header + optionally compressed file # contents.

This is followed by a Central Directory which acts as an index for zip contents.

To extract a file from zip file one must:

Scan backwards through zip file for End of Central Directory marker.
Read offset to begining of Central Directory
Find relevant Local Header offset in Central Directory index
Read + optionally decompress stored file.

Writing Optimized Zip Files

In order to optimize file IO on Firefox startup I wanted to make use of OS readahead¹.

Unfortunately reading files started from ending precludes readahead. It is also suboptimal to read files from zip in random order.

The following creative interpretation of Zip spec results in optimized zip files:

Since we already do PGO² for Firefox builds, I added a ZipArchiveLogger for logging zip-entries being accessed to the Firefox profiling stage.
Then during the build phase, I added optimizejars.py ³ to move the Central Directory
Additionally optimizejars.py would lay out zip entries in order specified by ZipArchiveLogger log.
Wrote down length of Central Directory + entries in step 3.

Thus we have a sequentual-read-friendly zip file that can still be ready by zip tools that follow the spec.

Reading Optimized Zip Files

Speculatively check if we can find the Central Directory signature 4 bytes in.
If step 1 succeeded, assume first 4 bytes are our read-ahead length. Issue a platform-specific read-ahead call.

So for all cases where zip file access pattern matches one recorded during profiling phase Firefox can read the relevant resources in a single IO!

Further Zip Trivia

https://bugzilla.mozilla.org/show_bug.cgi?id=701875 renamed .jar files .ja so Microsoft System restore wouldn’t corrupt Firefox.
At the time optimized jar change broke antivirus scanners, which further sped up Firefox startup :)
Zip reader in Firefox uses mmap thus zip files can be nested with 0 overhead (without compression). This also broke OS/2 support as OS/2 did not have a concept of memory mapping.
This work would not have worked-out without massive amounts of help and follow-on work from Mike Hommey, Michael Wu and Robert Kaiser.

Comment on Twitter