mmap Page Fault Tracing with bpftrace and ext4

Previous blog post on how to trace Firefox IO using bpftrace via official page_fault_user tracepoint left me a bit unsatisfied with how complicated it turned out. Complexity has potential to be error-prone and the syscall-tracing dependency makes it impossible to trace IO within the main executable. I decided to try reimplement the trace using my old approach of tracing ext4 functions that handle page-faults. This turned out to be much more robust. This is now documented on my github. It’s ugly in that it’s dependent on internal kernel structures, but it catches 100% of the IO and requires no post-trace syscall-correlation/fudging. ...

March 15, 2022 · 1 min · Taras Glek

EBPF for Tracing How Firefox Uses Page Faults to Load Libraries

Modern browsers are some of the most complicated programs ever written. For example, the main Firefox library on my system is over 130Mbytes. Doing 130MB of IO poorly can be quite a performance hit, even with SSDs! :). Few people seem to understand how memory-mapped IO works. There are no pre-canned tools to observe it on Linux, thus even even fewer know how to observe it. Years ago, when I was working on Firefox startup performance, I discovered that libraries were loaded backwards (blog1, blog2, paper, GCC bug) on Linux. Figuring this out was super-painful, involved learning SystemTap and a setting up a just-right kernel with headers and symbols. ...

February 22, 2022 · 4 min · Taras Glek

Reading NFS at >=25GB/s using FIO + libnfs

My current employer does a lot of really cool systems work that’s covered by NDAs. I recently did some work to integrate a cool open source tool into our workflow. Felt it deserved a blog post. NFS Testing Requires Parallelism. I work for Pure Storage. One of the products we make is a scale-out NFS1 (and S3-compatible) server called FlashBlade. I was asked to test FlashBlade2 performance scaling. I needed to generate NFS read workloads of 15-300 Gigabytes/second. Given that NICs in our lab max out at 100Gbit, (~10GB/s), this required a lab with fast servers and multiple NICs per server. I wanted to make maximum use of hardware to minimize lab space + cost. To make use of multiple NICs I needed multiple NFS connections per NIC. Linux does not make this easy3. ...

November 24, 2021 · 3 min · Taras Glek

Firefox's Optimized Zip Format: Reading Zip Files Really Quickly

This post is about minimizing amount of disk IO and CPU overhead when reading Zip files. I recently saw an article about a new format that was faster than zip. This is quite surprising as to my mind, zip is one of the most flexible and low-overhead formats I’ve encountered. Some googling showed me that over past 11 years people have noticed that Firefox uses optimized zip files. This inspired me to document thinking behind the optimized zip format I implemented in Firefox in the pre-pandemic 2010. I had a lot of fun writing this code, was surprised that I failed to blog about it. ...

November 22, 2021 · 3 min · Taras Glek

Why Google Pixel lags 10x more than Moto Z

In my previous post I made an argument that a modern phone is only as fast as the slowest component: ability of NAND to handle 4k writes. I decided to compare two Android flagships on the opposite ends of random-write-4k benchmark spectrum: Moto Z vs Google Pixel. I wrote a little fio benchmark driver to fill all available device storage with random 4k writes, print perf stats along the way. Idea is to run the benchmark on /data/ partition, then fill all available space by writing to /storage/emulated/0, then do another round of testing on /data. ...

December 11, 2016 · 5 min · Taras Glek