Laggy phones and misleading benchmarks
TLDR: You can predict degree of unresponsiveness of a phone via random-write-4k benchmarks. I wish review websites would fill phones to 80-90% prior to running the benchmark, especially on smaller-capacity phones where users are more likely to run out of space.
SQLite vs Phone NAND
I’ve long held a theory that Android lag is almost directly determined by slowness induced by SQLite transactions. This weekend, while researching phones for a family member, I found some supporting evidence.
I was employed to analyze Firefox performance at Mozilla. Most of the time I focused on IO performance as my niche. This was relatively easy because desktop OSes (especially Windows with XPerf and Linux being open source) are very open to developers. Unfortunately as a hobbyist I have less chances to figure out why my phones are slow. None of my phones have root, let alone an unlocked bootloader (eg no ability to recompile the kernel with IO tracing functionality).
In the past I verified that all of my phones that got super-laggy were exhibiting single-digit-per-second write-random-4k benchmarks. However until now I couldn’t point at SQLite is the main driver of IO on Android.
To trace IOs on Android one has to recompile the kernel or at-least have root to run something like fsmon to observe high level IO. I was able to run fsmon on my rooted Android TV box and overserve that most of the IO occurred in SQLite databases.
For some reason Android does not default to using WAL journaling mode for SQLite which would make it use 2x-less IOPS.
Nested Journalling Magnifies Cost of SQLite IO
In addition to fsmon stats, I found a great paper on how SQLite accounts for 90% of Android IO and how it amplifies every write transaction by ~4x by the time it hits the underlying storage (eg 1commit ~= 4 fsyncs). It also shows how a 100 bytes of SQL data translates into 64KB of block writes.
Basic premise of the paper is that SQLite journaling is amplified by ext4 filesystem journal resulting in extreme badness. One is tempted to assume that it is further amplified by the GC on the EMMC NAND controller :)
I actually think the paper is overly optimistic in focusing on length of time taken by a single SQLite transaction. In reality one is likely to wait on more than one transaction due to having to update multiple databases or poorly written code (common problem with ORMs).
Combine above data with the fact that Phone NAND is the only component that gets consistently slower as your phone ages. Memory cells wear out and NAND garbage collector slows as the phone fills up to 80-90% of storage capacity. Note one can briefly regain better system performance by doing a full reset.
Bad Android IO Patterns
SQLite is the default way of persisting structured data on Android. Android documentation seems to default to showing how to do SQLite IO on main thread (explanation). This means that Android apps are often waiting on reading and writing to NAND instead of responding to user input.
Even if most of the IO happens on a background thread, the mechanics of IO dispatch and low queue depths in consumer-grade environments mean that even if there is a large off-main-thread/background IO infront small IO on main thread, small IO. will take a long time to complete. If one is lucky and only runs apps without main thread IO on Android,there will still be the problem of waiting for long IOs.
Conclusion
A core principle of performance engineering is that a system is only as fast as the slowest bottleneck. In this particular case the bottleneck is hit very frequently, so seemingly users don’t get to benefit from fancy CPUs much.
Interestingly, unlike with CPU perf, there is no correlation between random writes and price of the phone. Random 4k writes on modern flagship hw are very slow compared to any other metric. IPhone 7 struggles to do over 2MB/s. Google Pixel struggles to get above 2MB/s too.
This means that irrespective of the graphics cores, CPU cores, your phone is going to suck as much as random write perf… This sort of barely-acceptable performance will quickly turn into a “My phone is too laggy, I need to upgrade” as NAND perf deteriorates.
Instead of burying random-write-4k performance (or not doing that test at all), reviews should expose that front-and-center. Ideally they would also fill up the phone to 80% to match a realistic usecase.
There is atleast one phone vendor who gets it. Motorola G4 is 7x better than the flagships. Surprisingly my $60 ZTE ZMax Pro phone is also 2-3x better than the flagships.
If you know people who run hardware review websites, please ask them to focus on random-write-4k performance as predictor of jank/lag/frustration.