Lyft’s Tips for Avoiding (Software) Crashes
Understanding what caused an app to crash is quite an undertaking. Bugs can happen anywhere in the codebase and vary in complexity and actionability (the engineering effort required for the first improvement). Deep knowledge of the underlying systems and frameworks is necessary in some cases for root causing a crash.
If you don’t know where to start looking for the solution the number of potential entryways can be paralyzing.
I’m guilty of opening an app and double tapping and swiping up to close it if it doesn’t work fast enough or if it even looks like it’s crashing. Guilty AF.
But up until now, I shied away from reading mobile performance articles. However, the recent blog post written by Wen Zhao, staff (Android) engineer at ride-sharing service Lyft, caught my eye because it has a lot to do with how data persistence leads to application crashes.
TL;DR: Monitor those reads/ writes and keep them as low as possible, or at the very least in a reasonable range. Don’t use synchronous interfaces for disk operations. And know those frameworks, people.
Note: The strategies outlined below are platform agnostic, though this article are uses examples from Android to highlight their execution.
Zhao wrote that it’s “important to start with the most obvious low-hanging fruit.”
- Native crashes were not included in Lyft’s internal crash rate tracking. Native crashes occur in the Native/ C++ Layer of the Android Operating system. These are captured and reported differently. Lyft doesn’t pursue any additional crash reporting on them as they are not actionable.
- Top 10 crashes contribute to 53% of overall crashes. This information was unexpected as there were many types of possible crashes. The chart below details the type and percentage.
- Top crashes were long-lasting and “not actionable.” These positions were held for at least six months because the crashes would require outsized time to fix. Some of these increased slowly over time, slipping under the radar of standard triage and on-call responsibilities.
The top crashes were then categorized into three buckets.
The third-party SKD bucket was not-actionable since Lyft has no control over third-party SDKs. They reported the crashes to Google Maps (cause of the crashes) and both teams are working together to resolve. We already know that native crashes are also not-actionable. That left Lyft with Out of Memory (OOM) crashes as the lowest hanging fruit. Instabug gives a good explanation of these.
Targeting OOM Crashes
The Investigation: Lyft engineers reviewed many OOM crash stack traces and found something they had in common — there were calls to a RxJava2 blocking API (e.g. blockingGet() ) when reading values synchronously from a disk.
Let’s look at Lyft’s internal storage solution. When it’s reading data from the disk, it always creates a new IO thread by subscribing on the IO scheduler, reading and caching the data in a PublishRelay and outlining the blockingGet() function from RxJava2.
This approach is problematic for a few reasons in relation to OOM crashes. Per the RxJava docs, the IO scheduler can create an unbounded number of weaker threads. The IO scheduler doesn’t remove the idle threads immediately since it uses CachedThreadPool. Rather the scheduler keeps threads alive for about 60 seconds before clearing them.
And threads aren’t reused either. If there are 1,000 reads a minute then there are 1,000 new threads with each thread occupying approximately one to two MB memory at a minimum leading to OOM exceptions… that’s a lot of threads.
The engineering team the top disk read operations for Lyft’s apps and found the majority of disk reads came from two places in the codebase where the number of reads was exceptionally high at > 2,000 times per minute. The root cause was located.
The Solution: The solution was straightforward since new threads were only created when data was read from the disk. When the app was launched via a cold-start and data was read for the first time, the data was cached in local memory. This allowed all additional reads to happen from the cache and prevent additional threads.
The Results: OOM crashes were reduced as expected. Additionally, native crashes were reduced by 53%. Lyft engineers weren’t expecting such a large impact on native crashes but apparently, the cause of many native crashes was low application memory.
Targeting ANR Crashes
App Not Responding (ANRs) are crashes that take place when the UI thread is blocked for longer than five seconds and (to gracefully put it) the operating system prompts the user to close the app. These aren’t as low hanging as the OOMs but were still actionable.
The Investigation: Bugsnag’s stack trace reports, which also group ANRs with similar stack traces together, were necessary for rooting the cause of the ANRs. Lyft sorted the reports in descending order and found that their use of SharedPreferences was the source of most of the ANRs (also in the persistence layer).
Google recommends calling SharedPreferences.apply() to write and edit data asynchronously. But under the hood SharedPreferences.apply() adds disk write operations to a queue rather than executing these operations immediately. SharedPreferences.apply() executes several lifecycle events on the main thread synchronously. Many operations in the queue = application crash.
In order to translate this new information to the Lyft codebase specifically, they profiled disk write operations and found disk write frequency for Lyft’s applications was as high as 1.5k times per minute. They also found instances where the same value was written to the disk multiple times per second.
Eventually, the root cause was boiled down to the fact that Lyft’s internal storage framework abstracted the underlying storage mechanism, meaning disk storage and memory storage used the same interface. Developers were inadvertently treating disk and memory storage as one and the same.
The Solution: The product teams worked to remove all unnecessary disk writes from their features. Logging was added to audit any additional disk writes. A memory cache was created at the feature level where the additional writes were added. Then the cache was synced with disk storage at frequencies depending on the use case. The disk storage interface was also separated from the memory storage interface.
The Results: There was a 21% reduction of ANRs after a few months of experimentation.
It was news to everyone that disk storage plays a much more critical role in application stability than previously known. With OOMs and ANRs reduced, a new long-term strategy was put in place that centered around what was learned throughout both investigations.
Lyft is going to continue working on its mobile performance. The next blog post promises to center around growing the actionability of issues in the performance space, by increasing investments in observability and debugging.