From blocked release to safe ship in 72 hours

Context

A consumer social app with more than 1M weekly active users was preparing a major release. Minutes before submission, monitoring showed a sharp increase in crashes on the latest build. Release risk was high, and the team needed a safe answer fast.

Symptoms

Crash-free users dropped. Launch time regressed on older devices. Internal reports looked noisy, but the user-facing impact was clear enough to stop the release.

EXC_BAD_ACCESS clustered around media handling.
Crash reports were more common on low-memory devices.
Launch regressions appeared after a recent image pipeline change.

Investigation

We correlated the crash spike with recent changes and narrowed the issue to an unsafe decode path introduced to improve feed performance.

Primary signals:

crash spike aligned with build 4.18.0
image decoding happened on a background queue with weak lifecycle guards
memory pressure caused the decode path to race the launch sequence

Decision

We did not roll back the entire release. Instead, we scoped a smaller fix that removed the unstable path, restored safe defaults, and deferred the performance experiment to a later build.

Decision drivers:

high user impact and visible launch instability
root cause was isolated enough to fix without large code churn
release window was tight, so low-risk changes mattered more than broad cleanup

Fix

The hotfix focused on safety first:

replaced unsafe decode with incremental decoding
added memory warning handling and cache limits
moved non-critical media work out of the launch path
added breadcrumbs to improve future crash triage

let options: [CFString: Any] = [
  kCGImageSourceShouldCache: false,
  kCGImageSourceShouldCacheImmediately: false
]

Outcome

Within 72 hours, the hotfix shipped. Stability recovered and performance improved enough to restore confidence.

crash-free users returned to 98.6%
launch time improved from 2.31s to 1.42s
the team shipped without widening the change set

Lessons

The fastest fix is not always the safest one. Clear signals, tight scope, and business-aware prioritization gave the team a better result than a full rollback or a rushed rewrite.