Context

A consumer social app with more than 1M weekly active users was preparing a major release. Minutes before submission, monitoring showed a sharp increase in crashes on the latest build. Release risk was high, and the team needed a safe answer fast.

Symptoms

Crash-free users dropped. Launch time regressed on older devices. Internal reports looked noisy, but the user-facing impact was clear enough to stop the release.

  • EXC_BAD_ACCESS clustered around media handling.
  • Crash reports were more common on low-memory devices.
  • Launch regressions appeared after a recent image pipeline change.

Investigation

We correlated the crash spike with recent changes and narrowed the issue to an unsafe decode path introduced to improve feed performance.

Primary signals:

  • crash spike aligned with build 4.18.0
  • image decoding happened on a background queue with weak lifecycle guards
  • memory pressure caused the decode path to race the launch sequence

Decision

We did not roll back the entire release. Instead, we scoped a smaller fix that removed the unstable path, restored safe defaults, and deferred the performance experiment to a later build.

Decision drivers:

  • high user impact and visible launch instability
  • root cause was isolated enough to fix without large code churn
  • release window was tight, so low-risk changes mattered more than broad cleanup

Fix

The hotfix focused on safety first:

  • replaced unsafe decode with incremental decoding
  • added memory warning handling and cache limits
  • moved non-critical media work out of the launch path
  • added breadcrumbs to improve future crash triage
let options: [CFString: Any] = [
  kCGImageSourceShouldCache: false,
  kCGImageSourceShouldCacheImmediately: false
]

Outcome

Within 72 hours, the hotfix shipped. Stability recovered and performance improved enough to restore confidence.

  • crash-free users returned to 98.6%
  • launch time improved from 2.31s to 1.42s
  • the team shipped without widening the change set

Lessons

The fastest fix is not always the safest one. Clear signals, tight scope, and business-aware prioritization gave the team a better result than a full rollback or a rushed rewrite.