Most incidents in production happen within the first 48 hours of a deploy. Not because engineers are careless, but because production traffic exposes edge cases that staging never will. Good observability is your early-warning system.
Start with error rate by endpoint. A spike in 5xx errors on a specific route tells you exactly where the regression is. Set an alert threshold at 2x your baseline error rate and page your on-call when it fires.
Second, track p50, p95, and p99 latency. p50 tells you what most users experience. p99 tells you about your worst-case tail. A widening gap between p50 and p99 often indicates a resource contention problem or an N+1 query.
Third, watch memory and CPU per instance. Gradual memory growth (a leak) and sustained high CPU (a hot loop) both manifest slowly and are missed by simple error-rate monitors.
Fourth, track your downstream dependencies: database query times, external API p99s, cache hit rates. Your application may be healthy while a dependency is quietly degrading.
Fifth — and most often skipped — monitor your business metrics. Order completion rate, sign-up funnel conversion, and checkout success are lagging indicators, but they catch regressions that pure infrastructure metrics miss.