memory sysadmin
Incident Runbook Memory
incident-response sre runbook on-call
Targets
---
id: "c9e5a3b6-2d4f-4a0c-b8e2-7f9d1c3a6b8e"
name: "Incident Runbook Memory"
type: memory
category: sysadmin
version: "1.0.0"
author: "markeddown"
license: MIT
min_context_tokens: 4096
target_frameworks:
- markeddown
- cursor
- claude
- generic
tags:
- incident-response
- sre
- runbook
- on-call
---
# Incident Runbook Memory
Persistent on-call context for fast incident triage. Load this memory into any AI agent that helps with production debugging. It carries forward what your team has already learned so the agent skips the diagnostic basics and goes straight to the root cause.
## Service Map
| Service | Owner | Repo | Health Check |
|---|---|---|---|
| API Gateway | Platform team (Maya, Ravi) | github.com/acme/gateway | GET /healthz — returns 200 with `{"db":"ok","cache":"ok"}` |
| Auth Service | Identity team (Jun) | github.com/acme/auth | GET /auth/health — returns 200; check `token_verify_latency_ms` in response |
| Job Queue (BullMQ) | Platform team (Maya) | github.com/acme/workers | Redis queue depth: `bull:email:waiting` count. Alert if > 500 |
| PostgreSQL 16 | Managed (Neon) | — | Connection count: `SELECT count(*) FROM pg_stat_activity` — alert if > 80% of pool |
| Redis 7 | Managed (Upstash) | — | `INFO memory` — alert if `used_memory` exceeds 200MB on free tier |
## Monitoring and Dashboards
- **Primary dashboard:** https://vercel.com/acme/api-gateway/analytics (request volume, error rate, p95 latency)
- **Error tracking:** https://sentry.io/acme/api-gateway/ (alert rules fire on new issue or 10x spike)
- **Logs:** Vercel function logs for serverless; `railway logs` for the worker service
- **Uptime monitor:** BetterStack status page at https://status.acme.dev — checks /healthz every 60s from 3 regions
- **Alerting channel:** #incidents in Slack (BetterStack + Sentry integrations post here). PagerDuty for SEV1 only.
## Known Failure Modes
### 1. Database Connection Pool Exhaustion
- **Symptoms:** 500 errors across all API routes simultaneously. Neon dashboard shows active connections at pool ceiling. Logs contain `remaining connection slots are reserved` or `too many clients already`.
- **Root cause:** Usually a long-running analytics query run from a developer's local machine against production, or a code path that opens a transaction and never commits (often in error-handling branches that `throw` before `finally`).
- **Diagnosis:** Run `SELECT pid, state, query, query_start FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start` to find the offending query. Check if `query_start` is more than 30 seconds old.
- **Fix:** `SELECT pg_terminate_backend(pid)` on the stuck connection. If the pool is fully jammed, restart the Vercel deployment (`vercel redeploy`). Then find and fix the leaked connection in code — look for `try` blocks that call the DB but have no `finally` to release.
- **Prevention:** Set `statement_timeout = '30s'` in the connection string. Add connection release in `finally` blocks for all transaction-based queries.
### 2. Auth Token Verification Failures
- **Symptoms:** Users get logged out randomly. 401 responses on authenticated endpoints. Sentry shows `JWT verification failed: invalid signature`.
- **Root cause:** The JWT signing secret was rotated in the environment but the auth service was not redeployed, so it still holds the old secret in memory. Alternatively, a clock skew between services causes `exp` validation to fail.
- **Diagnosis:** Check `AUTH_JWT_SECRET` in the running environment (`vercel env ls`). Compare the first 8 characters against what's in the secrets manager. If they match, check server time with `date -u` on the worker.
- **Fix:** If secret mismatch: redeploy the auth service so it picks up the new env var. If clock skew: the managed service provider needs to fix it (file a support ticket). As a stopgap, add 30 seconds of leeway to the `exp` check.
- **Prevention:** Deploy automatically after secret rotation. Add a health check that verifies a test token on startup.
### 3. Job Queue Backup
- **Symptoms:** Emails, webhooks, or async tasks stop processing. Redis `bull:email:waiting` count climbs above 500. No errors in the worker logs — it has simply stopped pulling jobs.
- **Root cause:** The worker process crashed silently (OOM on a large payload) and the process manager did not restart it, or Redis hit its memory limit and started evicting keys.
- **Diagnosis:** `railway logs --service workers --tail 100` to check if the worker is alive. `redis-cli INFO memory` to check `used_memory_peak` vs `maxmemory`. Check for `OOMKilled` in container events.
- **Fix:** Restart the worker (`railway restart --service workers`). If Redis is full, flush completed job data: `redis-cli DEL bull:email:completed`. For OOM, find the oversized payload in the dead letter queue and fix the upstream sender.
- **Prevention:** Set `maxmemory-policy allkeys-lru` on Redis. Add a health check that alerts when queue depth > 200. Set worker memory limit to 512MB with an explicit restart policy.
### 4. Deployment Broke Everything
- **Symptoms:** Error rate spikes within 5 minutes of a deploy. New errors in Sentry that did not exist before. Users report broken features.
- **Root cause:** A code change reached production that was not caught in CI. Common causes: database migration that ran in a different order than expected, a new env var that was not added to Vercel, or a dependency update with a breaking change.
- **Diagnosis:** `vercel ls --limit 5` to identify the deploy. `git log --oneline HEAD~3..HEAD` to see what changed. Check Vercel build logs for warnings. Check Sentry for the first new error and trace it to the diff.
- **Fix:** Roll back immediately: `vercel rollback`. Then investigate the diff locally. Do not try to fix-forward under time pressure unless the fix is a one-line env var addition.
- **Prevention:** Add a post-deploy smoke test that hits 3 critical endpoints and alerts if any return non-200. Never deploy on Friday afternoons.
## Escalation Contacts
| Severity | Who to Contact | Channel | SLA |
|---|---|---|---|
| SEV1 (site down, data loss risk) | Maya Chen (platform lead) | PagerDuty — `acme-platform` service | Acknowledge in 5 min, respond in 15 min |
| SEV2 (degraded, partial outage) | On-call engineer (rotation) | #incidents in Slack, tag @oncall | Respond in 30 min |
| SEV3 (minor, single feature broken) | Feature team owner | #engineering in Slack | Next business day |
| SEV4 (cosmetic, non-blocking) | File a Linear issue | Linear project "BUGS" | Sprint triage |
## Past Incidents
| Date | Incident | Root Cause | Resolution | Duration | Postmortem |
|---|---|---|---|---|---|
| 2026-03-14 | API 500s for 22 min during peak traffic | Connection pool exhausted — analytics query from dev laptop held 40 connections | Killed the query, redeployed, added `statement_timeout = '30s'` | 22 min | Added pool monitoring alert at 80% capacity |
| 2026-03-02 | Auth failures for ~200 users over 45 min | JWT secret rotated via CI but auth service deployment was skipped | Redeployed auth service, added auto-deploy hook to secret rotation pipeline | 45 min | Secret rotation now triggers deploy automatically |
| 2026-02-18 | Email queue backed up to 3,400 jobs | Worker OOM on a 12MB webhook payload from a partner integration | Restarted worker, added 1MB payload size limit to the queue producer | 2 hr | Added dead letter queue monitoring and payload size validation |
| 2026-01-29 | Homepage blank white screen for 8 min | Frontend build succeeded but SSR hydration failed due to missing env var `NEXT_PUBLIC_API_URL` in Vercel | Added env var, redeployed | 8 min | Added CI check that validates all `NEXT_PUBLIC_*` vars are set before deploy |
## Quick Commands
```bash
# Check recent deploys
vercel ls --limit 5
# Tail production logs (API)
vercel logs --follow
# Tail worker logs
railway logs --service workers --tail 200
# Check database connections
psql $DATABASE_URL -c "SELECT pid, state, query_start, left(query, 80) FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;"
# Kill a stuck database query
psql $DATABASE_URL -c "SELECT pg_terminate_backend(<pid>);"
# Check Redis queue depth
redis-cli -u $REDIS_URL LLEN bull:email:waiting
# Roll back to previous deploy
vercel rollback
# Force redeploy current code
vercel redeploy --force
# Check if a specific endpoint is healthy
curl -s -o /dev/null -w "%{http_code}" https://api.acme.dev/healthz
```
Download
Compatibility
gpt-4o-mini 100% sanity-v1
claude-haiku-4-5 80% sanity-v1