
10:42 AM — Support: "Getting reports that AI recommendations are way off. 12 tickets in the last hour."
10:44 AM — PM: "Checking… model accuracy looks normal on our dashboard. Probably user error?"
11:03 AM — Support: "Now 28 tickets. Users are angry. Can we turn this off?"
11:05 AM — PM: "Turn what off? The feature is hardcoded. We'd need to deploy a code change."
11:07 AM — Eng Lead: "Deploy takes 45 minutes (build + test + rollout). We're in the middle of a freeze."
11:15 AM — CEO: "This is trending on Twitter. Turn. It. Off. Now."
11:17 AM — PM: "We can't. No kill switch."
The Damage:
*The Feature**: AI-powered email auto-responder (suggests replies to customer support tickets).
The Feature: AI-powered email auto-responder (suggests replies to customer support tickets).
The Model: Fine-tuned LLM, retrained weekly on new support ticket data.
The Deployment: Model update deployed Friday 6 PM. QA passed. Went live.
The Incident: Saturday morning, model started suggesting inappropriate responses (sarcastic tone, off-topic, occasionally rude).
Root Cause Analysis:
Why It Got Expensive:

What: Boolean toggle in config file or admin panel.
How:
if (featureFlags.aiEmailSuggestions === true) {
return getAISuggestion(ticket);
} else {
return null; // Fall back to manual response
}
Click to examine closelyWho Can Use: PM, eng lead, on-call engineer
Response Time: Under 2 minutes
When to Use: Model degrades, user complaints spike, unexpected behavior
What: Adjustable threshold for AI confidence. Only show suggestions above threshold.
How:
const confidenceThreshold = config.aiConfidenceThreshold; // default: 0.7
if (aiConfidence > confidenceThreshold) {
return aiSuggestion;
} else {
return null; // Don't show low-confidence suggestions
}
Click to examine closelyWho Can Use: PM, data scientist
Response Time: 5-10 minutes
When to Use: Model is mostly correct but has higher-than-usual error rate; want to reduce false positives without full shutdown
What: Switch from new model (v2.3) back to old model (v2.2).
How:
const modelVersion = config.aiModelVersion; // "v2.3" or "v2.2" const model = loadModel(modelVersion);Click to examine closely
Who Can Use: ML engineer, PM (with approval)
Response Time: 10-30 minutes (model swap, cache clear)
When to Use: New model version is fundamentally broken; previous version was stable
Signals:
Action:
If Model Degradation Confirmed:
Communication:
Questions to Answer:
Data to Collect:
Temporary Fix (2-6 hours):
Permanent Fix (1-2 days):
Template:
INCIDENT: AI Email Suggestions Degraded (Aug 4, 2025) IMPACT: - Duration: 4 hours (10:42 AM - 2:47 PM) - Users affected: ~2,000 (all users saw degraded suggestions) - Support tickets: 200+ - Customer escalations: 3 (enterprise accounts) ROOT CAUSE: - Training data included spam tickets with sarcastic responses - Model learned inappropriate tone patterns - Offline eval set didn't include spam-like inputs - No kill switch for rapid disable TIMELINE: - 10:42 AM: First support ticket - 11:05 AM: PM confirms issue, realizes no kill switch - 11:30 AM: Emergency rollback initiated (bypassed deploy freeze) - 2:47 PM: Rollback complete, feature re-enabled with old model WHAT WENT WELL: - Team responded quickly once severity understood - Support team communicated proactively with affected users WHAT WENT POORLY: - No kill switch → 4-hour response time (should be <10 minutes) - Training data quality not monitored → spam leaked in - Eval set didn't cover spam-like inputs → missed in QA ACTION ITEMS: - [PM] Add feature flag kill switch (due: Aug 6) - [ML] Implement training data quality checks (due: Aug 10) - [ML] Expand eval set to include edge cases (due: Aug 10) - [Eng] Add confidence threshold dial (due: Aug 12) - [PM] Update runbook: how to disable AI features (due: Aug 8)Click to examine closely
Incident: Zillow Offers (AI-powered home buying) overvalued homes by hundreds of millions.
Impact: $881M inventory write-down, program shut down.
Root Cause: Model didn't adapt to rapidly changing housing market (2020-2021 COVID boom).
Missing: Kill switch to pause home purchases when model confidence dropped.
Lesson: If your AI controls high-value decisions (purchases, lending, hiring), you need real-time confidence monitoring + auto-shutoff.
Incident: AI chatbot learned offensive language from Twitter trolls within 24 hours.
Impact: Tay taken offline after 16 hours; PR disaster.
Root Cause: No content filtering on training data (learned from adversarial inputs).
Missing: Kill switch + human review layer for public-facing AI.
Lesson: If your AI learns from user inputs, you need toxicity filters + manual override.
Incident: Trading algorithm malfunction lost $440M in 45 minutes.
Impact: Company bankruptcy.
Root Cause: Software deployment error activated old test code in production.
Missing: Kill switch for algorithmic trading (required manual approval to halt).
Lesson: High-frequency AI decisions need pre-authorized kill switches (not "find PM, get approval, then disable").

Acceptable Exceptions (rare):
Everything Else Needs a Kill Switch.

Incident Response Time:
Customer Impact:
Engineering Cost:
Reputational Cost:
The math is simple: 1 day of eng time to build a kill switch saves 100x that in incident response.
*PM to Eng Lead**: "Every AI feature needs a kill switch. It's not paranoia—it's incident response. Budget 1 day per feature."
PM to Eng Lead: "Every AI feature needs a kill switch. It's not paranoia—it's incident response. Budget 1 day per feature."
Eng Lead: "We're already behind schedule."
PM: "When this breaks in production and we can't turn it off, we'll lose 4 hours of the team's time. Plus customer trust. Plus revenue. 1 day upfront or 100x that later. Your call."
Alex Welcing is a Senior AI Product Manager who learned the kill switch lesson the expensive way (once). His AI features now ship with feature flags, confidence dials, and model rollback plans—because 2-minute response time beats 4-hour scrambles.