when an ML team needs a formal incident severity matrix
Incident Classification Matrix AI Prompt
This prompt defines severity levels, SLAs, declaration rules, and communication templates for ML production incidents. It is helpful when teams need a shared language for incident response before building detailed runbooks.
Define an ML model incident classification matrix and response procedures for each severity level. 1. Severity levels and definitions: P0 โ Critical (page immediately, all hands): - Model is returning errors for > 5% of requests (hard failures) - Model is completely unresponsive (serving down) - Model predictions are obviously wrong across the board (e.g. classifier predicting all-one-class) - Downstream system failure caused by model output Response SLA: acknowledge in 5 minutes, update in 15 minutes, resolve or mitigate in 60 minutes P1 โ High (page on-call engineer): - Model latency p99 > 2ร SLA for > 10 minutes - Error rate > 1% and rising - Significant prediction distribution shift detected (PSI > 0.5) - Silent accuracy degradation confirmed (performance drop > 10% vs baseline) Response SLA: acknowledge in 15 minutes, resolve or mitigate in 4 hours P2 โ Medium (notify ML team via Slack): - Model latency p99 between 1ร and 2ร SLA - Moderate drift detected (PSI 0.2โ0.5) - Performance drop 5โ10% vs baseline - Label rate dropped below expected (feedback loop issue) Response SLA: acknowledge in 1 hour, resolve in 24 hours P3 โ Low (create ticket, handle next business day): - Minor drift (PSI 0.1โ0.2) - Performance drop < 5% - Monitoring data quality issues (missing logs, delayed metrics) Response SLA: acknowledge in 4 hours, resolve in 1 week 2. Incident declaration criteria: - Any automated alert at P0 or P1 automatically creates an incident - P2 and P3: engineer uses judgment based on business context 3. Incident communication template: - Status page update: 'Investigating reports of [issue] affecting [model]. Engineers are engaged.' - Internal Slack: 'P[X] incident declared for [model_name]. Owner: [name]. Bridge: [link]' Return: classification matrix table, SLA definitions, alert-to-incident mapping, and communication templates.
When to use this prompt
when alerts must map consistently to incident declarations
when response SLAs should be explicit for model-related failures
when standard communication templates are needed for incidents
What the AI should return
An ML incident classification matrix with severity definitions, SLAs, declaration logic, and communication templates.
How to use this prompt
Open your data context
Load your dataset, notebook, or working environment so the AI can operate on the actual project context.
Copy the prompt text
Use the copy button above and paste the prompt into the AI assistant or prompt input area.
Review the output critically
Check whether the result matches your data, assumptions, and desired format before moving on.
Chain into the next prompt
Once you have the first result, continue deeper with related prompts in Production Incident Response.
Frequently asked questions
What does the Incident Classification Matrix prompt do?+
It gives you a structured production incident response starting point for mlops work and helps you move faster without starting from a blank page.
Who is this prompt for?+
It is designed for mlops workflows and marked as beginner, so it works well as a guided starting point for that level of experience.
What type of prompt is this?+
Incident Classification Matrix is a single prompt. You can copy it as-is, adapt it, or use it as one step inside a larger workflow.
Can I use this outside MLJAR Studio?+
Yes. The prompt text works in other AI tools too, but MLJAR Studio is the best fit when you want local execution, visible Python code, and reusable notebooks.
What should I open next?+
Natural next steps from here are Emergency Rollback Procedure, Incident Post-Mortem, Incident Response Chain.