when on-call engineers need a step-by-step diagnostic guide
Triage Runbook AI Prompt
This prompt writes a practical triage runbook that any on-call engineer can execute quickly, even without deep model context. It is useful for turning incident response from tribal knowledge into a repeatable first-response procedure.
Write a model incident triage runbook for on-call engineers who may not be deeply familiar with the specific model.
Model: {{model_name}}
Serving infrastructure: {{infrastructure}}
The runbook must be executable by any on-call engineer within 15 minutes of being paged.
1. Initial triage (first 5 minutes):
- Is this a model issue or an infrastructure issue?
Check 1: Are the Kubernetes pods healthy? (kubectl get pods -n {{namespace}})
Check 2: Is the serving endpoint returning any responses? (curl -X POST {{health_endpoint}})
Check 3: Are the upstream feature pipelines healthy? (link to pipeline dashboard)
Check 4: Was there a recent deployment? (check deployment log at {{deploy_log_link}})
2. Model-specific diagnostics (minutes 5โ10):
- Check serving dashboard: error rate, latency, prediction distribution (link: {{dashboard_link}})
- Check feature drift dashboard: any features showing high PSI? (link: {{drift_dashboard_link}})
- Check model version currently serving: what model version is in production? Expected: {{expected_version}}
- Sample 10 recent predictions: do the inputs and outputs look sane? (query: {{sample_query}})
3. Common failure modes and immediate actions:
- High error rate โ Check logs for error type. If OOM: restart pods. If model file missing: check object storage.
- High latency โ Check GPU utilization. If GPU saturated: scale up pods. If CPU-bound: check for preprocessing bottleneck.
- Wrong predictions โ Check model version. If unexpected version: trigger rollback. Check feature pipeline for data quality issues.
- All predictions same class โ Model likely receiving all-null or all-default features. Check feature pipeline.
4. Rollback procedure:
- Command: {{rollback_command}}
- Expected output: {{expected_rollback_output}}
- Verification: wait 2 minutes, then confirm error rate has returned to baseline
5. Escalation:
- If unresolved in 30 minutes: page {{escalation_contact}}
- If data pipeline issue: page {{data_team_contact}}
- If model quality issue: page {{ml_team_contact}}
Return: complete triage runbook formatted as a step-by-step guide.When to use this prompt
when model, infrastructure, and data-pipeline failures must be separated quickly
when rollback instructions and escalation paths need to be explicit
when first-response actions should fit into a 15-minute triage window
What the AI should return
A step-by-step incident triage runbook with checks, diagnostics, immediate actions, rollback steps, and escalation guidance.
How to use this prompt
Open your data context
Load your dataset, notebook, or working environment so the AI can operate on the actual project context.
Copy the prompt text
Use the copy button above and paste the prompt into the AI assistant or prompt input area.
Review the output critically
Check whether the result matches your data, assumptions, and desired format before moving on.
Chain into the next prompt
Once you have the first result, continue deeper with related prompts in Production Incident Response.
Frequently asked questions
What does the Triage Runbook prompt do?+
It gives you a structured production incident response starting point for mlops work and helps you move faster without starting from a blank page.
Who is this prompt for?+
It is designed for mlops workflows and marked as intermediate, so it works well as a guided starting point for that level of experience.
What type of prompt is this?+
Triage Runbook is a single prompt. You can copy it as-is, adapt it, or use it as one step inside a larger workflow.
Can I use this outside MLJAR Studio?+
Yes. The prompt text works in other AI tools too, but MLJAR Studio is the best fit when you want local execution, visible Python code, and reusable notebooks.
What should I open next?+
Natural next steps from here are Emergency Rollback Procedure, Incident Classification Matrix, Incident Post-Mortem.