The scenario
An on-call engineer gets paged at 2am. A data pipeline has failed. Normally, they'd open a terminal, check logs, identify the broken step, re-run it with corrected parameters, monitor the recovery, update the incident ticket, and notify stakeholders.
With an AI-powered ops agent: the alert triggers an agent that reads the logs, identifies the root cause, attempts standard remediation, monitors recovery, updates the ticket automatically, and pages a human only if the automated fix fails or if the situation escalates.
This is ops workflow automation: agents taking over the toil while keeping humans accountable for the decisions that matter.
The spectrum of automation
Ops automation exists on a spectrum from purely reactive (humans do everything) to fully autonomous:
Manual â Assisted â Semi-autonomous â Autonomous
Human reads â AI suggests â AI acts, â AI acts end-to-end,
logs, acts next action human approves notifies humans
key steps on completion
Most production deployments today sit at semi-autonomous: the agent handles the diagnosis and first-line response, but a human approves anything that touches production data or triggers a customer-facing change.
Architecture: agent loop + tools
The agent loop for ops automation:
Trigger (alert / schedule / webhook)
â
[LLM: Understand context] â reads alert data, selects relevant tools
â
[LLM: Diagnose] â queries logs, metrics, trace data
â
[LLM: Decide] â match to known failure patterns, pick remediation
â
[LLM: Act] â calls remediation tool (restart service / re-run job / scale resource)
â
[Monitor] â polls health checks, reads new logs
â
[LLM: Evaluate] â is the issue resolved? escalate or close?
â
[Report] â updates ticket, notifies stakeholders
Key tools for ops agents:
| Tool | What it does |
|---|---|
| Log query | Search structured logs (Elasticsearch, Datadog, CloudWatch) |
| Metrics query | Fetch time-series data (Prometheus, Grafana, Datadog) |
| Trace lookup | Inspect distributed traces (Jaeger, Zipkin, Datadog APM) |
| Service restart | Restart a containerized service (Kubernetes, ECS) |
| Job re-run | Re-trigger a failed pipeline step |
| Ticket update | Write to JIRA, PagerDuty, Linear |
| Notification | Send Slack message, email, SMS |
| Runbook lookup | Retrieve the relevant runbook from the knowledge base |
Multi-agent for complex incidents
Simple incidents (single service down, obvious error) can be handled by a single agent loop. Complex incidents require a multi-agent setup:
- Orchestrator agent: understands the high-level incident, decomposes it into parallel investigation threads
- Diagnostic agents: each investigates one subsystem (database, API layer, network, external dependencies)
- Remediation agent: aggregates findings, executes the fix
- Communication agent: writes the incident summary, updates the status page
This is particularly valuable for cascading failures where multiple systems are affected simultaneously.
The runbook as structured memory
The most effective ops agents are grounded in runbooks â the organization's accumulated operational knowledge.
A runbook describes: the symptom, the likely root causes, and the step-by-step remediation. When the agent encounters an alert, it first retrieves the relevant runbook (via RAG over the runbook corpus), then executes the steps, adapting them to the specific instance.
This bridges the gap between AI capability and domain-specific operational knowledge. The agent doesn't need to infer how to restart a database replica from first principles â the runbook tells it exactly what to do and what to check afterward.
Trust, control, and blast radius
Ops automation has higher stakes than coding assistance. A bad command in production can:
- Drop database connections for thousands of users
- Wipe data if a cleanup script has a bug
- Trigger a billing or compliance violation
- Start a cascading failure instead of stopping one
Key control mechanisms:
Blast radius scoping. Define upfront which actions the agent is allowed to take: can it restart services? Can it scale up? Can it delete anything? Anything outside this scope requires human approval.
Dry-run mode. Before executing any remediation, the agent outputs what it would do. The human approves, then the agent executes.
Rollback checkpoints. Before every state-changing action, take a snapshot or note the reversal command. If the action makes things worse, the agent can undo it.
Rate limits on actions. Prevent runaway agents from hammering a service with restart loops. Limit: max N restarts per M minutes.
Escalation thresholds. Define conditions that immediately hand off to a human: customer impact detected, SLA breach imminent, action not in the playbook, or confidence below threshold.
Practical example: failed ETL job
01:47 Alert fires: "ETL job 'daily_revenue_aggregation' failed (exit code 1)"
01:47 Agent reads: job logs â "Error: connection timeout to source DB replica-2"
01:47 Agent queries: replica-2 health â "replica-2 lag: 2.3 hours (threshold: 30 min)"
01:48 Agent retrieves runbook: "ETL source timeout"
â runbook says: switch source to replica-1, re-run job
01:48 Agent checks: replica-1 health â "replica-1 lag: 4 minutes â OK"
01:49 Agent acts: updates ETL config (source â replica-1)
â DRY RUN output shown in Slack: "Will update config and re-trigger job"
â Auto-approved (within blast radius policy)
01:49 Agent acts: re-triggers ETL job
02:03 Agent monitors: job completes successfully
02:03 Agent reports:
- Updates JIRA ticket: "Resolved â source switched to replica-1 (replica-2 lagging)"
- Sends Slack: "ETL job recovered at 02:03. Replica-2 lag alert still open â DBAs notified."
- Opens separate ticket for replica-2 lag investigation
Total time: 16 minutes, zero human intervention, two tickets created (resolution + follow-up).
What agents can't (yet) handle well
Novel failure modes. Agents work best on patterns they've seen before (via runbooks or training). An entirely new failure type â one with no runbook, no similar precedent â requires human investigation.
Cross-team coordination. "This incident requires a change from the platform team and approval from security" â agents can draft the request, but the coordination itself requires humans.
Judgment calls under uncertainty. "Should we fail over to the DR region or wait to see if the primary recovers?" â this involves business impact assessment that goes beyond log reading.
Explaining to customers. Status page updates and customer communications require nuanced judgment about what to disclose, when, and how.
Typical stack
| Layer | Examples |
|---|---|
| Trigger | PagerDuty, OpsGenie, custom webhook |
| Log/metrics | Datadog, Grafana + Prometheus, ELK, CloudWatch |
| Agent framework | LangGraph, CrewAI, custom |
| LLM | Claude 3.5/4 (reasoning), GPT-4o |
| Runbook store | Confluence + RAG, Notion + RAG, GitOps runbooks |
| Ticketing | JIRA, Linear, PagerDuty incidents |
| Notification | Slack, PagerDuty, email |