Use Case: Ops Workflow Automation

The scenario

An on-call engineer gets paged at 2am. A data pipeline has failed. Normally, they'd open a terminal, check logs, identify the broken step, re-run it with corrected parameters, monitor the recovery, update the incident ticket, and notify stakeholders.

With an AI-powered ops agent: the alert triggers an agent that reads the logs, identifies the root cause, attempts standard remediation, monitors recovery, updates the ticket automatically, and pages a human only if the automated fix fails or if the situation escalates.

This is ops workflow automation: agents taking over the toil while keeping humans accountable for the decisions that matter.

The spectrum of automation

Ops automation exists on a spectrum from purely reactive (humans do everything) to fully autonomous:

Manual         →    Assisted         →    Semi-autonomous    →    Autonomous
Human reads    →    AI suggests       →    AI acts,           →    AI acts end-to-end,
logs, acts          next action            human approves          notifies humans
                                           key steps              on completion

Most production deployments today sit at semi-autonomous: the agent handles the diagnosis and first-line response, but a human approves anything that touches production data or triggers a customer-facing change.

Architecture: agent loop + tools

The agent loop for ops automation:

Trigger (alert / schedule / webhook)
    ↓
[LLM: Understand context]  →  reads alert data, selects relevant tools
    ↓
[LLM: Diagnose]  →  queries logs, metrics, trace data
    ↓
[LLM: Decide]  →  match to known failure patterns, pick remediation
    ↓
[LLM: Act]  →  calls remediation tool (restart service / re-run job / scale resource)
    ↓
[Monitor]  →  polls health checks, reads new logs
    ↓
[LLM: Evaluate]  →  is the issue resolved? escalate or close?
    ↓
[Report]  →  updates ticket, notifies stakeholders

Key tools for ops agents:

Tool	What it does
Log query	Search structured logs (Elasticsearch, Datadog, CloudWatch)
Metrics query	Fetch time-series data (Prometheus, Grafana, Datadog)
Trace lookup	Inspect distributed traces (Jaeger, Zipkin, Datadog APM)
Service restart	Restart a containerized service (Kubernetes, ECS)
Job re-run	Re-trigger a failed pipeline step
Ticket update	Write to JIRA, PagerDuty, Linear
Notification	Send Slack message, email, SMS
Runbook lookup	Retrieve the relevant runbook from the knowledge base

Multi-agent for complex incidents

Simple incidents (single service down, obvious error) can be handled by a single agent loop. Complex incidents require a multi-agent setup:

Orchestrator agent: understands the high-level incident, decomposes it into parallel investigation threads
Diagnostic agents: each investigates one subsystem (database, API layer, network, external dependencies)
Remediation agent: aggregates findings, executes the fix
Communication agent: writes the incident summary, updates the status page

This is particularly valuable for cascading failures where multiple systems are affected simultaneously.

The runbook as structured memory

The most effective ops agents are grounded in runbooks — the organization's accumulated operational knowledge.

A runbook describes: the symptom, the likely root causes, and the step-by-step remediation. When the agent encounters an alert, it first retrieves the relevant runbook (via RAG over the runbook corpus), then executes the steps, adapting them to the specific instance.

This bridges the gap between AI capability and domain-specific operational knowledge. The agent doesn't need to infer how to restart a database replica from first principles — the runbook tells it exactly what to do and what to check afterward.

Trust, control, and blast radius

Ops automation has higher stakes than coding assistance. A bad command in production can:

Drop database connections for thousands of users
Wipe data if a cleanup script has a bug
Trigger a billing or compliance violation
Start a cascading failure instead of stopping one

Key control mechanisms:

Blast radius scoping. Define upfront which actions the agent is allowed to take: can it restart services? Can it scale up? Can it delete anything? Anything outside this scope requires human approval.

Dry-run mode. Before executing any remediation, the agent outputs what it would do. The human approves, then the agent executes.

Rollback checkpoints. Before every state-changing action, take a snapshot or note the reversal command. If the action makes things worse, the agent can undo it.

Rate limits on actions. Prevent runaway agents from hammering a service with restart loops. Limit: max N restarts per M minutes.

Escalation thresholds. Define conditions that immediately hand off to a human: customer impact detected, SLA breach imminent, action not in the playbook, or confidence below threshold.

Practical example: failed ETL job

01:47  Alert fires: "ETL job 'daily_revenue_aggregation' failed (exit code 1)"
       
01:47  Agent reads: job logs → "Error: connection timeout to source DB replica-2"

01:47  Agent queries: replica-2 health → "replica-2 lag: 2.3 hours (threshold: 30 min)"

01:48  Agent retrieves runbook: "ETL source timeout" 
       → runbook says: switch source to replica-1, re-run job

01:48  Agent checks: replica-1 health → "replica-1 lag: 4 minutes — OK"

01:49  Agent acts: updates ETL config (source → replica-1)
       → DRY RUN output shown in Slack: "Will update config and re-trigger job"
       → Auto-approved (within blast radius policy)

01:49  Agent acts: re-triggers ETL job

02:03  Agent monitors: job completes successfully

02:03  Agent reports:
       - Updates JIRA ticket: "Resolved — source switched to replica-1 (replica-2 lagging)"
       - Sends Slack: "ETL job recovered at 02:03. Replica-2 lag alert still open — DBAs notified."
       - Opens separate ticket for replica-2 lag investigation

Total time: 16 minutes, zero human intervention, two tickets created (resolution + follow-up).

What agents can't (yet) handle well

Novel failure modes. Agents work best on patterns they've seen before (via runbooks or training). An entirely new failure type — one with no runbook, no similar precedent — requires human investigation.

Cross-team coordination. "This incident requires a change from the platform team and approval from security" — agents can draft the request, but the coordination itself requires humans.

Judgment calls under uncertainty. "Should we fail over to the DR region or wait to see if the primary recovers?" — this involves business impact assessment that goes beyond log reading.

Explaining to customers. Status page updates and customer communications require nuanced judgment about what to disclose, when, and how.

Typical stack

Layer	Examples
Trigger	PagerDuty, OpsGenie, custom webhook
Log/metrics	Datadog, Grafana + Prometheus, ELK, CloudWatch
Agent framework	LangGraph, CrewAI, custom
LLM	Claude 3.5/4 (reasoning), GPT-4o
Runbook store	Confluence + RAG, Notion + RAG, GitOps runbooks
Ticketing	JIRA, Linear, PagerDuty incidents
Notification	Slack, PagerDuty, email