Agentic AI

SentientOps Control Center

Cut incident response from 42 minutes to under 7 minutes.

Next.jsFastAPITemporalVertex AILangGraphPrometheus

The challenge

A global robotics team managing autonomous devices across multi-cloud fleets had a critical operational gap: incidents were being detected by alert fatigue-ridden engineers, triaged manually, and escalated through Slack threads. Mean time to acknowledge (MTTA) was 42 minutes on average. The team had tried conventional AIOps tools, but none could synthesise telemetry signals across cloud providers into a coherent incident narrative. They needed a unified brain — one that could monitor, reason, and act across heterogeneous infrastructure without creating new compliance risks.

Architecture

The core of SentientOps is a supervisor-agent architecture built on LangGraph with Temporal for durable workflow execution. The supervisor agent ingests telemetry from Prometheus, CloudWatch, and Azure Monitor via a unified adapter layer, then routes incidents to specialised sub-agents: a root-cause analysis agent, a remediation agent, and an escalation agent. Each agent operates within a constrained tool registry — it can read metrics, query logs, run playbook steps, and create Jira tickets, but cannot make infrastructure changes without a human approval step. Vertex AI provides the foundation models; a fine-tuned routing model classifies incident severity and selects the appropriate agent path. Temporal handles the durability layer — if any agent step fails or times out, the workflow resumes from the last successful checkpoint rather than restarting from scratch.

How we shipped it

We started with a two-week discovery sprint that mapped every incident type, their typical resolution paths, and the failure modes most likely to cause customer impact. The highest-risk path — cascading memory pressure across GPU nodes — became the first pilot. We built and tested the root-cause analysis agent against 90 days of historical incident data before going live. The first production deployment ran in shadow mode for two weeks, comparing agent-generated incident summaries against engineer-written ones. Inter-rater agreement reached 87% before we flipped the switch to active mode. The remediation agent went live four weeks later, initially limited to low-risk playbook actions (cache flushes, service restarts). Human approval gates were removed incrementally as confidence thresholds were met.

Results

After 60 days in production: mean time to acknowledge dropped from 42 minutes to 6.8 minutes. Mean time to resolve (MTTR) dropped 61%. The system handled 340 incidents in its first 60 days — 280 fully autonomously, 60 with human escalation. Usage analytics surfaced three revenue-impacting device behaviour patterns that the team had not previously measured, leading to a new paid analytics tier.

What we would do differently

The human approval gate architecture was the right call — it let us build trust incrementally rather than asking the team to trust the system all at once. We underestimated the importance of incident narrative quality: agents that could explain *why* they took an action were trusted far more quickly than those that simply took the action. The fine-tuned routing model was worth the extra three weeks of data collection — the off-the-shelf model mis-classified severity about 22% of the time, which would have created unacceptable false positive rates.

Written by Mudassir Khan

Agentic AI Consultant & AI Systems Architect · CEO of Cube A Cloud · Islamabad, Pakistan

Free tools

Related service

Agentic AI Consulting

See scope & engagement →

Related case studies

Want to build something like this?

Book a 30-minute strategy call and let us map out what is possible for your situation.

Book a strategy call