Teams running AI in production

AI in production already? Know your gaps before they bite.

Your eval suite is partial. Your HITL gates are uneven. Your observability has holes. We map them in five days, hand you a 90-day fix plan, and stay on if you want us to execute it.

Book audit

See what you get

Price: £5K–£10K · fixed price
Duration: 5–10 days
Access: Read-only access

Isometric illustration: a multi-agent stack with red warning indicators, viewed through an audit lens that highlights gaps in the pipeline.

The problem

Evals and pray doesn't work in prod.

r/AI_Agents thread, 1,000+ upvotes: "evals and pray" was the most upvoted comment of April 2026. The reason it landed is everyone running agents in production already knows the feeling — write a few unit tests, ship, hope.

Two reference incidents. 11x.ai lost roughly 70% of clients in 90 days when silent agent failures compounded into broken outcomes. Sierra grew from £0 to $150M ARR in the same window by selling outcomes with HITL guardrails as the moat — same agents, opposite trust posture.

If you've shipped on CrewAI, LangGraph, AutoGen, or anything custom, your gaps fit a pattern. We've seen it. The audit names it.

Isometric illustration: an unmonitored AI pipeline running at night, with no eval gates between the agents.

The deliverable

What you get.

A 30-page diagnostic with six gap categories, prioritised, plus a 90-day remediation plan you can execute yourself or hand back to us.

Eval gap analysis

Promptfoo / DeepEval / Ragas baselines mapped against your existing harness — what's measured, what's missing, what false-passes.

HITL gap analysis

Where approval gates are missing, where they're rubber-stamped, where they should exist before customer-facing actions.

Observability gap

LangSmith / Logfire / Helicone tracing coverage — which agent steps are visible, which are dark, which fail silently.

Cost-runaway exposure

Per-agent token budget vs realised spend, top-10 unbounded loops, and the cost ceiling under your current concurrency.

Architectural debt log

Where the system breaks under load, where coupling will block remediation, what to refactor first.

30-page diagnostic + 90-day plan

Findings, prioritised. A 90-day remediation plan you can execute in-house — and an optional critical-fix sprint quote if you'd rather we ship it.

Remediation

And then we fix it.

The 90-day phasing is built so each milestone ships independently. Stop after Day 30 and you still have a working eval suite — no half-finished refactor in the way.

Day 0–30

Eval suite + observability backfill

Stand up Promptfoo / DeepEval / Ragas suites against the top-5 critical paths
Backfill LangSmith / Logfire / Helicone traces across every agent step
Wire pass/fail gates into CI so regressions block deploy, not customers

Day 31–60

HITL gates installed

Approval checkpoints in front of every irreversible action — refunds, deletes, sends, prod writes
Dual-key approval for high-blast-radius operations
Audit log so every human override is traceable to a person and a reason

Day 61–90

Cost dashboards + on-call runbook

Per-agent and per-tenant cost dashboards with hard caps and soft alerts
On-call runbook for the top-10 incident classes — page, triage, rollback, postmortem
Quarterly architecture review embedded so the audit doesn't quietly rot

Isometric illustration: the same multi-agent stack now fortified with eval gates, HITL checkpoints, and observability traces.

Reliability audit

£5K–£10K

Fixed price · per engagement

Duration: 5–10 days
Founder hours: 6 hrs
Margin: 92–94%

5–10 day on-site or remote engagement
Includes audit + 30-page report + 90-day remediation plan
30-minute handoff call with your engineering lead
Optional follow-on critical-fix sprint quote

Book audit

FAQ

Practical questions.

No. Read-only access to source plus prod log samples is enough for the diagnostic. Write access is only required if you opt into the follow-on critical-fix sprint, and even then it's scoped to a feature branch you review before merge.

Five days from now

You'll know exactly where the gaps are.

No fluff, no consultancy theatre, no slides about "AI maturity". A diagnostic document, a fix plan, and a 30-minute call. Then you decide.

Book audit

Meet the workforce