Teams running AI in production
AI in production already? Know your gaps before they bite.
Your eval suite is partial. Your HITL gates are uneven. Your observability has holes. We map them in five days, hand you a 90-day fix plan, and stay on if you want us to execute it.
- Price
- £5K–£10K · fixed price
- Duration
- 5–10 days
- Access
- Read-only access

The problem
Evals and pray doesn't work in prod.
r/AI_Agents thread, 1,000+ upvotes: "evals and pray" was the most upvoted comment of April 2026. The reason it landed is everyone running agents in production already knows the feeling — write a few unit tests, ship, hope.
Two reference incidents. 11x.ai lost roughly 70% of clients in 90 days when silent agent failures compounded into broken outcomes. Sierra grew from £0 to $150M ARR in the same window by selling outcomes with HITL guardrails as the moat — same agents, opposite trust posture.
If you've shipped on CrewAI, LangGraph, AutoGen, or anything custom, your gaps fit a pattern. We've seen it. The audit names it.

The deliverable
What you get.
A 30-page diagnostic with six gap categories, prioritised, plus a 90-day remediation plan you can execute yourself or hand back to us.
Eval gap analysis
Promptfoo / DeepEval / Ragas baselines mapped against your existing harness — what's measured, what's missing, what false-passes.
HITL gap analysis
Where approval gates are missing, where they're rubber-stamped, where they should exist before customer-facing actions.
Observability gap
LangSmith / Logfire / Helicone tracing coverage — which agent steps are visible, which are dark, which fail silently.
Cost-runaway exposure
Per-agent token budget vs realised spend, top-10 unbounded loops, and the cost ceiling under your current concurrency.
Architectural debt log
Where the system breaks under load, where coupling will block remediation, what to refactor first.
30-page diagnostic + 90-day plan
Findings, prioritised. A 90-day remediation plan you can execute in-house — and an optional critical-fix sprint quote if you'd rather we ship it.

Remediation
And then we fix it.
The 90-day phasing is built so each milestone ships independently. Stop after Day 30 and you still have a working eval suite — no half-finished refactor in the way.
Day 0–30
Eval suite + observability backfill
- Stand up Promptfoo / DeepEval / Ragas suites against the top-5 critical paths
- Backfill LangSmith / Logfire / Helicone traces across every agent step
- Wire pass/fail gates into CI so regressions block deploy, not customers
Day 31–60
HITL gates installed
- Approval checkpoints in front of every irreversible action — refunds, deletes, sends, prod writes
- Dual-key approval for high-blast-radius operations
- Audit log so every human override is traceable to a person and a reason
Day 61–90
Cost dashboards + on-call runbook
- Per-agent and per-tenant cost dashboards with hard caps and soft alerts
- On-call runbook for the top-10 incident classes — page, triage, rollback, postmortem
- Quarterly architecture review embedded so the audit doesn't quietly rot

Reliability audit
£5K–£10K
Fixed price · per engagement
- Duration
- 5–10 days
- Founder hours
- 6 hrs
- Margin
- 92–94%
- 5–10 day on-site or remote engagement
- Includes audit + 30-page report + 90-day remediation plan
- 30-minute handoff call with your engineering lead
- Optional follow-on critical-fix sprint quote
FAQ
Practical questions.
No. Read-only access to source plus prod log samples is enough for the diagnostic. Write access is only required if you opt into the follow-on critical-fix sprint, and even then it's scoped to a feature branch you review before merge.
Five days from now
You'll know exactly where the gaps are.
No fluff, no consultancy theatre, no slides about "AI maturity". A diagnostic document, a fix plan, and a 30-minute call. Then you decide.