OpsCopilot is a governed Kubernetes copilot. The Philips test-bench initiative needs a governed BDD/Cucumber triage system. Every architectural pattern I built transfers directly.
Deterministic where correctness matters. Agentic where judgment adds value. LLMs called only for failed or ambiguous tests — never for clean passes.
Azure DevOps ──build event──► Ingestion API
│
┌────────────┴────────────┐
▼ ▼
Test Store (SQL) OpenSearch (RAG)
pass/fail · owners Gherkins · feature docs
deferral · tickets step defs · prior tickets
│
▼
Bounded Agent Graph
│
├─ [1] classify ← rules / small model
├─ [2] query metrics ← exact DB, no LLM
├─ [3] hybrid retrieval ← BM25 + vector
├─ [4] failure analysis ← stronger model, failures only
├─ [5] policy check ← deterministic safety gate
└─ [6] act ← notify · draft · ticket · approve
│
Langfuse · Audit Store · Cost Ledger
Clean passing tests never touch the LLM. Only failed, flaky, or ambiguous tests get deep analysis — reducing 22,000 monthly results to ~1,100 at 5% failure rate.
40 tests failing from one shared dependency → one incident. Group by signature hash, component, and semantic similarity before any ticket is created.
Run the agent over historical builds before enabling any write action. Measure projected ticket volume, owner accuracy, and cost before production rollout.
Philips already runs nightly builds on Azure DevOps and operates Nutanix-backed on-prem servers with substantial headroom. This means the ingestion API, OpenSearch index, Langfuse instance, and local embedding inference can all run on existing infrastructure from day one — no new cloud spend, no procurement cycle, no data leaving the building.
| Action | Auto | Approval |
|---|---|---|
| Audit log entry | Yes | No |
| Owner notification | If confident | Maybe |
| Draft ticket | Yes | No |
| Create ticket | After dedup | Maybe |
| Change deferral state | No | Yes |
| Merge generated Gherkin | No | Always |
Read-only first. Prove reliability through replay. Expand only as the system earns trust.
Read-only. No write authority yet.
After replay validates accuracy.
Compiler pipeline, not chatbot feature.