Same Architecture. Different Domain.

OpsCopilot is a governed Kubernetes copilot. The Philips test-bench initiative needs a governed BDD/Cucumber triage system. Every architectural pattern I built transfers directly.

Kubernetes Copilot OpsCopilot
transfers as
BDD / Cucumber Test-Bench Philips Initiative
Bounded Agent Runtime
Scope → planner → tool executor → answer. Explicit graph, no loops.
same graph pattern
Triage & Action Graph
Classify → retrieve → analyze → decide → act.
RAG over Runbooks
Hybrid vector + keyword search over operational docs.
same retrieval stack
RAG over Gherkins & Feature Docs
Same OpenSearch index, different corpus.
Read-only Tool Server
Go service. Namespace allowlists. Zero write access by design.
extends to
Read + Write Tool Planes
Read plane + policy-gated write plane for tickets & notifications.
LLM Budget Gateway
Per-run budget enforcement before any inference call.
same cost control
Per-Build Cost Control
Cheap model first. Strong model only for ambiguous failures.
OpenTelemetry + Langfuse
Every LLM call traced. Prompts versioned. Evals tracked.
same observability
Trace & Eval Pipeline
Trace every agent decision. Eval prompts before promoting.
Scope Check + Guardrails
Structural safety — not policy documents, not hope.
same safety model
Policy-Gated Mutations
Dedup → confidence → approval gate → idempotent write.
PostgreSQL Persistence
Sessions, runs, LLM calls, tool calls, cost events.
same data model
Test Store + Audit Log
History, ownership, deferral, ticket IDs, evidence bundles.
Philips R&D Insight

Local Embeddings on Existing Philips Infrastructure

Philips already operates Nutanix-backed on-prem compute. The entire Gherkin corpus, feature documents, and historical ticket data can be embedded and re-indexed locally — no cloud API calls, no data egress, no recurring token cost.

$0
external embedding cost
for full corpus build
Qwen3
0.6B · 4B · 8B models
runs on Nutanix nodes
Internal
test artifacts never
leave Philips infra
Incremental
re-embed only changed
files by content hash

Proposed System Design

Deterministic where correctness matters. Agentic where judgment adds value. LLMs called only for failed or ambiguous tests — never for clean passes.

  Azure DevOps  ──build event──►  Ingestion API
                                       │
                          ┌────────────┴────────────┐
                          ▼                         ▼
                   Test Store (SQL)          OpenSearch (RAG)
                   pass/fail · owners        Gherkins · feature docs
                   deferral · tickets        step defs · prior tickets
                          │
                          ▼
                   Bounded Agent Graph
                   │
                   ├─ [1] classify          ← rules / small model
                   ├─ [2] query metrics     ← exact DB, no LLM
                   ├─ [3] hybrid retrieval  ← BM25 + vector
                   ├─ [4] failure analysis  ← stronger model, failures only
                   ├─ [5] policy check      ← deterministic safety gate
                   └─ [6] act              ← notify · draft · ticket · approve
                          │
                   Langfuse  ·  Audit Store  ·  Cost Ledger
      

Deterministic first

Clean passing tests never touch the LLM. Only failed, flaky, or ambiguous tests get deep analysis — reducing 22,000 monthly results to ~1,100 at 5% failure rate.

Cluster before ticketing

40 tests failing from one shared dependency → one incident. Group by signature hash, component, and semantic similarity before any ticket is created.

Replay before writing

Run the agent over historical builds before enabling any write action. Measure projected ticket volume, owner accuracy, and cost before production rollout.

Philips R&D Insight

Philips already runs nightly builds on Azure DevOps and operates Nutanix-backed on-prem servers with substantial headroom. This means the ingestion API, OpenSearch index, Langfuse instance, and local embedding inference can all run on existing infrastructure from day one — no new cloud spend, no procurement cycle, no data leaving the building.

Safety & Cost

Write-Action Policy

ActionAutoApproval
Audit log entryYesNo
Owner notificationIf confidentMaybe
Draft ticketYesNo
Create ticketAfter dedupMaybe
Change deferral stateNoYes
Merge generated GherkinNoAlways

Phased Delivery

Read-only first. Prove reliability through replay. Expand only as the system earns trust.

Phase 1

Intelligent Triage

  • Ingest nightly results
  • Compute flaky / failure metrics
  • Retrieve related context
  • Resolve owner
  • Draft ticket or low-risk notify

Read-only. No write authority yet.

Phase 2

Controlled Automation

  • Create & update tickets
  • Group duplicate failures
  • Notify with evidence bundles
  • Approval gate for sensitive actions
  • Full audit trail on every write

After replay validates accuracy.

Phase 3

Gherkin Generation

  • Retrieve feature docs + prior Gherkins
  • Generate schema-constrained candidates
  • Validate syntax + dedup + step mapping
  • LLM-as-judge scoring via Langfuse
  • Human review before any merge

Compiler pipeline, not chatbot feature.