Same Architecture. Different Domain.

OpsCopilot is a governed Kubernetes copilot. The Philips test-bench initiative needs a governed BDD/Cucumber triage system. Every architectural pattern I built transfers directly.

Kubernetes Copilot OpsCopilot

transfers as

BDD / Cucumber Test-Bench Philips Initiative

Bounded Agent Runtime

Scope → planner → tool executor → answer. Explicit graph, no loops.

same graph pattern

Triage & Action Graph

Classify → retrieve → analyze → decide → act.

RAG over Runbooks

Hybrid vector + keyword search over operational docs.

same retrieval stack

RAG over Gherkins & Feature Docs

Same OpenSearch index, different corpus.

Read-only Tool Server

Go service. Namespace allowlists. Zero write access by design.

extends to

Read + Write Tool Planes

Read plane + policy-gated write plane for tickets & notifications.

LLM Budget Gateway

Per-run budget enforcement before any inference call.

same cost control

Per-Build Cost Control

Cheap model first. Strong model only for ambiguous failures.

OpenTelemetry + Langfuse

Every LLM call traced. Prompts versioned. Evals tracked.

same observability

Trace & Eval Pipeline

Trace every agent decision. Eval prompts before promoting.

Scope Check + Guardrails

Structural safety — not policy documents, not hope.

same safety model

Policy-Gated Mutations

Dedup → confidence → approval gate → idempotent write.

PostgreSQL Persistence

Sessions, runs, LLM calls, tool calls, cost events.

same data model

Test Store + Audit Log

History, ownership, deferral, ticket IDs, evidence bundles.

Philips R&D Insight

Local Embeddings on Existing Philips Infrastructure

Philips already operates Nutanix-backed on-prem compute. The entire Gherkin corpus, feature documents, and historical ticket data can be embedded and re-indexed locally — no cloud API calls, no data egress, no recurring token cost.

external embedding cost
for full corpus build

Qwen3

0.6B · 4B · 8B models
runs on Nutanix nodes

Internal

test artifacts never
leave Philips infra

Incremental

re-embed only changed
files by content hash

Proposed System Design

Deterministic where correctness matters. Agentic where judgment adds value. LLMs called only for failed or ambiguous tests — never for clean passes.

  Azure DevOps  ──build event──►  Ingestion API
                                       │
                          ┌────────────┴────────────┐
                          ▼                         ▼
                   Test Store (SQL)          OpenSearch (RAG)
                   pass/fail · owners        Gherkins · feature docs
                   deferral · tickets        step defs · prior tickets
                          │
                          ▼
                   Bounded Agent Graph
                   │
                   ├─ [1] classify          ← rules / small model
                   ├─ [2] query metrics     ← exact DB, no LLM
                   ├─ [3] hybrid retrieval  ← BM25 + vector
                   ├─ [4] failure analysis  ← stronger model, failures only
                   ├─ [5] policy check      ← deterministic safety gate
                   └─ [6] act              ← notify · draft · ticket · approve
                          │
                   Langfuse  ·  Audit Store  ·  Cost Ledger

Deterministic first

Clean passing tests never touch the LLM. Only failed, flaky, or ambiguous tests get deep analysis — reducing 22,000 monthly results to ~1,100 at 5% failure rate.

Cluster before ticketing

40 tests failing from one shared dependency → one incident. Group by signature hash, component, and semantic similarity before any ticket is created.

Replay before writing

Run the agent over historical builds before enabling any write action. Measure projected ticket volume, owner accuracy, and cost before production rollout.

Philips R&D Insight

Philips already runs nightly builds on Azure DevOps and operates Nutanix-backed on-prem servers with substantial headroom. This means the ingestion API, OpenSearch index, Langfuse instance, and local embedding inference can all run on existing infrastructure from day one — no new cloud spend, no procurement cycle, no data leaving the building.

Safety & Cost

Write-Action Policy

Action	Auto	Approval
Audit log entry	Yes	No
Owner notification	If confident	Maybe
Draft ticket	Yes	No
Create ticket	After dedup	Maybe
Change deferral state	No	Yes
Merge generated Gherkin	No	Always

Phased Delivery

Read-only first. Prove reliability through replay. Expand only as the system earns trust.

Phase 1

Intelligent Triage

Ingest nightly results
Compute flaky / failure metrics
Retrieve related context
Resolve owner
Draft ticket or low-risk notify

Read-only. No write authority yet.

Phase 2

Controlled Automation

Create & update tickets
Group duplicate failures
Notify with evidence bundles
Approval gate for sensitive actions
Full audit trail on every write

After replay validates accuracy.

Phase 3

Gherkin Generation

Retrieve feature docs + prior Gherkins
Generate schema-constrained candidates
Validate syntax + dedup + step mapping
LLM-as-judge scoring via Langfuse
Human review before any merge

Compiler pipeline, not chatbot feature.