Same Architecture. Different Domain.

OpsCopilot is a governed Kubernetes copilot. The Philips initiative needs a governed BDD/Cucumber triage system — every architectural pattern transfers directly.

Kubernetes Copilot OpsCopilot

transfers as

BDD / Cucumber Test-Bench Philips Initiative

Bounded Agent Runtime

Scope → clarified → planner → tool executor → answer. Explicit graph, no loops.

same graph pattern

Triage & Action Graph

Classify → retrieve → analyze → decide → act.

RAG over Runbooks

Hybrid vector + keyword search over operational docs.

same retrieval stack

RAG over Gherkins & Feature Docs

Same OpenSearch index, different corpus.

Read-only Tool Server

Go service. Namespace allowlists. Zero write access by design.

extends to

Read + Write Tool Planes

Read plane + policy-gated write plane for tickets & notifications.

LLM Budget Gateway

Per-run budget enforcement before any inference call.

same cost control

Per-Build Cost Control

Cheap model first. Strong model only for ambiguous failures.

OpenTelemetry + Langfuse

Every LLM call traced. Prompts versioned. Evals tracked.

same observability

Trace & Eval Pipeline

Trace every agent decision. Eval prompts before promoting.

Scope Check + Guardrails

Structural safety — not policy documents, not hope.

same safety model

Policy-Gated Mutations

Dedup → confidence → approval gate → idempotent write.

PostgreSQL Persistence

Sessions, runs, LLM calls, tool calls, cost events.

same data model

Test Store + Audit Log

History, ownership, deferral, ticket IDs, evidence bundles.

The Problem I Understood

Scale

1,000+ tests/night

Volume alone makes manual triage unsustainable.

Context

History makes failures meaningful

A test result without prior flakiness data is just noise.

Safety

Wrong mutations > no automation

An agent that writes incorrectly erodes trust faster than none.

Cost

Naïve LLM routing is expensive

Sending every test through deep reasoning burns budget fast.

Reliability

Flaky tests poison the signal

Flakiness scores must be computed before any LLM sees the data.

Proposed System Design

Deterministic where correctness matters. Agentic only where judgment adds value. LLMs invoked only for failed or ambiguous tests.

Azure DevOps (cloud) ──── nightly pipeline ──── POST /ingest ──────────────►
  payload: { run_id · results[] · branch · commit_sha · duration_ms }
                                     │ HTTPS · mTLS · auth token
                                     ▼
┌─────────────────────── Nutanix Cluster (on-prem) ─────────────────────────┐
│                                                                            │
│  Ingestion API  ·  Go  ·  port 8080                                       │
│  validate schema  ·  dedup run_id  ·  normalize results                   │
│  enqueue Δ-files → Embedding Worker (Qwen3-8B, CPU)                       │
│          │                                    │                            │
│          ▼                                    ▼                            │
│  PostgreSQL  (Test Store)             OpenSearch  (RAG Index)              │
│  ─────────────────────────            ─────────────────────────────────   │
│  test_runs · results                  Gherkin features · step defs        │
│  owners · deferrals                   feature docs · prior tickets        │
│  tickets · cost_events                embeddings: Qwen3-8B vectors         │
│          │                                    │                            │
│          └─────────────────┬──────────────────┘                           │
│                            ▼                                               │
│  Bounded Agent Graph  ·  LangGraph  ·  Python                             │
│  ──────────────────────────────────────────────────────────────────────   │
│  [1] classify    rule engine            → flaky / env / logic / unknown   │
│  [2] query       SQL · zero LLM calls   → trend · owner · deferral state  │
│  [3] retrieve    BM25 + vector hybrid   → similar Gherkins · past issues  │
│  [4] analyze     Qwen3-8B · CPU · on-prem  → root cause · evidence bundle  │
│  [5] gate        policy check           → confidence ≥ threshold          │
│  [6] act         write plane · Go       → notify · draft ticket · create  │
│                            │                                               │
│  Langfuse (LLM traces)  ·  Audit Store  ·  Cost Ledger  ·  all on-prem   │
│                            │                                               │
│  step [4] only: LLM inference call ─────────────────────────────────────┼──►  AWS Bedrock (cloud)
│  no embeddings · no raw test data · inference tokens only                │    strong model · ambiguous cases
└────────────────────────────────────────────────────────────────────────────┘

LangGraph Langfuse Qwen3 PostgreSQL OpenSearch Go Python Azure DevOps Nutanix AWS Bedrock BM25 Vector Search

Deterministic first

Clean passes never touch the LLM. Only failed/ambiguous tests trigger steps 3–6.

Cluster before ticketing

40 tests from one root cause → one incident, not 40 noisy tickets.

Replay before writing

Validate accuracy and cost on historical builds before any write action is enabled.

Local Embeddings Philips R&D Insight

Nutanix headroom is already available. Qwen3 runs entirely on-prem — full Gherkin corpus embedded at $0 cost. Azure DevOps, OpenSearch, and Langfuse deploy on existing infra. No procurement cycle, no data egress.

external embedding cost

Qwen3

0.6B · 4B · 8B on Nutanix

Internal

artifacts never leave infra

Δ-only

re-embed changed files by hash

Phased Delivery Read-only first. Prove reliability. Expand as the system earns trust.

Phase 1

Intelligent Triage

Ingest nightly results
Compute flaky / failure metrics
Retrieve related context
Resolve owner
Draft ticket or low-risk notify

Read-only. No write authority yet.

Phase 2

Controlled Automation

Create & update tickets
Group duplicate failures
Notify with evidence bundles
Approval gate for sensitive actions
Full audit trail on every write

After replay validates accuracy.

Phase 3

Gherkin Generation

Retrieve feature docs + prior Gherkins
Generate schema-constrained candidates
Validate syntax + dedup + step mapping
LLM-as-judge scoring via Langfuse
Human review before any merge

Compiler pipeline, not chatbot feature.

Philips Work Story Same team. Same infrastructure. Same problem space.

Deployment Automation

3 days → 1.5 hrs

Built a staged automation platform for PICiX performance-test environment setup — reducing a painful manual workflow to a repeatable, observable pipeline with improved reliability and configurability.

PICiX Scale

500+ Nutanix hosts · 2,550 beds

Worked in the R&D Systems Engineering and Integration team deploying and configuring the largest-scale PICiX setup in the organisation — the same infra stack proposed for this initiative.

Static MAC · Least Privilege

Zero NAT exceptions

When DNS/DHCP access was too risky to grant, worked with the networking team and solved the bottleneck through static MAC assignment at VM creation — the same least-privilege mindset behind the safe-mutation design here.

C# PICiX Config Module

−15 min per deployment

Added license assignment and pre-population functionality in in-house config tool inside the actual PICiX codebase — raised the PR, handled review comments. Shipped inside the existing SDLC, not around it.