Same Architecture. Different Domain.

OpsCopilot is a governed Kubernetes copilot. The Philips initiative needs a governed BDD/Cucumber triage system — every architectural pattern transfers directly.

Kubernetes Copilot OpsCopilot
transfers as
BDD / Cucumber Test-Bench Philips Initiative
Bounded Agent Runtime
Scope → clarified → planner → tool executor → answer. Explicit graph, no loops.
same graph pattern
Triage & Action Graph
Classify → retrieve → analyze → decide → act.
RAG over Runbooks
Hybrid vector + keyword search over operational docs.
same retrieval stack
RAG over Gherkins & Feature Docs
Same OpenSearch index, different corpus.
Read-only Tool Server
Go service. Namespace allowlists. Zero write access by design.
extends to
Read + Write Tool Planes
Read plane + policy-gated write plane for tickets & notifications.
LLM Budget Gateway
Per-run budget enforcement before any inference call.
same cost control
Per-Build Cost Control
Cheap model first. Strong model only for ambiguous failures.
OpenTelemetry + Langfuse
Every LLM call traced. Prompts versioned. Evals tracked.
same observability
Trace & Eval Pipeline
Trace every agent decision. Eval prompts before promoting.
Scope Check + Guardrails
Structural safety — not policy documents, not hope.
same safety model
Policy-Gated Mutations
Dedup → confidence → approval gate → idempotent write.
PostgreSQL Persistence
Sessions, runs, LLM calls, tool calls, cost events.
same data model
Test Store + Audit Log
History, ownership, deferral, ticket IDs, evidence bundles.
The Problem I Understood
Scale
1,000+ tests/night
Volume alone makes manual triage unsustainable.
Context
History makes failures meaningful
A test result without prior flakiness data is just noise.
Safety
Wrong mutations > no automation
An agent that writes incorrectly erodes trust faster than none.
Cost
Naïve LLM routing is expensive
Sending every test through deep reasoning burns budget fast.
Reliability
Flaky tests poison the signal
Flakiness scores must be computed before any LLM sees the data.
Proposed System Design

Deterministic where correctness matters. Agentic only where judgment adds value. LLMs invoked only for failed or ambiguous tests.

Azure DevOps (cloud) ──── nightly pipeline ──── POST /ingest ──────────────►
  payload: { run_id · results[] · branch · commit_sha · duration_ms }
                                     │ HTTPS · mTLS · auth token
                                     ▼
┌─────────────────────── Nutanix Cluster (on-prem) ─────────────────────────┐
│                                                                            │
│  Ingestion API  ·  Go  ·  port 8080                                       │
│  validate schema  ·  dedup run_id  ·  normalize results                   │
│  enqueue Δ-files → Embedding Worker (Qwen3-8B, CPU)                       │
│          │                                    │                            │
│          ▼                                    ▼                            │
│  PostgreSQL  (Test Store)             OpenSearch  (RAG Index)              │
│  ─────────────────────────            ─────────────────────────────────   │
│  test_runs · results                  Gherkin features · step defs        │
│  owners · deferrals                   feature docs · prior tickets        │
│  tickets · cost_events                embeddings: Qwen3-8B vectors         │
│          │                                    │                            │
│          └─────────────────┬──────────────────┘                           │
│                            ▼                                               │
│  Bounded Agent Graph  ·  LangGraph  ·  Python                             │
│  ──────────────────────────────────────────────────────────────────────   │
│  [1] classify    rule engine            → flaky / env / logic / unknown   │
│  [2] query       SQL · zero LLM calls   → trend · owner · deferral state  │
│  [3] retrieve    BM25 + vector hybrid   → similar Gherkins · past issues  │
│  [4] analyze     Qwen3-8B · CPU · on-prem  → root cause · evidence bundle  │
│  [5] gate        policy check           → confidence ≥ threshold          │
│  [6] act         write plane · Go       → notify · draft ticket · create  │
│                            │                                               │
│  Langfuse (LLM traces)  ·  Audit Store  ·  Cost Ledger  ·  all on-prem   │
│                            │                                               │
│  step [4] only: LLM inference call ─────────────────────────────────────┼──►  AWS Bedrock (cloud)
│  no embeddings · no raw test data · inference tokens only                │    strong model · ambiguous cases
└────────────────────────────────────────────────────────────────────────────┘
LangGraph Langfuse Qwen3 PostgreSQL OpenSearch Go Python Azure DevOps Nutanix AWS Bedrock BM25 Vector Search

Deterministic first

Clean passes never touch the LLM. Only failed/ambiguous tests trigger steps 3–6.

Cluster before ticketing

40 tests from one root cause → one incident, not 40 noisy tickets.

Replay before writing

Validate accuracy and cost on historical builds before any write action is enabled.

Local Embeddings Philips R&D Insight

Nutanix headroom is already available. Qwen3 runs entirely on-prem — full Gherkin corpus embedded at $0 cost. Azure DevOps, OpenSearch, and Langfuse deploy on existing infra. No procurement cycle, no data egress.

$0
external embedding cost
Qwen3
0.6B · 4B · 8B on Nutanix
Internal
artifacts never leave infra
Δ-only
re-embed changed files by hash
Phased Delivery Read-only first. Prove reliability. Expand as the system earns trust.
Phase 1

Intelligent Triage

  • Ingest nightly results
  • Compute flaky / failure metrics
  • Retrieve related context
  • Resolve owner
  • Draft ticket or low-risk notify

Read-only. No write authority yet.

Phase 2

Controlled Automation

  • Create & update tickets
  • Group duplicate failures
  • Notify with evidence bundles
  • Approval gate for sensitive actions
  • Full audit trail on every write

After replay validates accuracy.

Phase 3

Gherkin Generation

  • Retrieve feature docs + prior Gherkins
  • Generate schema-constrained candidates
  • Validate syntax + dedup + step mapping
  • LLM-as-judge scoring via Langfuse
  • Human review before any merge

Compiler pipeline, not chatbot feature.

Philips Work Story Same team. Same infrastructure. Same problem space.
Deployment Automation
3 days → 1.5 hrs

Built a staged automation platform for PICiX performance-test environment setup — reducing a painful manual workflow to a repeatable, observable pipeline with improved reliability and configurability.

PICiX Scale
500+ Nutanix hosts · 2,550 beds

Worked in the R&D Systems Engineering and Integration team deploying and configuring the largest-scale PICiX setup in the organisation — the same infra stack proposed for this initiative.

Static MAC · Least Privilege
Zero NAT exceptions

When DNS/DHCP access was too risky to grant, worked with the networking team and solved the bottleneck through static MAC assignment at VM creation — the same least-privilege mindset behind the safe-mutation design here.

C# PICiX Config Module
−15 min per deployment

Added license assignment and pre-population functionality in in-house config tool inside the actual PICiX codebase — raised the PR, handled review comments. Shipped inside the existing SDLC, not around it.

"I understand your problem well enough to avoid the first five bad versions."