Loading live Phoenix metrics…
Vendor integration
Arize Phoenix
OTel auto-instrumentation · LLM-as-judge · annotation flywheel · environment-namespaced projects
Status
● Live
production · real data flowing
— traces (7d) · — F1 judge runs · — annotations · $— spent today
Capability matrix
What we use, what's gated, what's planned
Production-grade observability for an agent system that meters cost per token. Every Gemini call is captured as a span, an LLM-as-judge harness scores decisions against a golden dataset, and cost-per-trace plus drift signals surface before they reach P&L.
Coming nextF1 evaluator runs on every commit through the Sprint 2 MR gate; per-tenant Phoenix projects isolate observability data once SaaS multi-tenancy lands.
| Capability | Status | Phase | Why / How / Note |
|---|---|---|---|
OTel auto-instrumentation via OpenInference | LIVE | Phase 1 | Gemini + ADK shims wrap every LLM call automatically — no manual start_span() calls anywhere. ✓ 12 tests |
Traces + spans dataframe | LIVE | Phase 1 | Every agent run produces a complete span tree queryable via the Phoenix client (px.Client().spans). |
/monitoring page widgets (5 native) | LIVE | Phase 1 | Latency, cost, error-rate, span-tree, eval-table widgets all read from real spans via /api/arize/*. ✓ 8 tests |
Environment-aware project + revision tagging | LIVE | Phase 1.5 | Spans carry deployment.environment (cloud-run vs local), service.instance.id (revision-id or git short SHA) and openinference.project.name — DEV traffic lands in `sentinelhub-dev`, PROD in `sentinelhub-prod`, filterable by revision inside each project. Wired in app/sentinel_hub/tracing.py — operator can override either via PHOENIX_PROJECT_NAME or OTEL_RESOURCE_ATTRIBUTES at deploy time. |
LLM-as-judge harness (F1) | LIVE | Phase 1.5 | 6-criterion rubric (groundedness, correctness, harm, format, risk-compliance, tool-use) with temperature=0 reproducibility; per-criterion + weighted mean written back to trade_decisions in Mongo and served via /arize/judge/{summary,recent}. ✓ 23 tests Judge model gemini-3.1-flash-lite-preview; `make eval-decisions` runs the harness; /flywheel surfaces the score distribution. |
Phoenix Annotations API (S1 flywheel) | LIVE | Phase 1.5.S (S1) | Operator thumbs-up/down on /decisions writes to Mongo decision_annotations + 15-min sync worker pushes to Phoenix log_annotation(). REST + UI live in /decisions; PHOENIX_API_KEY confirmed against app.phoenix.arize.com. Phase-2 retrofit links annotation → trace_id + span_id so a 👎 surfaces the exact span that produced the bad reasoning. |
Phoenix Datasets + Experiments substrate APP-028 | CODE READY | Phase 1.5.S (S4) | Curated golden examples pinned as a versioned dataset; PromptStudio-style experiment comparison harness ready (`compare_experiments` regression-threshold gate). ADR APP-028 ACCEPTED; uploader (phoenix_dataset_uploader.py, 534 LOC) + judge-on-dataset path implemented; golden_decisions_v1 dataset seeded on Phoenix Cloud. EvalOps epic (in flight) curates the 50-row corpus + attaches evaluators end-to-end. |
EvalOps regression suite (golden corpus + promotion gate) APP-042 | CODE READY | Phase 1.5.S (S4-S6) | Vendor-neutral EvalOps doctrine: every prompt-or-model change runs against the frozen golden corpus before merge; `compare_experiments` with a 0.5-point regression threshold serves the BLOCK_PROMOTION verdict. Phoenix is the first implementation backend. EvalOps skill at .claude/skills/evalops/SKILL.md; HLD at docs/integrations/evalops-integration-hld.md; epic roadmap in delivery/proposals/2026-05-24-evalops-epic.md. S3 (golden corpus + cross-model sweep + pairwise compares) shipped 2026-05-24/25 — see gh issue #147 ledger. |
Batch evaluator runs + context cache (S3) | CODE READY | Phase 1.5.S (S3) | POST /judge/run?mode=batch emits cache.hit_rate + cost.total_usd span attributes for unit-economics. Code path + cache.hit_rate / cost.total_usd attributes shipped; 66% cost cut validated in dev. Production cutover gated on APP-014 per-tenant cost-attribution dashboard so the savings surface in the operator UI, not just the spans. |
Sessions (multi-turn coherence detection) | PLANNED | Phase 2.0 | session.id = chat_thread_id clusters all turns of a /chat conversation; surfaces 'agent contradicted turn 2 vs turn 5' as a first-class drift signal. Lands in EvalOps-S5 alongside the multi-turn evaluator. Tracing.py session-id wiring is small; the value is in the matching evaluator rubric. |
Hallucination + QA evaluators (RAG-aware) | PLANNED | Phase 2.5 | Pre-built HallucinationEvaluator + QAEvaluator score retrieved-context-vs-generated-answer alignment. Lands in EvalOps-S6 once Vector Search retrieval is mainline in the chat agent (F2). Off-the-shelf Phoenix evaluators — we don't build, we wire. |
Multi-evaluator parallel scoring | PLANNED | Phase 2.5 | Split the 6-criterion rubric into N independent evaluators per span — see which dimension regressed. EvalOps-S7. Surfaces 'risk-rule-adherence dropped 0.2 this week' instead of one collapsed score moving 0.1. |
Embedding drift visualisation | PLANNED | Phase 2.5 | UMAP-projected embedding views + drift detection between two corpora (last-week vs this-week decisions). EvalOps-S8. F2 already produces 768-dim embeddings on every decision; pipe into Phoenix for the drift UI. |
tenant.id span attribute (multi-tenant cost attribution) APP-014 | PLANNED | Phase 2.0 | Every span tagged with tenant.id → per-user cost rollups → SaaS billing source of truth. Foundational for the SaaS pricing model; APP-014 LLM-budget circuit-breaker reads tenant.id directly. |
Cost tracking with tenant.id rollups APP-014 | PLANNED | Phase 2.0 | Aggregate llm.cost per span by tenant.id → 'is this user profitable' chart that closes investor questions. Depends on the tenant.id span attribute landing first. |
Arize AX migration (persistent + multi-tenant alerting) | PLANNED | Phase 2.5 | Phoenix OSS = local + ephemeral; AX = SaaS with monitor rules ('page me if eval drops 10% WoW') and SSO. Required past M2 for persistent storage + cross-tenant dashboards. |
GDPR DSR cascade (project-per-tenant at enterprise tier) APP-015 | PLANNED | Phase 2.5 | Article 17 right-to-erasure: enterprise-tier tenants get isolated Phoenix projects for surgical span deletion. Standard tier uses tenant.id-filtered deletion via AX API; enterprise tier gets project-per-tenant for isolation. |
Live data
Real-time from the Arize Phoenix backend
Every tile below is a live read from the vendor backend via the FastAPI BFF. If a tile shows "—" the backend is unreachable or the metric is not yet wired (no hardcoded numbers — see anti-pattern #2).
Awaiting backend
—
Awaiting backend
—
Awaiting backend
—
Awaiting backend
—
Golden decisions v1
F1 regression set — curated 50 decisions covering high-confidence wins, false-positive low-confidence approvals, skipped decisions, drawdown-period and macro-extreme samples. Each prompt iteration is scored against this set so regressions surface instantly.
Dataset rows
—
curated · 0
Baseline F1 mean
—
0 scored
Latest experiment
—
no runs yet
Regressions
—vs prior run
Mean score · last 10 runs
Class distribution
No baseline yet — corpus has no scored decisions.
Roadmap commitments
Roadmap dependencies
Capabilities enabled by this integration — what is built, what is gated, and why.
Per-tenant LLM budget circuit breaker
Phoenix tenant.id span attribute is the source of truth for billing; APP-014 reads cost rollups by tenant.id and trips the circuit breaker when a plan's monthly cap is hit.
GDPR DSR cascade (Article 17 right-to-erasure)
Phoenix is one of 5 backends in the cascade. Standard tier uses tenant.id-filtered AX API deletion; enterprise tier gets a dedicated Phoenix project per tenant for surgical isolation.
S1 Annotations + Phase-2 trace-id retrofit
S1 ships annotations keyed by decision_id today; Phase 2 retrofit adds trace_id + span_id linkage so a 👎 on /decisions surfaces the exact span that produced the bad reasoning, queryable in the Phoenix UI.
Demo flow
End-to-end showcase journey
Five steps a judge or investor can replay live. Each step links to the page that demonstrates it.
- 1
Open /monitoring. Five native Phoenix widgets show real spans — latency, cost, error rate, span tree, eval table. All from /api/arize/* against the live sentinelhub-dev project. → Open /monitoring
- 2
Click any trace row → drill down into the span tree. See agent.name, llm.model_name, llm.token_count.prompt + .completion, and the per-call cost. → Open /monitoring
- 3
Open /decisions, click 👎 on a low-quality reasoning row. Annotation lands in Mongo decision_annotations, then the 15-min sync worker pushes it to Phoenix via log_annotation() — flywheel closes. → Open /decisions
- 4
Open /insights → Self-Improvement Loop widget. Aggregate 👎 counts + top-5 lowest-approval strategies, all backed by Phoenix annotations + Mongo aggregations. → Open /insights
- 5
POST /api/arize/judge/run?mode=batch — see the batch evaluator span land with cache.hit_rate + cost.total_usd attributes. 66% cost cut vs the per-span baseline (S3 unit-economics evidence). → Open /monitoring
What's next
Top-3 vendor-enabled capabilities coming soon
Sourced from the vendor's playbook. Each entry is mapped to its delivery phase and the value it unlocks.
EvalOps S4 — golden corpus + experiment promotion gate
Phase 1.5.S (S4)
50-row curated dataset on Phoenix Cloud + `compare_experiments` regression gate (0.5-point threshold). Every prompt/model change runs against the frozen corpus before merge.
EvalOps S5 — Sessions + multi-turn coherence evaluator
Phase 2.0
session.id wiring + chat-coherence rubric — flags agents that contradict themselves across turns of the same conversation.
EvalOps S6 — Hallucination + QA evaluators on RAG output
Phase 2.5
Off-the-shelf Phoenix HallucinationEvaluator + QAEvaluator scoring F2 Vector Search retrieved chunks vs generated answer.
EvalOps S7 — multi-evaluator parallel scoring
Phase 2.5
Six rubric criteria split into independent evaluators per span — diagnose which dimension regressed (risk-rule-adherence vs format vs grounding).
EvalOps S8 — embedding drift visualisation
Phase 2.5
UMAP-projected embeddings on Phoenix + week-over-week drift detection on the decision corpus.
Arize AX migration
Phase 2.5
Persistent storage past M2 + cross-tenant dashboards + monitor-rule alerting + enterprise SSO. Lands after EvalOps S4–S8 stabilise on Phoenix OSS.