Loading live Phoenix metrics…

Vendor integration

Arize Phoenix

OTel auto-instrumentation · LLM-as-judge · annotation flywheel · environment-namespaced projects

Status

● Live

production · real data flowing

— traces (7d) · — F1 judge runs · — annotations · $— spent today

Marketing →Admin →API keys →Docs →

Capability matrix

What we use, what's gated, what's planned

Production-grade observability for an agent system that meters cost per token. Every Gemini call is captured as a span, an LLM-as-judge harness scores decisions against a golden dataset, and cost-per-trace plus drift signals surface before they reach P&L.

Coming nextF1 evaluator runs on every commit through the Sprint 2 MR gate; per-tenant Phoenix projects isolate observability data once SaaS multi-tenancy lands.

Capability	Status	Phase	Why / How / Note
OTel auto-instrumentation via OpenInference	LIVE	Phase 1	Gemini + ADK shims wrap every LLM call automatically — no manual start_span() calls anywhere. ✓ 12 tests
Traces + spans dataframe	LIVE	Phase 1	Every agent run produces a complete span tree queryable via the Phoenix client (px.Client().spans).
/monitoring page widgets (5 native)	LIVE	Phase 1	Latency, cost, error-rate, span-tree, eval-table widgets all read from real spans via /api/arize/*. ✓ 8 tests
Environment-aware project + revision tagging	LIVE	Phase 1.5	Spans carry deployment.environment (cloud-run vs local), service.instance.id (revision-id or git short SHA) and openinference.project.name — DEV traffic lands in `sentinelhub-dev`, PROD in `sentinelhub-prod`, filterable by revision inside each project. Wired in app/sentinel_hub/tracing.py — operator can override either via PHOENIX_PROJECT_NAME or OTEL_RESOURCE_ATTRIBUTES at deploy time.
LLM-as-judge harness (F1)	LIVE	Phase 1.5	6-criterion rubric (groundedness, correctness, harm, format, risk-compliance, tool-use) with temperature=0 reproducibility; per-criterion + weighted mean written back to trade_decisions in Mongo and served via /arize/judge/{summary,recent}. ✓ 23 tests Judge model gemini-3.1-flash-lite-preview; `make eval-decisions` runs the harness; /flywheel surfaces the score distribution.
Phoenix Annotations API (S1 flywheel)	LIVE	Phase 1.5.S (S1)	Operator thumbs-up/down on /decisions writes to Mongo decision_annotations + 15-min sync worker pushes to Phoenix log_annotation(). REST + UI live in /decisions; PHOENIX_API_KEY confirmed against app.phoenix.arize.com. Phase-2 retrofit links annotation → trace_id + span_id so a 👎 surfaces the exact span that produced the bad reasoning.
Phoenix Datasets + Experiments substrate APP-028	CODE READY	Phase 1.5.S (S4)	Curated golden examples pinned as a versioned dataset; PromptStudio-style experiment comparison harness ready (`compare_experiments` regression-threshold gate). ADR APP-028 ACCEPTED; uploader (phoenix_dataset_uploader.py, 534 LOC) + judge-on-dataset path implemented; golden_decisions_v1 dataset seeded on Phoenix Cloud. EvalOps epic (in flight) curates the 50-row corpus + attaches evaluators end-to-end.
EvalOps regression suite (golden corpus + promotion gate) APP-042	CODE READY	Phase 1.5.S (S4-S6)	Vendor-neutral EvalOps doctrine: every prompt-or-model change runs against the frozen golden corpus before merge; `compare_experiments` with a 0.5-point regression threshold serves the BLOCK_PROMOTION verdict. Phoenix is the first implementation backend. EvalOps skill at .claude/skills/evalops/SKILL.md; HLD at docs/integrations/evalops-integration-hld.md; epic roadmap in delivery/proposals/2026-05-24-evalops-epic.md. S3 (golden corpus + cross-model sweep + pairwise compares) shipped 2026-05-24/25 — see gh issue #147 ledger.
Batch evaluator runs + context cache (S3)	CODE READY	Phase 1.5.S (S3)	POST /judge/run?mode=batch emits cache.hit_rate + cost.total_usd span attributes for unit-economics. Code path + cache.hit_rate / cost.total_usd attributes shipped; 66% cost cut validated in dev. Production cutover gated on APP-014 per-tenant cost-attribution dashboard so the savings surface in the operator UI, not just the spans.
Sessions (multi-turn coherence detection)	PLANNED	Phase 2.0	session.id = chat_thread_id clusters all turns of a /chat conversation; surfaces 'agent contradicted turn 2 vs turn 5' as a first-class drift signal. Lands in EvalOps-S5 alongside the multi-turn evaluator. Tracing.py session-id wiring is small; the value is in the matching evaluator rubric.
Hallucination + QA evaluators (RAG-aware)	PLANNED	Phase 2.5	Pre-built HallucinationEvaluator + QAEvaluator score retrieved-context-vs-generated-answer alignment. Lands in EvalOps-S6 once Vector Search retrieval is mainline in the chat agent (F2). Off-the-shelf Phoenix evaluators — we don't build, we wire.
Multi-evaluator parallel scoring	PLANNED	Phase 2.5	Split the 6-criterion rubric into N independent evaluators per span — see which dimension regressed. EvalOps-S7. Surfaces 'risk-rule-adherence dropped 0.2 this week' instead of one collapsed score moving 0.1.
Embedding drift visualisation	PLANNED	Phase 2.5	UMAP-projected embedding views + drift detection between two corpora (last-week vs this-week decisions). EvalOps-S8. F2 already produces 768-dim embeddings on every decision; pipe into Phoenix for the drift UI.
tenant.id span attribute (multi-tenant cost attribution) APP-014	PLANNED	Phase 2.0	Every span tagged with tenant.id → per-user cost rollups → SaaS billing source of truth. Foundational for the SaaS pricing model; APP-014 LLM-budget circuit-breaker reads tenant.id directly.
Cost tracking with tenant.id rollups APP-014	PLANNED	Phase 2.0	Aggregate llm.cost per span by tenant.id → 'is this user profitable' chart that closes investor questions. Depends on the tenant.id span attribute landing first.
Arize AX migration (persistent + multi-tenant alerting)	PLANNED	Phase 2.5	Phoenix OSS = local + ephemeral; AX = SaaS with monitor rules ('page me if eval drops 10% WoW') and SSO. Required past M2 for persistent storage + cross-tenant dashboards.
GDPR DSR cascade (project-per-tenant at enterprise tier) APP-015	PLANNED	Phase 2.5	Article 17 right-to-erasure: enterprise-tier tenants get isolated Phoenix projects for surgical span deletion. Standard tier uses tenant.id-filtered deletion via AX API; enterprise tier gets project-per-tenant for isolation.

Live data

Real-time from the Arize Phoenix backend

Every tile below is a live read from the vendor backend via the FastAPI BFF. If a tile shows "—" the backend is unreachable or the metric is not yet wired (no hardcoded numbers — see anti-pattern #2).

Awaiting backend

—

Awaiting backend

—

Awaiting backend

—

Awaiting backend

—

Golden decisions v1

F1 regression set — curated 50 decisions covering high-confidence wins, false-positive low-confidence approvals, skipped decisions, drawdown-period and macro-extreme samples. Each prompt iteration is scored against this set so regressions surface instantly.

Dataset rows

—

curated · 0

Baseline F1 mean

—

0 scored

Latest experiment

—

no runs yet

Regressions

—

vs prior run

Mean score · last 10 runs

No experiments yet.no runs

Class distribution

No baseline yet — corpus has no scored decisions.

Roadmap commitments

Roadmap dependencies

Capabilities enabled by this integration — what is built, what is gated, and why.

APP-014Phase 2.0

Per-tenant LLM budget circuit breaker

Phoenix tenant.id span attribute is the source of truth for billing; APP-014 reads cost rollups by tenant.id and trips the circuit breaker when a plan's monthly cap is hit.

APP-015Phase 2.5

GDPR DSR cascade (Article 17 right-to-erasure)

Phoenix is one of 5 backends in the cascade. Standard tier uses tenant.id-filtered AX API deletion; enterprise tier gets a dedicated Phoenix project per tenant for surgical isolation.

APP-014ongoing platform release

S1 Annotations + Phase-2 trace-id retrofit

S1 ships annotations keyed by decision_id today; Phase 2 retrofit adds trace_id + span_id linkage so a 👎 on /decisions surfaces the exact span that produced the bad reasoning, queryable in the Phoenix UI.

Demo flow

End-to-end showcase journey

Five steps a judge or investor can replay live. Each step links to the page that demonstrates it.

1
Open /monitoring. Five native Phoenix widgets show real spans — latency, cost, error rate, span tree, eval table. All from /api/arize/* against the live sentinelhub-dev project. → Open /monitoring
2
Click any trace row → drill down into the span tree. See agent.name, llm.model_name, llm.token_count.prompt + .completion, and the per-call cost. → Open /monitoring
3
Open /decisions, click 👎 on a low-quality reasoning row. Annotation lands in Mongo decision_annotations, then the 15-min sync worker pushes it to Phoenix via log_annotation() — flywheel closes. → Open /decisions
4
Open /insights → Self-Improvement Loop widget. Aggregate 👎 counts + top-5 lowest-approval strategies, all backed by Phoenix annotations + Mongo aggregations. → Open /insights
5
POST /api/arize/judge/run?mode=batch — see the batch evaluator span land with cache.hit_rate + cost.total_usd attributes. 66% cost cut vs the per-span baseline (S3 unit-economics evidence). → Open /monitoring

What's next

Top-3 vendor-enabled capabilities coming soon

Sourced from the vendor's playbook. Each entry is mapped to its delivery phase and the value it unlocks.

EvalOps S4 — golden corpus + experiment promotion gate

Phase 1.5.S (S4)

50-row curated dataset on Phoenix Cloud + `compare_experiments` regression gate (0.5-point threshold). Every prompt/model change runs against the frozen corpus before merge.

EvalOps S5 — Sessions + multi-turn coherence evaluator

Phase 2.0

session.id wiring + chat-coherence rubric — flags agents that contradict themselves across turns of the same conversation.

EvalOps S6 — Hallucination + QA evaluators on RAG output

Phase 2.5

Off-the-shelf Phoenix HallucinationEvaluator + QAEvaluator scoring F2 Vector Search retrieved chunks vs generated answer.

EvalOps S7 — multi-evaluator parallel scoring

Phase 2.5

Six rubric criteria split into independent evaluators per span — diagnose which dimension regressed (risk-rule-adherence vs format vs grounding).

EvalOps S8 — embedding drift visualisation

Phase 2.5

UMAP-projected embeddings on Phoenix + week-over-week drift detection on the decision corpus.

Arize AX migration

Phase 2.5

Persistent storage past M2 + cross-tenant dashboards + monitor-rule alerting + enterprise SSO. Lands after EvalOps S4–S8 stabilise on Phoenix OSS.

Arize Phoenix

What we use, what's gated, what's planned

Real-time from the Arize Phoenix backend

Golden decisions v1

Roadmap dependencies

Per-tenant LLM budget circuit breaker

GDPR DSR cascade (Article 17 right-to-erasure)

S1 Annotations + Phase-2 trace-id retrofit

End-to-end showcase journey

Top-3 vendor-enabled capabilities coming soon

EvalOps S4 — golden corpus + experiment promotion gate

EvalOps S5 — Sessions + multi-turn coherence evaluator

EvalOps S6 — Hallucination + QA evaluators on RAG output

EvalOps S7 — multi-evaluator parallel scoring

EvalOps S8 — embedding drift visualisation

Arize AX migration

AI Strategy Advisor