Loading live Phoenix metrics…

Vendor integration

Arize Phoenix

OTel auto-instrumentation · LLM-as-judge · annotation flywheel · environment-namespaced projects

Status

● Live

production · real data flowing

— traces (7d) · — F1 judge runs · — annotations · $— spent today

Capability matrix

What we use, what's gated, what's planned

Production-grade observability for an agent system that meters cost per token. Every Gemini call is captured as a span, an LLM-as-judge harness scores decisions against a golden dataset, and cost-per-trace plus drift signals surface before they reach P&L.

Coming nextF1 evaluator runs on every commit through the Sprint 2 MR gate; per-tenant Phoenix projects isolate observability data once SaaS multi-tenancy lands.

CapabilityStatusPhaseWhy / How / Note

OTel auto-instrumentation via OpenInference

LIVEPhase 1

Gemini + ADK shims wrap every LLM call automatically — no manual start_span() calls anywhere.

12 tests

Traces + spans dataframe

LIVEPhase 1

Every agent run produces a complete span tree queryable via the Phoenix client (px.Client().spans).

/monitoring page widgets (5 native)

LIVEPhase 1

Latency, cost, error-rate, span-tree, eval-table widgets all read from real spans via /api/arize/*.

8 tests

Environment-aware project + revision tagging

LIVEPhase 1.5

Spans carry deployment.environment (cloud-run vs local), service.instance.id (revision-id or git short SHA) and openinference.project.name — DEV traffic lands in `sentinelhub-dev`, PROD in `sentinelhub-prod`, filterable by revision inside each project.

Wired in app/sentinel_hub/tracing.py — operator can override either via PHOENIX_PROJECT_NAME or OTEL_RESOURCE_ATTRIBUTES at deploy time.

LLM-as-judge harness (F1)

LIVEPhase 1.5

6-criterion rubric (groundedness, correctness, harm, format, risk-compliance, tool-use) with temperature=0 reproducibility; per-criterion + weighted mean written back to trade_decisions in Mongo and served via /arize/judge/{summary,recent}.

23 tests

Judge model gemini-3.1-flash-lite-preview; `make eval-decisions` runs the harness; /flywheel surfaces the score distribution.

Phoenix Annotations API (S1 flywheel)

LIVEPhase 1.5.S (S1)

Operator thumbs-up/down on /decisions writes to Mongo decision_annotations + 15-min sync worker pushes to Phoenix log_annotation().

REST + UI live in /decisions; PHOENIX_API_KEY confirmed against app.phoenix.arize.com. Phase-2 retrofit links annotation → trace_id + span_id so a 👎 surfaces the exact span that produced the bad reasoning.

Phoenix Datasets + Experiments substrate

APP-028

CODE READYPhase 1.5.S (S4)

Curated golden examples pinned as a versioned dataset; PromptStudio-style experiment comparison harness ready (`compare_experiments` regression-threshold gate).

ADR APP-028 ACCEPTED; uploader (phoenix_dataset_uploader.py, 534 LOC) + judge-on-dataset path implemented; golden_decisions_v1 dataset seeded on Phoenix Cloud. EvalOps epic (in flight) curates the 50-row corpus + attaches evaluators end-to-end.

EvalOps regression suite (golden corpus + promotion gate)

APP-042

CODE READYPhase 1.5.S (S4-S6)

Vendor-neutral EvalOps doctrine: every prompt-or-model change runs against the frozen golden corpus before merge; `compare_experiments` with a 0.5-point regression threshold serves the BLOCK_PROMOTION verdict. Phoenix is the first implementation backend.

EvalOps skill at .claude/skills/evalops/SKILL.md; HLD at docs/integrations/evalops-integration-hld.md; epic roadmap in delivery/proposals/2026-05-24-evalops-epic.md. S3 (golden corpus + cross-model sweep + pairwise compares) shipped 2026-05-24/25 — see gh issue #147 ledger.

Batch evaluator runs + context cache (S3)

CODE READYPhase 1.5.S (S3)

POST /judge/run?mode=batch emits cache.hit_rate + cost.total_usd span attributes for unit-economics.

Code path + cache.hit_rate / cost.total_usd attributes shipped; 66% cost cut validated in dev. Production cutover gated on APP-014 per-tenant cost-attribution dashboard so the savings surface in the operator UI, not just the spans.

Sessions (multi-turn coherence detection)

PLANNEDPhase 2.0

session.id = chat_thread_id clusters all turns of a /chat conversation; surfaces 'agent contradicted turn 2 vs turn 5' as a first-class drift signal.

Lands in EvalOps-S5 alongside the multi-turn evaluator. Tracing.py session-id wiring is small; the value is in the matching evaluator rubric.

Hallucination + QA evaluators (RAG-aware)

PLANNEDPhase 2.5

Pre-built HallucinationEvaluator + QAEvaluator score retrieved-context-vs-generated-answer alignment.

Lands in EvalOps-S6 once Vector Search retrieval is mainline in the chat agent (F2). Off-the-shelf Phoenix evaluators — we don't build, we wire.

Multi-evaluator parallel scoring

PLANNEDPhase 2.5

Split the 6-criterion rubric into N independent evaluators per span — see which dimension regressed.

EvalOps-S7. Surfaces 'risk-rule-adherence dropped 0.2 this week' instead of one collapsed score moving 0.1.

Embedding drift visualisation

PLANNEDPhase 2.5

UMAP-projected embedding views + drift detection between two corpora (last-week vs this-week decisions).

EvalOps-S8. F2 already produces 768-dim embeddings on every decision; pipe into Phoenix for the drift UI.

tenant.id span attribute (multi-tenant cost attribution)

APP-014

PLANNEDPhase 2.0

Every span tagged with tenant.id → per-user cost rollups → SaaS billing source of truth.

Foundational for the SaaS pricing model; APP-014 LLM-budget circuit-breaker reads tenant.id directly.

Cost tracking with tenant.id rollups

APP-014

PLANNEDPhase 2.0

Aggregate llm.cost per span by tenant.id → 'is this user profitable' chart that closes investor questions.

Depends on the tenant.id span attribute landing first.

Arize AX migration (persistent + multi-tenant alerting)

PLANNEDPhase 2.5

Phoenix OSS = local + ephemeral; AX = SaaS with monitor rules ('page me if eval drops 10% WoW') and SSO.

Required past M2 for persistent storage + cross-tenant dashboards.

GDPR DSR cascade (project-per-tenant at enterprise tier)

APP-015

PLANNEDPhase 2.5

Article 17 right-to-erasure: enterprise-tier tenants get isolated Phoenix projects for surgical span deletion.

Standard tier uses tenant.id-filtered deletion via AX API; enterprise tier gets project-per-tenant for isolation.

Live data

Real-time from the Arize Phoenix backend

Every tile below is a live read from the vendor backend via the FastAPI BFF. If a tile shows "—" the backend is unreachable or the metric is not yet wired (no hardcoded numbers — see anti-pattern #2).

Awaiting backend

Awaiting backend

Awaiting backend

Awaiting backend

Golden decisions v1

F1 regression set — curated 50 decisions covering high-confidence wins, false-positive low-confidence approvals, skipped decisions, drawdown-period and macro-extreme samples. Each prompt iteration is scored against this set so regressions surface instantly.

Dataset rows

curated · 0

Baseline F1 mean

0 scored

Latest experiment

no runs yet

Regressions

vs prior run

Mean score · last 10 runs

No experiments yet.no runs

Class distribution

No baseline yet — corpus has no scored decisions.

Roadmap commitments

Roadmap dependencies

Capabilities enabled by this integration — what is built, what is gated, and why.

APP-014Phase 2.0

Per-tenant LLM budget circuit breaker

Phoenix tenant.id span attribute is the source of truth for billing; APP-014 reads cost rollups by tenant.id and trips the circuit breaker when a plan's monthly cap is hit.

APP-015Phase 2.5

GDPR DSR cascade (Article 17 right-to-erasure)

Phoenix is one of 5 backends in the cascade. Standard tier uses tenant.id-filtered AX API deletion; enterprise tier gets a dedicated Phoenix project per tenant for surgical isolation.

APP-014ongoing platform release

S1 Annotations + Phase-2 trace-id retrofit

S1 ships annotations keyed by decision_id today; Phase 2 retrofit adds trace_id + span_id linkage so a 👎 on /decisions surfaces the exact span that produced the bad reasoning, queryable in the Phoenix UI.

Demo flow

End-to-end showcase journey

Five steps a judge or investor can replay live. Each step links to the page that demonstrates it.

  1. 1

    Open /monitoring. Five native Phoenix widgets show real spans — latency, cost, error rate, span tree, eval table. All from /api/arize/* against the live sentinelhub-dev project. → Open /monitoring

  2. 2

    Click any trace row → drill down into the span tree. See agent.name, llm.model_name, llm.token_count.prompt + .completion, and the per-call cost. → Open /monitoring

  3. 3

    Open /decisions, click 👎 on a low-quality reasoning row. Annotation lands in Mongo decision_annotations, then the 15-min sync worker pushes it to Phoenix via log_annotation() — flywheel closes. → Open /decisions

  4. 4

    Open /insights → Self-Improvement Loop widget. Aggregate 👎 counts + top-5 lowest-approval strategies, all backed by Phoenix annotations + Mongo aggregations. → Open /insights

  5. 5

    POST /api/arize/judge/run?mode=batch — see the batch evaluator span land with cache.hit_rate + cost.total_usd attributes. 66% cost cut vs the per-span baseline (S3 unit-economics evidence). → Open /monitoring

What's next

Top-3 vendor-enabled capabilities coming soon

Sourced from the vendor's playbook. Each entry is mapped to its delivery phase and the value it unlocks.

EvalOps S4 — golden corpus + experiment promotion gate

Phase 1.5.S (S4)

50-row curated dataset on Phoenix Cloud + `compare_experiments` regression gate (0.5-point threshold). Every prompt/model change runs against the frozen corpus before merge.

EvalOps S5 — Sessions + multi-turn coherence evaluator

Phase 2.0

session.id wiring + chat-coherence rubric — flags agents that contradict themselves across turns of the same conversation.

EvalOps S6 — Hallucination + QA evaluators on RAG output

Phase 2.5

Off-the-shelf Phoenix HallucinationEvaluator + QAEvaluator scoring F2 Vector Search retrieved chunks vs generated answer.

EvalOps S7 — multi-evaluator parallel scoring

Phase 2.5

Six rubric criteria split into independent evaluators per span — diagnose which dimension regressed (risk-rule-adherence vs format vs grounding).

EvalOps S8 — embedding drift visualisation

Phase 2.5

UMAP-projected embeddings on Phoenix + week-over-week drift detection on the decision corpus.

Arize AX migration

Phase 2.5

Persistent storage past M2 + cross-tenant dashboards + monitor-rule alerting + enterprise SSO. Lands after EvalOps S4–S8 stabilise on Phoenix OSS.

AI Strategy Advisor

agents · tools · Gemini Flash

Welcome to SentinelHub. I'm the AI Strategy Advisor — I coordinate 9 specialist agents to help you analyse markets, build strategies, and manage risk.

Ask me about market conditions, your portfolio, trading strategies, or past performance.

Powered by Gemini Flash · Connects to Alpaca, MongoDB, Phoenix