Arize Phoenix / OpenInference

Arize Phoenix and the broader OpenInference tracing standard are whatifd’s second supported trace source, shipped in v0.2. The adapter is tracer-neutral by construction: it consumes OpenInference-shaped span dictionaries from a caller-supplied provider, not a pinned Phoenix client. Anything that emits OpenInference spans — Phoenix proper, a custom OTLP collector, an offline JSONL dump — works with a ~5-line wiring callable.

What whatifd reads from Phoenix / OpenInference

For each selected trace:

  • input.value — the agent’s user-facing input (wrapped as Sensitive[str] at the boundary per cardinal #5).

  • output.value — the agent’s output (the original output the verdict is diffed against; also wrapped).

  • context.trace_id + parent_id — used to group spans into traces and identify the root span.

  • openinference.span.kind — used to identify LLM vs tool vs retrieval spans for the ToolCache.

  • All other span attributes — pass through to RawTrace.metadata unwrapped (cardinal #5: only intentional user content is Sensitive).

whatifd is read-only on Phoenix. It never writes spans back.

Install

uv pip install whatifd whatifd-langfuse whatifd-inspect-ai whatifd-phoenix

whatifd-phoenix itself is tracer-neutral and has no required SDK pin. If you want to read from a live Phoenix instance via arize-phoenix-client, install the optional extra:

uv pip install "whatifd-phoenix[live]"

The core whatifd package does not pull Phoenix — you can use Langfuse, your own adapter, or the synthetic whatifd.adapters.stub.StubTraceSource without ever installing this package.

Wiring a spans_provider

The adapter accepts a spans_provider: Callable[[], Iterable[dict]] that yields OpenInference-shaped span dictionaries. The shape decouples whatifd from any single Phoenix client SDK or transport layer.

From arize-phoenix-client (live Phoenix instance)

from arize.phoenix.client import Client
from whatifd_phoenix import PhoenixTraceSource

def spans_provider():
    client = Client(endpoint="https://your-phoenix-host")
    # Iterate however your Phoenix deployment exposes spans —
    # the API surface varies by version. The adapter cares only
    # that each yielded item is a dict with OpenInference attrs.
    for span in client.get_spans(project="my-project"):
        yield span.to_dict()

source = PhoenixTraceSource(spans_provider=spans_provider)

From a JSONL dump (offline / CI fixture)

import json
from pathlib import Path
from whatifd_phoenix import PhoenixTraceSource

def spans_provider():
    with Path("spans.jsonl").open() as f:
        for line in f:
            yield json.loads(line)

source = PhoenixTraceSource(spans_provider=spans_provider)

From an OpenTelemetry collector

If your spans land in any OpenInference-emitting OTLP destination, the same pattern applies — implement a spans_provider that pulls them out as dicts. The adapter never assumes Phoenix-specific transport.

OpenInference attribute mapping

The adapter reads the standard OpenInference span attribute conventions:

OpenInference attribute

whatifd use

Sensitive[str]?

context.trace_id

Group spans into traces

no

parent_id (or its absence)

Identify root span per trace

no

openinference.span.kind

Classify span (LLM / tool / retriever / agent)

no

input.value

Trace input (user message)

yes

output.value

Trace output (agent response)

yes

any other attribute

Passed through to RawTrace.metadata

no

The Sensitive[str] wrapping is enforced at the adapter boundary; downstream code that needs the raw value must call .unwrap(reason=...) and the unwrap is audit-logged.

Selectors

The Phoenix adapter supports the same selector grammar as Langfuse for cohort filtering — see the Langfuse selector grammar. Selectors that depend on tracer-specific concepts (e.g., Langfuse-style score-based filters) are interpreted by the adapter against whatever attribute Phoenix exposes for the equivalent signal.

If a selector references a concept Phoenix doesn’t have a direct equivalent for, the adapter declares it unsupported at config-load time (cardinal #1: structural failure, not silent skip).

Tool-call cache

OpenInference tool spans surface as cache entries under whatifd’s default use-original policy. The cache key components are derived from the span’s openinference.span.kind == "tool" plus the span’s input.value hash. Tool-output retrieval reads output.value from the same span.

If a trace has incomplete tool spans (output missing, mismatched parent/child structure), whatifd records the trace as a structured replay failure (cardinal #1) in the report’s Replay validity section.

Scope of v0.2 support

Feature

v0.2

Notes

Read OpenInference spans from any provider

The spans_provider callable is the boundary.

Wrap user content as Sensitive[str]

Per cardinal #5.

Conformance with TraceSource Protocol

14 conformance tests (5 inherited from the harness + 9 adapter-specific).

Live-Phoenix recorded-cassette smoke

❌ (v0.3)

Parity with whatifd-langfuse’s cassette discipline. Today the adapter is verified against fixture spans only.

Phoenix-specific selector grammar

partial

Core selectors map cleanly; Phoenix-only signals (e.g., evaluation runs) are not yet first-class.

Multi-project Phoenix tenancy

The spans_provider is per-instance — point it at whichever project.

cluster_key for clustered-paired bootstrap (cardinal #10)

declared empty

cluster_key_support() returns (); v0.3’s cluster-paired bootstrap will widen this.

Limitations

  • The adapter doesn’t ship a Phoenix client wrapper. You supply the spans_provider. This is a deliberate design choice: pinning to a specific Phoenix SDK version would bind whatifd-phoenix to one transport, breaking neutrality. The trade-off is ~5 lines of wiring code at integration time.

  • OpenInference attribute conformance. If your spans are emitted by a non-standard tracer that almost follows OpenInference, attribute mapping may surface gaps. The adapter validates the attributes it reads at trace-build time; gaps appear as structured RawTrace build errors.

  • Live-Phoenix verification is not yet structurally pinned. v0.3 will land a pytest-recording-based cassette suite mirroring whatifd-langfuse’s discipline. Until then, validate your spans_provider against a fixture set before relying on it in CI.