ADR-026: Provider Middleware Pipeline
- Status
-
Accepted
- Date
-
2026-04
- Authors
-
Netresearch DTT GmbH
Context
Every provider call in the extension is wrapped by the same cross-cutting concerns — or rather, it should be, but today those concerns are scattered:
Fallback(Chain Executor Classes/) is aService/ Fallback Chain Executor. php try primary / catch / foreach fallbacksloop with two retryable exception types hardcoded. It has no pre/post hooks and no composition seam.- It is applied only to database-backed configuration paths in
Llm. Direct calls —Service Manager:: run With Fallback () chat(),complete(),embed(),vision()— bypass it entirely, which silently splits retry semantics. Budget(ADR-025) andService:: check () Usageare primitives that no feature service actually calls. Budget enforcement and usage accounting must be remembered by every caller, which is a silent footgun.Tracker Service:: track Usage () - HTTP-level retry with back-off lives inside
Abstract(Provider sendRequest()). That is the wrong layer — a rate-limited provider should be swapped, not retried in-place. - Cache lookup exists only inside
Embeddingas ad-hoc branches. There is no way to plug it in for deterministic completion scenarios (seed / temperature 0) without duplicating the branch.Service
The end result is that every new cross-cutting requirement — PII redaction, prompt logging, trace correlation, per-provider rate limits, circuit breakers, a cost calculator — forces either a bespoke branch in every feature service or a subclass of one of the god classes.
Decision
Introduce a PSR-15-inspired middleware pipeline under
Classes/:
interface ProviderMiddlewareInterface
{
public function handle(
ProviderCallContext $context,
LlmConfiguration $configuration,
callable $next, // callable(LlmConfiguration): mixed
): mixed;
}
Each middleware receives
- an immutable
Provider(operation kind, correlation id, metadata map),Call Context - the current
Llm,Configuration - a
$nextcallable that continues the pipeline.
and decides whether to pass through, short-circuit, swap the
configuration, or wrap the call with before/after logic.
Middleware composes an ordered stack of them
around a terminal callable in classic onion fashion — the
first-registered middleware is the outermost layer.
The payload — messages, embedding input, tool specs, vision content —
stays captured in the terminal callable. That keeps the existing typed
response objects (Completion, Embedding,
Vision) intact on the return side and avoids inventing a
generic ProviderRequest envelope that would then have to know about
every operation variant.
Registration
Implementations are discovered via the nr_llm.provider_middleware
tag, which Autoconfigure applies automatically to every class
that implements the interface. The pipeline's constructor injects the
collected middleware via Autowire. Ordering follows tag
priority; priority is an ordering hint only.
Contributors can add behaviour without touching Services. —
implement the interface, drop the class under
Classes/, you are done.
Scope of this ADR
Infrastructure only. No behaviour change in this PR:
Provider,Middleware Interface Middleware,Pipeline Provider,Call Context Providerenum.Operation - Unit tests covering empty pipeline, single/multiple composition, short-circuit, configuration substitution, context propagation, generator-based iterables.
- This ADR.
Fallback stays untouched. Feature services continue
to work exactly as they do today. The pipeline is opt-in: consumers
have to build a terminal callable and call Middleware
to use it.
Follow-ups
Each item below is a separate PR that lands one behaviour at a time, so the test matrix keeps green end-to-end:
- FallbackMiddleware — port
Fallbackto the interface.Chain Executor Llmstops instantiating the executor directly and runs the pipeline instead. Retry semantics become identical for every call path, not just database-backed ones. Deprecate the standalone executor.Service Manager:: run With Fallback () - BudgetMiddleware — call
BudgetbeforeService:: check () $next; throw a typedBudgeton denial so controllers can report which bucket tripped.Exceeded Exception - UsageMiddleware — after
$nextreturns, hand the response toUsage. Centralises cost/token accounting regardless of which feature called in.Tracker Service:: track Usage () - CacheMiddleware — opt-in per operation via
Provider. Embedding lookups start going through it; the branch currently insideOperation Embeddingcomes out.Service -
Direct-method wiring (centralised) — every direct API method on
Llm(Service Manager chat,complete,embed,vision,chatWithTools) builds its terminal callable and invokes the pipeline via a synthesised transientLlm. Because every feature service (Configuration Completion,Service Embedding,Service Translation,Service Vision) delegates to these methods, feature-service traffic inherits the full middleware stack without each service owning its own pipeline glue.Service The transient configuration is unpersisted (no uid), carries an empty fallback chain (so
Fallbackpasses through verbatim), and uses a human-readableMiddleware ad-hoc:<operation>:<provider>identifier so log / trace labels distinguish direct traffic from configuration-backed calls. Middleware that needs more context (beUserUidforBudget, cache keys forMiddleware Cache) reads it from theMiddleware Providermetadata, not from the configuration.Call Context Streaming (
stream/Chat stream) deliberately stays out of the pipeline per the ADR's original scope: once the first chunk has been emitted, we cannot swap providers mid-stream, and most middleware assume a single terminal result.Chat With Configuration Why the centralised form rather than "every feature service owns glue": the ADR's problem statement explicitly identifies direct calls as the bug ("
chat(),complete(),embed(),vision()— bypass [the fallback executor] entirely, which silently splits retry semantics"). Wiring feature services only would have left directLlmcallers still bypassing the pipeline. Centralising onService Manager Llmfixes both in one step and keeps feature services free of pipeline concerns.Service Manager
Each follow-up is scoped to a single concern and keeps the codebase shippable after every step.
Embedding cache migration — done
The inline cache branch that used to live in Embedding
has been moved behind Cache:
EmbeddingandResponse UsagegrewStatistics toArray()/fromArray()helpers so the typed response can round-trip throughCache(which persistsMiddleware array<string, mixed>via the TYPO3 cache frontend).Llmderives a stable cache key viaService Manager:: embed () Cache(same hash shape the old inline branch produced, so existing cache entries stay valid) and places it on theManager Interface:: generate Cache Key () Providermetadata underCall Context Cache.Middleware:: METADATA_ CACHE_ KEY cache_ttl == 0(Embedding) omits the key so the middleware is a no-op — consistent with the oldOptions:: no Cache () cacheTtlsemantics.- The terminal now returns
$response->toArray(); the manager reconstructs the typedEmbeddingviaResponse Embeddingbefore returning to the caller. Public method signature is unchanged.Response:: from Array Usagelearned to also recognise the array-payload shape (Middleware ['usage' => [...], 'provider' => '...']) so usage accounting stays consistent whether the pipeline produced a typed response (other operations) or an array (embeddings viaCache).Middleware Embeddingno longer depends onService Cache; it is a pure vector-math façade on top ofManager Interface Llm.Service Manager:: embed ()
Alternatives considered
- Per-operation pipelines (separate middleware stacks for chat /
embed / vision / tools). Rejected: every middleware we can foresee
— fallback, budget, usage, cache, retry, tracing — wants to run for
multiple operations. Filtering inside a middleware via
Provideris cheaper than maintaining N parallel stacks.Call Context:: operation - Generic ``ProviderRequest`` envelope with a
mixed $payload. Rejected: forces every provider / middleware / test to downcast payloads. Keeping the payload inside the terminal closure preserves the typed signatures already defined byProviderand the capability interfaces.Interface - PSR-15 directly (
ServerRequestInterface/ResponseInterfaceshapes). Rejected: HTTP semantics do not fit an LLM call, mapping OpenAI's message array onto aServeris lossy, and the extension already ownsRequest Interface Llmand typed response objects that are a better fit than a generic PSR-7 request.Configuration - Event dispatcher (PSR-14) pre/post hooks. Rejected: events cannot short-circuit, cannot substitute the call target, and cannot return a response to the caller — all three are load-bearing for fallback and cache middleware.
References
- Audit (2026-04-23): claim #1 — "No middleware pipeline — cross-cutting
concerns are scattered or absent". Locally stored under
claudedocs/.audit- 2026- 04- 23- architecture. md - ADR-021 — Provider Fallback Chain (the behaviour this pipeline will eventually subsume).
- ADR-025 — Per-User AI Budgets (budget primitive to be wired via BudgetMiddleware).