ADR-026: Provider Middleware Pipeline

Status: Accepted
Date: 2026-04
Authors: Netresearch DTT GmbH

Context

Every provider call in the extension is wrapped by the same cross-cutting concerns — or rather, it should be, but today those concerns are scattered:

FallbackChainExecutor (Classes/Service/FallbackChainExecutor.php) is a try primary / catch / foreach fallbacks loop with two retryable exception types hardcoded. It has no pre/post hooks and no composition seam.
It is applied only to database-backed configuration paths in LlmServiceManager::runWithFallback(). Direct calls — chat(), complete(), embed(), vision() — bypass it entirely, which silently splits retry semantics.
BudgetService::check() (ADR-025) and UsageTrackerService::trackUsage() are primitives that no feature service actually calls. Budget enforcement and usage accounting must be remembered by every caller, which is a silent footgun.
HTTP-level retry with back-off lives inside AbstractProvider (sendRequest()). That is the wrong layer — a rate-limited provider should be swapped, not retried in-place.
Cache lookup exists only inside EmbeddingService as ad-hoc branches. There is no way to plug it in for deterministic completion scenarios (seed / temperature 0) without duplicating the branch.

The end result is that every new cross-cutting requirement — PII redaction, prompt logging, trace correlation, per-provider rate limits, circuit breakers, a cost calculator — forces either a bespoke branch in every feature service or a subclass of one of the god classes.

Decision

Introduce a PSR-15-inspired middleware pipeline under Classes/Provider/Middleware/:

the contract

interface ProviderMiddlewareInterface
{
    public function handle(
        ProviderCallContext $context,
        LlmConfiguration $configuration,
        callable $next,           // callable(LlmConfiguration): mixed
    ): mixed;
}

Each middleware receives

an immutable ProviderCallContext (operation kind, correlation id, metadata map),
the current LlmConfiguration,
a $next callable that continues the pipeline.

and decides whether to pass through, short-circuit, swap the configuration, or wrap the call with before/after logic. MiddlewarePipeline::run() composes an ordered stack of them around a terminal callable in classic onion fashion — the first-registered middleware is the outermost layer.

The payload — messages, embedding input, tool specs, vision content — stays captured in the terminal callable. That keeps the existing typed response objects (CompletionResponse, EmbeddingResponse, VisionResponse) intact on the return side and avoids inventing a generic ProviderRequest envelope that would then have to know about every operation variant.

Registration

Implementations are discovered via the nr_llm.provider_middleware tag, which AutoconfigureTag applies automatically to every class that implements the interface. The pipeline's constructor injects the collected middleware via AutowireIterator. Ordering follows tag priority; priority is an ordering hint only.

Contributors can add behaviour without touching Services.yaml — implement the interface, drop the class under Classes/Provider/Middleware/, you are done.

Scope of this ADR

Infrastructure only. No behaviour change in this PR:

ProviderMiddlewareInterface, MiddlewarePipeline, ProviderCallContext, ProviderOperation enum.
Unit tests covering empty pipeline, single/multiple composition, short-circuit, configuration substitution, context propagation, generator-based iterables.
This ADR.

FallbackChainExecutor stays untouched. Feature services continue to work exactly as they do today. The pipeline is opt-in: consumers have to build a terminal callable and call MiddlewarePipeline::run() to use it.

Follow-ups

Each item below is a separate PR that lands one behaviour at a time, so the test matrix keeps green end-to-end:

FallbackMiddleware — port FallbackChainExecutor to the interface. LlmServiceManager::runWithFallback() stops instantiating the executor directly and runs the pipeline instead. Retry semantics become identical for every call path, not just database-backed ones. Deprecate the standalone executor.
BudgetMiddleware — call BudgetService::check() before $next; throw a typed BudgetExceededException on denial so controllers can report which bucket tripped.
UsageMiddleware — after $next returns, hand the response to UsageTrackerService::trackUsage(). Centralises cost/token accounting regardless of which feature called in.
CacheMiddleware — opt-in per operation via ProviderOperation. Embedding lookups start going through it; the branch currently inside EmbeddingService comes out.
Direct-method wiring (centralised) — every direct API method on LlmServiceManager (chat, complete, embed, vision, chatWithTools) builds its terminal callable and invokes the pipeline via a synthesised transient LlmConfiguration. Because every feature service (CompletionService, EmbeddingService, TranslationService, VisionService) delegates to these methods, feature-service traffic inherits the full middleware stack without each service owning its own pipeline glue.

The transient configuration is unpersisted (no uid), carries an empty fallback chain (so FallbackMiddleware passes through verbatim), and uses a human-readable ad-hoc:<operation>:<provider> identifier so log / trace labels distinguish direct traffic from configuration-backed calls. Middleware that needs more context (beUserUid for BudgetMiddleware, cache keys for CacheMiddleware) reads it from the ProviderCallContext metadata, not from the configuration.

Streaming (streamChat / streamChatWithConfiguration) deliberately stays out of the pipeline per the ADR's original scope: once the first chunk has been emitted, we cannot swap providers mid-stream, and most middleware assume a single terminal result.

Why the centralised form rather than "every feature service owns glue": the ADR's problem statement explicitly identifies direct calls as the bug ("chat(), complete(), embed(), vision() — bypass [the fallback executor] entirely, which silently splits retry semantics"). Wiring feature services only would have left direct LlmServiceManager callers still bypassing the pipeline. Centralising on LlmServiceManager fixes both in one step and keeps feature services free of pipeline concerns.

Each follow-up is scoped to a single concern and keeps the codebase shippable after every step.

Embedding cache migration — done

The inline cache branch that used to live in EmbeddingService::embedFull() has been moved behind CacheMiddleware:

EmbeddingResponse and UsageStatistics grew toArray() / fromArray() helpers so the typed response can round-trip through CacheMiddleware (which persists array<string, mixed> via the TYPO3 cache frontend).
LlmServiceManager::embed() derives a stable cache key via CacheManagerInterface::generateCacheKey() (same hash shape the old inline branch produced, so existing cache entries stay valid) and places it on the ProviderCallContext metadata under CacheMiddleware::METADATA_CACHE_KEY. cache_ttl == 0 (EmbeddingOptions::noCache()) omits the key so the middleware is a no-op — consistent with the old cacheTtl semantics.
The terminal now returns $response->toArray(); the manager reconstructs the typed EmbeddingResponse via EmbeddingResponse::fromArray before returning to the caller. Public method signature is unchanged.
UsageMiddleware learned to also recognise the array-payload shape (['usage' => [...], 'provider' => '...']) so usage accounting stays consistent whether the pipeline produced a typed response (other operations) or an array (embeddings via CacheMiddleware).
EmbeddingService no longer depends on CacheManagerInterface; it is a pure vector-math façade on top of LlmServiceManager::embed().

Diagnostic / connectivity calls intentionally bypass the pipeline

Three controller actions test provider connectivity by calling an adapter capability method directly, with their own try / catch block; none of them go through MiddlewarePipeline::run(). The exact call paths today are:

ProviderController::testConnectionAction → ProviderAdapterRegistry::testProviderConnection() → ProviderInterface::testConnection(). The registry method catches Throwable and runs an inline preg_replace over $e->getMessage() to strip key / api_key / token / secret / access_token query parameters before returning a {success: false, message} shape. The regex mirrors what AbstractProvider::sanitizeErrorMessage() does for inside-provider errors but is implemented locally to keep the registry independent of the provider base class.
ConfigurationController::testConfigurationAction → ProviderAdapterRegistry::createAdapterFromModel() → ProviderInterface::complete(). A short test prompt is sent with the configuration's options. Sanitization happens at the catch (ProviderResponseException $e) arm — by that point the message has already been sanitised by AbstractProvider::sanitizeErrorMessage() inside the adapter before the exception was thrown, so the controller surfaces the upstream HTTP status verbatim.
ModelController::testModelAction → ProviderAdapterRegistry::createAdapterFromModel() → ProviderInterface::complete() with a 100-token cap. Same exception-arm sanitization story as the configuration test.

In every case the bypass is deliberate:

Budget — a connectivity / configuration probe must not be charged against a user's monthly bucket. These are backend-admin actions; they have no end-user budget owner.
Usage — recording a probe in the usage table would distort cost / token dashboards. Probes are administrative, not productive traffic.
Fallback — a probe must surface the failure of the probed provider. Silently swapping to a healthy alternative would mask the very condition the probe was designed to detect.
Cache — caching the result of a probe would defeat the purpose of probing.

Together with streaming (see Follow-ups step 5 — once the first chunk has been emitted we cannot swap providers mid-stream, and most middleware assume a single terminal result), these three diagnostic actions are the documented exemptions from the "productive provider calls go through the pipeline" rule. There are no others. New diagnostic / health-check entry points should follow the same pattern as the three listed here: build the adapter via ProviderAdapterRegistry, call the capability method directly, sanitize and surface the error themselves. New non-streaming productive entry points must go through MiddlewarePipeline::run().

Alternatives considered

Per-operation pipelines (separate middleware stacks for chat / embed / vision / tools). Rejected: every middleware we can foresee — fallback, budget, usage, cache, retry, tracing — wants to run for multiple operations. Filtering inside a middleware via ProviderCallContext::operation is cheaper than maintaining N parallel stacks.
Generic ``ProviderRequest`` envelope with a mixed $payload. Rejected: forces every provider / middleware / test to downcast payloads. Keeping the payload inside the terminal closure preserves the typed signatures already defined by ProviderInterface and the capability interfaces.
PSR-15 directly (ServerRequestInterface / ResponseInterface shapes). Rejected: HTTP semantics do not fit an LLM call, mapping OpenAI's message array onto a ServerRequestInterface is lossy, and the extension already owns LlmConfiguration and typed response objects that are a better fit than a generic PSR-7 request.
Event dispatcher (PSR-14) pre/post hooks. Rejected: events cannot short-circuit, cannot substitute the call target, and cannot return a response to the caller — all three are load-bearing for fallback and cache middleware.

References

Audit (2026-04-23): claim #1 — "No middleware pipeline — cross-cutting concerns are scattered or absent". Locally stored under claudedocs/audit-2026-04-23-architecture.md.
ADR-021 — Provider Fallback Chain (the behaviour this pipeline will eventually subsume).
ADR-025 — Per-User AI Budgets (budget primitive to be wired via BudgetMiddleware).