ADR-026: Provider Middleware Pipeline 

Status

Accepted

Date

2026-04

Authors

Netresearch DTT GmbH

Context 

Every provider call in the extension is wrapped by the same cross-cutting concerns — or rather, it should be, but today those concerns are scattered:

  • FallbackChainExecutor (Classes/Service/FallbackChainExecutor.php) is a try primary / catch / foreach fallbacks loop with two retryable exception types hardcoded. It has no pre/post hooks and no composition seam.
  • It is applied only to database-backed configuration paths in LlmServiceManager::runWithFallback(). Direct calls — chat(), complete(), embed(), vision() — bypass it entirely, which silently splits retry semantics.
  • BudgetService::check() (ADR-025) and UsageTrackerService::trackUsage() are primitives that no feature service actually calls. Budget enforcement and usage accounting must be remembered by every caller, which is a silent footgun.
  • HTTP-level retry with back-off lives inside AbstractProvider (sendRequest()). That is the wrong layer — a rate-limited provider should be swapped, not retried in-place.
  • Cache lookup exists only inside EmbeddingService as ad-hoc branches. There is no way to plug it in for deterministic completion scenarios (seed / temperature 0) without duplicating the branch.

The end result is that every new cross-cutting requirement — PII redaction, prompt logging, trace correlation, per-provider rate limits, circuit breakers, a cost calculator — forces either a bespoke branch in every feature service or a subclass of one of the god classes.

Decision 

Introduce a PSR-15-inspired middleware pipeline under Classes/Provider/Middleware/:

the contract
interface ProviderMiddlewareInterface
{
    public function handle(
        ProviderCallContext $context,
        LlmConfiguration $configuration,
        callable $next,           // callable(LlmConfiguration): mixed
    ): mixed;
}
Copied!

Each middleware receives

  1. an immutable ProviderCallContext (operation kind, correlation id, metadata map),
  2. the current LlmConfiguration,
  3. a $next callable that continues the pipeline.

and decides whether to pass through, short-circuit, swap the configuration, or wrap the call with before/after logic. MiddlewarePipeline::run() composes an ordered stack of them around a terminal callable in classic onion fashion — the first-registered middleware is the outermost layer.

The payload — messages, embedding input, tool specs, vision content — stays captured in the terminal callable. That keeps the existing typed response objects (CompletionResponse, EmbeddingResponse, VisionResponse) intact on the return side and avoids inventing a generic ProviderRequest envelope that would then have to know about every operation variant.

Registration 

Implementations are discovered via the nr_llm.provider_middleware tag, which AutoconfigureTag applies automatically to every class that implements the interface. The pipeline's constructor injects the collected middleware via AutowireIterator. Ordering follows tag priority; priority is an ordering hint only.

Contributors can add behaviour without touching Services.yaml — implement the interface, drop the class under Classes/Provider/Middleware/, you are done.

Scope of this ADR 

Infrastructure only. No behaviour change in this PR:

  • ProviderMiddlewareInterface, MiddlewarePipeline, ProviderCallContext, ProviderOperation enum.
  • Unit tests covering empty pipeline, single/multiple composition, short-circuit, configuration substitution, context propagation, generator-based iterables.
  • This ADR.

FallbackChainExecutor stays untouched. Feature services continue to work exactly as they do today. The pipeline is opt-in: consumers have to build a terminal callable and call MiddlewarePipeline::run() to use it.

Follow-ups 

Each item below is a separate PR that lands one behaviour at a time, so the test matrix keeps green end-to-end:

  1. FallbackMiddleware — port FallbackChainExecutor to the interface. LlmServiceManager::runWithFallback() stops instantiating the executor directly and runs the pipeline instead. Retry semantics become identical for every call path, not just database-backed ones. Deprecate the standalone executor.
  2. BudgetMiddleware — call BudgetService::check() before $next; throw a typed BudgetExceededException on denial so controllers can report which bucket tripped.
  3. UsageMiddleware — after $next returns, hand the response to UsageTrackerService::trackUsage(). Centralises cost/token accounting regardless of which feature called in.
  4. CacheMiddleware — opt-in per operation via ProviderOperation. Embedding lookups start going through it; the branch currently inside EmbeddingService comes out.
  5. Direct-method wiring (centralised) — every direct API method on LlmServiceManager (chat, complete, embed, vision, chatWithTools) builds its terminal callable and invokes the pipeline via a synthesised transient LlmConfiguration. Because every feature service (CompletionService, EmbeddingService, TranslationService, VisionService) delegates to these methods, feature-service traffic inherits the full middleware stack without each service owning its own pipeline glue.

    The transient configuration is unpersisted (no uid), carries an empty fallback chain (so FallbackMiddleware passes through verbatim), and uses a human-readable ad-hoc:<operation>:<provider> identifier so log / trace labels distinguish direct traffic from configuration-backed calls. Middleware that needs more context (beUserUid for BudgetMiddleware, cache keys for CacheMiddleware) reads it from the ProviderCallContext metadata, not from the configuration.

    Streaming (streamChat / streamChatWithConfiguration) deliberately stays out of the pipeline per the ADR's original scope: once the first chunk has been emitted, we cannot swap providers mid-stream, and most middleware assume a single terminal result.

    Why the centralised form rather than "every feature service owns glue": the ADR's problem statement explicitly identifies direct calls as the bug ("chat(), complete(), embed(), vision() — bypass [the fallback executor] entirely, which silently splits retry semantics"). Wiring feature services only would have left direct LlmServiceManager callers still bypassing the pipeline. Centralising on LlmServiceManager fixes both in one step and keeps feature services free of pipeline concerns.

Each follow-up is scoped to a single concern and keeps the codebase shippable after every step.

Embedding cache migration — done 

The inline cache branch that used to live in EmbeddingService::embedFull() has been moved behind CacheMiddleware:

  • EmbeddingResponse and UsageStatistics grew toArray() / fromArray() helpers so the typed response can round-trip through CacheMiddleware (which persists array<string, mixed> via the TYPO3 cache frontend).
  • LlmServiceManager::embed() derives a stable cache key via CacheManagerInterface::generateCacheKey() (same hash shape the old inline branch produced, so existing cache entries stay valid) and places it on the ProviderCallContext metadata under CacheMiddleware::METADATA_CACHE_KEY. cache_ttl == 0 (EmbeddingOptions::noCache()) omits the key so the middleware is a no-op — consistent with the old cacheTtl semantics.
  • The terminal now returns $response->toArray(); the manager reconstructs the typed EmbeddingResponse via EmbeddingResponse::fromArray before returning to the caller. Public method signature is unchanged.
  • UsageMiddleware learned to also recognise the array-payload shape (['usage' => [...], 'provider' => '...']) so usage accounting stays consistent whether the pipeline produced a typed response (other operations) or an array (embeddings via CacheMiddleware).
  • EmbeddingService no longer depends on CacheManagerInterface; it is a pure vector-math façade on top of LlmServiceManager::embed().

Alternatives considered 

  • Per-operation pipelines (separate middleware stacks for chat / embed / vision / tools). Rejected: every middleware we can foresee — fallback, budget, usage, cache, retry, tracing — wants to run for multiple operations. Filtering inside a middleware via ProviderCallContext::operation is cheaper than maintaining N parallel stacks.
  • Generic ``ProviderRequest`` envelope with a mixed $payload. Rejected: forces every provider / middleware / test to downcast payloads. Keeping the payload inside the terminal closure preserves the typed signatures already defined by ProviderInterface and the capability interfaces.
  • PSR-15 directly (ServerRequestInterface / ResponseInterface shapes). Rejected: HTTP semantics do not fit an LLM call, mapping OpenAI's message array onto a ServerRequestInterface is lossy, and the extension already owns LlmConfiguration and typed response objects that are a better fit than a generic PSR-7 request.
  • Event dispatcher (PSR-14) pre/post hooks. Rejected: events cannot short-circuit, cannot substitute the call target, and cannot return a response to the caller — all three are load-bearing for fallback and cache middleware.

References 

  • Audit (2026-04-23): claim #1 — "No middleware pipeline — cross-cutting concerns are scattered or absent". Locally stored under claudedocs/audit-2026-04-23-architecture.md.
  • ADR-021 — Provider Fallback Chain (the behaviour this pipeline will eventually subsume).
  • ADR-025 — Per-User AI Budgets (budget primitive to be wired via BudgetMiddleware).