AI Inference
Epochly AI Inference: Architecture
How the inference accelerator works: framework detection, progressive enhancement levels, safety gates, and the optimization pipeline.
How the inference accelerator works: framework detection, progressive enhancement levels, safety gates, and the optimization pipeline.
System Overview
The Epochly AI Inference Accelerator is an inference control layer for Python services. It observes, optimizes, gates, and proves impact on AI inference workloads without replacing the model runtime.
Four-Layer Architecture
Customer's Python Service(FastAPI / Ray Serve / custom)|+=========+=========+ <-- Data Plane (in-process)| || Serving Adapters | Request-level interception| ASGIMiddleware | ASGI middleware: timing, policy, admission| RayServeWrapper | Deployment wrapper (v1 preview)| LLMCompanion | vLLM/TGI: admission, caching, attribution| ABModelComparison | A/B testing: split/shadow traffic routing| GradualRollout | Staged traffic ramping with health checks| || Framework Adapters | Model-level interception| PyTorchAdapter | InferenceProxy(invoke=Module.__call__)| TransformersAdapter| InferenceProxy(invoke=pipeline.__call__)| OnnxAdapter | InferenceProxy(invoke=session.run)| |+=========+=========+|ModelRegistryClient <-- Model loading, versioning, lineage|InferenceOptimizer <-- Policy & orchestration|SafetyOrchestrator <-- gate_optimization()|+---------+---------+------+| | | |Micro- Compilation Cache ValidatorRegistryBatching (torch.compile) (custom + built-in)|InferenceMetrics --> Lens DashboardCostEstimatorPrometheusExporter --> /metrics endpointOTelExporter --> OpenTelemetry SDK
Data Plane
In-process components that intercept and instrument inference traffic:
- Serving Adapters: ASGI middleware, Ray Serve wrappers, LLM companion
- A/B Testing: Split and shadow mode traffic routing between model variants, with Welch's t-test statistical significance and gradual rollout
- Framework Adapters: Model-specific proxies for PyTorch, HuggingFace, ONNX
- Inference Proxies: Wraps individual model instances at the serving boundary
Model Registry
Version-tracked model loading from multiple backends (local, HuggingFace Hub, MLflow). Provides:
- Version tracking: SHA-256 content hashing with automatic cache invalidation signaling
- Lineage tracking: Provenance metadata (data version, code hash, hyperparameters, training metrics)
- Staged promotions: Lifecycle management (DRAFT -> STAGING -> PRODUCTION -> ARCHIVED) with optional approval callbacks and TOCTOU-safe atomic transitions
- Rollback: Revert to previous production version with full version history
Policy Engine
Per-endpoint configuration: batching window, precision mode, concurrency cap, cache TTL, fallback runtime, rollout percentage.
Validation Harness
Safety controls: golden datasets, canary validation, circuit breakers, drift monitoring, fallback chains, validator registry, hysteresis control.
Control Plane
Lens dashboard for fleet-wide visibility and performance monitoring.
Module Structure
src/epochly/inference/__init__.py # Lazy init, zero cost if no ML frameworkdetector.py # Framework and model detectionprofiler.py # Inference-specific profiling + golden captureoptimizer.py # Optimization orchestratorcache.py # L1 in-memory cachecache_l2.py # L2 SQLite WAL + AES-256-GCM cachecache_l3.py # L3 Redis distributed cacheconfig.py # InferenceConfig dataclasscontext.py # RequestContext, BatchKeysecurity.py # Startup security validationprogression.py # L0->L1->L2 progression validation criteriabatch/dynamic_micro_batcher.py # Async micro-batching with keyed sub-queuesbatch_optimizer.pyrequest_queue.pycompilation/torch_compiler.py # torch.compile with pre-check and cachesafety_monitor.py # Graph break, memory, NaN monitoringframeworks/base_adapter.pyinference_proxy.pypytorch_adapter.pytransformers_adapter.pyonnx_adapter.pyregistry/model_registry.py # ModelRegistryClient, ModelStage, ModelLineageserving/ab_testing.py # ABModelComparison, MultiModelComparison, GradualRolloutfastapi_middleware.pyfastapi_dependency.pyllm_companion.py # vLLM/TGI control layerray_serve_wrapper.py # Ray Serve wrapper (v1 preview)metrics/inference_metrics.pycost_estimator.pyotel_exporter.py # OpenTelemetry instrument mappingprometheus_exporter.py # Prometheus exposition format exportsafety/canary_validator.pycircuit_breaker.pydrift_monitor.py # EWMA-based online drift detectionfallback_chain.pygolden_store.pyhysteresis.py # Anti-flapping state transition controlprivacy.py # InputRedactor, TenantIsolation, AuditLoggersafety_orchestrator.pyvalidator_registry.py # Custom workload validator plugin registry
Enhancement Levels
| Level | Name | Description |
|---|---|---|
| L0 | Profiling | Framework detection + model profiling |
| L1 | Pre/Post Optimization | CPU parallelization, tokenizer caching |
| L2a | Micro-Batching | Dynamic request batching |
| L2b | Compilation | torch.compile with golden validation |
| L3+ | Verified Optimize | Quantization, ONNX export (v2) |
Cache Architecture
Three-tier cache with progressive latency/capacity tradeoff:
L1 (In-Memory LRU) L2 (SQLite WAL) L3 (Redis)- ~1us access - ~1ms access - ~5ms access- 10K entries - 1GB on disk - Distributed- Process-local - AES-256-GCM - TLS + key prefix- Thread-safe - TTL enforcement - TTL via EXPIRE
Lookup order: L1 -> L2 -> L3. Cache misses promote to faster tiers.
Safety Architecture
Every optimization goes through:
- Golden output capture during L0 profiling (via InferenceProfiler golden callback)
- Pre-compile check via torch._dynamo.explain()
- Canary validation with workload-specific validators (built-in or custom via ValidatorRegistry)
- Circuit breaker per optimization
- Drift monitoring via EWMA-based DriftMonitor with shadow comparisons
- Hysteresis control to prevent state flapping (asymmetric enable/disable thresholds)
- Fallback chain for graceful degradation
- Alert hooks via ValidatorRegistry alert callbacks on FAIL reports
Validator Registry
The ValidatorRegistry provides a plugin interface for custom workload validators. Built-in validators cover standard workload types (embedding, classifier, reranker, generation, encoder). Enterprise users can register domain-specific validators (financial accuracy, medical terminology, etc.) that override built-in validators for the same workload type.
Key properties:
- Custom validators override built-in validators; unregistering restores the original
- Supports both synchronous and asynchronous validators
- Alert callbacks fire on FAIL results for integration with PagerDuty, Slack, etc.
- InputSchemaValidator provides pre-inference input data validation (dtype, shape, range)
Drift Monitoring
The DriftMonitor samples live traffic using shadow execution, computes workload-specific comparison metrics via the same validators as canary validation, and applies EWMA smoothing for streaming drift detection. When drift exceeds thresholds, the circuit breaker trips and the optimization is disabled.
Hysteresis Control
The HysteresisController prevents optimization state oscillation:
- Disable: Immediate on circuit breaker trip
- Re-enable: Requires minimum off-duration (default 300s) AND N consecutive canary PASS results (not MARGINAL)
- Minimum on-duration: Prevents disabling too quickly before meaningful data is collected (default 60s)
Model Registry Architecture
ModelRegistryClient|+-- Backend Loaders| +-- local: File read + SHA-256 hash| +-- huggingface: transformers.AutoModel.from_pretrained| +-- mlflow: mlflow.pytorch.load_model|+-- Version Tracking| +-- SHA-256 content hashing| +-- on_version_change callback for cache invalidation|+-- Lineage Tracking (ModelLineage)| +-- data_version, code_hash, hyperparameters| +-- training_metrics, parent_model|+-- Staged Promotions (ModelStage)| +-- DRAFT -> STAGING -> PRODUCTION -> ARCHIVED| +-- Optional approval callbacks| +-- TOCTOU-safe atomic verify+apply| +-- Single PRODUCTION version per model (auto-archive)|+-- Rollback+-- Production version history (deque, max 100)+-- Archive current, restore previous
A/B Testing Architecture
ABModelComparison|+-- Split Mode: Route to A or B per request| +-- Random selection based on traffic_split probability| +-- Per-model latency and error metrics|+-- Shadow Mode: Run both, return A| +-- B runs after A (no latency impact on critical path)| +-- Both results recorded for comparison|+-- Statistical Analysis+-- Welch's t-test (scipy or manual fallback)+-- Confidence intervals+-- Sample size trackingMultiModelComparison (A/B/n)+-- Weighted traffic distribution across N models+-- Cumulative weight routing for efficient selectionGradualRollout+-- Stepped traffic ramping: 1% -> 5% -> 10% -> 25% -> 50% -> 100%+-- Health check gates at each step+-- Minimum dwell time enforcement per step
Metrics Architecture
InferenceMetrics (core collector)|+-- PrometheusExporter.export() --> /metrics HTTP endpoint| +-- Counters, gauges, histograms| +-- Per-model labels (model_name, model_version, endpoint)| +-- Exemplars for trace correlation| +-- Cost metrics (baseline vs optimized)|+-- InferenceOTelExporter.export() --> OTel SDK| +-- Spec Section 8.1 instrument names| +-- Safety metrics (canary, circuit breaker, drift, hysteresis)|+-- CostEstimator+-- Per-1k request cost (baseline vs optimized)+-- Projected hourly/monthly savings
Security Model
Startup validation checks:
- Cache tenant isolation
- Privacy control configuration (mode, redaction patterns, audit logging)
- Safety bypass resistance
- Input size limits
Privacy controls:
- InputRedactor: Regex PII scrubbing before any data is stored
- TenantIsolation: Namespace cache keys by tenant/deployment ID
- AuditLogger: Append-only structured log of safety decisions