Documentation

AI Inference

Epochly AI Inference: Architecture

How the inference accelerator works: framework detection, progressive enhancement levels, safety gates, and the optimization pipeline.

How the inference accelerator works: framework detection, progressive enhancement levels, safety gates, and the optimization pipeline.

System Overview

The Epochly AI Inference Accelerator is an inference control layer for Python services. It observes, optimizes, gates, and proves impact on AI inference workloads without replacing the model runtime.

Four-Layer Architecture

Customer's Python Service
(FastAPI / Ray Serve / custom)
|
+=========+=========+ <-- Data Plane (in-process)
| |
| Serving Adapters | Request-level interception
| ASGIMiddleware | ASGI middleware: timing, policy, admission
| RayServeWrapper | Deployment wrapper (v1 preview)
| LLMCompanion | vLLM/TGI: admission, caching, attribution
| ABModelComparison | A/B testing: split/shadow traffic routing
| GradualRollout | Staged traffic ramping with health checks
| |
| Framework Adapters | Model-level interception
| PyTorchAdapter | InferenceProxy(invoke=Module.__call__)
| TransformersAdapter| InferenceProxy(invoke=pipeline.__call__)
| OnnxAdapter | InferenceProxy(invoke=session.run)
| |
+=========+=========+
|
ModelRegistryClient <-- Model loading, versioning, lineage
|
InferenceOptimizer <-- Policy & orchestration
|
SafetyOrchestrator <-- gate_optimization()
|
+---------+---------+------+
| | | |
Micro- Compilation Cache ValidatorRegistry
Batching (torch.compile) (custom + built-in)
|
InferenceMetrics --> Lens Dashboard
CostEstimator
PrometheusExporter --> /metrics endpoint
OTelExporter --> OpenTelemetry SDK

Data Plane

In-process components that intercept and instrument inference traffic:

  • Serving Adapters: ASGI middleware, Ray Serve wrappers, LLM companion
  • A/B Testing: Split and shadow mode traffic routing between model variants, with Welch's t-test statistical significance and gradual rollout
  • Framework Adapters: Model-specific proxies for PyTorch, HuggingFace, ONNX
  • Inference Proxies: Wraps individual model instances at the serving boundary

Model Registry

Version-tracked model loading from multiple backends (local, HuggingFace Hub, MLflow). Provides:

  • Version tracking: SHA-256 content hashing with automatic cache invalidation signaling
  • Lineage tracking: Provenance metadata (data version, code hash, hyperparameters, training metrics)
  • Staged promotions: Lifecycle management (DRAFT -> STAGING -> PRODUCTION -> ARCHIVED) with optional approval callbacks and TOCTOU-safe atomic transitions
  • Rollback: Revert to previous production version with full version history

Policy Engine

Per-endpoint configuration: batching window, precision mode, concurrency cap, cache TTL, fallback runtime, rollout percentage.

Validation Harness

Safety controls: golden datasets, canary validation, circuit breakers, drift monitoring, fallback chains, validator registry, hysteresis control.

Control Plane

Lens dashboard for fleet-wide visibility and performance monitoring.

Module Structure

src/epochly/inference/
__init__.py # Lazy init, zero cost if no ML framework
detector.py # Framework and model detection
profiler.py # Inference-specific profiling + golden capture
optimizer.py # Optimization orchestrator
cache.py # L1 in-memory cache
cache_l2.py # L2 SQLite WAL + AES-256-GCM cache
cache_l3.py # L3 Redis distributed cache
config.py # InferenceConfig dataclass
context.py # RequestContext, BatchKey
security.py # Startup security validation
progression.py # L0->L1->L2 progression validation criteria
batch/
dynamic_micro_batcher.py # Async micro-batching with keyed sub-queues
batch_optimizer.py
request_queue.py
compilation/
torch_compiler.py # torch.compile with pre-check and cache
safety_monitor.py # Graph break, memory, NaN monitoring
frameworks/
base_adapter.py
inference_proxy.py
pytorch_adapter.py
transformers_adapter.py
onnx_adapter.py
registry/
model_registry.py # ModelRegistryClient, ModelStage, ModelLineage
serving/
ab_testing.py # ABModelComparison, MultiModelComparison, GradualRollout
fastapi_middleware.py
fastapi_dependency.py
llm_companion.py # vLLM/TGI control layer
ray_serve_wrapper.py # Ray Serve wrapper (v1 preview)
metrics/
inference_metrics.py
cost_estimator.py
otel_exporter.py # OpenTelemetry instrument mapping
prometheus_exporter.py # Prometheus exposition format export
safety/
canary_validator.py
circuit_breaker.py
drift_monitor.py # EWMA-based online drift detection
fallback_chain.py
golden_store.py
hysteresis.py # Anti-flapping state transition control
privacy.py # InputRedactor, TenantIsolation, AuditLogger
safety_orchestrator.py
validator_registry.py # Custom workload validator plugin registry

Enhancement Levels

LevelNameDescription
L0ProfilingFramework detection + model profiling
L1Pre/Post OptimizationCPU parallelization, tokenizer caching
L2aMicro-BatchingDynamic request batching
L2bCompilationtorch.compile with golden validation
L3+Verified OptimizeQuantization, ONNX export (v2)

Cache Architecture

Three-tier cache with progressive latency/capacity tradeoff:

L1 (In-Memory LRU) L2 (SQLite WAL) L3 (Redis)
- ~1us access - ~1ms access - ~5ms access
- 10K entries - 1GB on disk - Distributed
- Process-local - AES-256-GCM - TLS + key prefix
- Thread-safe - TTL enforcement - TTL via EXPIRE

Lookup order: L1 -> L2 -> L3. Cache misses promote to faster tiers.

Safety Architecture

Every optimization goes through:

  1. Golden output capture during L0 profiling (via InferenceProfiler golden callback)
  2. Pre-compile check via torch._dynamo.explain()
  3. Canary validation with workload-specific validators (built-in or custom via ValidatorRegistry)
  4. Circuit breaker per optimization
  5. Drift monitoring via EWMA-based DriftMonitor with shadow comparisons
  6. Hysteresis control to prevent state flapping (asymmetric enable/disable thresholds)
  7. Fallback chain for graceful degradation
  8. Alert hooks via ValidatorRegistry alert callbacks on FAIL reports

Validator Registry

The ValidatorRegistry provides a plugin interface for custom workload validators. Built-in validators cover standard workload types (embedding, classifier, reranker, generation, encoder). Enterprise users can register domain-specific validators (financial accuracy, medical terminology, etc.) that override built-in validators for the same workload type.

Key properties:

  • Custom validators override built-in validators; unregistering restores the original
  • Supports both synchronous and asynchronous validators
  • Alert callbacks fire on FAIL results for integration with PagerDuty, Slack, etc.
  • InputSchemaValidator provides pre-inference input data validation (dtype, shape, range)

Drift Monitoring

The DriftMonitor samples live traffic using shadow execution, computes workload-specific comparison metrics via the same validators as canary validation, and applies EWMA smoothing for streaming drift detection. When drift exceeds thresholds, the circuit breaker trips and the optimization is disabled.

Hysteresis Control

The HysteresisController prevents optimization state oscillation:

  • Disable: Immediate on circuit breaker trip
  • Re-enable: Requires minimum off-duration (default 300s) AND N consecutive canary PASS results (not MARGINAL)
  • Minimum on-duration: Prevents disabling too quickly before meaningful data is collected (default 60s)

Model Registry Architecture

ModelRegistryClient
|
+-- Backend Loaders
| +-- local: File read + SHA-256 hash
| +-- huggingface: transformers.AutoModel.from_pretrained
| +-- mlflow: mlflow.pytorch.load_model
|
+-- Version Tracking
| +-- SHA-256 content hashing
| +-- on_version_change callback for cache invalidation
|
+-- Lineage Tracking (ModelLineage)
| +-- data_version, code_hash, hyperparameters
| +-- training_metrics, parent_model
|
+-- Staged Promotions (ModelStage)
| +-- DRAFT -> STAGING -> PRODUCTION -> ARCHIVED
| +-- Optional approval callbacks
| +-- TOCTOU-safe atomic verify+apply
| +-- Single PRODUCTION version per model (auto-archive)
|
+-- Rollback
+-- Production version history (deque, max 100)
+-- Archive current, restore previous

A/B Testing Architecture

ABModelComparison
|
+-- Split Mode: Route to A or B per request
| +-- Random selection based on traffic_split probability
| +-- Per-model latency and error metrics
|
+-- Shadow Mode: Run both, return A
| +-- B runs after A (no latency impact on critical path)
| +-- Both results recorded for comparison
|
+-- Statistical Analysis
+-- Welch's t-test (scipy or manual fallback)
+-- Confidence intervals
+-- Sample size tracking
MultiModelComparison (A/B/n)
+-- Weighted traffic distribution across N models
+-- Cumulative weight routing for efficient selection
GradualRollout
+-- Stepped traffic ramping: 1% -> 5% -> 10% -> 25% -> 50% -> 100%
+-- Health check gates at each step
+-- Minimum dwell time enforcement per step

Metrics Architecture

InferenceMetrics (core collector)
|
+-- PrometheusExporter.export() --> /metrics HTTP endpoint
| +-- Counters, gauges, histograms
| +-- Per-model labels (model_name, model_version, endpoint)
| +-- Exemplars for trace correlation
| +-- Cost metrics (baseline vs optimized)
|
+-- InferenceOTelExporter.export() --> OTel SDK
| +-- Spec Section 8.1 instrument names
| +-- Safety metrics (canary, circuit breaker, drift, hysteresis)
|
+-- CostEstimator
+-- Per-1k request cost (baseline vs optimized)
+-- Projected hourly/monthly savings

Security Model

Startup validation checks:

  • Cache tenant isolation
  • Privacy control configuration (mode, redaction patterns, audit logging)
  • Safety bypass resistance
  • Input size limits

Privacy controls:

  • InputRedactor: Regex PII scrubbing before any data is stored
  • TenantIsolation: Namespace cache keys by tenant/deployment ID
  • AuditLogger: Append-only structured log of safety decisions