AI Inference

Epochly AI Inference: Architecture

How the inference accelerator works: framework detection, progressive enhancement levels, safety gates, and the optimization pipeline.

System Overview

Epochly's inference optimization is a control layer for Python services. It observes, optimizes, gates, and proves impact on AI inference workloads without replacing the model runtime.

Four-Layer Architecture

    Customer's Python Service
    (FastAPI / Ray Serve / custom)
              |
    +=========+=========+  <-- Data Plane (in-process)
    |                     |
    | Serving Adapters    |  Request-level interception
    |  ASGIMiddleware     |  ASGI middleware: timing, policy, admission
    |  RayServeWrapper    |  Deployment wrapper (v1 preview)
    |  LLMCompanion       |  vLLM/TGI: admission, caching, attribution
    |  ABModelComparison  |  A/B testing: split/shadow traffic routing
    |  GradualRollout     |  Staged traffic ramping with health checks
    |                     |
    | Framework Adapters  |  Model-level interception
    |  PyTorchAdapter     |  InferenceProxy(invoke=Module.__call__)
    |  TransformersAdapter|  InferenceProxy(invoke=pipeline.__call__)
    |  OnnxAdapter        |  InferenceProxy(invoke=session.run)
    |                     |
    +=========+=========+
              |
    ModelRegistryClient   <-- Model loading, versioning, lineage
              |
    InferenceOptimizer    <-- Policy & orchestration
              |
    SafetyOrchestrator    <-- gate_optimization()
              |
    +---------+---------+------+
    |         |         |      |
  Micro-    Compilation  Cache  ValidatorRegistry
  Batching  (torch.compile)     (custom + built-in)
              |
    InferenceMetrics --> Lens Dashboard
    CostEstimator
    PrometheusExporter --> /metrics endpoint
    OTelExporter       --> OpenTelemetry SDK

Data Plane

In-process components that intercept and instrument inference traffic:

Serving Adapters: ASGI middleware, Ray Serve wrappers, LLM companion
A/B Testing: Split and shadow mode traffic routing between model variants, with Welch's t-test statistical significance and gradual rollout
Framework Adapters: Model-specific proxies for PyTorch, HuggingFace, ONNX
Inference Proxies: Wraps individual model instances at the serving boundary

Model Registry

Version-tracked model loading from multiple backends (local, HuggingFace Hub, MLflow). Provides:

Version tracking: SHA-256 content hashing with automatic cache invalidation signaling
Lineage tracking: Provenance metadata (data version, code hash, hyperparameters, training metrics)
Staged promotions: Lifecycle management (DRAFT -> STAGING -> PRODUCTION -> ARCHIVED) with optional approval callbacks and TOCTOU-safe atomic transitions
Rollback: Revert to previous production version with full version history

Policy Engine

Per-endpoint configuration: batching window, precision mode, concurrency cap, cache TTL, fallback runtime, rollout percentage.

Validation Harness

Safety controls: golden datasets, canary validation, circuit breakers, drift monitoring, fallback chains, validator registry, hysteresis control.

Control Plane

Lens dashboard for fleet-wide visibility and performance monitoring.

Module Structure

src/epochly/inference/
    __init__.py              # Lazy init, zero cost if no ML framework
    detector.py              # Framework and model detection
    profiler.py              # Inference-specific profiling + golden capture
    optimizer.py             # Optimization orchestrator
    cache.py                 # L1 in-memory cache
    cache_l2.py              # L2 SQLite WAL + AES-256-GCM cache
    cache_l3.py              # L3 Redis distributed cache
    config.py                # InferenceConfig dataclass
    context.py               # RequestContext, BatchKey
    security.py              # Startup security validation
    progression.py           # L0->L1->L2 progression validation criteria
    batch/
        dynamic_micro_batcher.py  # Async micro-batching with keyed sub-queues
        batch_optimizer.py
        request_queue.py
    compilation/
        torch_compiler.py    # torch.compile with pre-check and cache
        safety_monitor.py    # Graph break, memory, NaN monitoring
    frameworks/
        base_adapter.py
        inference_proxy.py
        pytorch_adapter.py
        transformers_adapter.py
        onnx_adapter.py
    registry/
        model_registry.py    # ModelRegistryClient, ModelStage, ModelLineage
    serving/
        ab_testing.py        # ABModelComparison, MultiModelComparison, GradualRollout
        fastapi_middleware.py
        fastapi_dependency.py
        llm_companion.py     # vLLM/TGI control layer
        ray_serve_wrapper.py # Ray Serve wrapper (v1 preview)
    metrics/
        inference_metrics.py
        cost_estimator.py
        otel_exporter.py         # OpenTelemetry instrument mapping
        prometheus_exporter.py   # Prometheus exposition format export
    safety/
        canary_validator.py
        circuit_breaker.py
        drift_monitor.py         # EWMA-based online drift detection
        fallback_chain.py
        golden_store.py
        hysteresis.py            # Anti-flapping state transition control
        privacy.py               # InputRedactor, TenantIsolation, AuditLogger
        safety_orchestrator.py
        validator_registry.py    # Custom workload validator plugin registry

Enhancement Levels

Level	Name	Description
L0	Profiling	Framework detection + model profiling
L1	Pre/Post Optimization	CPU parallelization, tokenizer caching
L2a	Micro-Batching	Dynamic request batching
L2b	Compilation	torch.compile with golden validation
L3+	Verified Optimize	Quantization, ONNX export (v2)

Cache Architecture

Three-tier cache with progressive latency/capacity tradeoff:

L1 (In-Memory LRU)     L2 (SQLite WAL)        L3 (Redis)
- ~1us access           - ~1ms access           - ~5ms access
- 10K entries           - 1GB on disk           - Distributed
- Process-local         - AES-256-GCM           - TLS + key prefix
- Thread-safe           - TTL enforcement        - TTL via EXPIRE

Lookup order: L1 -> L2 -> L3. Cache misses promote to faster tiers.

Safety Architecture

Every optimization goes through:

Golden output capture during L0 profiling (via InferenceProfiler golden callback)
Pre-compile check via torch._dynamo.explain()
Canary validation with workload-specific validators (built-in or custom via ValidatorRegistry)
Circuit breaker per optimization
Drift monitoring via EWMA-based DriftMonitor with shadow comparisons
Hysteresis control to prevent state flapping (asymmetric enable/disable thresholds)
Fallback chain for graceful degradation
Alert hooks via ValidatorRegistry alert callbacks on FAIL reports

Validator Registry

The ValidatorRegistry provides a plugin interface for custom workload validators. Built-in validators cover standard workload types (embedding, classifier, reranker, generation, encoder). Enterprise users can register domain-specific validators (financial accuracy, medical terminology, etc.) that override built-in validators for the same workload type.

Key properties:

Custom validators override built-in validators; unregistering restores the original
Supports both synchronous and asynchronous validators
Alert callbacks fire on FAIL results for integration with PagerDuty, Slack, etc.
InputSchemaValidator provides pre-inference input data validation (dtype, shape, range)

Drift Monitoring

The DriftMonitor samples live traffic using shadow execution, computes workload-specific comparison metrics via the same validators as canary validation, and applies EWMA smoothing for streaming drift detection. When drift exceeds thresholds, the circuit breaker trips and the optimization is disabled.

Hysteresis Control

The HysteresisController prevents optimization state oscillation:

Disable: Immediate on circuit breaker trip
Re-enable: Requires minimum off-duration (default 300s) AND N consecutive canary PASS results (not MARGINAL)
Minimum on-duration: Prevents disabling too quickly before meaningful data is collected (default 60s)

Model Registry Architecture

ModelRegistryClient
    |
    +-- Backend Loaders
    |     +-- local:       File read + SHA-256 hash
    |     +-- huggingface: transformers.AutoModel.from_pretrained
    |     +-- mlflow:      mlflow.pytorch.load_model
    |
    +-- Version Tracking
    |     +-- SHA-256 content hashing
    |     +-- on_version_change callback for cache invalidation
    |
    +-- Lineage Tracking (ModelLineage)
    |     +-- data_version, code_hash, hyperparameters
    |     +-- training_metrics, parent_model
    |
    +-- Staged Promotions (ModelStage)
    |     +-- DRAFT -> STAGING -> PRODUCTION -> ARCHIVED
    |     +-- Optional approval callbacks
    |     +-- TOCTOU-safe atomic verify+apply
    |     +-- Single PRODUCTION version per model (auto-archive)
    |
    +-- Rollback
          +-- Production version history (deque, max 100)
          +-- Archive current, restore previous

A/B Testing Architecture

ABModelComparison
    |
    +-- Split Mode: Route to A or B per request
    |     +-- Random selection based on traffic_split probability
    |     +-- Per-model latency and error metrics
    |
    +-- Shadow Mode: Run both, return A
    |     +-- B runs after A (no latency impact on critical path)
    |     +-- Both results recorded for comparison
    |
    +-- Statistical Analysis
          +-- Welch's t-test (scipy or manual fallback)
          +-- Confidence intervals
          +-- Sample size tracking
MultiModelComparison (A/B/n)
    +-- Weighted traffic distribution across N models
    +-- Cumulative weight routing for efficient selection
GradualRollout
    +-- Stepped traffic ramping: 1% -> 5% -> 10% -> 25% -> 50% -> 100%
    +-- Health check gates at each step
    +-- Minimum dwell time enforcement per step

Metrics Architecture

InferenceMetrics (core collector)
    |
    +-- PrometheusExporter.export()  --> /metrics HTTP endpoint
    |     +-- Counters, gauges, histograms
    |     +-- Per-model labels (model_name, model_version, endpoint)
    |     +-- Exemplars for trace correlation
    |     +-- Cost metrics (baseline vs optimized)
    |
    +-- InferenceOTelExporter.export()  --> OTel SDK
    |     +-- Spec Section 8.1 instrument names
    |     +-- Safety metrics (canary, circuit breaker, drift, hysteresis)
    |
    +-- CostEstimator
          +-- Per-1k request cost (baseline vs optimized)
          +-- Projected hourly/monthly savings

Security Model

Startup validation checks:

Cache tenant isolation
Privacy control configuration (mode, redaction patterns, audit logging)
Safety bypass resistance
Input size limits

Privacy controls:

InputRedactor: Regex PII scrubbing before any data is stored
TenantIsolation: Namespace cache keys by tenant/deployment ID
AuditLogger: Append-only structured log of safety decisions