AI Inference

Epochly AI Inference: API Reference

Complete API reference for epochly.wrap(), InferenceProxy, cache tiers, micro-batching, safety gates, and serving integrations.

Top-Level API

epochly.wrap(model) -> InferenceProxy

Wrap a model instance for profiling and optimization.

Parameters:

model -- A supported model instance (torch.nn.Module, transformers.Pipeline, transformers.PreTrainedModel, onnxruntime.InferenceSession)

Returns: InferenceProxy that delegates to the original model with profiling.

Raises: TypeError if model is not from a supported framework.

Usage:

import epochly
model = MyModel().to("cuda").eval()
model = epochly.wrap(model)  # Wrap LAST
result = model(input_tensor)

Framework Adapters

InferenceAdapter Protocol

Protocol defining the common interface for all framework adapters.

Key Methods:

detect(obj) -> bool -- Check if obj is a model from this framework
get_model_classes() -> List[type] -- Base classes for isinstance detection
get_model_info(model) -> ModelInfo -- Extract metadata
classify_model(model) -> ModelType -- Classify workload type
wrap_forward(model) -> Any -- Create proxy wrapper (returns InferenceProxy)
compile(model, config) -> Any -- Compile model
run_inference(model, inputs) -> Any -- Run inference
outputs_match(expected, actual, tolerance) -> bool -- Compare outputs
hash_model(model) -> str -- Hash model weights
capture_golden(model, n) -> List[GoldenOutput] -- Capture golden outputs

InferenceProxy

Framework-agnostic proxy wrapping a model instance.

Properties:

unwrapped -- Access the original model
model_id -> int -- The id() of the wrapped model
call_count -> int -- Number of inference calls routed through this proxy

Usage:

proxy = epochly.wrap(model)
result = proxy(input)              # Profiled call
original = proxy.unwrapped         # Raw model access

Serving Adapters

EpochlyInferenceMiddleware

ASGI middleware for request-level observability and control.

Constructor Parameters:

app -- ASGI application
max_concurrency: int = 64 -- Maximum concurrent requests
config: Any = None -- Optional InferenceConfig

Properties:

request_count -> int -- Total requests processed
avg_latency_ms -> float -- Average latency in milliseconds

LLMCompanionAdapter

Control layer for vLLM/TGI deployments.

Constructor Parameters:

runtime_url: str -- URL of the LLM runtime
config: Optional[LLMCompanionConfig] = None -- Configuration

Methods:

async generate(prompt, **kwargs) -> str -- Generate completion
get_stats() -> dict -- Get companion metrics

LLMCompanionConfig

Configuration for LLM companion mode.

Fields:

max_concurrent_requests: int = 64
cache_enabled: bool = True
cache_max_size: int = 10_000
default_timeout_seconds: float = 120.0
api_path: str = "/v1/completions"

EpochlyRayServeWrapper

Ray Serve deployment wrapper (v1 preview).

Constructor Parameters:

config: Optional[RayServeConfig] = None -- Configuration (defaults to RayServeConfig())

Methods:

handle_request(model_fn, args, priority=RequestPriority.NORMAL, *kwargs) -> Any
get_stats() -> dict

epochly_serve

Decorator for adding Epochly telemetry to inference functions.

Usage:

@epochly_serve
def predict(model, data):
    return model(data)
@epochly_serve(config=RayServeConfig(enable_telemetry=True))
def predict(model, data):
    return model(data)

ABModelComparison

A/B testing for model variants. Routes a configurable percentage of traffic to a challenger model (B), compares outputs, and reports per-variant metrics. Supports split mode (one model per request) and shadow mode (both models run, only A's result returned).

Constructor Parameters:

model_a: Any -- The production (champion) model callable
model_b: Any -- The challenger model callable
adapter: InferenceAdapter -- Framework adapter
traffic_split: float = 0.1 -- Fraction of traffic routed to model B (0.0 to 1.0)
shadow: bool = False -- If True, both models run for every request; only model A's result is returned

Raises: ValueError if traffic_split is outside [0.0, 1.0].

Methods:

infer(input_data, **kwargs) -> Any -- Route request to model A or B based on traffic split. In shadow mode, both models run and model A's result is returned. In split mode, exactly one model runs per request.
get_comparison_report() -> dict -- Return comparison metrics between A and B. Returns a dictionary with keys: model_a (metrics dict), model_b (metrics dict), traffic_split, total_requests, shadow.
compute_significance(confidence_level=0.95) -> dict -- Compute statistical significance of latency difference using Welch's t-test. Uses scipy.stats.ttest_ind when available, falls back to a built-in manual implementation. Returns a dictionary with keys: p_value, confidence_interval (lower, upper), significant (bool), sample_sizes.

Usage:

from epochly.inference.serving.ab_testing import ABModelComparison
ab = ABModelComparison(
    model_a=production_model,
    model_b=challenger_model,
    adapter=pytorch_adapter,
    traffic_split=0.1,
)
result = ab.infer(input_tensor)
report = ab.get_comparison_report()
sig = ab.compute_significance(confidence_level=0.95)
if sig["significant"]:
    print(f"Difference is significant (p={sig['p_value']:.4f})")

MultiModelComparison

A/B/n testing for N models with weighted traffic distribution.

Constructor Parameters:

models: Dict[str, Any] -- Mapping from model name to model callable
weights: Dict[str, float] -- Mapping from model name to traffic weight (must sum to ~1.0)
adapter: InferenceAdapter -- Framework adapter

Methods:

infer(input_data, **kwargs) -> Any -- Route request based on weighted distribution
get_comparison_report() -> dict -- Return comparison metrics across all models

GradualRollout

Staged traffic ramping with health checks at each step.

Default steps: 1% -> 5% -> 10% -> 25% -> 50% -> 100%.

Constructor Parameters:

initial_pct: float -- Starting traffic percentage
target_pct: float -- Target traffic percentage
step_duration_sec: float -- Minimum time at each step
steps: Optional[List[float]] -- Custom step percentages

Properties:

current_pct -> float -- Current traffic percentage
is_complete -> bool -- Whether rollout reached target

Methods:

advance_step(health_ok: bool) -> bool -- Attempt to advance to next step

Model Registry

ModelRegistryClient

Version-tracked model loading from multiple backends.

Constructor Parameters:

backend: str -- One of "local", "huggingface", or "mlflow"
on_version_change: Optional[Callable] -- Callback on version hash change

Methods:

load(model_name, revision=None) -> Tuple[Any, str] -- Load model and return (model, version_hash)
check_for_update(model_name) -> Optional[str] -- Check for newer version
set_lineage(model_name, version, lineage) -> None -- Store provenance metadata
get_lineage(model_name, version) -> Optional[ModelLineage] -- Retrieve lineage
set_stage(model_name, version, stage) -> None -- Set lifecycle stage
get_stage(model_name, version) -> Optional[ModelStage] -- Get lifecycle stage
promote(model_name, version, approval_callback=None) -> bool -- Promote to next stage
demote(model_name, version) -> bool -- Demote to previous stage
rollback(model_name) -> Optional[str] -- Rollback to previous production version

ModelStage

Lifecycle stage enum: DRAFT -> STAGING -> PRODUCTION -> ARCHIVED

ModelLineage

Provenance tracking dataclass.

Fields:

data_version: str
code_hash: str
hyperparameters: Dict[str, Any]
training_metrics: Dict[str, float]
parent_model: Optional[str]
created_at: float

Cache

InferenceCache

Multi-tier LRU cache for inference results (L1 in-memory).

Constructor Parameters:

config: Optional[CacheConfig] -- Cache configuration
tenant_isolation: Optional[TenantIsolation] -- Tenant isolation for key namespacing

Methods:

get(model_id, inputs) -> Optional[Any]
put(model_id, inputs, outputs) -> None
invalidate_model(model_id) -> None
get_stats() -> dict

L2Cache (SQLite)

SQLite WAL-mode persistent cache with AES-256-GCM encryption.

Methods:

get(key) -> Optional[bytes]
put(key, value) -> None
delete(key) -> None
clear_prefix(prefix) -> None
get_stats() -> dict
close() -> None

L3Cache (Redis)

Redis-backed distributed cache tier.

Methods:

get(key) -> Optional[bytes]
put(key, value, ttl=None) -> None
delete(key) -> None
clear_all() -> None
get_stats() -> dict

Compilation

ModelCompiler

Background model compilation with safety gate.

Methods:

pre_compile_check(model, sample_input) -> Tuple[bool, Optional[str]]
async compile_async(model, safety_orchestrator) -> Optional[Any]
get_compiled(model) -> Optional[Any]

TorchCompileSafetyMonitor

Safety monitoring for torch.compile operations.

Methods:

pre_compile_check(model, sample_input) -> PreCompileResult
check_health(pre_compile_memory_mb, post_compile_memory_mb) -> HealthCheckResult
check_output_validity(outputs) -> OutputValidityResult

Configuration

InferenceConfig

Top-level inference configuration dataclass.

Fields:

enabled: bool = True -- Enable/disable inference module
max_level: int = 2 -- Maximum enhancement level (0-4)
frameworks: List[str] -- Supported framework names
batching: BatchingConfig -- Batching configuration
compilation: CompilationConfig -- Compilation configuration
cache: CacheSettings -- Cache configuration

Class Methods:

from_dict(data) -> InferenceConfig -- Create configuration from a dictionary
from_env() -> InferenceConfig -- Create configuration from environment variables

Metrics

InferenceMetrics

Central metrics collector for request, model, cache, and GPU statistics.

Methods:

record_request(endpoint, total_ms, ...) -> None
record_inference(model_id, latency_ms, batch_size=1, ...) -> None
record_cache_hit() -> None / record_cache_miss() -> None
get_all_metrics() -> dict
get_model_stats(model_id) -> Optional[dict]
get_cache_stats() -> dict

CostEstimator

Inference cost estimation based on GPU time and pricing.

Constructor Parameters:

gpu_name: str -- GPU identifier (e.g., "A100_80GB", "H100", "T4"). Falls back to $1.00/hr for unrecognized names.
custom_rate: Optional[float] = None -- Override GPU hourly rate in USD. When provided (and > 0), overrides the lookup table.

Methods:

record_inference(baseline_gpu_seconds, optimized_gpu_seconds) -> None -- Record a single inference for cost tracking

Properties:

gpu_seconds_per_request -> Tuple[float, float] -- (baseline, optimized) GPU-seconds per request
cost_per_1k_requests -> Tuple[float, float] -- (baseline_cost, optimized_cost) per 1000 requests in USD
observed_requests_per_hour -> float -- Observed request rate based on actual traffic
projected_hourly_savings -> float -- Projected hourly savings in USD
projected_monthly_savings -> float -- Projected monthly savings in USD (730 hours)

PrometheusExporter

Generates Prometheus exposition format text from InferenceMetrics.

Static Methods:

export(metrics, cost=None, latency_buckets=None, model_labels=None, exemplars=None) -> str

Emitted Metrics:

Metric Name	Type	Description
`epochly_inference_requests_total`	counter	Total HTTP requests processed
`epochly_inference_request_latency_avg_ms`	gauge	Average request latency in milliseconds
`epochly_inference_inferences_total`	counter	Total model inference calls
`epochly_inference_inferences_per_model`	counter	Per-model inference counts (labeled)
`epochly_inference_latency_seconds`	histogram	Inference latency distribution in seconds
`epochly_inference_cache_hits_total`	counter	Total cache hits
`epochly_inference_cache_misses_total`	counter	Total cache misses
`epochly_inference_cache_hit_rate`	gauge	Cache hit rate (0.0-1.0)
`epochly_inference_gpu_utilization`	gauge	Per-model GPU compute utilization (0.0-1.0)
`epochly_inference_gpu_memory_utilization`	gauge	Per-model GPU memory utilization (0.0-1.0)
`epochly_inference_cost_per_1k_requests_usd`	gauge	Cost per 1000 requests (baseline and optimized variants)
`epochly_inference_cost_savings_per_hour_usd`	gauge	Projected hourly savings
`epochly_inference_cost_savings_per_month_usd`	gauge	Projected monthly savings

InferenceOTelExporter

Maps InferenceMetrics to OpenTelemetry instrument names.

Methods:

export(metrics) -> Dict[str, Any]
export_cost(cost_estimator) -> Dict[str, Any]
export_safety(breakers=None, drift_monitor=None, golden_store=None) -> Dict[str, Any]

Profiling

InferenceProfiler

Per-model inference profiler. Profiles the first N calls to establish baselines, then continues sampling periodically for drift detection.

Constructor Parameters:

warmup_target: int = 10
sample_rate: float = 0.01

Methods:

set_golden_callback(callback) -> None
record_call(model_id, adapter, call_time, gpu_time, batch_size, model, inputs=None, outputs=None) -> None
get_summary(model_id) -> Optional[ModelProfileSummary]
is_warmup_complete(model_id) -> bool
get_call_count(model_id) -> int -- Get total call count for a model
get_all_summaries() -> Dict[int, ModelProfileSummary]

Safety

SafetyOrchestrator

Top-level safety coordinator. Orchestrates golden output capture, canary validation, circuit breakers, drift monitoring, fallback chains, and hysteresis control.

ValidatorRegistry

Registry for custom workload validators. Built-in validators: embedding, classifier, reranker, generation, llm, encoder.

Methods:

register(workload_type, validator) -> None
unregister(workload_type) -> bool
get(workload_type) -> Optional[Any]
list_registered() -> List[str]
is_async(workload_type) -> bool
on_alert(callback) -> None
fire_alerts(report) -> None

InputSchemaValidator

Validates input data against an InputSchema before inference.

Usage:

from epochly.inference.safety.validator_registry import (
    InputSchemaValidator, InputSchema, FieldSpec,
)
import numpy as np
schema = InputSchema(fields={
    "input_ids": FieldSpec(dtype="int64", shape=(-1, 128), min_val=0, max_val=30522),
    "attention_mask": FieldSpec(dtype="int64", shape=(-1, 128), min_val=0, max_val=1),
})
validator = InputSchemaValidator(schema)
result = validator.validate({
    "input_ids": np.zeros((4, 128), dtype=np.int64),
    "attention_mask": np.ones((4, 128), dtype=np.int64),
})
if not result.valid:
    for err in result.errors:
        print(f"Validation error: {err}")

DriftMonitor

EWMA-based online drift detection for optimized inference outputs.

Methods:

should_sample() -> bool
record_shadow_comparison(optimization_name, original_output, optimized_output, adapter, workload_type) -> Optional[str]
refresh_reference(optimization_name) -> None
get_ewma_scores(optimization_name) -> Optional[Dict[str, float]]
get_diagnostics() -> dict

HysteresisController

Anti-flapping controller for optimization state transitions.

Methods:

should_enable(optimization_name, canary_report) -> bool
should_disable(optimization_name) -> bool
record_transition(optimization_name) -> None
get_diagnostics() -> dict

CircuitBreaker

Per-optimization circuit breaker. Opens on repeated failures, transitions to half-open for probing, and closes on success.

FallbackChain

Graceful degradation chain. When an optimization fails, falls back through a configurable chain of alternatives.

GoldenStore

Storage for golden outputs captured during L0 profiling. Used as ground truth for canary validation.

Privacy & Security

SecurityValidator

Startup security validation.

Constructor Parameters:

max_input_size_bytes: int = 10_485_760 -- Maximum input size (10 MB)

Methods:

check_cache_isolation(cache) -> SecurityCheckResult
check_privacy_controls(privacy_mode, encryption_key_present, redaction_patterns_configured) -> SecurityCheckResult
check_safety_bypass_resistance(config) -> SecurityCheckResult
check_input_size(input_data) -> SecurityCheckResult
run_all_checks(cache, privacy_mode, encryption_key_present, redaction_patterns_configured, config) -> Dict[str, Any]

InputRedactor

Regex-based PII/sensitive data redaction before storage. Redaction runs before any data is written to disk or retained in golden stores.

Constructor Parameters:

patterns: List[str] -- List of regex patterns to match sensitive data
replacement: str = "[REDACTED]" -- Replacement string for matched patterns

Methods:

redact(data) -> Union[str, bytes] -- Redact sensitive patterns from input data. Handles both str and bytes inputs. Returns the same type as the input.

TenantIsolation

Namespace cache keys by tenant/deployment ID. Prevents cross-tenant cache leakage by prefixing all cache keys with the tenant identifier.

Constructor Parameters:

tenant_id: str = "default" -- Tenant identifier for key namespacing

Properties:

tenant_id -> str -- The configured tenant identifier

Methods:

namespace_key(key) -> str -- Prefix a cache key with the tenant namespace. Returns "{tenant_id}:{key}".
from_env() -> TenantIsolation -- Class method. Create from EPOCHLY_TENANT_ID environment variable (defaults to "default").

AuditLogger

Append-only structured audit log for safety decisions. Records all safety gate decisions, cache accesses, and optimization state changes. Bounded via deque(maxlen) to prevent unbounded memory growth.

Constructor Parameters:

max_entries: int = 10_000 -- Maximum number of audit entries retained in memory

Methods:

log_event(operation, model_id, optimization_name, result, details=None) -> None -- Log a safety or cache event
get_entries() -> List[dict] -- Return a copy of all audit entries
get_entries_since(since_timestamp) -> List[dict] -- Return entries since a given Unix timestamp

Properties:

count -> int -- Number of entries in the audit log

PrivacyConfig

Privacy configuration dataclass.

Fields:

mode: str = "ephemeral" -- Privacy mode: "ephemeral", "persisted_encrypted", or "hashes_only"
encrypt_at_rest: bool = True -- Enable encryption for persistent storage
encryption_key_source: str = "env" -- Key source: "env", "file", or "hsm"
redact_patterns: List[str] = [] -- Regex patterns for PII redaction
cache_tenant_isolation: bool = True -- Enable tenant-scoped cache keys
audit_log_enabled: bool = True -- Enable safety audit logging
audit_log_path: str = "~/.epochly/audit/" -- Path for persistent audit logs

Batching

DynamicMicroBatcher

Async request micro-batching with keyed sub-queues.

Constructor Parameters:

model: Any -- The model to run inference on
adapter: InferenceAdapter -- Framework adapter
config: Optional[MicroBatcherConfig] -- Configuration

Methods:

async start() -> None
async stop() -> None
async infer(input_data, priority=1, batch_key=None) -> Any

MicroBatcherConfig

Fields:

max_batch_size: int = 32
max_queue_depth: int = 1024
max_wait_ms: float = 50.0
gpu_headroom_pct: int = 20

Context

RequestContext

Per-request context propagated via contextvars.

Fields:

request_id: str
endpoint: str
enqueue_time: float
model_calls: int
timings: Dict[str, float]
trace_id: Optional[str]
span_id: Optional[str]

BatchKey

Structured key for micro-batching request compatibility.

Fields:

model_id: int
endpoint: str
input_shape_bucket: Tuple[int, ...]
dtype: str
max_new_tokens: Optional[int]
temperature: Optional[float]
top_p: Optional[float]
pad_token_id: Optional[int]