AI Inference
Epochly AI Inference: Quickstart
Get started with Epochly's AI Inference Accelerator. Wrap your model in one line and start optimizing PyTorch, Transformers, and ONNX Runtime inference.
Get started with Epochly's AI Inference Accelerator. Wrap your model in one line and start optimizing PyTorch, Transformers, and ONNX Runtime inference.
Installation
The inference accelerator ships as part of the epochly package:
pip install epochly
Optional dependencies for full feature set:
# L2 persistent cache encryptionpip install cryptography# L3 Redis cachepip install redis# Async HTTP for LLM companion (stdlib fallback available)pip install httpx# Statistical significance for A/B testing (manual fallback available)pip install scipy# HuggingFace model registry backendpip install transformers# MLflow model registry backendpip install mlflow
Zero-Config Framework Detection
import epochly# Importing a framework triggers automatic detectionimport torch# Check detected frameworksfrom epochly.inference.detector import InferenceDetectordetector = InferenceDetector.instance()print(detector.detected_frameworks) # ['torch']
One-Call Model Profiling
import epochlyimport torch# Create and configure model firstmodel = MyModel().to("cuda").eval()# Wrap LAST (after .to(), .eval(), etc.)model = epochly.wrap(model)# Use normally -- profiling is automaticoutput = model(input_tensor)# Access raw model for serializationtorch.save(model.unwrapped.state_dict(), "model.pt")
FastAPI Serving Integration
from fastapi import FastAPIfrom epochly.inference.serving.fastapi_middleware import EpochlyInferenceMiddlewareapp = FastAPI()app.add_middleware(EpochlyInferenceMiddleware, max_concurrency=64)@app.post("/predict")async def predict(data: dict):result = model(data["input"])return {"prediction": result}
Prometheus Metrics Endpoint
Expose inference metrics for Prometheus scraping alongside your FastAPI application:
from fastapi import FastAPIfrom fastapi.responses import PlainTextResponsefrom epochly.inference.serving.fastapi_middleware import EpochlyInferenceMiddlewarefrom epochly.inference.metrics.prometheus_exporter import PrometheusExporterfrom epochly.inference.metrics.inference_metrics import InferenceMetricsfrom epochly.inference.metrics.cost_estimator import CostEstimatorapp = FastAPI()app.add_middleware(EpochlyInferenceMiddleware, max_concurrency=64)inference_metrics = InferenceMetrics()cost_estimator = CostEstimator(gpu_name="A100_80GB")@app.get("/metrics")def metrics():return PlainTextResponse(content=PrometheusExporter.export(inference_metrics,cost=cost_estimator,model_labels={id(model): {"model_name": "bert-base","model_version": "v2.1","endpoint": "/predict",}},),media_type="text/plain; version=0.0.4",)
This exposes counters, gauges, and histograms including epochly_inference_requests_total, epochly_inference_latency_seconds, epochly_inference_cache_hit_rate, and per-model GPU utilization.
A/B Model Testing
Compare a production model against a challenger with built-in statistical analysis:
from epochly.inference.serving.ab_testing import ABModelComparison# Split mode: 10% traffic to challengerab = ABModelComparison(model_a=production_model,model_b=challenger_model,adapter=pytorch_adapter,traffic_split=0.1,)# Use in your endpoint -- routing is automaticresult = ab.infer(input_tensor)# After collecting enough samples, check statistical significancereport = ab.get_comparison_report()sig = ab.compute_significance(confidence_level=0.95)if sig["significant"]:print(f"Challenger is significantly different (p={sig['p_value']:.4f})")print(f"Latency diff CI: {sig['confidence_interval']}")
For safe comparison without affecting users, use shadow mode (both models run, only A's result returned):
ab_shadow = ABModelComparison(model_a=production_model,model_b=challenger_model,adapter=pytorch_adapter,traffic_split=0.5, # Ignored in shadow modeshadow=True,)
For gradual rollouts with health check gates:
from epochly.inference.serving.ab_testing import GradualRolloutrollout = GradualRollout(initial_pct=1.0,target_pct=100.0,step_duration_sec=300, # 5 min per step)# Steps through: 1% -> 5% -> 10% -> 25% -> 50% -> 100%
Custom Validator Registration
Register domain-specific validators for enterprise workloads:
from epochly.inference.safety.validator_registry import ValidatorRegistryregistry = ValidatorRegistry()# Define a custom validator implementing the WorkloadValidator protocolclass FinancialAccuracyValidator:def validate(self, adapter, model, golden_outputs, config):# Validate that financial calculations maintain required precision...def compute_comparison_metrics(self, adapter, original, optimized):# Compare original vs optimized outputsreturn {"precision_delta": 0.0001, "max_rounding_error": 0.005}def check_drift_thresholds(self, ewma_scores):if ewma_scores.get("precision_delta", 0) > 0.01:return "Financial precision drift exceeds 1% threshold"return Noneregistry.register("financial", FinancialAccuracyValidator())# Register alert callback for FAIL resultsdef on_failure(report):send_pagerduty_alert(f"Canary FAIL: {report.optimization_name}")registry.on_alert(on_failure)
Validate input data before inference with schema validation:
from epochly.inference.safety.validator_registry import (InputSchemaValidator, InputSchema, FieldSpec,)import numpy as npschema = InputSchema(fields={"input_ids": FieldSpec(dtype="int64", shape=(-1, 128), min_val=0, max_val=30522),"attention_mask": FieldSpec(dtype="int64", shape=(-1, 128), min_val=0, max_val=1),})validator = InputSchemaValidator(schema)result = validator.validate({"input_ids": np.zeros((4, 128), dtype=np.int64),"attention_mask": np.ones((4, 128), dtype=np.int64),})assert result.valid
Model Registry
Load models from HuggingFace Hub, MLflow, or local paths with version tracking:
from epochly.inference.registry.model_registry import (ModelRegistryClient,ModelLineage,ModelStage,)# Initialize with HuggingFace backendregistry = ModelRegistryClient(backend="huggingface",on_version_change=lambda name, old, new: print(f"{name}: {old} -> {new}"),)# Load a model (returns model object + version hash)model, version = registry.load("bert-base-uncased", revision="main")# Track provenanceregistry.set_lineage("bert-base-uncased", version, ModelLineage(data_version="v2.3",code_hash="abc123f",hyperparameters={"lr": 3e-5, "epochs": 3},training_metrics={"loss": 0.042, "accuracy": 0.965},))# Promote through lifecycle stagesregistry.set_stage("bert-base-uncased", version, ModelStage.DRAFT)registry.promote("bert-base-uncased", version) # DRAFT -> STAGINGregistry.promote("bert-base-uncased", version) # STAGING -> PRODUCTION# Check for updatesnew_version = registry.check_for_update("bert-base-uncased")if new_version:model, version = registry.load("bert-base-uncased")# Rollback to previous production versionprevious = registry.rollback("bert-base-uncased")
For local model files:
registry = ModelRegistryClient(backend="local")model_bytes, version = registry.load("/models/bert-fine-tuned.pt")
Inference Cache
from epochly.inference.cache import InferenceCache, CacheConfigcache = InferenceCache(CacheConfig(l1_size=10_000,ttl_seconds=3600,privacy_mode="ephemeral",))# Automatic cache integration happens via InferenceOptimizer# Manual usage:cache.put(model_id=id(model), inputs=input_data, outputs=result)cached = cache.get(model_id=id(model), inputs=input_data)
LLM Companion Mode
For services wrapping vLLM or TGI:
from epochly.inference.serving.llm_companion import (LLMCompanionAdapter,LLMCompanionConfig,)companion = LLMCompanionAdapter(runtime_url="http://localhost:8000",config=LLMCompanionConfig(max_concurrent_requests=32,cache_enabled=True,),)# In your endpoint handler:result = await companion.generate("What is machine learning?")
Configuration
Configure via pyproject.toml:
[tool.epochly.inference]enabled = truemax_level = 2[tool.epochly.inference.batching]max_batch_size = 32max_wait_ms = 50.0[tool.epochly.inference.cache]l1_size = 10000ttl_seconds = 86400privacy_mode = "ephemeral"[tool.epochly.inference.privacy]mode = "ephemeral"redact_patterns = []cache_tenant_isolation = trueaudit_log_enabled = true
Or via environment variables:
EPOCHLY_INFERENCE_ENABLED=trueEPOCHLY_INFERENCE_MAX_LEVEL=2EPOCHLY_INFERENCE_BATCHING_MAX_BATCH_SIZE=32EPOCHLY_INFERENCE_CACHE_PRIVACY_MODE=ephemeralEPOCHLY_INFERENCE_PRIVACY_REDACT_PATTERNS="\b\d{3}-\d{2}-\d{4}\b,\b\d{16}\b"EPOCHLY_INFERENCE_PRIVACY_AUDIT_LOG_ENABLED=trueEPOCHLY_TENANT_ID=customer_abc
Next Steps
- API Reference -- All public APIs including ValidatorRegistry, PrometheusExporter, ABModelComparison, ModelRegistryClient, and privacy controls
- Architecture -- System design with A/B testing, model registry, and metrics export layers
- Configuration -- Full config reference including privacy settings and Prometheus endpoint setup