Documentation

AI Inference

Epochly AI Inference: Configuration

Configure inference optimization via pyproject.toml, environment variables, or the Python API. Control caching, batching, compilation, and safety thresholds.

Configure inference optimization via pyproject.toml, environment variables, or the Python API. Control caching, batching, compilation, and safety thresholds.

Overview

The inference accelerator supports three configuration sources, applied in order of precedence:

  1. Programmatic (highest) -- InferenceConfig(...) constructor
  2. Environment variables -- EPOCHLY_INFERENCE_*
  3. pyproject.toml (lowest) -- [tool.epochly.inference]

pyproject.toml Configuration

[tool.epochly.inference]
enabled = true
max_level = 2
frameworks = ["pytorch", "transformers", "onnxruntime"]
[tool.epochly.inference.batching]
enabled = true
max_batch_size = 0 # 0 = auto-calibrate from GPU memory
max_queue_depth = 1024
max_wait_ms = 50.0
target_latency_ms = 100.0
gpu_headroom_pct = 20
[tool.epochly.inference.compilation]
enabled = true
backend = "auto" # "auto", "inductor", "cudagraphs"
cache_dir = "~/.epochly/inference_cache"
warmup_requests = 3
tolerance = 1e-5
[tool.epochly.inference.cache]
enabled = true
l1_size = 10000
ttl_seconds = 86400 # 24 hours
privacy_mode = "ephemeral" # "ephemeral", "hashes_only", "persisted_encrypted"
redis_url = "" # Empty = no L3 cache
[tool.epochly.inference.privacy]
mode = "ephemeral" # "ephemeral", "hashes_only", "persisted_encrypted"
redact_patterns = [] # Regex patterns for PII scrubbing
cache_tenant_isolation = true # Namespace cache keys by tenant ID
audit_log_enabled = true # Enable append-only safety audit log
audit_log_path = "~/.epochly/audit/"
encrypt_at_rest = true
encryption_key_source = "env" # "env", "file", "hsm"

Environment Variables

All environment variables use the pattern: EPOCHLY_INFERENCE_

_

Top-Level

VariableTypeDefaultDescription
EPOCHLY_INFERENCE_ENABLEDbooltrueEnable/disable inference module
EPOCHLY_INFERENCE_MAX_LEVELint2Maximum enhancement level (0-4)

Batching

VariableTypeDefaultDescription
EPOCHLY_INFERENCE_BATCHING_ENABLEDbooltrueEnable micro-batching
EPOCHLY_INFERENCE_BATCHING_MAX_BATCH_SIZEint0Max batch size (0=auto)
EPOCHLY_INFERENCE_BATCHING_MAX_QUEUE_DEPTHint1024Max pending requests
EPOCHLY_INFERENCE_BATCHING_MAX_WAIT_MSfloat50.0Max wait before flush
EPOCHLY_INFERENCE_BATCHING_TARGET_LATENCY_MSfloat100.0Target p95 latency
EPOCHLY_INFERENCE_BATCHING_GPU_HEADROOM_PCTint20GPU memory headroom %

Compilation

VariableTypeDefaultDescription
EPOCHLY_INFERENCE_COMPILATION_ENABLEDbooltrueEnable torch.compile
EPOCHLY_INFERENCE_COMPILATION_BACKENDstr"auto"Compilation backend
EPOCHLY_INFERENCE_COMPILATION_CACHE_DIRstr"~/.epochly/inference_cache"Cache directory
EPOCHLY_INFERENCE_COMPILATION_WARMUP_REQUESTSint3Warmup iterations
EPOCHLY_INFERENCE_COMPILATION_TOLERANCEfloat1e-5Output tolerance

Cache

VariableTypeDefaultDescription
EPOCHLY_INFERENCE_CACHE_ENABLEDbooltrueEnable caching
EPOCHLY_INFERENCE_CACHE_L1_SIZEint10000L1 cache capacity
EPOCHLY_INFERENCE_CACHE_TTL_SECONDSfloat86400Cache TTL
EPOCHLY_INFERENCE_CACHE_PRIVACY_MODEstr"ephemeral"Privacy mode
EPOCHLY_INFERENCE_CACHE_REDIS_URLstr""Redis URL for L3

Privacy

VariableTypeDefaultDescription
EPOCHLY_INFERENCE_PRIVACY_MODEstr"ephemeral"Privacy mode
EPOCHLY_INFERENCE_PRIVACY_REDACT_PATTERNSstr""Comma-separated regex patterns for PII scrubbing
EPOCHLY_INFERENCE_PRIVACY_CACHE_TENANT_ISOLATIONbooltrueNamespace cache keys by tenant ID
EPOCHLY_INFERENCE_PRIVACY_AUDIT_LOG_ENABLEDbooltrueEnable append-only safety audit log
EPOCHLY_INFERENCE_PRIVACY_AUDIT_LOG_PATHstr"~/.epochly/audit/"Path for persistent audit log files
EPOCHLY_INFERENCE_PRIVACY_ENCRYPT_AT_RESTbooltrueEnable at-rest encryption
EPOCHLY_INFERENCE_PRIVACY_ENCRYPTION_KEY_SOURCEstr"env"Encryption key source
EPOCHLY_TENANT_IDstr"default"Tenant ID for cache key namespacing

Privacy Modes

ModeL1 BehaviorL2 BehaviorL3 Behavior
ephemeralIn-memory LRUNo persistenceNo persistence
hashes_onlyIn-memory LRUNo persistenceNo persistence
persisted_encryptedIn-memory LRUSQLite + AES-256-GCMRedis + TLS

Privacy Controls

Privacy is configured via the PrivacyConfig dataclass or the [tool.epochly.inference.privacy] section in pyproject.toml.

Input Redaction

PII and sensitive data are scrubbed before any data is written to disk or retained in golden stores. Configure regex patterns to match sensitive data:

from epochly.inference.safety.privacy import InputRedactor
redactor = InputRedactor(
patterns=[
r"\b\d{3}-\d{2}-\d{4}\b", # SSN
r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", # Email
r"\b\d{16}\b", # Credit card
],
replacement="[REDACTED]",
)
clean_text = redactor.redact("SSN: 123-45-6789")
# "SSN: [REDACTED]"

Tenant Isolation

Cache keys are namespaced by tenant ID to prevent cross-tenant data leakage:

from epochly.inference.safety.privacy import TenantIsolation
# From constructor
isolation = TenantIsolation(tenant_id="customer_abc")
# From environment (reads EPOCHLY_TENANT_ID)
isolation = TenantIsolation.from_env()
# Namespace a cache key
namespaced = isolation.namespace_key("model:bert:input_hash_abc123")
# "customer_abc:model:bert:input_hash_abc123"

Audit Logging

An append-only audit log records all safety gate decisions, cache accesses, and optimization state changes:

from epochly.inference.safety.privacy import AuditLogger
audit = AuditLogger(max_entries=10_000)
audit.log_event(
operation="canary_validation",
model_id=42,
optimization_name="torch_compile_bert",
result="PASS",
details={"cosine_sim": 0.998, "max_diff": 0.0012},
)
# Query recent entries
entries = audit.get_entries_since(time.time() - 3600) # Last hour

L2 Cache (SQLite)

The L2 cache is configured separately when creating the cache directly:

from epochly.inference.cache_l2 import L2Cache, L2CacheConfig
l2 = L2Cache(L2CacheConfig(
db_path="~/.epochly/inference_cache/cache.db",
ttl_seconds=86400,
encryption_key=b"32-byte-aes-256-key-here!!!!!!!", # Optional
))

Encryption Requirements:

  • AES-256-GCM requires a 32-byte key
  • Set via EPOCHLY_L2_ENCRYPTION_KEY environment variable
  • When cryptography library is absent, data is stored unencrypted
  • Each row uses a unique 12-byte nonce

L3 Cache (Redis)

from epochly.inference.cache_l3 import L3Cache, L3CacheConfig
l3 = L3Cache(L3CacheConfig(
redis_url="redis://localhost:6379/0",
key_prefix="tenant_a:",
ttl_seconds=86400,
tls_enabled=False,
socket_timeout=5.0,
))

TLS Support: Use rediss:// URL scheme or set tls_enabled=True.

LLM Companion

from epochly.inference.serving.llm_companion import (
LLMCompanionAdapter,
LLMCompanionConfig,
)
companion = LLMCompanionAdapter(
runtime_url="http://localhost:8000",
config=LLMCompanionConfig(
max_concurrent_requests=64,
cache_enabled=True,
cache_max_size=10_000,
default_timeout_seconds=120.0,
api_path="/v1/completions",
),
)

Ray Serve Wrapper

from epochly.inference.serving.ray_serve_wrapper import (
EpochlyRayServeWrapper,
RayServeConfig,
epochly_serve,
)
# Via wrapper
wrapper = EpochlyRayServeWrapper(RayServeConfig(
enable_telemetry=True,
enable_priority_routing=True,
))
# Via decorator
@epochly_serve(config=RayServeConfig())
def predict(model, data):
return model(data)

Security Validator

from epochly.inference.security import SecurityValidator
validator = SecurityValidator(max_input_size_bytes=10 * 1024 * 1024)
report = validator.run_all_checks(
cache=inference_cache,
privacy_mode="ephemeral",
encryption_key_present=False,
redaction_patterns_configured=False,
config={"canary_enabled": True},
)

Compilation Safety Monitor

from epochly.inference.compilation.safety_monitor import TorchCompileSafetyMonitor
monitor = TorchCompileSafetyMonitor(
max_graph_breaks=5,
memory_growth_threshold_mb=500.0,
)
# Pre-compile check
result = monitor.pre_compile_check(model, sample_input)
# Post-compile health
health = monitor.check_health(pre_mb=1000, post_mb=1050)
# Output validity
validity = monitor.check_output_validity(model_output)

Prometheus Endpoint

Expose a Prometheus-compatible /metrics endpoint using the PrometheusExporter:

from fastapi import FastAPI
from fastapi.responses import PlainTextResponse
from epochly.inference.metrics.prometheus_exporter import PrometheusExporter
from epochly.inference.metrics.inference_metrics import InferenceMetrics
from epochly.inference.metrics.cost_estimator import CostEstimator
app = FastAPI()
inference_metrics = InferenceMetrics()
cost_estimator = CostEstimator(gpu_name="A100_80GB")
@app.get("/metrics")
def metrics():
return PlainTextResponse(
content=PrometheusExporter.export(
inference_metrics,
cost=cost_estimator,
model_labels={
id(model): {
"model_name": "bert-base",
"model_version": "v2.1",
"endpoint": "/predict",
}
},
),
media_type="text/plain; version=0.0.4",
)

The exporter generates metrics in the standard Prometheus exposition format, including counters, gauges, and histograms.