AI Inference
Epochly AI Inference: Configuration
Configure inference optimization via pyproject.toml, environment variables, or the Python API. Control caching, batching, compilation, and safety thresholds.
Configure inference optimization via pyproject.toml, environment variables, or the Python API. Control caching, batching, compilation, and safety thresholds.
Overview
The inference accelerator supports three configuration sources, applied in order of precedence:
- Programmatic (highest) --
InferenceConfig(...)constructor - Environment variables --
EPOCHLY_INFERENCE_* - pyproject.toml (lowest) --
[tool.epochly.inference]
pyproject.toml Configuration
[tool.epochly.inference]enabled = truemax_level = 2frameworks = ["pytorch", "transformers", "onnxruntime"][tool.epochly.inference.batching]enabled = truemax_batch_size = 0 # 0 = auto-calibrate from GPU memorymax_queue_depth = 1024max_wait_ms = 50.0target_latency_ms = 100.0gpu_headroom_pct = 20[tool.epochly.inference.compilation]enabled = truebackend = "auto" # "auto", "inductor", "cudagraphs"cache_dir = "~/.epochly/inference_cache"warmup_requests = 3tolerance = 1e-5[tool.epochly.inference.cache]enabled = truel1_size = 10000ttl_seconds = 86400 # 24 hoursprivacy_mode = "ephemeral" # "ephemeral", "hashes_only", "persisted_encrypted"redis_url = "" # Empty = no L3 cache[tool.epochly.inference.privacy]mode = "ephemeral" # "ephemeral", "hashes_only", "persisted_encrypted"redact_patterns = [] # Regex patterns for PII scrubbingcache_tenant_isolation = true # Namespace cache keys by tenant IDaudit_log_enabled = true # Enable append-only safety audit logaudit_log_path = "~/.epochly/audit/"encrypt_at_rest = trueencryption_key_source = "env" # "env", "file", "hsm"
Environment Variables
All environment variables use the pattern: EPOCHLY_INFERENCE_
Top-Level
| Variable | Type | Default | Description |
|---|---|---|---|
EPOCHLY_INFERENCE_ENABLED | bool | true | Enable/disable inference module |
EPOCHLY_INFERENCE_MAX_LEVEL | int | 2 | Maximum enhancement level (0-4) |
Batching
| Variable | Type | Default | Description |
|---|---|---|---|
EPOCHLY_INFERENCE_BATCHING_ENABLED | bool | true | Enable micro-batching |
EPOCHLY_INFERENCE_BATCHING_MAX_BATCH_SIZE | int | 0 | Max batch size (0=auto) |
EPOCHLY_INFERENCE_BATCHING_MAX_QUEUE_DEPTH | int | 1024 | Max pending requests |
EPOCHLY_INFERENCE_BATCHING_MAX_WAIT_MS | float | 50.0 | Max wait before flush |
EPOCHLY_INFERENCE_BATCHING_TARGET_LATENCY_MS | float | 100.0 | Target p95 latency |
EPOCHLY_INFERENCE_BATCHING_GPU_HEADROOM_PCT | int | 20 | GPU memory headroom % |
Compilation
| Variable | Type | Default | Description |
|---|---|---|---|
EPOCHLY_INFERENCE_COMPILATION_ENABLED | bool | true | Enable torch.compile |
EPOCHLY_INFERENCE_COMPILATION_BACKEND | str | "auto" | Compilation backend |
EPOCHLY_INFERENCE_COMPILATION_CACHE_DIR | str | "~/.epochly/inference_cache" | Cache directory |
EPOCHLY_INFERENCE_COMPILATION_WARMUP_REQUESTS | int | 3 | Warmup iterations |
EPOCHLY_INFERENCE_COMPILATION_TOLERANCE | float | 1e-5 | Output tolerance |
Cache
| Variable | Type | Default | Description |
|---|---|---|---|
EPOCHLY_INFERENCE_CACHE_ENABLED | bool | true | Enable caching |
EPOCHLY_INFERENCE_CACHE_L1_SIZE | int | 10000 | L1 cache capacity |
EPOCHLY_INFERENCE_CACHE_TTL_SECONDS | float | 86400 | Cache TTL |
EPOCHLY_INFERENCE_CACHE_PRIVACY_MODE | str | "ephemeral" | Privacy mode |
EPOCHLY_INFERENCE_CACHE_REDIS_URL | str | "" | Redis URL for L3 |
Privacy
| Variable | Type | Default | Description |
|---|---|---|---|
EPOCHLY_INFERENCE_PRIVACY_MODE | str | "ephemeral" | Privacy mode |
EPOCHLY_INFERENCE_PRIVACY_REDACT_PATTERNS | str | "" | Comma-separated regex patterns for PII scrubbing |
EPOCHLY_INFERENCE_PRIVACY_CACHE_TENANT_ISOLATION | bool | true | Namespace cache keys by tenant ID |
EPOCHLY_INFERENCE_PRIVACY_AUDIT_LOG_ENABLED | bool | true | Enable append-only safety audit log |
EPOCHLY_INFERENCE_PRIVACY_AUDIT_LOG_PATH | str | "~/.epochly/audit/" | Path for persistent audit log files |
EPOCHLY_INFERENCE_PRIVACY_ENCRYPT_AT_REST | bool | true | Enable at-rest encryption |
EPOCHLY_INFERENCE_PRIVACY_ENCRYPTION_KEY_SOURCE | str | "env" | Encryption key source |
EPOCHLY_TENANT_ID | str | "default" | Tenant ID for cache key namespacing |
Privacy Modes
| Mode | L1 Behavior | L2 Behavior | L3 Behavior |
|---|---|---|---|
ephemeral | In-memory LRU | No persistence | No persistence |
hashes_only | In-memory LRU | No persistence | No persistence |
persisted_encrypted | In-memory LRU | SQLite + AES-256-GCM | Redis + TLS |
Privacy Controls
Privacy is configured via the PrivacyConfig dataclass or the [tool.epochly.inference.privacy] section in pyproject.toml.
Input Redaction
PII and sensitive data are scrubbed before any data is written to disk or retained in golden stores. Configure regex patterns to match sensitive data:
from epochly.inference.safety.privacy import InputRedactorredactor = InputRedactor(patterns=[r"\b\d{3}-\d{2}-\d{4}\b", # SSNr"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", # Emailr"\b\d{16}\b", # Credit card],replacement="[REDACTED]",)clean_text = redactor.redact("SSN: 123-45-6789")# "SSN: [REDACTED]"
Tenant Isolation
Cache keys are namespaced by tenant ID to prevent cross-tenant data leakage:
from epochly.inference.safety.privacy import TenantIsolation# From constructorisolation = TenantIsolation(tenant_id="customer_abc")# From environment (reads EPOCHLY_TENANT_ID)isolation = TenantIsolation.from_env()# Namespace a cache keynamespaced = isolation.namespace_key("model:bert:input_hash_abc123")# "customer_abc:model:bert:input_hash_abc123"
Audit Logging
An append-only audit log records all safety gate decisions, cache accesses, and optimization state changes:
from epochly.inference.safety.privacy import AuditLoggeraudit = AuditLogger(max_entries=10_000)audit.log_event(operation="canary_validation",model_id=42,optimization_name="torch_compile_bert",result="PASS",details={"cosine_sim": 0.998, "max_diff": 0.0012},)# Query recent entriesentries = audit.get_entries_since(time.time() - 3600) # Last hour
L2 Cache (SQLite)
The L2 cache is configured separately when creating the cache directly:
from epochly.inference.cache_l2 import L2Cache, L2CacheConfigl2 = L2Cache(L2CacheConfig(db_path="~/.epochly/inference_cache/cache.db",ttl_seconds=86400,encryption_key=b"32-byte-aes-256-key-here!!!!!!!", # Optional))
Encryption Requirements:
- AES-256-GCM requires a 32-byte key
- Set via
EPOCHLY_L2_ENCRYPTION_KEYenvironment variable - When
cryptographylibrary is absent, data is stored unencrypted - Each row uses a unique 12-byte nonce
L3 Cache (Redis)
from epochly.inference.cache_l3 import L3Cache, L3CacheConfigl3 = L3Cache(L3CacheConfig(redis_url="redis://localhost:6379/0",key_prefix="tenant_a:",ttl_seconds=86400,tls_enabled=False,socket_timeout=5.0,))
TLS Support: Use rediss:// URL scheme or set tls_enabled=True.
LLM Companion
from epochly.inference.serving.llm_companion import (LLMCompanionAdapter,LLMCompanionConfig,)companion = LLMCompanionAdapter(runtime_url="http://localhost:8000",config=LLMCompanionConfig(max_concurrent_requests=64,cache_enabled=True,cache_max_size=10_000,default_timeout_seconds=120.0,api_path="/v1/completions",),)
Ray Serve Wrapper
from epochly.inference.serving.ray_serve_wrapper import (EpochlyRayServeWrapper,RayServeConfig,epochly_serve,)# Via wrapperwrapper = EpochlyRayServeWrapper(RayServeConfig(enable_telemetry=True,enable_priority_routing=True,))# Via decorator@epochly_serve(config=RayServeConfig())def predict(model, data):return model(data)
Security Validator
from epochly.inference.security import SecurityValidatorvalidator = SecurityValidator(max_input_size_bytes=10 * 1024 * 1024)report = validator.run_all_checks(cache=inference_cache,privacy_mode="ephemeral",encryption_key_present=False,redaction_patterns_configured=False,config={"canary_enabled": True},)
Compilation Safety Monitor
from epochly.inference.compilation.safety_monitor import TorchCompileSafetyMonitormonitor = TorchCompileSafetyMonitor(max_graph_breaks=5,memory_growth_threshold_mb=500.0,)# Pre-compile checkresult = monitor.pre_compile_check(model, sample_input)# Post-compile healthhealth = monitor.check_health(pre_mb=1000, post_mb=1050)# Output validityvalidity = monitor.check_output_validity(model_output)
Prometheus Endpoint
Expose a Prometheus-compatible /metrics endpoint using the PrometheusExporter:
from fastapi import FastAPIfrom fastapi.responses import PlainTextResponsefrom epochly.inference.metrics.prometheus_exporter import PrometheusExporterfrom epochly.inference.metrics.inference_metrics import InferenceMetricsfrom epochly.inference.metrics.cost_estimator import CostEstimatorapp = FastAPI()inference_metrics = InferenceMetrics()cost_estimator = CostEstimator(gpu_name="A100_80GB")@app.get("/metrics")def metrics():return PlainTextResponse(content=PrometheusExporter.export(inference_metrics,cost=cost_estimator,model_labels={id(model): {"model_name": "bert-base","model_version": "v2.1","endpoint": "/predict",}},),media_type="text/plain; version=0.0.4",)
The exporter generates metrics in the standard Prometheus exposition format, including counters, gauges, and histograms.