AI Inference

Epochly AI Inference: Configuration

Configure inference optimization via pyproject.toml, environment variables, or the Python API. Control caching, batching, compilation, and safety thresholds.

Overview

The inference accelerator supports three configuration sources, applied in order of precedence:

Programmatic (highest) -- InferenceConfig(...) constructor
Environment variables -- EPOCHLY_INFERENCE_*
pyproject.toml (lowest) -- [tool.epochly.inference]

pyproject.toml Configuration

[tool.epochly.inference]
enabled = true
max_level = 2
frameworks = ["pytorch", "transformers", "onnxruntime"]
[tool.epochly.inference.batching]
enabled = true
max_batch_size = 0           # 0 = auto-calibrate from GPU memory
max_queue_depth = 1024
max_wait_ms = 50.0
target_latency_ms = 100.0
gpu_headroom_pct = 20
[tool.epochly.inference.compilation]
enabled = true
backend = "auto"              # "auto", "inductor", "cudagraphs"
cache_dir = "~/.epochly/inference_cache"
warmup_requests = 3
tolerance = 1e-5
[tool.epochly.inference.cache]
enabled = true
l1_size = 10000
ttl_seconds = 86400           # 24 hours
privacy_mode = "ephemeral"    # "ephemeral", "hashes_only", "persisted_encrypted"
redis_url = ""                # Empty = no L3 cache
[tool.epochly.inference.privacy]
mode = "ephemeral"                  # "ephemeral", "hashes_only", "persisted_encrypted"
redact_patterns = []                # Regex patterns for PII scrubbing
cache_tenant_isolation = true       # Namespace cache keys by tenant ID
audit_log_enabled = true            # Enable append-only safety audit log
audit_log_path = "~/.epochly/audit/"
encrypt_at_rest = true
encryption_key_source = "env"       # "env", "file", "hsm"

Environment Variables

All environment variables use the pattern: EPOCHLY_INFERENCE_

_

Top-Level

Variable	Type	Default	Description
`EPOCHLY_INFERENCE_ENABLED`	bool	`true`	Enable/disable inference module
`EPOCHLY_INFERENCE_MAX_LEVEL`	int	`2`	Maximum enhancement level (0-4)

Batching

Variable	Type	Default	Description
`EPOCHLY_INFERENCE_BATCHING_ENABLED`	bool	`true`	Enable micro-batching
`EPOCHLY_INFERENCE_BATCHING_MAX_BATCH_SIZE`	int	`0`	Max batch size (0=auto)
`EPOCHLY_INFERENCE_BATCHING_MAX_QUEUE_DEPTH`	int	`1024`	Max pending requests
`EPOCHLY_INFERENCE_BATCHING_MAX_WAIT_MS`	float	`50.0`	Max wait before flush
`EPOCHLY_INFERENCE_BATCHING_TARGET_LATENCY_MS`	float	`100.0`	Target p95 latency
`EPOCHLY_INFERENCE_BATCHING_GPU_HEADROOM_PCT`	int	`20`	GPU memory headroom %

Compilation

Variable	Type	Default	Description
`EPOCHLY_INFERENCE_COMPILATION_ENABLED`	bool	`true`	Enable torch.compile
`EPOCHLY_INFERENCE_COMPILATION_BACKEND`	str	`"auto"`	Compilation backend
`EPOCHLY_INFERENCE_COMPILATION_CACHE_DIR`	str	`"~/.epochly/inference_cache"`	Cache directory
`EPOCHLY_INFERENCE_COMPILATION_WARMUP_REQUESTS`	int	`3`	Warmup iterations
`EPOCHLY_INFERENCE_COMPILATION_TOLERANCE`	float	`1e-5`	Output tolerance

Cache

Variable	Type	Default	Description
`EPOCHLY_INFERENCE_CACHE_ENABLED`	bool	`true`	Enable caching
`EPOCHLY_INFERENCE_CACHE_L1_SIZE`	int	`10000`	L1 cache capacity
`EPOCHLY_INFERENCE_CACHE_TTL_SECONDS`	float	`86400`	Cache TTL
`EPOCHLY_INFERENCE_CACHE_PRIVACY_MODE`	str	`"ephemeral"`	Privacy mode
`EPOCHLY_INFERENCE_CACHE_REDIS_URL`	str	`""`	Redis URL for L3

Privacy

Variable	Type	Default	Description
`EPOCHLY_INFERENCE_PRIVACY_MODE`	str	`"ephemeral"`	Privacy mode
`EPOCHLY_INFERENCE_PRIVACY_REDACT_PATTERNS`	str	`""`	Comma-separated regex patterns for PII scrubbing
`EPOCHLY_INFERENCE_PRIVACY_CACHE_TENANT_ISOLATION`	bool	`true`	Namespace cache keys by tenant ID
`EPOCHLY_INFERENCE_PRIVACY_AUDIT_LOG_ENABLED`	bool	`true`	Enable append-only safety audit log
`EPOCHLY_INFERENCE_PRIVACY_AUDIT_LOG_PATH`	str	`"~/.epochly/audit/"`	Path for persistent audit log files
`EPOCHLY_INFERENCE_PRIVACY_ENCRYPT_AT_REST`	bool	`true`	Enable at-rest encryption
`EPOCHLY_INFERENCE_PRIVACY_ENCRYPTION_KEY_SOURCE`	str	`"env"`	Encryption key source
`EPOCHLY_TENANT_ID`	str	`"default"`	Tenant ID for cache key namespacing

Privacy Modes

Mode	L1 Behavior	L2 Behavior	L3 Behavior
`ephemeral`	In-memory LRU	No persistence	No persistence
`hashes_only`	In-memory LRU	No persistence	No persistence
`persisted_encrypted`	In-memory LRU	SQLite + AES-256-GCM	Redis + TLS

Privacy Controls

Privacy is configured via the PrivacyConfig dataclass or the [tool.epochly.inference.privacy] section in pyproject.toml.

Input Redaction

PII and sensitive data are scrubbed before any data is written to disk or retained in golden stores. Configure regex patterns to match sensitive data:

from epochly.inference.safety.privacy import InputRedactor
redactor = InputRedactor(
    patterns=[
        r"\b\d{3}-\d{2}-\d{4}\b",           # SSN
        r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",  # Email
        r"\b\d{16}\b",                        # Credit card
    ],
    replacement="[REDACTED]",
)
clean_text = redactor.redact("SSN: 123-45-6789")
# "SSN: [REDACTED]"

Tenant Isolation

Cache keys are namespaced by tenant ID to prevent cross-tenant data leakage:

from epochly.inference.safety.privacy import TenantIsolation
# From constructor
isolation = TenantIsolation(tenant_id="customer_abc")
# From environment (reads EPOCHLY_TENANT_ID)
isolation = TenantIsolation.from_env()
# Namespace a cache key
namespaced = isolation.namespace_key("model:bert:input_hash_abc123")
# "customer_abc:model:bert:input_hash_abc123"

Audit Logging

An append-only audit log records all safety gate decisions, cache accesses, and optimization state changes:

from epochly.inference.safety.privacy import AuditLogger
audit = AuditLogger(max_entries=10_000)
audit.log_event(
    operation="canary_validation",
    model_id=42,
    optimization_name="torch_compile_bert",
    result="PASS",
    details={"cosine_sim": 0.998, "max_diff": 0.0012},
)
# Query recent entries
entries = audit.get_entries_since(time.time() - 3600)  # Last hour

L2 Cache (SQLite)

The L2 cache is configured separately when creating the cache directly:

from epochly.inference.cache_l2 import L2Cache, L2CacheConfig
l2 = L2Cache(L2CacheConfig(
    db_path="~/.epochly/inference_cache/cache.db",
    ttl_seconds=86400,
    encryption_key=b"32-byte-aes-256-key-here!!!!!!!",  # Optional
))

Encryption Requirements:

AES-256-GCM requires a 32-byte key
Set via EPOCHLY_L2_ENCRYPTION_KEY environment variable
When cryptography library is absent, data is stored unencrypted
Each row uses a unique 12-byte nonce

L3 Cache (Redis)

from epochly.inference.cache_l3 import L3Cache, L3CacheConfig
l3 = L3Cache(L3CacheConfig(
    redis_url="redis://localhost:6379/0",
    key_prefix="tenant_a:",
    ttl_seconds=86400,
    tls_enabled=False,
    socket_timeout=5.0,
))

TLS Support: Use rediss:// URL scheme or set tls_enabled=True.

LLM Companion

from epochly.inference.serving.llm_companion import (
    LLMCompanionAdapter,
    LLMCompanionConfig,
)
companion = LLMCompanionAdapter(
    runtime_url="http://localhost:8000",
    config=LLMCompanionConfig(
        max_concurrent_requests=64,
        cache_enabled=True,
        cache_max_size=10_000,
        default_timeout_seconds=120.0,
        api_path="/v1/completions",
    ),
)

Ray Serve Wrapper

from epochly.inference.serving.ray_serve_wrapper import (
    EpochlyRayServeWrapper,
    RayServeConfig,
    epochly_serve,
)
# Via wrapper
wrapper = EpochlyRayServeWrapper(RayServeConfig(
    enable_telemetry=True,
    enable_priority_routing=True,
))
# Via decorator
@epochly_serve(config=RayServeConfig())
def predict(model, data):
    return model(data)

Security Validator

from epochly.inference.security import SecurityValidator
validator = SecurityValidator(max_input_size_bytes=10 * 1024 * 1024)
report = validator.run_all_checks(
    cache=inference_cache,
    privacy_mode="ephemeral",
    encryption_key_present=False,
    redaction_patterns_configured=False,
    config={"canary_enabled": True},
)

Compilation Safety Monitor

from epochly.inference.compilation.safety_monitor import TorchCompileSafetyMonitor
monitor = TorchCompileSafetyMonitor(
    max_graph_breaks=5,
    memory_growth_threshold_mb=500.0,
)
# Pre-compile check
result = monitor.pre_compile_check(model, sample_input)
# Post-compile health
health = monitor.check_health(pre_mb=1000, post_mb=1050)
# Output validity
validity = monitor.check_output_validity(model_output)

Prometheus Endpoint

Expose a Prometheus-compatible /metrics endpoint using the PrometheusExporter:

from fastapi import FastAPI
from fastapi.responses import PlainTextResponse
from epochly.inference.metrics.prometheus_exporter import PrometheusExporter
from epochly.inference.metrics.inference_metrics import InferenceMetrics
from epochly.inference.metrics.cost_estimator import CostEstimator
app = FastAPI()
inference_metrics = InferenceMetrics()
cost_estimator = CostEstimator(gpu_name="A100_80GB")
@app.get("/metrics")
def metrics():
    return PlainTextResponse(
        content=PrometheusExporter.export(
            inference_metrics,
            cost=cost_estimator,
            model_labels={
                id(model): {
                    "model_name": "bert-base",
                    "model_version": "v2.1",
                    "endpoint": "/predict",
                }
            },
        ),
        media_type="text/plain; version=0.0.4",
    )

The exporter generates metrics in the standard Prometheus exposition format, including counters, gauges, and histograms.