All use cases

ML pipelines and preprocessing

Preprocessing, scoring, and pipeline work around model training and inference.

Machine learning pipelines are computationally expensive. Training loops iterate millions of times. Hyperparameter searches multiply that cost by hundreds. Data preprocessing transforms run before every experiment.

Epochly accelerates the Python-level bottlenecks in these pipelines.


Where ML Training Time Goes

A typical ML training pipeline:

PhaseTypical TimeCPU Profile
Data preprocessing10-30%CPU-bound Python loops
Data loading/augmentation5-15%I/O + CPU mixed
Forward pass30-50%GPU (PyTorch/TF)
Backward pass20-30%GPU (PyTorch/TF)
Metrics/logging2-5%CPU Python

The forward and backward passes already run on GPU through PyTorch or TensorFlow. Epochly won't speed those up -- they're already optimized. But data preprocessing, custom loss functions, and metric computation are often written in Python loops. That's where Epochly helps.


Accelerating Data Preprocessing

Custom data transformations written in Python loops are prime JIT targets.

import epochly
import numpy as np
@epochly.optimize
def preprocess_batch(images, labels):
"""Custom preprocessing pipeline."""
processed = np.empty_like(images)
for i in range(len(images)):
# Normalize per-channel
for c in range(images.shape[1]):
channel = images[i, c]
processed[i, c] = (channel - channel.mean()) / (channel.std() + 1e-8)
return processed, labels

The nested loop above iterates over every image and channel. Level 2 JIT compiles this to native code: 58-193x speedup on numerical loops.

Custom loss functions

import epochly
@epochly.optimize
def focal_loss(predictions, targets, gamma=2.0, alpha=0.25):
"""Focal loss for class imbalance -- custom Python implementation."""
result = 0.0
for i in range(len(predictions)):
p = predictions[i]
t = targets[i]
pt = p * t + (1 - p) * (1 - t)
weight = alpha * t + (1 - alpha) * (1 - t)
result += -weight * (1 - pt) ** gamma * np.log(pt + 1e-8)
return result / len(predictions)

This Python-level loop benefits from JIT compilation (Level 2). If you're using PyTorch's built-in loss functions, they already run as optimized C++/CUDA -- Epochly won't improve those.


Hyperparameter search runs the training pipeline many times with different configurations. Level 3 parallel execution distributes independent runs across CPU cores.

import epochly
from sklearn.model_selection import GridSearchCV
@epochly.optimize
def search_hyperparameters(X_train, y_train, param_grid):
"""Distribute hyperparameter search across cores."""
model = create_model()
search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
search.fit(X_train, y_train)
return search.best_params_, search.best_score_

Level 3 provides 8-12x speedup on 16 cores for CPU-bound workloads. Each hyperparameter combination runs as an independent task, making this embarrassingly parallel.


GPU Acceleration for Large Tensors

When working with large arrays outside of PyTorch/TensorFlow's compute graph, Epochly's Level 4 offloads to GPU.

import epochly
import numpy as np
@epochly.optimize
def compute_similarity_matrix(embeddings):
"""Compute pairwise cosine similarity for large embedding sets."""
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalized = embeddings / norms
similarity = normalized @ normalized.T
return similarity

For 10M+ element arrays: up to 70x with GPU acceleration. For smaller arrays, CPU is more efficient.


What Epochly Does NOT Help in ML

  • PyTorch/TensorFlow forward/backward passes: Already on GPU. ~1.0x.
  • CUDA kernel execution: Already native GPU code. ~1.0x.
  • Data loading from disk: I/O-bound. ~1.0x.
  • Small batch preprocessing: Overhead exceeds benefit on small tensors.

Epochly works alongside your ML framework. It accelerates the Python glue -- not the framework internals.


Getting Started

pip install epochly
import epochly
@epochly.optimize
def your_preprocessing(batch):
# Your existing preprocessing code
pass
@epochly.optimize
def your_custom_metric(predictions, targets):
# Your existing metric computation
pass

Start with @epochly.optimize on your custom preprocessing and metric functions. Profile with Level 0 first to identify where Python-level time concentrates.


Benchmark conditions: Python 3.12.3, Linux WSL2, 16 cores, NVIDIA Quadro M6000 24GB (CUDA 12.1). January 29, 2026 comprehensive benchmark report.

pythonmachine-learningdeep-learningtrainingpytorchtensorflow