Machine learning pipelines are computationally expensive. Training loops iterate millions of times. Hyperparameter searches multiply that cost by hundreds. Data preprocessing transforms run before every experiment.
Epochly accelerates the Python-level bottlenecks in these pipelines.
Where ML Training Time Goes
A typical ML training pipeline:
| Phase | Typical Time | CPU Profile |
|---|---|---|
| Data preprocessing | 10-30% | CPU-bound Python loops |
| Data loading/augmentation | 5-15% | I/O + CPU mixed |
| Forward pass | 30-50% | GPU (PyTorch/TF) |
| Backward pass | 20-30% | GPU (PyTorch/TF) |
| Metrics/logging | 2-5% | CPU Python |
The forward and backward passes already run on GPU through PyTorch or TensorFlow. Epochly won't speed those up -- they're already optimized. But data preprocessing, custom loss functions, and metric computation are often written in Python loops. That's where Epochly helps.
Accelerating Data Preprocessing
Custom data transformations written in Python loops are prime JIT targets.
import epochlyimport numpy as np@epochly.optimizedef preprocess_batch(images, labels):"""Custom preprocessing pipeline."""processed = np.empty_like(images)for i in range(len(images)):# Normalize per-channelfor c in range(images.shape[1]):channel = images[i, c]processed[i, c] = (channel - channel.mean()) / (channel.std() + 1e-8)return processed, labels
The nested loop above iterates over every image and channel. Level 2 JIT compiles this to native code: 58-193x speedup on numerical loops.
Custom loss functions
import epochly@epochly.optimizedef focal_loss(predictions, targets, gamma=2.0, alpha=0.25):"""Focal loss for class imbalance -- custom Python implementation."""result = 0.0for i in range(len(predictions)):p = predictions[i]t = targets[i]pt = p * t + (1 - p) * (1 - t)weight = alpha * t + (1 - alpha) * (1 - t)result += -weight * (1 - pt) ** gamma * np.log(pt + 1e-8)return result / len(predictions)
This Python-level loop benefits from JIT compilation (Level 2). If you're using PyTorch's built-in loss functions, they already run as optimized C++/CUDA -- Epochly won't improve those.
Accelerating Hyperparameter Search
Hyperparameter search runs the training pipeline many times with different configurations. Level 3 parallel execution distributes independent runs across CPU cores.
import epochlyfrom sklearn.model_selection import GridSearchCV@epochly.optimizedef search_hyperparameters(X_train, y_train, param_grid):"""Distribute hyperparameter search across cores."""model = create_model()search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')search.fit(X_train, y_train)return search.best_params_, search.best_score_
Level 3 provides 8-12x speedup on 16 cores for CPU-bound workloads. Each hyperparameter combination runs as an independent task, making this embarrassingly parallel.
GPU Acceleration for Large Tensors
When working with large arrays outside of PyTorch/TensorFlow's compute graph, Epochly's Level 4 offloads to GPU.
import epochlyimport numpy as np@epochly.optimizedef compute_similarity_matrix(embeddings):"""Compute pairwise cosine similarity for large embedding sets."""norms = np.linalg.norm(embeddings, axis=1, keepdims=True)normalized = embeddings / normssimilarity = normalized @ normalized.Treturn similarity
For 10M+ element arrays: up to 70x with GPU acceleration. For smaller arrays, CPU is more efficient.
What Epochly Does NOT Help in ML
- PyTorch/TensorFlow forward/backward passes: Already on GPU. ~1.0x.
- CUDA kernel execution: Already native GPU code. ~1.0x.
- Data loading from disk: I/O-bound. ~1.0x.
- Small batch preprocessing: Overhead exceeds benefit on small tensors.
Epochly works alongside your ML framework. It accelerates the Python glue -- not the framework internals.
Getting Started
pip install epochlyimport epochly@epochly.optimizedef your_preprocessing(batch):# Your existing preprocessing codepass@epochly.optimizedef your_custom_metric(predictions, targets):# Your existing metric computationpass
Start with @epochly.optimize on your custom preprocessing and metric functions. Profile with Level 0 first to identify where Python-level time concentrates.
Benchmark conditions: Python 3.12.3, Linux WSL2, 16 cores, NVIDIA Quadro M6000 24GB (CUDA 12.1). January 29, 2026 comprehensive benchmark report.