Python is slow. You know it. We know it. But rewriting your codebase in Rust, C++, or Go isn't always an option. What if you could make Python faster without changing your code?
Epochly takes a different approach to Python performance: progressive enhancement. Instead of a single optimization that either works or doesn't, Epochly offers five levels of optimization that build on each other. You start with monitoring. As Epochly proves each optimization is safe for your specific workloads, it graduates to JIT compilation, parallel execution, and GPU acceleration.
No rewrites. Minimal risk. Every step is verified. Every optimization can roll back.
Here's what that journey looks like in practice.
Level 0: Monitor (1.0x)
Before optimizing anything, Epochly watches. Level 0 is pure telemetry -- it instruments your Python code to understand where time is spent without changing execution.
import epochly@epochly.optimize(level=0)def process_data(data):result = 0.0for i in range(len(data)):result += data[i] ** 2 + data[i] * 3.14return result
At Level 0, process_data runs at exactly the same speed as before. But now Epochly knows:
- How long each function takes
- Where CPU time concentrates
- Whether the workload is CPU-bound, I/O-bound, or memory-bound
- What optimization opportunities exist
This is the diagnostic step. You can't optimize what you can't measure.
Level 1: Threading (<5% overhead)
Level 1 introduces GIL-aware scheduling. Epochly's runtime coordinates thread usage to minimize contention with Python's Global Interpreter Lock.
The overhead is less than 5% on average. This level is about preparing the ground for parallelism without disrupting existing behavior.
@epochly.optimize(level=1)def process_data(data):result = 0.0for i in range(len(data)):result += data[i] ** 2 + data[i] * 3.14return result
Level 1 won't make your code faster in most cases. Its purpose is establishing the monitoring and scheduling infrastructure that Levels 2-4 build on.
Level 2: JIT Compilation (58-193x, 113x average)
This is where things get interesting. Level 2 compiles your numerical Python loops to native machine code using Numba JIT compilation.
@epochly.optimize(level=2)def process_data(data):result = 0.0for i in range(len(data)):result += data[i] ** 2 + data[i] * 3.14return result
The same function. The same code. But now the inner loop runs as compiled native code instead of interpreted Python bytecode.
What we measured
We benchmarked Level 2 across six numerical workloads on Linux (Python 3.12.3, 16 cores):
| Workload | Baseline | Optimized | Speedup |
|---|---|---|---|
| Numerical loop (1M elements) | 101.25ms | 1.15ms | 88.3x |
| Nested loop (10K elements) | 66.54ms | 1.15ms | 58.0x |
| Polynomial evaluation (1M elements) | 324.16ms | 1.68ms | 193.0x |
Average across workloads: 113x.
The range is 58-193x depending on the workload. Polynomial and mathematical operations see the highest speedups because they have the most interpreter overhead per iteration. Nested loops with array access see lower but still substantial gains.
When JIT helps
- Simple numerical loops (50-100x)
- Nested loops with math operations (30-60x)
- Polynomial and mathematical operations (100-200x)
- Iterative algorithms (50-150x)
When JIT does not help
- String operations (not JIT-compilable)
- Dictionary manipulation (dynamic typing prevents compilation)
- Object-heavy code (Python object overhead persists)
- Code with many Python C API calls
Level 3: Parallel Execution (8-12x on 16 cores)
Level 3 adds multi-core parallel execution using ProcessPoolExecutor. Instead of running on a single core, CPU-bound work distributes across all available cores.
@epochly.optimize(level=3)def process_batch(items):results = []for item in items:results.append(heavy_computation(item))return results
What we measured
On Linux (Python 3.12.3, 16 cores, ProcessPool):
| Workload | Sequential | Parallel | Speedup |
|---|---|---|---|
| Pure Python loop | 40.03s | 9.02s | 4.44x |
| NumPy compute (64 tasks) | 13.48s | 5.16s | 2.61x |
| Monte Carlo (32 tasks) | 17.88s | 5.87s | 3.04x |
The table above shows individual benchmark results with varying task counts and workload types. With workloads that fully saturate all 16 cores (higher task counts and embarrassingly parallel work), CPU-bound workloads achieve 8-12x speedup.
On an Apple M2 Max (16 cores, Python 3.13.5), we measured 8.7x on pure Python parallel workloads with full core saturation -- 54% efficiency at 16 cores.
Why not 16x on 16 cores?
Amdahl's Law. Process spawn overhead is approximately 200ms per worker. Memory contention increases with worker count. OS scheduling adds overhead. The parallelizable portion of your code determines your ceiling.
Expect 50-60% parallel efficiency on CPU-bound Python workloads. That still turns 13 seconds into 1.5 seconds.
ThreadPool vs ProcessPool
| Executor | CPU-bound speedup | Why |
|---|---|---|
| ThreadPool | ~1.1x | GIL prevents true parallelism |
| ProcessPool | 3-4x (8-12x all cores) | Separate processes bypass GIL |
For CPU-bound work, ProcessPool is the only executor that provides meaningful speedup. ThreadPool is only useful for I/O-bound workloads.
Level 4: GPU Acceleration (up to 70x)
Level 4 offloads computation to CUDA-capable GPUs using PyTorch and CuPy backends.
@epochly.optimize(level=4)def elementwise_ops(data):return np.sin(data) + np.cos(data) + np.sqrt(np.abs(data))
What we measured
On NVIDIA Quadro M6000 (24GB, CUDA 12.1, PyTorch 2.5.1):
| Operation | Data Size | Speedup |
|---|---|---|
| Elementwise (sin+cos+sqrt) | 100K elements | 2.3x |
| Elementwise | 1M elements | 12.3x |
| Elementwise | 10M elements | 65.6x |
| Elementwise | 50M elements | 69.8x |
| Elementwise | 100M elements | 68.1x |
| Matrix multiply (1024x1024) | 8MB | 9.5x |
| Convolution (batch 16) | - | 19.4x |
| Reduction (100M elements) | 763MB | 35.9x |
The key threshold is 10M+ elements for elementwise operations. Below 1M elements, GPU kernel launch overhead exceeds the benefit.
GPU summary by data size
| Data Size | Elementwise | Matrix | Convolution | Reduction |
|---|---|---|---|---|
| Small (<1M) | 2.3x | 2.5x | 16x | 1.8x |
| Medium (1-10M) | 12x | 6-10x | 16x | 22x |
| Large (10M+) | 66-70x | 7x | 19x | 36x |
When GPU does not help
- Small arrays (<1M elements): kernel launch overhead dominates
- Already vectorized NumPy: BLAS already optimizes these operations
- Operations with frequent CPU-GPU data transfers
- Workloads that don't parallelize across CUDA cores
Combining Levels: The Full Journey
The enhancement levels compose. Level 2 JIT compiles loops to native code (113x average). Level 3 parallelizes across cores (8-12x on 16 cores). Combined, that's a theoretical maximum of approximately 1,350x on pure Python numerical loops (113x JIT * 12x parallel).
That's the journey from 1x to 1,350x. But let's be precise about what that means:
- 113x comes from JIT-compiling numerical loops to native code
- 12x comes from distributing work across 16 cores
- ~1,350x is the theoretical maximum when both apply to the same workload
- Individual results vary by workload characteristics
The combined speedup is not a single magic number. It's the product of two independent optimizations, each with its own applicability conditions.
What Epochly Does NOT Help
Transparency means acknowledging limitations:
- I/O-bound workloads: ~1.0x. If your bottleneck is disk or network, more CPU doesn't help.
- Already vectorized NumPy: ~1.0x. NumPy's BLAS backend (OpenBLAS, MKL) already parallelizes matrix operations.
- Small workloads: ~1.0x or slower. Process spawn overhead (~200ms) exceeds benefit on small data.
- GPU with small arrays: Below 1M elements, kernel launch overhead dominates.
Profile your workload first. Understand whether you're CPU-bound, I/O-bound, or memory-bound. Then choose the right enhancement level.
Getting Started
import epochly# Start with monitoring@epochly.optimize(level=0)def your_function(data):# your existing codepass# Graduate to JIT when ready@epochly.optimize(level=2)def your_function(data):# same code, 58-193x faster on numerical loopspass
The decorator is the only change. Your code stays the same. Epochly handles the rest.
Benchmark conditions: Python 3.12.3, Linux WSL2, 16 cores, NVIDIA Quadro M6000 24GB. All numbers from Jan 29, 2026 comprehensive benchmark report. Full benchmark suite available for reproduction.