Performance marketing is full of cherry-picked numbers. A 1000x speedup headline that only applies to a contrived microbenchmark. A "10x faster" claim that disappears under real workloads. We've seen it. You've seen it.
We're going to do something different. This post presents Epochly's benchmark results -- the impressive ones and the embarrassing ones -- with full context on every number.
Every claim here traces to our comprehensive benchmark report (January 29, 2026). Every test is reproducible. And we start with what doesn't work.
Where Epochly Does NOT Help
I/O-bound workloads: 1.0x
If your Python code spends most of its time waiting for disk, network, or database responses, Epochly won't help. The CPU is already idle. Adding parallelism or JIT compilation to idle time changes nothing.
# This will NOT benefit from Epochlydef fetch_all_records(urls):results = []for url in urls:response = requests.get(url) # Waiting on networkresults.append(response.json())return results
Measured result: ~1.0x speedup. No improvement.
What to use instead: asyncio, aiohttp, or thread pools for I/O concurrency.
Already vectorized NumPy: ~1.0x
NumPy operations (matmul, dot, sum) already use optimized BLAS libraries (OpenBLAS, MKL) that parallelize internally. Adding Epochly's parallelism on top adds interception overhead without additional benefit.
# This will NOT benefit from Epochlydef matrix_multiply(a, b):return np.matmul(a, b) # Already uses OpenBLAS/MKL threads
Measured result: 1.025x (within measurement noise).
Why: NumPy's BLAS backend already uses all available cores for matrix operations. There's nothing left to parallelize.
Small workloads: ~1.0x (or slower)
Process spawn overhead is approximately 200ms per worker. If your workload completes in 100ms, parallelization makes it slower.
# This will NOT benefit from Epochly paralleldef quick_sum(data):# data has 1000 elements -- too small for parallelismreturn sum(x ** 2 for x in data)
Measured result: 0.998x on CPU_BOUND with size=1000. Essentially neutral.
Rule of thumb: Your workload needs to be at least a few seconds long for ProcessPool parallelism to pay off.
GPU with small arrays: Minimal benefit
GPU kernel launch overhead means small arrays don't benefit from GPU acceleration.
| Array size | Elementwise speedup |
|---|---|
| 100K elements | 2.3x |
| 1M elements | 12.3x |
| 10M elements | 65.6x |
Below 1M elements, the transfer overhead to GPU memory dominates the actual computation time.
Where Epochly Helps: The Numbers
Level 2 JIT: 58-193x on Numerical Loops
Epochly's Level 2 uses Numba JIT compilation to transform Python numerical loops into native machine code.
| Workload | Size | Baseline | JIT | Speedup |
|---|---|---|---|---|
| Numerical loop | 1M elements | 101.25ms | 1.15ms | 88.3x |
| Nested loop | 10K elements | 66.54ms | 1.15ms | 58.0x |
| Polynomial eval | 1M elements | 324.16ms | 1.68ms | 193.0x |
Average: 113x across tested workloads.
The variance is real: polynomial evaluation (193x) benefits more than nested array access (58x). The workload characteristics determine where you land in the 58-193x range.
What makes a good JIT target:
- Pure Python loops with numerical operations
- Iterative algorithms (convergence loops, simulations)
- Element-wise operations not already vectorized
What makes a bad JIT target:
- String manipulation
- Dictionary operations
- Object-oriented patterns with polymorphism
- Code that calls back into CPython frequently
Level 3 Parallel: 8-12x on CPU-bound Work (16 cores)
Level 3 distributes CPU-bound work across multiple cores using ProcessPoolExecutor.
| Workload | Sequential | Parallel | Speedup |
|---|---|---|---|
| Pure Python loop | 40.03s | 9.02s | 4.44x |
| NumPy compute (64 tasks) | 13.48s | 5.16s | 2.61x |
| Monte Carlo (32 tasks) | 17.88s | 5.87s | 3.04x |
The table above shows specific benchmarks with varying task parallelism. With workloads that fully saturate all 16 cores (embarrassingly parallel, sufficient task count): 8-12x (ProcessPool).
On Apple M2 Max (16 cores, Python 3.13.5): 8.7x on pure Python parallel workloads with full core saturation, 54% parallel efficiency.
Important context: 16 cores does not mean 16x speedup. Amdahl's Law, process spawn overhead (~200ms), and memory contention limit practical efficiency to 50-60%. The gap between the table results (2.6-4.4x) and the 8-12x headline reflects workloads with different levels of core saturation.
Level 4 GPU: Up to 70x on Large Arrays
GPU acceleration requires large data volumes to overcome kernel launch overhead.
Elementwise operations (sin + cos + sqrt):
| Array Size | Data Volume | Speedup |
|---|---|---|
| 100K | 1MB | 2.3x |
| 1M | 8MB | 12.3x |
| 10M | 76MB | 65.6x |
| 50M | 381MB | 69.8x |
| 100M | 763MB | 68.1x |
Matrix multiplication:
| Matrix Size | Speedup |
|---|---|
| 512x512 | 2.5x |
| 1024x1024 | 9.5x |
| 2048x2048 | 6.3x |
| 4096x4096 | 7.0x |
Batched convolution:
| Batch Config | Speedup |
|---|---|
| batch=1, 64ch, 224px | 15.6x |
| batch=16, 64ch, 224px | 19.4x |
| batch=32, 128ch, 112px | 13.8x |
| batch=64, 256ch, 56px | 14.8x |
Large array reductions:
| Array Size | Speedup |
|---|---|
| 1M | 1.8x |
| 10M | 21.5x |
| 100M | 35.9x |
GPU is measured on NVIDIA Quadro M6000 24GB, CUDA 12.1, PyTorch 2.5.1.
Combined JIT + Parallel: ~1,350x Theoretical Maximum
When Level 2 JIT (113x average) and Level 3 parallel (12x on 16 cores) apply to the same workload, the combined theoretical maximum is approximately 1,350x (113 * 12).
This applies to pure Python numerical loops that are both:
- JIT-compilable (numerical, no string/dict operations)
- Embarrassingly parallel (independent iterations)
The ~1,350x figure is a theoretical maximum, not an average. Individual results depend on workload characteristics, data dependencies, and parallelizability.
Mixed Workloads: The Realistic Scenario
Most real code isn't purely numerical loops or purely I/O. It's a mix. Here's how Epochly performs on mixed workloads:
| Mix | Speedup |
|---|---|
| Python loops around NumPy ops | 6.5x (ProcessPool) |
| Mixed compute + I/O | 1.5-1.7x |
| Mostly NumPy with thin Python glue | 1.0-1.2x |
The speedup depends on the ratio of parallelizable Python code to already-optimized library calls. If 80% of your time is in NumPy BLAS operations, parallelizing the remaining 20% Python overhead yields modest improvement.
How to Know If Epochly Will Help Your Code
Step 1: Profile
import cProfileimport pstatscProfile.run('your_function(your_data)', 'profile_output')stats = pstats.Stats('profile_output')stats.sort_stats('cumulative').print_stats(10)
Step 2: Classify your bottleneck
| If most time is in... | Epochly benefit | Recommended level |
|---|---|---|
| Python loops with math | High (58-193x) | Level 2 (JIT) |
| CPU-bound Python | Medium (8-12x) | Level 3 (Parallel) |
| Large array operations | High (up to 70x) | Level 4 (GPU) |
| NumPy BLAS calls | None (~1.0x) | Don't use Epochly |
| I/O waits | None (~1.0x) | Use asyncio instead |
| Small operations (<100ms) | None/negative | Don't parallelize |
Step 3: Start at Level 0
@epochly.optimize(level=0)def your_function(data):pass
Monitor first. Understand the profile. Then graduate to the appropriate level.
Our Benchmark Methodology
Transparency means showing how we test, not just the results.
Hardware: Linux WSL2, Python 3.12.3, 16 cores (x86_64), NVIDIA Quadro M6000 24GB, CUDA 12.1
Methodology:
- Each benchmark runs multiple iterations
- Warm-up runs excluded from measurement
- Results verified against sequential execution for correctness
- Standard deviation reported where applicable
- Cold start and warm start measured separately
Reproduction: The complete benchmark suite is available. Run it on your hardware to see your specific results.
The Bottom Line
Epochly makes Python significantly faster for CPU-bound numerical work. 113x on numerical loops (Numba JIT, Python 3.12.3). 8-12x with multi-core parallel execution (16 cores, ProcessPool). Up to 70x with GPU acceleration on arrays with 10M+ elements (CUDA).
It does not help with I/O-bound code, already-vectorized NumPy, small workloads, or GPU operations on small arrays.
Profile first. Understand your bottleneck. Then choose the right enhancement level.
Benchmark conditions: Python 3.12.3, Linux WSL2, 16 cores, NVIDIA Quadro M6000 24GB (CUDA 12.1). January 29, 2026 comprehensive benchmark report.