All posts
Benchmarks

Honest Benchmarks: What Epochly Can and Can't Do

A transparent look at Epochly's performance across different workload types. Including the ones where it doesn't help.

Epochly TeamJanuary 30, 20268 min read

Performance marketing is full of cherry-picked numbers. A 1000x speedup headline that only applies to a contrived microbenchmark. A "10x faster" claim that disappears under real workloads. We've seen it. You've seen it.

We're going to do something different. This post presents Epochly's benchmark results -- the impressive ones and the embarrassing ones -- with full context on every number.

Every claim here traces to our comprehensive benchmark report (January 29, 2026). Every test is reproducible. And we start with what doesn't work.


Where Epochly Does NOT Help

I/O-bound workloads: 1.0x

If your Python code spends most of its time waiting for disk, network, or database responses, Epochly won't help. The CPU is already idle. Adding parallelism or JIT compilation to idle time changes nothing.

# This will NOT benefit from Epochly
def fetch_all_records(urls):
results = []
for url in urls:
response = requests.get(url) # Waiting on network
results.append(response.json())
return results

Measured result: ~1.0x speedup. No improvement.

What to use instead: asyncio, aiohttp, or thread pools for I/O concurrency.

Already vectorized NumPy: ~1.0x

NumPy operations (matmul, dot, sum) already use optimized BLAS libraries (OpenBLAS, MKL) that parallelize internally. Adding Epochly's parallelism on top adds interception overhead without additional benefit.

# This will NOT benefit from Epochly
def matrix_multiply(a, b):
return np.matmul(a, b) # Already uses OpenBLAS/MKL threads

Measured result: 1.025x (within measurement noise).

Why: NumPy's BLAS backend already uses all available cores for matrix operations. There's nothing left to parallelize.

Small workloads: ~1.0x (or slower)

Process spawn overhead is approximately 200ms per worker. If your workload completes in 100ms, parallelization makes it slower.

# This will NOT benefit from Epochly parallel
def quick_sum(data):
# data has 1000 elements -- too small for parallelism
return sum(x ** 2 for x in data)

Measured result: 0.998x on CPU_BOUND with size=1000. Essentially neutral.

Rule of thumb: Your workload needs to be at least a few seconds long for ProcessPool parallelism to pay off.

GPU with small arrays: Minimal benefit

GPU kernel launch overhead means small arrays don't benefit from GPU acceleration.

Array sizeElementwise speedup
100K elements2.3x
1M elements12.3x
10M elements65.6x

Below 1M elements, the transfer overhead to GPU memory dominates the actual computation time.


Where Epochly Helps: The Numbers

Level 2 JIT: 58-193x on Numerical Loops

Epochly's Level 2 uses Numba JIT compilation to transform Python numerical loops into native machine code.

WorkloadSizeBaselineJITSpeedup
Numerical loop1M elements101.25ms1.15ms88.3x
Nested loop10K elements66.54ms1.15ms58.0x
Polynomial eval1M elements324.16ms1.68ms193.0x

Average: 113x across tested workloads.

The variance is real: polynomial evaluation (193x) benefits more than nested array access (58x). The workload characteristics determine where you land in the 58-193x range.

What makes a good JIT target:

  • Pure Python loops with numerical operations
  • Iterative algorithms (convergence loops, simulations)
  • Element-wise operations not already vectorized

What makes a bad JIT target:

  • String manipulation
  • Dictionary operations
  • Object-oriented patterns with polymorphism
  • Code that calls back into CPython frequently

Level 3 Parallel: 8-12x on CPU-bound Work (16 cores)

Level 3 distributes CPU-bound work across multiple cores using ProcessPoolExecutor.

WorkloadSequentialParallelSpeedup
Pure Python loop40.03s9.02s4.44x
NumPy compute (64 tasks)13.48s5.16s2.61x
Monte Carlo (32 tasks)17.88s5.87s3.04x

The table above shows specific benchmarks with varying task parallelism. With workloads that fully saturate all 16 cores (embarrassingly parallel, sufficient task count): 8-12x (ProcessPool).

On Apple M2 Max (16 cores, Python 3.13.5): 8.7x on pure Python parallel workloads with full core saturation, 54% parallel efficiency.

Important context: 16 cores does not mean 16x speedup. Amdahl's Law, process spawn overhead (~200ms), and memory contention limit practical efficiency to 50-60%. The gap between the table results (2.6-4.4x) and the 8-12x headline reflects workloads with different levels of core saturation.

Level 4 GPU: Up to 70x on Large Arrays

GPU acceleration requires large data volumes to overcome kernel launch overhead.

Elementwise operations (sin + cos + sqrt):

Array SizeData VolumeSpeedup
100K1MB2.3x
1M8MB12.3x
10M76MB65.6x
50M381MB69.8x
100M763MB68.1x

Matrix multiplication:

Matrix SizeSpeedup
512x5122.5x
1024x10249.5x
2048x20486.3x
4096x40967.0x

Batched convolution:

Batch ConfigSpeedup
batch=1, 64ch, 224px15.6x
batch=16, 64ch, 224px19.4x
batch=32, 128ch, 112px13.8x
batch=64, 256ch, 56px14.8x

Large array reductions:

Array SizeSpeedup
1M1.8x
10M21.5x
100M35.9x

GPU is measured on NVIDIA Quadro M6000 24GB, CUDA 12.1, PyTorch 2.5.1.

Combined JIT + Parallel: ~1,350x Theoretical Maximum

When Level 2 JIT (113x average) and Level 3 parallel (12x on 16 cores) apply to the same workload, the combined theoretical maximum is approximately 1,350x (113 * 12).

This applies to pure Python numerical loops that are both:

  1. JIT-compilable (numerical, no string/dict operations)
  2. Embarrassingly parallel (independent iterations)

The ~1,350x figure is a theoretical maximum, not an average. Individual results depend on workload characteristics, data dependencies, and parallelizability.


Mixed Workloads: The Realistic Scenario

Most real code isn't purely numerical loops or purely I/O. It's a mix. Here's how Epochly performs on mixed workloads:

MixSpeedup
Python loops around NumPy ops6.5x (ProcessPool)
Mixed compute + I/O1.5-1.7x
Mostly NumPy with thin Python glue1.0-1.2x

The speedup depends on the ratio of parallelizable Python code to already-optimized library calls. If 80% of your time is in NumPy BLAS operations, parallelizing the remaining 20% Python overhead yields modest improvement.


How to Know If Epochly Will Help Your Code

Step 1: Profile

import cProfile
import pstats
cProfile.run('your_function(your_data)', 'profile_output')
stats = pstats.Stats('profile_output')
stats.sort_stats('cumulative').print_stats(10)

Step 2: Classify your bottleneck

If most time is in...Epochly benefitRecommended level
Python loops with mathHigh (58-193x)Level 2 (JIT)
CPU-bound PythonMedium (8-12x)Level 3 (Parallel)
Large array operationsHigh (up to 70x)Level 4 (GPU)
NumPy BLAS callsNone (~1.0x)Don't use Epochly
I/O waitsNone (~1.0x)Use asyncio instead
Small operations (<100ms)None/negativeDon't parallelize

Step 3: Start at Level 0

@epochly.optimize(level=0)
def your_function(data):
pass

Monitor first. Understand the profile. Then graduate to the appropriate level.


Our Benchmark Methodology

Transparency means showing how we test, not just the results.

Hardware: Linux WSL2, Python 3.12.3, 16 cores (x86_64), NVIDIA Quadro M6000 24GB, CUDA 12.1

Methodology:

  • Each benchmark runs multiple iterations
  • Warm-up runs excluded from measurement
  • Results verified against sequential execution for correctness
  • Standard deviation reported where applicable
  • Cold start and warm start measured separately

Reproduction: The complete benchmark suite is available. Run it on your hardware to see your specific results.


The Bottom Line

Epochly makes Python significantly faster for CPU-bound numerical work. 113x on numerical loops (Numba JIT, Python 3.12.3). 8-12x with multi-core parallel execution (16 cores, ProcessPool). Up to 70x with GPU acceleration on arrays with 10M+ elements (CUDA).

It does not help with I/O-bound code, already-vectorized NumPy, small workloads, or GPU operations on small arrays.

Profile first. Understand your bottleneck. Then choose the right enhancement level.


Benchmark conditions: Python 3.12.3, Linux WSL2, 16 cores, NVIDIA Quadro M6000 24GB (CUDA 12.1). January 29, 2026 comprehensive benchmark report.

pythonperformancebenchmarkstransparencylimitations