Honest Benchmarks: What Epochly Can and Can't Do

Performance marketing is full of cherry-picked numbers. A 1000x speedup headline that only applies to a contrived microbenchmark. A "10x faster" claim that disappears under real workloads. We've seen it. You've seen it.

We're going to do something different. This post presents Epochly's benchmark results -- the impressive ones and the embarrassing ones -- with full context on every number.

Every claim here traces to our comprehensive benchmark report (January 29, 2026). Every test is reproducible. And we start with what doesn't work.

Where Epochly Does NOT Help

I/O-bound workloads: 1.0x

If your Python code spends most of its time waiting for disk, network, or database responses, Epochly won't help. The CPU is already idle. Adding parallelism or JIT compilation to idle time changes nothing.

# This will NOT benefit from Epochly
def fetch_all_records(urls):
    results = []
    for url in urls:
        response = requests.get(url)  # Waiting on network
        results.append(response.json())
    return results

Measured result: ~1.0x speedup. No improvement.

What to use instead: asyncio, aiohttp, or thread pools for I/O concurrency.

Already vectorized NumPy: ~1.0x

NumPy operations (matmul, dot, sum) already use optimized BLAS libraries (OpenBLAS, MKL) that parallelize internally. Adding Epochly's parallelism on top adds interception overhead without additional benefit.

# This will NOT benefit from Epochly
def matrix_multiply(a, b):
    return np.matmul(a, b)  # Already uses OpenBLAS/MKL threads

Measured result: 1.025x (within measurement noise).

Why: NumPy's BLAS backend already uses all available cores for matrix operations. There's nothing left to parallelize.

Small workloads: ~1.0x (or slower)

Process spawn overhead is approximately 200ms per worker. If your workload completes in 100ms, parallelization makes it slower.

# This will NOT benefit from Epochly parallel
def quick_sum(data):
    # data has 1000 elements -- too small for parallelism
    return sum(x ** 2 for x in data)

Measured result: 0.998x on CPU_BOUND with size=1000. Essentially neutral.

Rule of thumb: Your workload needs to be at least a few seconds long for ProcessPool parallelism to pay off.

GPU with small arrays: Minimal benefit

GPU kernel launch overhead means small arrays don't benefit from GPU acceleration.

Array size	Elementwise speedup
100K elements	2.3x
1M elements	12.3x
10M elements	65.6x

Below 1M elements, the transfer overhead to GPU memory dominates the actual computation time.

Where Epochly Helps: The Numbers

Level 2 JIT: 58-193x on Numerical Loops

Epochly's Level 2 uses Numba JIT compilation to transform Python numerical loops into native machine code.

Workload	Size	Baseline	JIT	Speedup
Numerical loop	1M elements	101.25ms	1.15ms	88.3x
Nested loop	10K elements	66.54ms	1.15ms	58.0x
Polynomial eval	1M elements	324.16ms	1.68ms	193.0x

Average: 113x across tested workloads.

The variance is real: polynomial evaluation (193x) benefits more than nested array access (58x). The workload characteristics determine where you land in the 58-193x range.

What makes a good JIT target:

Pure Python loops with numerical operations
Iterative algorithms (convergence loops, simulations)
Element-wise operations not already vectorized

What makes a bad JIT target:

String manipulation
Dictionary operations
Object-oriented patterns with polymorphism
Code that calls back into CPython frequently

Level 3 Parallel: 8-12x on CPU-bound Work (16 cores)

Level 3 distributes CPU-bound work across multiple cores using ProcessPoolExecutor.

Workload	Sequential	Parallel	Speedup
Pure Python loop	40.03s	9.02s	4.44x
NumPy compute (64 tasks)	13.48s	5.16s	2.61x
Monte Carlo (32 tasks)	17.88s	5.87s	3.04x

The table above shows specific benchmarks with varying task parallelism. With workloads that fully saturate all 16 cores (embarrassingly parallel, sufficient task count): 8-12x (ProcessPool).

On Apple M2 Max (16 cores, Python 3.13.5): 8.7x on pure Python parallel workloads with full core saturation, 54% parallel efficiency.

Important context: 16 cores does not mean 16x speedup. Amdahl's Law, process spawn overhead (~200ms), and memory contention limit practical efficiency to 50-60%. The gap between the table results (2.6-4.4x) and the 8-12x headline reflects workloads with different levels of core saturation.

Level 4 GPU: Up to 70x on Large Arrays

GPU acceleration requires large data volumes to overcome kernel launch overhead.

Elementwise operations (sin + cos + sqrt):

Array Size	Data Volume	Speedup
100K	1MB	2.3x
1M	8MB	12.3x
10M	76MB	65.6x
50M	381MB	69.8x
100M	763MB	68.1x

Matrix multiplication:

Matrix Size	Speedup
512x512	2.5x
1024x1024	9.5x
2048x2048	6.3x
4096x4096	7.0x

Batched convolution:

Batch Config	Speedup
batch=1, 64ch, 224px	15.6x
batch=16, 64ch, 224px	19.4x
batch=32, 128ch, 112px	13.8x
batch=64, 256ch, 56px	14.8x

Large array reductions:

Array Size	Speedup
1M	1.8x
10M	21.5x
100M	35.9x

GPU is measured on NVIDIA Quadro M6000 24GB, CUDA 12.1, PyTorch 2.5.1.

Combined JIT + Parallel: ~1,350x Theoretical Maximum

When Level 2 JIT (113x average) and Level 3 parallel (12x on 16 cores) apply to the same workload, the combined theoretical maximum is approximately 1,350x (113 * 12).

This applies to pure Python numerical loops that are both:

JIT-compilable (numerical, no string/dict operations)
Embarrassingly parallel (independent iterations)

The ~1,350x figure is a theoretical maximum, not an average. Individual results depend on workload characteristics, data dependencies, and parallelizability.

Mixed Workloads: The Realistic Scenario

Most real code isn't purely numerical loops or purely I/O. It's a mix. Here's how Epochly performs on mixed workloads:

Mix	Speedup
Python loops around NumPy ops	6.5x (ProcessPool)
Mixed compute + I/O	1.5-1.7x
Mostly NumPy with thin Python glue	1.0-1.2x

The speedup depends on the ratio of parallelizable Python code to already-optimized library calls. If 80% of your time is in NumPy BLAS operations, parallelizing the remaining 20% Python overhead yields modest improvement.

How to Know If Epochly Will Help Your Code

Step 1: Profile

import cProfile
import pstats
cProfile.run('your_function(your_data)', 'profile_output')
stats = pstats.Stats('profile_output')
stats.sort_stats('cumulative').print_stats(10)

Step 2: Classify your bottleneck

If most time is in...	Epochly benefit	Recommended level
Python loops with math	High (58-193x)	Level 2 (JIT)
CPU-bound Python	Medium (8-12x)	Level 3 (Parallel)
Large array operations	High (up to 70x)	Level 4 (GPU)
NumPy BLAS calls	None (~1.0x)	Don't use Epochly
I/O waits	None (~1.0x)	Use asyncio instead
Small operations (<100ms)	None/negative	Don't parallelize

Step 3: Start at Level 0

@epochly.optimize(level=0)
def your_function(data):
    pass

Monitor first. Understand the profile. Then graduate to the appropriate level.

Our Benchmark Methodology

Transparency means showing how we test, not just the results.

Hardware: Linux WSL2, Python 3.12.3, 16 cores (x86_64), NVIDIA Quadro M6000 24GB, CUDA 12.1

Methodology:

Each benchmark runs multiple iterations
Warm-up runs excluded from measurement
Results verified against sequential execution for correctness
Standard deviation reported where applicable
Cold start and warm start measured separately

Reproduction: The complete benchmark suite is available. Run it on your hardware to see your specific results.

The Bottom Line

Epochly makes Python significantly faster for CPU-bound numerical work. 113x on numerical loops (Numba JIT, Python 3.12.3). 8-12x with multi-core parallel execution (16 cores, ProcessPool). Up to 70x with GPU acceleration on arrays with 10M+ elements (CUDA).

It does not help with I/O-bound code, already-vectorized NumPy, small workloads, or GPU operations on small arrays.

Profile first. Understand your bottleneck. Then choose the right enhancement level.

Benchmark conditions: Python 3.12.3, Linux WSL2, 16 cores, NVIDIA Quadro M6000 24GB (CUDA 12.1). January 29, 2026 comprehensive benchmark report.