Benchmarks

Internal performance results with reproducible methodology. We believe in transparency—run these benchmarks yourself and verify the results.

Reproducible Results

These benchmarks use our open methodology. Run them yourself: pip install epochly && python -m epochly.benchmark

193x

JIT compilation (Level 2)

70x

GPU acceleration (Level 4)

<5%

Overhead when not helping

GPU example: 100M-element array operation: 1,427ms → 21ms (68x)

Detailed Results

Workload	Baseline	With Epochly	Speedup	Level
JIT polynomial evaluation (1M iterations)	324.16ms	1.68ms	193x	Level 2
JIT numerical loop (1M iterations)	101.25ms	1.15ms	88x	Level 2
JIT nested loop (10K iterations)	66.54ms	1.15ms	58x	Level 2
GPU elementwise ops (100M elements)	1,427ms	21ms	68x	Level 4
GPU reduction (100M elements)	59ms	1.6ms	36x	Level 4
GPU convolution (batch=16, 224x224)	148ms	7.6ms	19x	Level 4
GPU matrix multiply (4096x4096)	200ms	29ms	7x	Level 4
Parallel heavy CPU (16 cores)	1,396ms	166ms	8x	Level 3
Monte Carlo simulation (100M samples)	3,639ms	504ms	7x	Level 3
Unsuitable workload	measured	measured	~1.0x	Disabled

Test Environment

CPU

16-core CPU (all cores utilized for parallel benchmarks)

GPU

NVIDIA Quadro M6000 24GB, CUDA 12.1

Software

Linux WSL2 (x86_64)
Python 3.12.3
PyTorch 2.5.1+cu121, NumPy 1.26.4

Reproduce the Benchmark

We publish our benchmark suite so you can verify our results on your own hardware. Clone the repository and run the benchmarks yourself.

# Clone the benchmark repository
$ git clone https://github.com/epochly/benchmarks
$ cd benchmarks
# Set up the environment
$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
# Run all benchmarks
$ python run_benchmarks.py --all
# Run specific benchmark
$ python run_benchmarks.py --workload numpy_matrix

View on GitHub

Methodology

1Each benchmark runs 10 iterations with the first 2 discarded as warm-up
2Results show median execution time to minimize outlier impact
3Baseline measurements use vanilla Python with no optimization packages
4GPU benchmarks (Level 4) use PyTorch on CUDA. Parallel benchmarks (Level 3) use ProcessPoolExecutor with all cores. Pure NumPy is intentionally not intercepted (~1.0x) because it already uses optimized C code.
5System is idle during benchmarks with no other CPU-intensive processes