Benchmarks

Internal performance results with reproducible methodology. We believe in transparency—run these benchmarks yourself and verify the results.

Reproducible Results

These benchmarks use our open methodology. Run them yourself: pip install epochly && python -m epochly.benchmark

193x

JIT compilation (Level 2)

70x

GPU acceleration (Level 4)

<5%

Overhead when not helping

GPU example: 100M-element array operation: 1,427ms → 21ms (68x)

Detailed Results

WorkloadBaselineWith EpochlySpeedupLevel
JIT polynomial evaluation (1M iterations)324.16ms1.68ms193xLevel 2
JIT numerical loop (1M iterations)101.25ms1.15ms88xLevel 2
JIT nested loop (10K iterations)66.54ms1.15ms58xLevel 2
GPU elementwise ops (100M elements)1,427ms21ms68xLevel 4
GPU reduction (100M elements)59ms1.6ms36xLevel 4
GPU convolution (batch=16, 224x224)148ms7.6ms19xLevel 4
GPU matrix multiply (4096x4096)200ms29ms7xLevel 4
Parallel heavy CPU (16 cores)1,396ms166ms8xLevel 3
Monte Carlo simulation (100M samples)3,639ms504ms7xLevel 3
Unsuitable workloadmeasuredmeasured~1.0xDisabled

Test Environment

CPU

16-core CPU (all cores utilized for parallel benchmarks)

GPU

NVIDIA Quadro M6000 24GB, CUDA 12.1

Software

Linux WSL2 (x86_64)
Python 3.12.3
PyTorch 2.5.1+cu121, NumPy 1.26.4

Reproduce the Benchmark

We publish our benchmark suite so you can verify our results on your own hardware. Clone the repository and run the benchmarks yourself.

# Clone the benchmark repository
$ git clone https://github.com/epochly/benchmarks
$ cd benchmarks
# Set up the environment
$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
# Run all benchmarks
$ python run_benchmarks.py --all
# Run specific benchmark
$ python run_benchmarks.py --workload numpy_matrix

Methodology

  • 1Each benchmark runs 10 iterations with the first 2 discarded as warm-up
  • 2Results show median execution time to minimize outlier impact
  • 3Baseline measurements use vanilla Python with no optimization packages
  • 4GPU benchmarks (Level 4) use PyTorch on CUDA. Parallel benchmarks (Level 3) use ProcessPoolExecutor with all cores. Pure NumPy is intentionally not intercepted (~1.0x) because it already uses optimized C code.
  • 5System is idle during benchmarks with no other CPU-intensive processes