Benchmarks
Internal performance results with reproducible methodology. We believe in transparency—run these benchmarks yourself and verify the results.
Reproducible Results
These benchmarks use our open methodology. Run them yourself: pip install epochly && python -m epochly.benchmark
193x
JIT compilation (Level 2)
70x
GPU acceleration (Level 4)
<5%
Overhead when not helping
GPU example: 100M-element array operation: 1,427ms → 21ms (68x)
Detailed Results
| Workload | Baseline | With Epochly | Speedup | Level |
|---|---|---|---|---|
| JIT polynomial evaluation (1M iterations) | 324.16ms | 1.68ms | 193x | Level 2 |
| JIT numerical loop (1M iterations) | 101.25ms | 1.15ms | 88x | Level 2 |
| JIT nested loop (10K iterations) | 66.54ms | 1.15ms | 58x | Level 2 |
| GPU elementwise ops (100M elements) | 1,427ms | 21ms | 68x | Level 4 |
| GPU reduction (100M elements) | 59ms | 1.6ms | 36x | Level 4 |
| GPU convolution (batch=16, 224x224) | 148ms | 7.6ms | 19x | Level 4 |
| GPU matrix multiply (4096x4096) | 200ms | 29ms | 7x | Level 4 |
| Parallel heavy CPU (16 cores) | 1,396ms | 166ms | 8x | Level 3 |
| Monte Carlo simulation (100M samples) | 3,639ms | 504ms | 7x | Level 3 |
| Unsuitable workload | measured | measured | ~1.0x | Disabled |
Test Environment
CPU
16-core CPU (all cores utilized for parallel benchmarks)
GPU
NVIDIA Quadro M6000 24GB, CUDA 12.1
Software
Linux WSL2 (x86_64)
Python 3.12.3
PyTorch 2.5.1+cu121, NumPy 1.26.4
Reproduce the Benchmark
We publish our benchmark suite so you can verify our results on your own hardware. Clone the repository and run the benchmarks yourself.
# Clone the benchmark repository$ git clone https://github.com/epochly/benchmarks$ cd benchmarks# Set up the environment$ python -m venv venv$ source venv/bin/activate$ pip install -r requirements.txt# Run all benchmarks$ python run_benchmarks.py --all# Run specific benchmark$ python run_benchmarks.py --workload numpy_matrix
Methodology
- 1Each benchmark runs 10 iterations with the first 2 discarded as warm-up
- 2Results show median execution time to minimize outlier impact
- 3Baseline measurements use vanilla Python with no optimization packages
- 4GPU benchmarks (Level 4) use PyTorch on CUDA. Parallel benchmarks (Level 3) use ProcessPoolExecutor with all cores. Pure NumPy is intentionally not intercepted (~1.0x) because it already uses optimized C code.
- 5System is idle during benchmarks with no other CPU-intensive processes