GPU Optimization Strategy
This document explains Epochly's approach to GPU acceleration and how to use it effectively.
GPU Acceleration Overview
Epochly's Level 4 provides GPU acceleration via CuPy for compatible workloads. GPU acceleration is applied selectively based on workload characteristics to ensure actual performance improvements.
Why NumPy Operations Are Not Auto-Intercepted
The Question
A common question is: "Why doesn't Epochly automatically GPU-accelerate all NumPy operations?"
The Answer: Benchmarks Showed It Hurts Performance
During development, we implemented automatic NumPy function interception. Our benchmarks revealed:
| Operation | Expected Speedup | Actual Result |
|---|---|---|
numpy.matmul | 2-10x | 0.97x (3% slower) |
numpy.dot | 2-10x | 0.95x (5% slower) |
numpy.sum | 2-5x | 0.92x (8% slower) |
Result: Automatic interception made things worse, not better.
Why This Happens
- NumPy is already optimized: NumPy uses OpenBLAS or Intel MKL internally, which are highly optimized C/Fortran libraries with their own parallelization.
- Transfer overhead: Moving data CPU <-> GPU takes time. For operations that complete in microseconds, the transfer time dominates.
- Double parallelization: Intercepting already-parallel NumPy operations and adding GPU parallelization causes contention.
- IPC overhead: The interception mechanism itself adds inter-process communication overhead.
Design Decision
Based on empirical evidence, automatic NumPy interception is disabled. This is not a missing feature - it's a data-driven engineering decision.
When GPU Acceleration Helps
GPU acceleration provides significant speedup when:
1. Large Array Operations
import numpy as np# Small array - GPU won't help (transfer overhead dominates)small = np.random.randn(100, 100)result = small @ small # ~0.1ms, GPU transfer takes longer# Large array - GPU helps significantlylarge = np.random.randn(10000, 10000)result = large @ large # ~200ms on CPU, ~29ms on GPU = 7x speedup
2. Custom Numerical Kernels
Functions that perform element-wise operations on large arrays benefit from GPU acceleration.
3. Batch Operations
# Processing many arrays benefits from GPUfor batch in batches:result = process(batch) # Amortizes transfer cost
4. Operations NumPy Doesn't Parallelize
Custom operations that aren't in OpenBLAS/MKL benefit from GPU parallelization.
GPU Module API
The epochly.gpu module provides the following functionality:
Checking GPU Availability
from epochly.gpu import is_gpu_available, get_gpu_info# Check if GPU is availableif is_gpu_available():print("GPU acceleration available")# Get GPU informationinfo = get_gpu_info()print(f"Device: {info.device_name}")print(f"Memory: {info.memory_total // (1024**3)} GB")print(f"Compute capability: {info.compute_capability}")
Running GPU Diagnostics
from epochly.gpu import run_diagnostics, format_report, get_installation_guide# Run comprehensive GPU diagnosticsreport = run_diagnostics()print(format_report(report))# If GPU not working, get installation instructionsif not report.overall_status:print(get_installation_guide())
Using the CuPy Manager
The CuPyManager class handles GPU operations with automatic memory management:
from epochly.gpu import get_gpu_manager# Get the singleton manager instancemanager = get_gpu_manager()# Check if manager is enabledif manager.is_enabled():# Check if a specific operation should use GPUarray_size = 100 * 1024 * 1024 # 100MBif manager.should_use_gpu(array_size, 'matmul'):print("GPU recommended for this operation")
Converting Arrays Between CPU and GPU
import numpy as npfrom epochly.gpu import get_gpu_managermanager = get_gpu_manager()if manager.is_enabled():# Convert NumPy array to CuPy (GPU)numpy_array = np.random.randn(1000, 1000)gpu_array = manager.numpy_to_cupy(numpy_array)# Perform operations on GPU...# Convert back to NumPy (CPU)result = manager.cupy_to_numpy(gpu_array)
GPU Context Manager
Use the context manager for GPU operations with automatic cleanup:
from epochly.gpu import get_gpu_managermanager = get_gpu_manager()with manager.gpu_context() as cp:if cp is not None:# CuPy is available, perform GPU operationsgpu_array = cp.random.randn(1000, 1000)result = cp.linalg.svd(gpu_array)# Memory cleanup happens automatically
Creating GPU-Accelerated Functions
Create wrapper functions that automatically use GPU when beneficial:
import numpy as npfrom epochly.gpu import get_gpu_managermanager = get_gpu_manager()# Create a GPU-accelerated version of a NumPy functiongpu_matmul = manager.create_gpu_accelerated_function(np.matmul, 'matmul')# Use it - GPU is used automatically when beneficialresult = gpu_matmul(large_array_a, large_array_b)
Performance Statistics
Track GPU operation performance:
from epochly.gpu import get_gpu_managermanager = get_gpu_manager()# Get performance statisticsstats = manager.get_performance_stats()print(f"GPU operations: {stats['gpu_operations']}")print(f"CPU operations: {stats['cpu_operations']}")print(f"GPU ratio: {stats['gpu_operation_ratio']:.1%}")print(f"Avg GPU time: {stats['avg_gpu_time']*1000:.2f}ms")print(f"Memory transfers: {stats['memory_transfers']}")# Reset statisticsmanager.reset_stats()
Intelligent Memory Management
Epochly provides intelligent GPU memory management that prevents out-of-memory errors:
Enabling Intelligent Memory
from epochly.gpu.intelligent_memory import (enable_intelligent_memory,disable_intelligent_memory,is_intelligent_memory_enabled,cleanup_gpu_memory)# Enable intelligent memory managementenable_intelligent_memory()# Check if enabledif is_intelligent_memory_enabled():print("Intelligent memory management active")# Force cleanup of GPU memorycleanup_gpu_memory()# Disable when donedisable_intelligent_memory()
Safe GPU Operations Decorator
Use the @safe_gpu_operation decorator for operations that might run out of GPU memory:
from epochly.gpu.intelligent_memory import safe_gpu_operationimport cupy as cp@safe_gpu_operationdef my_gpu_function(x, y):"""This function automatically falls back to CPU on OOM."""return cp.linalg.svd(x @ y)# If GPU runs out of memory, it falls back to CPU automaticallyresult = my_gpu_function(large_array_a, large_array_b)
OOM Safety Mode
Enable global OOM (out-of-memory) protection:
from epochly.gpu.intelligent_memory import enable_oom_safety, disable_oom_safety# Enable OOM safety for all GPU operationsenable_oom_safety()# Perform operations - automatic fallback on memory errors# Disable when donedisable_oom_safety()
Multi-Backend Support
Epochly supports multiple GPU backends through the backend registry:
from epochly.gpu.backend_registry import (GPUBackendRegistry,GPUBackendKind)# Detect all available backendsavailable = GPUBackendRegistry.detect_available_backends()for kind, info in available.items():print(f"{kind.value}: {info.device_count} devices, {info.memory_gb:.1f}GB")# Get the best available backendbackend = GPUBackendRegistry.get_best_available()print(f"Using backend: {backend.info.kind.value}")# Transfer array to GPUgpu_array = backend.to_gpu(numpy_array)# Perform operations...# Transfer back to CPUresult = backend.to_cpu(gpu_array)
Supported Backends
| Backend | Vendor | Library | Notes |
|---|---|---|---|
| CUDA | NVIDIA | CuPy | Fully supported |
| CPU | N/A | NumPy | Fallback when no GPU |
> Note: AMD ROCm and Intel oneAPI are not currently supported. GPU acceleration requires NVIDIA GPUs with CUDA.
GPU Selection Heuristics
Epochly uses intelligent heuristics to decide CPU vs GPU execution:
Automatic Selection Criteria
| Factor | GPU Preferred | CPU Preferred |
|---|---|---|
| Array size | > 10MB | < 10MB |
| Operation type | Matrix multiply, FFT | Simple arithmetic |
| Data locality | Data already on GPU | Data on CPU |
| Operation count | Many operations | Single operation |
Configuring Offload Optimizer
The GPUOffloadOptimizer makes decisions about GPU usage:
from epochly.gpu import GPUDetector, GPUOffloadOptimizer# Get GPU infogpu_info = GPUDetector.get_gpu_info()# Create optimizer with custom minimum array sizeoptimizer = GPUOffloadOptimizer(gpu_info=gpu_info,min_array_size=50 * 1024 * 1024 # 50MB minimum)# Check if operation should use GPUshould_use = optimizer.should_offload(data_size=100 * 1024 * 1024, # 100MBoperation='matmul')# Get detailed analysisanalysis = optimizer.analyze_offload_opportunity(data_size=100 * 1024 * 1024,operation='matmul')print(f"Decision: {analysis.decision}")print(f"Estimated speedup: {analysis.estimated_speedup}x")print(f"Reason: {analysis.reason}")
Performance Guidelines
Operations That Benefit Most from GPU
- Large matrix operations (> 1000x1000)
- FFT on large arrays (> 100,000 elements)
- Batch image processing
- Custom element-wise operations on large arrays
- Repeated operations on same data
Operations That Don't Benefit from GPU
- Small arrays (< 10,000 elements)
- Single scalar operations
- Operations dominated by Python overhead
- I/O-bound operations
- Operations with heavy CPU-GPU data transfer
Environment Variables
| Variable | Purpose | Default |
|---|---|---|
EPOCHLY_GPU_ENABLED | Enable/disable GPU | true |
EPOCHLY_GPU_MEMORY_LIMIT | GPU memory limit (MB) | Auto-detected |
EPOCHLY_GPU_WORKLOAD_THRESHOLD | Minimum workload for GPU | 10000000 (10MB) |
Troubleshooting
GPU Not Detected
from epochly.gpu import run_diagnostics, format_report# Run diagnosticsreport = run_diagnostics()print(format_report(report, verbose=True))
Common Issues
- CuPy not installed:
pip install cupy-cuda12x - CUDA drivers missing: Install NVIDIA drivers
- Insufficient GPU memory: Reduce workload size or use intelligent memory
Getting User-Friendly Errors
from epochly.gpu import get_user_friendly_gpu_errortry:# GPU operation that failsresult = problematic_operation()except Exception as e:message, suggestion = get_user_friendly_gpu_error(e)print(f"Error: {message}")print(f"Suggestion: {suggestion}")
Summary
- Automatic NumPy interception is disabled by design based on benchmarks
- GPU acceleration provides real speedup for large arrays and custom kernels
- Use the CuPy Manager for explicit GPU operations
- Use intelligent memory management for automatic OOM protection
- Use
@safe_gpu_operationdecorator for functions that might exceed GPU memory - Always measure before assuming GPU will help
- Epochly's automatic level progression handles most cases correctly