Documentation

GPU Optimization Strategy

This document explains Epochly's approach to GPU acceleration and how to use it effectively.

GPU Acceleration Overview

Epochly's Level 4 provides GPU acceleration via CuPy for compatible workloads. GPU acceleration is applied selectively based on workload characteristics to ensure actual performance improvements.

Why NumPy Operations Are Not Auto-Intercepted

The Question

A common question is: "Why doesn't Epochly automatically GPU-accelerate all NumPy operations?"

The Answer: Benchmarks Showed It Hurts Performance

During development, we implemented automatic NumPy function interception. Our benchmarks revealed:

OperationExpected SpeedupActual Result
numpy.matmul2-10x0.97x (3% slower)
numpy.dot2-10x0.95x (5% slower)
numpy.sum2-5x0.92x (8% slower)

Result: Automatic interception made things worse, not better.

Why This Happens

  1. NumPy is already optimized: NumPy uses OpenBLAS or Intel MKL internally, which are highly optimized C/Fortran libraries with their own parallelization.
  2. Transfer overhead: Moving data CPU <-> GPU takes time. For operations that complete in microseconds, the transfer time dominates.
  3. Double parallelization: Intercepting already-parallel NumPy operations and adding GPU parallelization causes contention.
  4. IPC overhead: The interception mechanism itself adds inter-process communication overhead.

Design Decision

Based on empirical evidence, automatic NumPy interception is disabled. This is not a missing feature - it's a data-driven engineering decision.

When GPU Acceleration Helps

GPU acceleration provides significant speedup when:

1. Large Array Operations

import numpy as np
# Small array - GPU won't help (transfer overhead dominates)
small = np.random.randn(100, 100)
result = small @ small # ~0.1ms, GPU transfer takes longer
# Large array - GPU helps significantly
large = np.random.randn(10000, 10000)
result = large @ large # ~200ms on CPU, ~29ms on GPU = 7x speedup

2. Custom Numerical Kernels

Functions that perform element-wise operations on large arrays benefit from GPU acceleration.

3. Batch Operations

# Processing many arrays benefits from GPU
for batch in batches:
result = process(batch) # Amortizes transfer cost

4. Operations NumPy Doesn't Parallelize

Custom operations that aren't in OpenBLAS/MKL benefit from GPU parallelization.

GPU Module API

The epochly.gpu module provides the following functionality:

Checking GPU Availability

from epochly.gpu import is_gpu_available, get_gpu_info
# Check if GPU is available
if is_gpu_available():
print("GPU acceleration available")
# Get GPU information
info = get_gpu_info()
print(f"Device: {info.device_name}")
print(f"Memory: {info.memory_total // (1024**3)} GB")
print(f"Compute capability: {info.compute_capability}")

Running GPU Diagnostics

from epochly.gpu import run_diagnostics, format_report, get_installation_guide
# Run comprehensive GPU diagnostics
report = run_diagnostics()
print(format_report(report))
# If GPU not working, get installation instructions
if not report.overall_status:
print(get_installation_guide())

Using the CuPy Manager

The CuPyManager class handles GPU operations with automatic memory management:

from epochly.gpu import get_gpu_manager
# Get the singleton manager instance
manager = get_gpu_manager()
# Check if manager is enabled
if manager.is_enabled():
# Check if a specific operation should use GPU
array_size = 100 * 1024 * 1024 # 100MB
if manager.should_use_gpu(array_size, 'matmul'):
print("GPU recommended for this operation")

Converting Arrays Between CPU and GPU

import numpy as np
from epochly.gpu import get_gpu_manager
manager = get_gpu_manager()
if manager.is_enabled():
# Convert NumPy array to CuPy (GPU)
numpy_array = np.random.randn(1000, 1000)
gpu_array = manager.numpy_to_cupy(numpy_array)
# Perform operations on GPU...
# Convert back to NumPy (CPU)
result = manager.cupy_to_numpy(gpu_array)

GPU Context Manager

Use the context manager for GPU operations with automatic cleanup:

from epochly.gpu import get_gpu_manager
manager = get_gpu_manager()
with manager.gpu_context() as cp:
if cp is not None:
# CuPy is available, perform GPU operations
gpu_array = cp.random.randn(1000, 1000)
result = cp.linalg.svd(gpu_array)
# Memory cleanup happens automatically

Creating GPU-Accelerated Functions

Create wrapper functions that automatically use GPU when beneficial:

import numpy as np
from epochly.gpu import get_gpu_manager
manager = get_gpu_manager()
# Create a GPU-accelerated version of a NumPy function
gpu_matmul = manager.create_gpu_accelerated_function(np.matmul, 'matmul')
# Use it - GPU is used automatically when beneficial
result = gpu_matmul(large_array_a, large_array_b)

Performance Statistics

Track GPU operation performance:

from epochly.gpu import get_gpu_manager
manager = get_gpu_manager()
# Get performance statistics
stats = manager.get_performance_stats()
print(f"GPU operations: {stats['gpu_operations']}")
print(f"CPU operations: {stats['cpu_operations']}")
print(f"GPU ratio: {stats['gpu_operation_ratio']:.1%}")
print(f"Avg GPU time: {stats['avg_gpu_time']*1000:.2f}ms")
print(f"Memory transfers: {stats['memory_transfers']}")
# Reset statistics
manager.reset_stats()

Intelligent Memory Management

Epochly provides intelligent GPU memory management that prevents out-of-memory errors:

Enabling Intelligent Memory

from epochly.gpu.intelligent_memory import (
enable_intelligent_memory,
disable_intelligent_memory,
is_intelligent_memory_enabled,
cleanup_gpu_memory
)
# Enable intelligent memory management
enable_intelligent_memory()
# Check if enabled
if is_intelligent_memory_enabled():
print("Intelligent memory management active")
# Force cleanup of GPU memory
cleanup_gpu_memory()
# Disable when done
disable_intelligent_memory()

Safe GPU Operations Decorator

Use the @safe_gpu_operation decorator for operations that might run out of GPU memory:

from epochly.gpu.intelligent_memory import safe_gpu_operation
import cupy as cp
@safe_gpu_operation
def my_gpu_function(x, y):
"""This function automatically falls back to CPU on OOM."""
return cp.linalg.svd(x @ y)
# If GPU runs out of memory, it falls back to CPU automatically
result = my_gpu_function(large_array_a, large_array_b)

OOM Safety Mode

Enable global OOM (out-of-memory) protection:

from epochly.gpu.intelligent_memory import enable_oom_safety, disable_oom_safety
# Enable OOM safety for all GPU operations
enable_oom_safety()
# Perform operations - automatic fallback on memory errors
# Disable when done
disable_oom_safety()

Multi-Backend Support

Epochly supports multiple GPU backends through the backend registry:

from epochly.gpu.backend_registry import (
GPUBackendRegistry,
GPUBackendKind
)
# Detect all available backends
available = GPUBackendRegistry.detect_available_backends()
for kind, info in available.items():
print(f"{kind.value}: {info.device_count} devices, {info.memory_gb:.1f}GB")
# Get the best available backend
backend = GPUBackendRegistry.get_best_available()
print(f"Using backend: {backend.info.kind.value}")
# Transfer array to GPU
gpu_array = backend.to_gpu(numpy_array)
# Perform operations...
# Transfer back to CPU
result = backend.to_cpu(gpu_array)

Supported Backends

BackendVendorLibraryNotes
CUDANVIDIACuPyFully supported
CPUN/ANumPyFallback when no GPU

> Note: AMD ROCm and Intel oneAPI are not currently supported. GPU acceleration requires NVIDIA GPUs with CUDA.

GPU Selection Heuristics

Epochly uses intelligent heuristics to decide CPU vs GPU execution:

Automatic Selection Criteria

FactorGPU PreferredCPU Preferred
Array size> 10MB< 10MB
Operation typeMatrix multiply, FFTSimple arithmetic
Data localityData already on GPUData on CPU
Operation countMany operationsSingle operation

Configuring Offload Optimizer

The GPUOffloadOptimizer makes decisions about GPU usage:

from epochly.gpu import GPUDetector, GPUOffloadOptimizer
# Get GPU info
gpu_info = GPUDetector.get_gpu_info()
# Create optimizer with custom minimum array size
optimizer = GPUOffloadOptimizer(
gpu_info=gpu_info,
min_array_size=50 * 1024 * 1024 # 50MB minimum
)
# Check if operation should use GPU
should_use = optimizer.should_offload(
data_size=100 * 1024 * 1024, # 100MB
operation='matmul'
)
# Get detailed analysis
analysis = optimizer.analyze_offload_opportunity(
data_size=100 * 1024 * 1024,
operation='matmul'
)
print(f"Decision: {analysis.decision}")
print(f"Estimated speedup: {analysis.estimated_speedup}x")
print(f"Reason: {analysis.reason}")

Performance Guidelines

Operations That Benefit Most from GPU

  1. Large matrix operations (> 1000x1000)
  2. FFT on large arrays (> 100,000 elements)
  3. Batch image processing
  4. Custom element-wise operations on large arrays
  5. Repeated operations on same data

Operations That Don't Benefit from GPU

  1. Small arrays (< 10,000 elements)
  2. Single scalar operations
  3. Operations dominated by Python overhead
  4. I/O-bound operations
  5. Operations with heavy CPU-GPU data transfer

Environment Variables

VariablePurposeDefault
EPOCHLY_GPU_ENABLEDEnable/disable GPUtrue
EPOCHLY_GPU_MEMORY_LIMITGPU memory limit (MB)Auto-detected
EPOCHLY_GPU_WORKLOAD_THRESHOLDMinimum workload for GPU10000000 (10MB)

Troubleshooting

GPU Not Detected

from epochly.gpu import run_diagnostics, format_report
# Run diagnostics
report = run_diagnostics()
print(format_report(report, verbose=True))

Common Issues

  1. CuPy not installed: pip install cupy-cuda12x
  2. CUDA drivers missing: Install NVIDIA drivers
  3. Insufficient GPU memory: Reduce workload size or use intelligent memory

Getting User-Friendly Errors

from epochly.gpu import get_user_friendly_gpu_error
try:
# GPU operation that fails
result = problematic_operation()
except Exception as e:
message, suggestion = get_user_friendly_gpu_error(e)
print(f"Error: {message}")
print(f"Suggestion: {suggestion}")

Summary

  • Automatic NumPy interception is disabled by design based on benchmarks
  • GPU acceleration provides real speedup for large arrays and custom kernels
  • Use the CuPy Manager for explicit GPU operations
  • Use intelligent memory management for automatic OOM protection
  • Use @safe_gpu_operation decorator for functions that might exceed GPU memory
  • Always measure before assuming GPU will help
  • Epochly's automatic level progression handles most cases correctly