GPU Optimization Strategy

This document explains Epochly's approach to GPU acceleration and how to use it effectively.

GPU Acceleration Overview

Epochly's Level 4 provides GPU acceleration via CuPy for compatible workloads. GPU acceleration is applied selectively based on workload characteristics to ensure actual performance improvements.

Why NumPy Operations Are Not Auto-Intercepted

The Question

A common question is: "Why doesn't Epochly automatically GPU-accelerate all NumPy operations?"

The Answer: Benchmarks Showed It Hurts Performance

During development, we implemented automatic NumPy function interception. Our benchmarks revealed:

Operation	Expected Speedup	Actual Result
`numpy.matmul`	2-10x	0.97x (3% slower)
`numpy.dot`	2-10x	0.95x (5% slower)
`numpy.sum`	2-5x	0.92x (8% slower)

Result: Automatic interception made things worse, not better.

Why This Happens

NumPy is already optimized: NumPy uses OpenBLAS or Intel MKL internally, which are highly optimized C/Fortran libraries with their own parallelization.
Transfer overhead: Moving data CPU <-> GPU takes time. For operations that complete in microseconds, the transfer time dominates.
Double parallelization: Intercepting already-parallel NumPy operations and adding GPU parallelization causes contention.
IPC overhead: The interception mechanism itself adds inter-process communication overhead.

Design Decision

Based on empirical evidence, automatic NumPy interception is disabled. This is not a missing feature - it's a data-driven engineering decision.

When GPU Acceleration Helps

GPU acceleration provides significant speedup when:

1. Large Array Operations

import numpy as np
# Small array - GPU won't help (transfer overhead dominates)
small = np.random.randn(100, 100)
result = small @ small  # ~0.1ms, GPU transfer takes longer
# Large array - GPU helps significantly
large = np.random.randn(10000, 10000)
result = large @ large  # ~200ms on CPU, ~29ms on GPU = 7x speedup

2. Custom Numerical Kernels

Functions that perform element-wise operations on large arrays benefit from GPU acceleration.

3. Batch Operations

# Processing many arrays benefits from GPU
for batch in batches:
    result = process(batch)  # Amortizes transfer cost

4. Operations NumPy Doesn't Parallelize

Custom operations that aren't in OpenBLAS/MKL benefit from GPU parallelization.

GPU Module API

The epochly.gpu module provides the following functionality:

Checking GPU Availability

from epochly.gpu import is_gpu_available, get_gpu_info
# Check if GPU is available
if is_gpu_available():
    print("GPU acceleration available")
    # Get GPU information
    info = get_gpu_info()
    print(f"Device: {info.device_name}")
    print(f"Memory: {info.memory_total // (1024**3)} GB")
    print(f"Compute capability: {info.compute_capability}")

Running GPU Diagnostics

from epochly.gpu import run_diagnostics, format_report, get_installation_guide
# Run comprehensive GPU diagnostics
report = run_diagnostics()
print(format_report(report))
# If GPU not working, get installation instructions
if not report.overall_status:
    print(get_installation_guide())

Using the CuPy Manager

The CuPyManager class handles GPU operations with automatic memory management:

from epochly.gpu import get_gpu_manager
# Get the singleton manager instance
manager = get_gpu_manager()
# Check if manager is enabled
if manager.is_enabled():
    # Check if a specific operation should use GPU
    array_size = 100 * 1024 * 1024  # 100MB
    if manager.should_use_gpu(array_size, 'matmul'):
        print("GPU recommended for this operation")

Converting Arrays Between CPU and GPU

import numpy as np
from epochly.gpu import get_gpu_manager
manager = get_gpu_manager()
if manager.is_enabled():
    # Convert NumPy array to CuPy (GPU)
    numpy_array = np.random.randn(1000, 1000)
    gpu_array = manager.numpy_to_cupy(numpy_array)
    # Perform operations on GPU...
    # Convert back to NumPy (CPU)
    result = manager.cupy_to_numpy(gpu_array)

GPU Context Manager

Use the context manager for GPU operations with automatic cleanup:

from epochly.gpu import get_gpu_manager
manager = get_gpu_manager()
with manager.gpu_context() as cp:
    if cp is not None:
        # CuPy is available, perform GPU operations
        gpu_array = cp.random.randn(1000, 1000)
        result = cp.linalg.svd(gpu_array)
        # Memory cleanup happens automatically

Creating GPU-Accelerated Functions

Create wrapper functions that automatically use GPU when beneficial:

import numpy as np
from epochly.gpu import get_gpu_manager
manager = get_gpu_manager()
# Create a GPU-accelerated version of a NumPy function
gpu_matmul = manager.create_gpu_accelerated_function(np.matmul, 'matmul')
# Use it - GPU is used automatically when beneficial
result = gpu_matmul(large_array_a, large_array_b)

Performance Statistics

Track GPU operation performance:

from epochly.gpu import get_gpu_manager
manager = get_gpu_manager()
# Get performance statistics
stats = manager.get_performance_stats()
print(f"GPU operations: {stats['gpu_operations']}")
print(f"CPU operations: {stats['cpu_operations']}")
print(f"GPU ratio: {stats['gpu_operation_ratio']:.1%}")
print(f"Avg GPU time: {stats['avg_gpu_time']*1000:.2f}ms")
print(f"Memory transfers: {stats['memory_transfers']}")
# Reset statistics
manager.reset_stats()

Intelligent Memory Management

Epochly provides intelligent GPU memory management that prevents out-of-memory errors:

Enabling Intelligent Memory

from epochly.gpu.intelligent_memory import (
    enable_intelligent_memory,
    disable_intelligent_memory,
    is_intelligent_memory_enabled,
    cleanup_gpu_memory
)
# Enable intelligent memory management
enable_intelligent_memory()
# Check if enabled
if is_intelligent_memory_enabled():
    print("Intelligent memory management active")
# Force cleanup of GPU memory
cleanup_gpu_memory()
# Disable when done
disable_intelligent_memory()

Safe GPU Operations Decorator

Use the @safe_gpu_operation decorator for operations that might run out of GPU memory:

from epochly.gpu.intelligent_memory import safe_gpu_operation
import cupy as cp
@safe_gpu_operation
def my_gpu_function(x, y):
    """This function automatically falls back to CPU on OOM."""
    return cp.linalg.svd(x @ y)
# If GPU runs out of memory, it falls back to CPU automatically
result = my_gpu_function(large_array_a, large_array_b)

OOM Safety Mode

Enable global OOM (out-of-memory) protection:

from epochly.gpu.intelligent_memory import enable_oom_safety, disable_oom_safety
# Enable OOM safety for all GPU operations
enable_oom_safety()
# Perform operations - automatic fallback on memory errors
# Disable when done
disable_oom_safety()

Multi-Backend Support

Epochly supports multiple GPU backends through the backend registry:

from epochly.gpu.backend_registry import (
    GPUBackendRegistry,
    GPUBackendKind
)
# Detect all available backends
available = GPUBackendRegistry.detect_available_backends()
for kind, info in available.items():
    print(f"{kind.value}: {info.device_count} devices, {info.memory_gb:.1f}GB")
# Get the best available backend
backend = GPUBackendRegistry.get_best_available()
print(f"Using backend: {backend.info.kind.value}")
# Transfer array to GPU
gpu_array = backend.to_gpu(numpy_array)
# Perform operations...
# Transfer back to CPU
result = backend.to_cpu(gpu_array)

Supported Backends

Backend	Vendor	Library	Notes
CUDA	NVIDIA	CuPy	Fully supported
CPU	N/A	NumPy	Fallback when no GPU

> Note: AMD ROCm and Intel oneAPI are not currently supported. GPU acceleration requires NVIDIA GPUs with CUDA.

GPU Selection Heuristics

Epochly uses intelligent heuristics to decide CPU vs GPU execution:

Automatic Selection Criteria

Factor	GPU Preferred	CPU Preferred
Array size	> 10MB	< 10MB
Operation type	Matrix multiply, FFT	Simple arithmetic
Data locality	Data already on GPU	Data on CPU
Operation count	Many operations	Single operation

Configuring Offload Optimizer

The GPUOffloadOptimizer makes decisions about GPU usage:

from epochly.gpu import GPUDetector, GPUOffloadOptimizer
# Get GPU info
gpu_info = GPUDetector.get_gpu_info()
# Create optimizer with custom minimum array size
optimizer = GPUOffloadOptimizer(
    gpu_info=gpu_info,
    min_array_size=50 * 1024 * 1024  # 50MB minimum
)
# Check if operation should use GPU
should_use = optimizer.should_offload(
    data_size=100 * 1024 * 1024,  # 100MB
    operation='matmul'
)
# Get detailed analysis
analysis = optimizer.analyze_offload_opportunity(
    data_size=100 * 1024 * 1024,
    operation='matmul'
)
print(f"Decision: {analysis.decision}")
print(f"Estimated speedup: {analysis.estimated_speedup}x")
print(f"Reason: {analysis.reason}")

Performance Guidelines

Operations That Benefit Most from GPU

Large matrix operations (> 1000x1000)
FFT on large arrays (> 100,000 elements)
Batch image processing
Custom element-wise operations on large arrays
Repeated operations on same data

Operations That Don't Benefit from GPU

Small arrays (< 10,000 elements)
Single scalar operations
Operations dominated by Python overhead
I/O-bound operations
Operations with heavy CPU-GPU data transfer

Environment Variables

Variable	Purpose	Default
`EPOCHLY_GPU_ENABLED`	Enable/disable GPU	`true`
`EPOCHLY_GPU_MEMORY_LIMIT`	GPU memory limit (MB)	Auto-detected
`EPOCHLY_GPU_WORKLOAD_THRESHOLD`	Minimum workload for GPU	`10000000` (10MB)

Troubleshooting

GPU Not Detected

from epochly.gpu import run_diagnostics, format_report
# Run diagnostics
report = run_diagnostics()
print(format_report(report, verbose=True))

Common Issues

CuPy not installed: pip install cupy-cuda12x
CUDA drivers missing: Install NVIDIA drivers
Insufficient GPU memory: Reduce workload size or use intelligent memory

Getting User-Friendly Errors

from epochly.gpu import get_user_friendly_gpu_error
try:
    # GPU operation that fails
    result = problematic_operation()
except Exception as e:
    message, suggestion = get_user_friendly_gpu_error(e)
    print(f"Error: {message}")
    print(f"Suggestion: {suggestion}")

Summary

Automatic NumPy interception is disabled by design based on benchmarks
GPU acceleration provides real speedup for large arrays and custom kernels
Use the CuPy Manager for explicit GPU operations
Use intelligent memory management for automatic OOM protection
Use @safe_gpu_operation decorator for functions that might exceed GPU memory
Always measure before assuming GPU will help
Epochly's automatic level progression handles most cases correctly