All posts
Education

Profiling Python: Where Your Time Actually Goes

A practical guide to profiling Python code. Find your real bottlenecks before optimizing. Includes before/after analysis showing how optimization shifts bottlenecks.

Epochly TeamJanuary 30, 20269 min read

The fastest way to waste time on performance optimization is to optimize the wrong thing. Profile first. Always.

This post walks through practical Python profiling techniques, shows you how to read profile output, and demonstrates how optimization shifts bottlenecks from one part of your code to another.


The Cardinal Rule

Never optimize without profiling first.

Most developers have incorrect intuitions about where their code spends time. A function that "feels slow" might take 2% of total execution time. The actual bottleneck might be a line you've never looked at.

Profiling replaces intuition with data.


Tool 1: cProfile (Built-in)

Python ships with cProfile, a deterministic profiler that records every function call.

import cProfile
import pstats
from io import StringIO
def your_workload():
"""Replace with your actual code."""
import numpy as np
data = np.random.randn(1_000_000)
result = 0.0
# Python loop (slow path)
for i in range(0, len(data), 1000):
chunk = data[i:i+1000]
partial = np.sum(np.sqrt(np.abs(chunk)))
result += partial
return result
# Profile
pr = cProfile.Profile()
pr.enable()
your_workload()
pr.disable()
# Print results
s = StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats('cumulative')
ps.print_stats(15)
print(s.getvalue())

Reading cProfile output

ncalls tottime percall cumtime percall filename:lineno(function)
1000 0.045 0.000 0.089 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
1000 0.031 0.000 0.031 0.000 {method 'reduce' of 'numpy.ufunc' objects}
1 0.028 0.028 0.156 0.156 profile_example.py:8(your_workload)
1000 0.018 0.000 0.052 0.000 fromnumeric.py:2184(sum)

Key columns:

  • ncalls: Number of times the function was called
  • tottime: Total time spent in this function (excluding subcalls)
  • cumtime: Cumulative time including subcalls
  • percall: Per-call time

What to look for: Functions with high tottime are your bottlenecks. High ncalls with low percall suggests a tight loop.


Tool 2: line_profiler (Line-by-Line)

When cProfile tells you which function is slow, line_profiler tells you which line is slow.

# Install: pip install line_profiler
# Add @profile decorator to functions you want to profile
@profile
def mixed_workload(data):
result = 0.0
for i in range(0, len(data), 1000): # Loop overhead
chunk = data[i:i+1000] # Array slicing
partial = np.sum(np.sqrt(np.abs(chunk)))# NumPy operations
result += partial # Accumulation
return result

Run with:

kernprof -l -v your_script.py

Output:

Line # Hits Time Per Hit % Time Line Contents
3 def mixed_workload(data):
4 1 0.5 0.5 0.0 result = 0.0
5 1001 8234.0 8.2 52.3 for i in range(0, len(data), 1000):
6 1000 2156.0 2.2 13.7 chunk = data[i:i+1000]
7 1000 4891.0 4.9 31.1 partial = np.sum(np.sqrt(np.abs(chunk)))
8 1000 452.0 0.5 2.9 result += partial
9 1 0.2 0.2 0.0 return result

This tells you that 52.3% of time is in the for loop overhead itself -- not in the computation. The Python interpreter is spending more time managing the loop than doing useful work.


Tool 3: time.perf_counter (Manual Timing)

For quick measurements, time.perf_counter provides high-resolution timing.

import time
start = time.perf_counter()
result = your_function(data)
elapsed = time.perf_counter() - start
print(f"Elapsed: {elapsed:.4f}s")

Use this for:

  • Quick A/B comparisons between approaches
  • Timing specific sections of code
  • Automated benchmark scripts

Don't use this for:

  • Understanding where time goes (use cProfile)
  • Understanding which lines are slow (use line_profiler)

Tool 4: memory_profiler

Sometimes the bottleneck isn't CPU -- it's memory allocation.

# Install: pip install memory_profiler
from memory_profiler import profile
@profile
def memory_intensive(n):
data = [i ** 2 for i in range(n)] # List comprehension
filtered = [x for x in data if x > 100] # Filter creates new list
return sum(filtered)

Run with:

python -m memory_profiler your_script.py

Output:

Line # Mem usage Increment Occurrences Line Contents
4 45.2 MiB 45.2 MiB 1 def memory_intensive(n):
5 83.7 MiB 38.5 MiB 1 data = [i ** 2 for i in range(n)]
6 121.9 MiB 38.2 MiB 1 filtered = [x for x in data if x > 100]
7 121.9 MiB 0.0 MiB 1 return sum(filtered)

This shows each list comprehension allocating ~38MB. If n is large enough, this becomes the bottleneck -- not CPU.


Case Study: Before and After Optimization

Here's a real-world profiling example showing how optimization shifts the bottleneck.

Before: Vanilla Python

Profiling a typical mixed workload (Python loops around NumPy operations):

cumulative time by function:
87.3% pure_python_loops
8.2% numpy_operations
3.1% data_serialization
1.4% other

87% of time is Python loop overhead. The actual numerical work (NumPy) is 8%. Most execution time is the Python interpreter iterating through bytecode, not doing useful computation.

After: Optimized with Epochly

After applying Level 2 JIT and Level 3 parallel execution:

cumulative time by function:
72.1% numpy_operations
15.3% parallel_coordination
8.9% data_transfer
3.7% python_loops

Python loop overhead dropped from 87% to 4%. NumPy operations (72%) are now the bottleneck -- exactly where you want them.

What this means

The bottleneck shifted from "Python interpreter overhead" to "actual computation." NumPy's BLAS operations are already highly optimized. When they become the bottleneck, you've reached the point of diminishing returns for this workload.

This is what good optimization looks like. Not making everything faster, but making the right thing the bottleneck.

The speedup on this workload: Python loops went from dominating at 87% to 4%. That's where the measured 8.7x speedup on pure Python parallel workloads comes from (Apple M2 Max, 16 cores, Python 3.13.5). Not making Python faster -- making Python loops irrelevant.


Profiling Workflow: Step by Step

Step 1: Measure the baseline

import time
start = time.perf_counter()
result = your_function(your_data)
baseline = time.perf_counter() - start
print(f"Baseline: {baseline:.3f}s")

If it's fast enough already, stop. Don't optimize code that doesn't need it.

Step 2: Profile with cProfile

import cProfile
cProfile.run('your_function(your_data)', 'profile_output')
import pstats
stats = pstats.Stats('profile_output')
stats.sort_stats('tottime').print_stats(10)

Identify the top 3 functions by tottime. These are your candidates.

Step 3: Classify the bottleneck

If top functions are...Bottleneck typeAction
Python loops, list comprehensionsCPU-bound PythonJIT compilation (Level 2)
Multiple independent tasksParallelizableProcessPool (Level 3)
Large array operationsCompute-boundGPU (Level 4)
read, write, recv, sendI/O-boundasyncio or ThreadPool
malloc, free, gc.collectMemory-boundReduce allocations

Step 4: Optimize the bottleneck

Apply the appropriate optimization for the bottleneck type. Don't optimize everything -- focus on the function that takes the most time.

Step 5: Profile again

After optimization, profile again. Check:

  1. Did the bottleneck shrink?
  2. What's the new bottleneck?
  3. Is further optimization worth the effort?

The goal isn't zero execution time. It's making the right things the bottleneck.


Common Profiling Mistakes

Mistake 1: Profiling in development mode

Debug mode, verbose logging, and development settings add overhead that skews profiling data. Profile with production-like settings.

Mistake 2: Profiling cold starts only

First-run timing includes import overhead, JIT compilation warmup, and cache population. Measure both cold start and warm steady-state performance separately.

Mistake 3: Profiling too small a workload

With small data, profiling overhead itself may dominate. Use realistic data sizes that match your production workload.

Mistake 4: Optimizing without a target

Define "fast enough" before you start. Is it 100ms response time? 1 second for a batch job? 10 seconds for a report? Without a target, optimization never ends.

Mistake 5: Ignoring the profiler

The most common mistake. Developers optimize based on intuition instead of data. The profiler might show that your "slow" function is 2% of runtime while the bottleneck is a line you've never considered.


Quick Reference: Python Profiling Tools

ToolWhat it measuresWhen to use
time.perf_counterWall-clock timeQuick A/B comparisons
cProfileFunction-level CPU timeFinding which function is slow
line_profilerLine-level CPU timeFinding which line is slow
memory_profilerMemory allocationFinding memory bottlenecks
py-spySampling profilerProfiling running processes
scaleneCPU + memory + GPUComprehensive profiling

What Comes After Profiling

Once you know where time goes, you have clear options:

  1. Python loops dominate (>50% time): JIT compilation transforms these to native code. Level 2 achieves 58-193x (113x average) on numerical loops.
  2. Independent tasks dominate: Multi-core parallel execution distributes work. Level 3 achieves 8-12x on 16 cores using ProcessPool.
  3. Large array ops dominate: GPU acceleration offloads to CUDA cores. Level 4 achieves up to 70x on arrays with 10M+ elements.
  4. I/O dominates: Use asyncio or threading. Epochly won't help here (~1.0x).
  5. NumPy BLAS dominates: Already optimized. Epochly won't help here (~1.0x). This is the optimal state -- it means there's no low-hanging fruit left.

Profile first. Classify the bottleneck. Choose the right tool. Measure again.


Benchmark conditions: Python 3.12.3, Linux WSL2, 16 cores. Apple M2 Max results on Python 3.13.5, 16 cores. January 29, 2026 comprehensive benchmark report.

pythonprofilingcProfileperformanceoptimization