Python Zebra Stacks

May 20, 2026

The Problem: Truncated PyTorch Stacks

PyTorch Lightning training workloads can get complicated! Like any powerful and
flexible framework there's a lot of indirection, interfaces and modularity to
accommodate a very diverse set of use cases and hardware. Dynamic languages can
be, well, dynamic. One of our customers found that the stacks being reported
were being truncated. This means we hit an error in our attempt to unwind the
stack from the leaf to the root. Two problems needed to be addressed to fix it
and we'll discuss both as they are kinda both interesting to BPF development. If
you don't care about BPF development you might be in the wrong room!

Both problems are related to the fact that these stacks are very deep and training
workloads tend to be long running.

Read errors

The first problem we encountered was an error in the python unwinder hitting
bpf_probe_read. This can happen when
an address is bogus but it can also happen when an address is perfectly fine
and the memory in question has been paged out. Since BPF programs execute in
a privileged kernel context page fault handling is disabled. In this case the
address was good but it was paged out. Normally when this happens you just give
up and bail but this was happening to a lot of samples, like 1/3 of them! So
this wasn't some blip we could ignore. In a typical program stacks grow and shrink
and as programs run pages get touched and stay in cache, but if you have long
running programs the likelihood increases that stack frames high on the stack
stay in place for hours or even days during long training workloads. During
that time memory demands may cause things to get paged out. But since eBPF reads
don't trigger page faults, every time we walk the stack we'll bail and error
out on any reads that fail.

The solution in this case is to just keep going! Turns out the pointers we were
faulting on aren't necessary to keep walking the stack so we can just record
the pointer and walk to the next frame. The pointer in question will be read
again in userspace where the OS will handle the page fault and the next time
we walk the stack the read error will go away. It's a tradeoff, do we care more
about getting complete stacks or do we not want the profiler to cause additional
memory pressure by paging things in the program doesn't need? We think a little
extra memory pressure is worth getting complete stacks and in the end the OS
will find something else to evict if it has to.

All the gory details can be found in the PR.

Tail Call Limit

The second problem we faced was hitting the BPF tail call limit. To understand
why we need to look at a typical PyTorch training stack:

Trainer.fit
  _PyFunction_Vectorcall          ← CPython C runtime
  _PyEval_EvalFrameDefault        ← CPython interpreter loop
  do_call_core                    ← CPython dispatch
Trainer._fit_impl
  _PyFunction_Vectorcall
  _PyEval_EvalFrameDefault
  do_call_core
Trainer._run_stage
  _PyFunction_Vectorcall
  _PyEval_EvalFrameDefault
  do_call_core
_FitLoop.run
  ...25+ more Python frames, each separated by a few native frames...

Every Python function call passes through CPython's C eval loop —
_PyFunction_Vectorcall → _PyEval_Vector → _PyEval_EvalFrame →
_PyEval_EvalFrameDefault → do_call_core — creating roughly 5 native
frames between each Python frame. A typical ML training stack has 25+
Python frames, which means 25+ transitions between Python and native code.

The stacks were ending with unwinding aborted due to error max_tail_calls errors.
To understand why, we need to understand how eBPF profilers unwind
mixed-language stacks.

How eBPF Stack Unwinding Works

The eBPF profiler uses separate unwinder programs for each
language runtime. There's a native unwinder that handles standard C calling
convention frames using DWARF stack deltas, and interpreter-specific
unwinders for Python, Ruby, Java HotSpot, V8, PHP, etc.

When a profiling sample fires, the native unwinder starts walking the stack.
When it hits a frame belonging to an interpreter (detected via memory
mappings), it does a BPF tail call to hand off to the appropriate
interpreter unwinder. The interpreter unwinder processes its frames, then
tail calls back to the native unwinder when it hits native code again.

native_unwinder → tail_call → python_unwinder → tail_call → native_unwinder → ...

Why Tail Calls?

BPF programs have a hard limit of 1 million verified instructions. The
verifier statically analyzes every possible execution path through the
program, and the sum of all instruction visits must stay under 1M. This is
how the linux kernel knows the program will terminate and is safe to run in a
privileged, interrupt-free, state.

A single unwinder program that handled every language would blow through this
limit or would be extremely constrained in the number of frames it could walk.
Currently our unwinders handle ~5-12 frames per program and its common for
programs to have dozens or even hundreds of frames in the call stack.

Tail calls solve this by splitting the work across multiple programs. Each
bpf_tail_call() transfers control to a new program with its own fresh
instruction budget. The native unwinder processes a few frames, tail calls
to Python, Python processes a few frames, tail calls back to native — each
invocation staying well under the 1M limit.

The Tail Call Limit

But tail calls have their own limit: 33 per chain (the profiler
conservatively caps at 29 to leave room for cleanup). Each Python ↔ native
transition costs one tail call in each direction. With 25+ transitions in a
PyTorch training stack, the profiler was hitting the tail call limit and
truncating the stack mid-unwind.

The Fix: Combining the Unwinders

The solution was to combine the Python and native unwinders into a single
BPF program that handles both in one loop. Instead of:

for (i = 0; i < FRAMES_PER_PROGRAM; i++) {
    unwind_one_python_frame();
    if (done) break;
}
tail_call(PROG_UNWIND_NATIVE);  // hand off to native unwinder

The combined approach uses a single loop that switches between unwinding
modes without tail calls:

for (u32 t = 0; t < python_native_loop_iters; t++) {
    switch (unwinder) {
    case PROG_UNWIND_PYTHON:
        error = step_python(record, pyinfo, &py_frame, &unwinder);
        break;
    case PROG_UNWIND_NATIVE:
        error = step_native(record, &unwinder);
        break;
    default:
        goto done;
    }
}

Each iteration handles one frame — either Python or native — and the
unwinder variable tracks which mode to use next. Transitions between
Python and native code happen within the loop with zero tail calls.

The deep PyTorch stack that previously needed ~50 tail calls now
unwinds fully within roughly a dozen program invocations.

The Instruction Budget Challenge

Combining two unwinders into one program means the BPF verifier has to
analyze both unwinders' code paths in a single pass. The native unwinder
alone consumes ~220K verified instructions. Inlining it into the Python
unwinder, multiplied across loop iterations and branch paths, pushes the
combined program past the 1M limit. We can reduce the number of frames
to make the verifier happy but this leads to the inability to walk deep
stacks. Can we support mixed stacks like this but not sacrifice any
stack depth?

The DEBUG_PRINT Problem

A significant portion of the instruction budget goes to debug logging.
The profiler uses a DEBUG_PRINT macro that compiles to bpf_trace_printk:

DEBUG_PRINT("==== unwind_native %d ====", trace->num_frames);

This expands to a runtime check of a global variable followed by the
format string construction:

; if (with_debug_output) {
ldxw     r1, [r6+0x0]           ; load with_debug_output
jeq      r1, 0x0, +0x19         ; skip if debug off

; Push format string onto BPF stack 
mov64    r1, 0xa3d3d            ; "==\n\0"
stxw     [r10-0x48], r1         ; store to stack
lddw     r1, 0x3d3d206425206576 ; "ev %d =="
stxdw    [r10-0x50], r1
lddw     r1, 0x6974616e5f646e69 ; "ind_nati"
stxdw    [r10-0x58], r1
lddw     r1, 0x776e75203d3d3d3d ; "==== unw"
stxdw    [r10-0x60], r1

; Set up call arguments
ldxh     r3, [r9+0x2be]         ; trace->num_frames
mov64    r1, r10
add64    r1, -0x60              ; r1 = pointer to format string
mov64    r2, 0x1c               ; r2 = string length (28)
call     0x6                    ; bpf_trace_printk()

Because of how the DEBUG_PRINT macro works every format string must be
constructed on the stack at runtime, one 8-byte immediate at a time.
A single DEBUG_PRINT with a 28-byte format string compiles to 13
instructions. With dozens of debug prints in the unwinder code, each
guarded by an if (with_debug_output) check, the verifier must analyze
both the taken and not-taken paths.

Now to keep things efficient with_debug_output is a read-only variable
(BPF .rodata), so the verifier knows its value at load time. When debug
output is disabled (with_debug_output = 0), the verifier prunes the
entire debug branch — it knows the condition is always false and skips
analyzing the string construction and bpf_trace_printk call. When debug
output is enabled, the verifier must analyze all of those instructions.

This means the verified instruction count depends on whether debug output
is enabled:

Mode	Verified Instructions	% of 1M Limit
Debug OFF	476,052	47.6%
Debug ON	668,515	66.9%

(both measured at the original compile-time default of 9 loop iterations)

With debug output enabled, the verifier explores roughly 40% more
instructions due to the debug branches.

Dynamic Loop Tuning via RODATA Variables

The combined unwinder's loop iteration count directly controls how many
frames it can process per invocation. More iterations = more frames
unwound, but more verified instructions. We needed to find the sweet
spot for both debug-on and debug-off modes.

The solution was to make the loop count a RODATA variable instead of a
compile-time constant:

// Old: fixed at compile time
#define PYTHON_NATIVE_LOOP_ITERS 9

// New: tunable at load time
BPF_RODATA_VAR(u32, python_native_loop_iters, 6)

At program load time, the host agent sets the loop count based on
whether debug output is enabled:

if cfg.VerboseMode {
    // Debug output enabled — verifier explores debug branches,
    // leaving less room for loop iterations
    // Default of 6 is already correct
} else {
    // Debug output disabled — verifier prunes debug branches,
    // freeing up instruction budget for more iterations
    coll.Variables["python_native_loop_iters"].Set(uint32(12))
}

This gives us the best of both worlds:

Mode	Loop Iterations	Verified Instructions	Headroom
Debug ON (verbose)	6	452,849	55%
Debug OFF (production)	12	913,603	9%

In production (debug off), the profiler unwinds up to 12 frames per
program invocation — a 2x improvement over the 6-frame debug mode — while
staying safely under the 1M instruction limit.

Results

We validated the fix with a coredump test using a 20-level deep Python
call stack (each level going through CPython's C eval loop). This creates
the same interleaving pattern seen in a PyTorch workload:

Branch	Frames Unwound	Result
Main (tail call design)	Truncated	`unwinding aborted due to error max_tail_calls`
Combined loop	138 frames	Full stack through `_start`

The combined unwinder eliminates the tail call bottleneck entirely for
Python workloads, while the RODATA-based loop tuning ensures the verifier
instruction budget is used optimally in both debug and production modes.

But wait there's more!

No engineering blog would be complete without a section on paths not taken! At one point
we considered revamping all the DEBUG_PRINT statements on the critical path with some
aggressive abbreviations and error codes so that each DEBUG_PRINT statement was smaller. What a terrible idea!
Luckily the dynamic looping counts killed that idea but there was still this
nagging feeling about how DEBUG_PRINT abused the stack. I'd initially dismissed
it as just some weird quirk of eBPF, but eventually decided to dig in. The
reason our DEBUG_PRINT used the stack is that it's defined like this:

  #define printt(fmt, ...)                                                                         \
    ({                                                                                             \
      const char ____fmt[] = fmt "\n";                                                             \
      bpf_trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__);                                   \
    })

  #define DEBUG_PRINT(fmt, ...)                                                                    \
    ({                                                                                             \
      if (__builtin_expect(with_debug_output, 0)) {                                                \
        printt(fmt, ##__VA_ARGS__);                                                                \
      }                                                                                            \
    })

Because we introduce a new ___fmt variable with a newline appended and it's a local
variable it gets put on the stack, but if you just add the static keyword all
that data moves to rodata. The results are kind of staggering, most unwinders
got ~40% smaller, full results here.

But does shrinking the eBPF programs really matter? Actually it does because it
means we can handle much bigger stacks and/or have plenty of room to enhance (or combine!)
our programs and stay within the 1M verifier limit. Early testing shows we can 2x our native unwinder
from 5 frames per program to 10. Python goes from 12 to 90! Basically this is found by cranking up the
number of loops up until we hit verifier errors. Clearly the relationship between instruction size
and verifier complexity is non-linear.

Do all languages need a hybrid unwinder? Probably not, JIT'd languages (Node, .Net, Java)
tend not to bounce back and forth like this, it's possible Ruby, PHP and Perl programs
could benefit from such a technique.

Hopefully you enjoyed this little journey in the world of eBPF profiling, what started out
as trying to support complicated PyTorch stacks led to really moving the needle on supporting
large stacks for all languages!

Discuss: