Zig dangling pointers and segfaults

Jun 10, 2023

If you're coming to Zig from background of garbage collected languages, you have to be ready to stumble and be patient while you acclimatize to your new responsibilities of manually managing memory. It seems to me that the biggest challenge, and the one I still consistently run into, is referencing memory from a stack that is no longer valid. This is known as a dangling pointer.

Consider this example, what's the output?:

const std = @import("std");

pub fn main() !void {
  const warning1 = try powerLevel(9000);
  const warning2 = try powerLevel(10);

  std.debug.print("{s}\n", .{warning1});
  std.debug.print("{s}\n", .{warning2});
}

fn powerLevel(over: i32) ![]u8 {
  var buf: [20]u8 = undefined;
  return std.fmt.bufPrint(&buf, "over {d}!!!", .{over});
}

I believe the output you get is undefined and might vary based on a number of factors, but I get:

over 10!!!��
over 10!!!

The issue with the above code, and the reason for the weird output, is that std.fmt.bufPrint writes the formatted output into the supplied &buf. &buf is the address of buf which exists on the stack of the powerLevel function. This is a specific memory address which is only valid while powerLevel is executing. However, both warning1 and warning2 reference this address and, crucially, do so after the function has returned. warning1 and warning2 essentially point to invalid addresses.

But if the two variables in main point to an invalid address, why do we get a weird output for warning1 and a correct output for warning2? Well, it's often said that a function's stack is destroyed or uninitialized when a function returns. That might be true in some cases. But in my case, what I'm seeing is that the stack memory isn't cleared between function calls, but it is re-initialized on each call. So, in this simple case, while warning2 points to a technically invalid address, it was never cleared. warning1 also points to the same address, which the second call to powerLevel re-initialized and then wrote the new value to. The reason that we're seeing � is because the slice points to the updated memory but maintains the original length - thus we're seeing into the re-initialized but unwritten buf space.

The above is simple code. It's obvious that buf is scoped to the powerLevel function and that std.fmt.bufPrint returns a pointer to buf's address.

Let's look at another more complex example. This is convoluted, but it encapsulates a scenario I commonly see people asking for help with:

const std = @import("std");
const Allocator = std.mem.Allocator;

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  const allocator = gpa.allocator();

  const res1 = Response.init(allocator);
  defer res1.deinit();

  const res2 = Response.init(allocator);
  res2.deinit();

  const warning = try std.fmt.allocPrint(res1.allocator, "over {d}\n", .{9000});
  std.debug.print("{s}\n", .{warning});
}

const Response = struct {
  arena: std.heap.ArenaAllocator,
  allocator: Allocator,

  fn init(parent_allocator: Allocator) Response {
    var arena = std.heap.ArenaAllocator.init(parent_allocator);
    const allocator = arena.allocator();
    return Response{
      .arena = arena,
      .allocator = allocator,
    };
  }

  fn deinit(self: Response) void {
    self.arena.deinit();
  }
};

If you were to run this code, you'd almost certainly see a segmentation fault (aka, segfault). We create a Response which involves creating an ArenaAllocator and from that, an Allocator. This allocator is then used to format our string. For the purpose of this example, we create a 2nd response and immediately free it. We need this for the same reason that warning1 in our first example printed an almost ok value: we want to re-initialize the memory in our init function stack.

At first glance, this code might look ok (it did to me the first and second time that I wrote it!). res2 doesn't seem to be problematic, because none of our code is illegally referencing anything on the Response.init stack, right? Sure arena is created there, but we move that into the Response, which is then moved to main, which is where the allocator is used. What's the problem?

The problem is that the allocator created via const allocator = arena.allocator(); references the arena at the point of creation, which is to say, the arena that exists on init's stack. Sure we move arena to the Response, but any existing references to arena's stack address become invalid. Without a garbage collector, when we move an object, existing references to the old address become dangling.

And this is where I keep stumbling: it isn't always obvious when this is happening. Maybe I'm missing something obvious, but I don't think there's a way to tell whether arena.allocator() returns something self-contained or returns something dependent on arena.

Knowing this specific issue, what's the solution? Generally, the answer is that any allocator we create has to be tied to the arena's scope. In the above, rather than creating allocator in init and storing it in Response, we could narrow allocator's scope:

const warning = try std.fmt.allocPrint(res1.arena.allocator(), "over {d}\n", .{9000});
std.debug.print("{s}\n", .{warning});

Or, a bit more broadly, scoped to the owner of the arena: res1:

const res1 = Response.init(gpa.allocator());
defer res1.deinit();
const aa = res.arena.allocator();
...

Finally, if allocator has to share the same (or smaller) scope as arena, we can give the arena a "global" scope by putting it on the heap:

const std = @import("std");
const Allocator = std.mem.Allocator;

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  const allocator = gpa.allocator();

  const res1 = try Response.init(allocator);
  defer res1.deinit(allocator);

  const res2 = try Response.init(allocator);
  res2.deinit(allocator);

  const warning = try std.fmt.allocPrint(res1.allocator, "over {d}\n", .{9000});
  std.debug.print("{s}\n", .{warning});
}

const Response = struct {
  // notice this is now a pointer
  arena: *std.heap.ArenaAllocator,
  allocator: Allocator,

  // this can now return an error, since our heap allocation can fail
  fn init(parent_allocator: Allocator) !Response {
    // create the arena on the heap
    var arena = try parent_allocator.create(std.heap.ArenaAllocator);
    arena.* = std.heap.ArenaAllocator.init(parent_allocator);
    const allocator = arena.allocator();

    return Response{
      .arena = arena,
      .allocator = allocator,
    };
  }

  fn deinit(self: Response, parent_allocator: Allocator) void {
    self.arena.deinit();

    // we need to delete the arena from the heap
    parent_allocator.destroy(self.arena);
  }
};

Here's one last example, which is, again, something I've run into more than once:

const std = @import("std");

const User = struct {
  id: i32,
};

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  const allocator = gpa.allocator();

  var lookup = std.StringHashMap(User).init(allocator);

  try lookup.put("u1", User{.id = 9001});

  const entry = lookup.getPtr("u1").?;

  // returns true/false if the item was removed
  _ = lookup.remove("u1");

  std.debug.print("{d}\n", .{entry.id});
}

Same question as before: what does this print? I get -1431655766. Why? Because entry points to memory that's made invalid by our call to remove. If you comment out the remove, it'll print the expected 9001.

Like our first example, in isolation, this is pretty obvious. But consider something like the cache for Zig that I wrote. The cache uses a StringHashMap, and, when full, items are removed. But what if Thread1 gets entry "u1" from the cache while Thread2 deletes entry "u1"? This is similar to our simplified example above, but less obvious - the issue is only surfaced when two threads interact in a specific manner. As above, part of the solution is to allocate the map value on the heap. So our HashMap now holds *User instead of User. This mean our HashMap doesn't "own" the value (the heap does). If we remove a user from the HashMap, it's only removing the reference to the heap-allocated value, which remains valid.

(That, of course, introduces a new problem: when/how do we delete the heap value? But that's a different blog post, and the answer is: we add reference counting.)

I'd be willing to bet that, if you're new to Zig and coming from a garbage-collected language, you've run into some variation of this. The good news is that it gets easier to spot with practice. The bad news is that, at least for me, I'm pretty sure I'll never be able to eliminate 100% of them from my code. Sometimes it just not obvious to me that I've referenced invalid memory and, more terrifyingly, it doesn't always manifest in a way that's guaranteed to be caught with a test.