a--- layout: post date: 2024-01-16 title: "Zig's HashMap - Part 1" description: "A Look at Zig's HashMap, AutoHashMap and StringHashMap as well as the unmanaged variants" tags: [zig] ---

Like most hash map implementations, Zig's std.HashMap relies on 2 functions, hash(key: K) u64 and eql(key_a: K, key_b: K) bool. The hash function takes a key and returns an unsigned 64 bit integer, known as the hash code. The same key always returns the same hash code. The hash function can produce the same hash code for two different keys which is one reason we also need eql: to determine if two keys are the same.

This is all standard stuff, but Zig's implementation has a few specific details worth exploring. This is particularly true given the number of hash map types in the standard library as well as the fact that the documentation seems incomplete and confusing. Specifically, there are six hash map variants: std.HashMap, std.HashMapUnmanaged, std.AutoHashMap, std.AutoHashMapUnmanaged, std.StringHashMap, std.StringHashMapUnmanaged. One of those, std.HashMapUnmanaged, contains the bulk of the implementation. The other five are thin wrappers: they are composed of an std.HashMapUnmanaged. The documentation for those five variants is challenging because of this composition, which is not, in my opinion, handled well by the document generator.

If we look at the put method of std.HashMap, we'll see a pattern that's often repeated:

pub fn put(self: *Self, key: K, value: V) Allocator.Error!void {
  return self.unmanaged.putContext(self.allocator, key, value, self.ctx);
}

As I said, most of the heavy lifting is done by std.HashMapUnmanaged, which the other variants wrap in a field named "unmanaged".

Unmanaged

"Unmanaged" types are sprinkled throughout the standard library. It's a naming convention which indicates that the type in question doesn't maintain an allocator. Any method that requires allocations takes an explicit allocator as a parameter. To see this in practice, consider this linked list:

pub fn LinkedList(comptime T: type) type {
  return struct {
    head: ?*Node = null,
    allocator: Allocator,

    const Self = @This();

    pub fn init(allocator: Allocator) Self {
      return .{
        .allocator = allocator,
      };
    }

    pub fn deinit(self: Self) void {
      var node = self.head;
      while (node) |n| {
        node = n.next;
        self.allocator.destroy(n);
      }
    }

    pub fn append(self: *Self, value: T) !void {
      const node = try self.allocator.create(Node);
      node.value = value;
      const h = self.head orelse {
        node.next = null;
        self.head = node;
        return;
      };
      node.next = h;
      self.head = node;
    }

    pub const Node = struct {
      value: T,
      next: ?*Node,
    };
  };
}

Our init function takes and stores an std.mem.Allocator. This allocator is then used as needed in append and deinit. This is a common pattern in Zig. The "unmanaged" version of the above code is only slightly different:

pub fn LinkedListUnmanaged(comptime T: type) type {
  return struct {
    head: ?*Node = null,

    const Self = @This();

    pub fn deinit(self: Self, allocator: Allocator) void {
      var node = self.head;
      while (node) |n| {
        node = n.next;
        allocator.destroy(n);
      }
    }

    pub fn append(self: *Self, allocator: Allocator, value: T) !void {
      const node = try allocator.create(Node);
      // .. same as above
    }

    // Node is the same as above
    pub const Node = struct {...}
  };
}

We no longer have an allocator field. The append and deinit functions both take an extra parameter: allocator. Because we no longer need to store the allocator, we're able to initialize a LinkedListUnmanaged(T) exclusively with default values (i.e. head: ?*Node = null) and are able to remove the init function altogether. This isn't a requirement of unamanged types, but it is common. To create a LinkedListUnmanaged(i32), you'd do:

var ll = LinkedListUnmanaged(i32){};

It's a bit cryptic, but it's standard Zig. LinkedListUnmanaged(i32) returns a type, so the above is no different than doing: var user = User{} and relying on the default field values of User.

You might be wondering what's the point of unamaged types?. But before we answer that, let's consider how easy it is to provide both managed and unmanaged versions of our LinkedList. We keep our LinkedListUnmanaged as-is, and change our LinkedList to wrap it:

pub fn LinkedList(comptime T: type) type {
  return struct {
    allocator: Allocator,
    unmanaged: LinkedListUnmanaged(T),

    const Self = @This();

    pub fn init(allocator: Allocator) Self {
      return .{
        .unmanaged = .{},
        .allocator = allocator,
      };
    }

    pub fn deinit(self: Self) void {
      self.unmanaged.deinit(self.allocator);
    }

    pub fn append(self: *Self, value: T) !void {
      return self.unmanaged.append(self.allocator, value);
    }

    pub const Node = LinkedListUnmanaged(T).Node;
  };
}

This simple composition is, as we saw above, the same as how the various hash map types wrap an std.HashMapUnmanaged.

There are a few benefits to unmanaged types. The most important is that they're more explicit. Rather than knowing that a type, like LinkList(T), will probably need to allocate memory at some point, the explicit API of the unmanaged variant identifies the specific methods that allocate/deallocate. This can help reduce surprises and gives greater control to the caller. A secondary benefit of unamanged types is that they save a few bytes of memory by not having a reference to the allocator. Some applications might need to store thousands or even millions of these structures, in which case it can add up.

In the name of simplicity, the rest of this post won't mentioned unamanged. Anything we see about the StringHashMap or AutoHashMap or HashMap applies equally to their *Unmanaged variant.

HashMap vs AutoHashMap

std.HashMap is a generic type which takes two type parameters: the type of the key and the type of the value. And, as we saw, hash maps require two functions: hash and eql. Combined, these functions are called the "context". Both functions operate on the key, and there isn't a single hash or eql function that'll work for all types. For example, for integer keys, eql is going to be a_key == b_key where as for []const u8 keys we want std.mem.eql(u8, a_key, b_key).

When we use std.HashMap we need to provide the context (the two functions). We'll look at this shortly, but for now we can rely on std.AutoHashMap which automatically generate these functions for us. It might surprise you to know that AutoHashMap can even generate a context for more complex keys. The following works:

const std = @import("std");

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  const allocator = gpa.allocator();

  var h = std.AutoHashMap(User, i32).init(allocator);
  try h.put(User{.id = 3, .state = .active}, 9001);
  defer h.deinit();
}

const User = struct{
  id: i32,
  state: State,

  const State = enum {
    active,
    pending,
  };
};

But there's a limit to AutoHashMap's capabilities. Change our User struct to:

const User = struct{
  id: i32,
  state: State,
  login_ids: []i32,

  ...
};

And we get the following compilation error:

std.hash.autoHash does not allow slices as well as unions and structs containing slices here (User) because the intent is unclear. Consider using std.hash.autoHashStrat or providing your own hash function instead.

The reality is that there are types with ambiguous equality rules. Slices, like our login_ids above, are a good example. Should two slices pointing to different memory, but with the same length and same content be considered equal? It's application-specific. Similarly, I was surprised to find out that AutoHashMap won't allow floats (either directly or as part of a struct):

var h = std.AutoHashMap(f32, i32).init(allocator);
defer h.deinit();
try h.put(1.1, 9001);

The above gives a compilation error: unable to hash type f32. It turns out that hashing floats isn't straightforward. It isn't that floats and slices can't be hashed or compared. It's that whatever implementation Zig picks would likely surprise some developers and that unexpectedness could result in misuse which could lead to bugs. In the end, if AutoHashMap can handle your key type, then use it. If not, use a HashMap and provide your own context (which we'll look at shortly).

AutoHashMap vs StringHashMap

You'd be forgiven for thinking that StringHashMap(V) is an alias for AutoHashMap([]const u8, V). But, as we just saw, AutoHashMap doesn't support slice keys. We can confirm this. Trying to run:

const std = @import("std");

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  const allocator = gpa.allocator();

  var h = std.AutoHashMap([]const u8, i32).init(allocator);
  try h.put("over", 9000);
  defer h.deinit();
}

gives us the following error:

error: std.auto_hash.autoHash does not allow slices here ([]const u8) because the intent is unclear. Consider using std.StringHashMap for hashing the contents of []const u8. Alternatively, consider using std.auto_hash.hash or providing your own hash function instead.

As I said earlier, it isn't that slices can't be hashed or compared, it's that some cases might consider slices equal only if they reference the same memory, while others might consider two slices equal if their elements are the same. But, in the cases of strings, most people expect "teg" to equal "teg" regardless of where each is stored.

const std = @import("std");

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  const allocator = gpa.allocator();

  const name1: []const u8 = &.{'T', 'e', 'g'};
  const name2 = try allocator.dupe(u8, name1);

  const eql1 = std.meta.eql(name1, name2);
  const eql2 = std.mem.eql(u8, name1, name2);
  std.debug.print("{any}\n{any}", .{eql1, eql2});
}

The above program prints "false" followed by "true". std.meta.eql compares pointers using: a.ptr == b.ptr and a.len == b.len. But, specifically for strings, most programmers probably expect the behavior of std.mem.eql, which compares the bytes within the strings.

Therefore, just like AutoHashMap wraps HashMap with a auto-generated context, StringHashMap also wraps HashMap with a string-specific context. We'll look more closely at contexts next, but here's the context that StringHashMap uses:

pub const StringContext = struct {
  pub fn hash(self: @This(), s: []const u8) u64 {
    _ = self;
    return std.hash.Wyhash.hash(0, s);
  }
  pub fn eql(self: @This(), a: []const u8, b: []const u8) bool {
    _ = self;
    return std.mem.eql(u8, a, b);
  }
};

Custom Context

We'll finish part 1 by looking at using HashMap directly, which means providing our own context. We'll start with a simple example: creating a HashMap for case-insensitive ascii strings. We want the following to output: Goku = 9000. Notice though that while we insert using the key "GOKU" we fetch using "goku":

const std = @import("std");

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  const allocator = gpa.allocator();

  var h = std.HashMap([]const u8, i32, CaseInsensitiveContext, std.hash_map.default_max_load_percentage).init(allocator);
  defer h.deinit();
  try h.put("Goku", 9000);
  std.debug.print("Goku = {d}\n", .{h.get("goku").?});
}

Unlike the StringHashMap generic which only takes the value type, or AutoHashMap that takes both the key and value type, HashMap takes the key type, value type, context type and a fill factor. I'm not going to cover the fill factor; above we're using Zig's default fill factor (80). The part that we are interested in is the CaseInsensitiveContext. Let's look at its implementation:

const CaseInsensitiveContext = struct {
  pub fn hash(_: CaseInsensitiveContext, s: []const u8) u64 {
    var key = s;
    var buf: [64]u8 = undefined;
    var h = std.hash.Wyhash.init(0);
    while (key.len >= 64) {
      const lower = std.ascii.lowerString(buf[0..], key[0..64]);
      h.update(lower);
      key = key[64..];
    }

    if (key.len > 0) {
      const lower = std.ascii.lowerString(buf[0..key.len], key);
      h.update(lower);
    }
    return h.final();
  }

  pub fn eql(_: CaseInsensitiveContext, a: []const u8, b: []const u8) bool {
    return std.ascii.eqlIgnoreCase(a, b);
  }
};

The first parameter to both functions is an instance of the context itself. This allows more advanced patterns where the context might have state. In many cases though, it isn't used.

Our eql function uses the existing std.ascii.eqlIgnoreCase function to compare two keys in a case-insensitive manner. Straightforward.

Our hash function can be broken down into two parts. The first is lower-casing the key. If we want "goku" and "GOKU" to be treated as equal, our hash function has to return the same hash code for both. We do this in batches of 64 bytes to avoid having to allocate a buffer to hold the lower-cased value. This is possible because our hashing function can be updated with new bytes (which is quite common).

This leads us to the second part, what's std.hash.Wyhash? When talking about a hashing algorithm for hash maps (which are distinct from cryptographic hashing algorithms), we're looking for a number of properties, such as performance (every operation on a hash map requires hashing the key), uniform distribution (if our hash function returns an u64, then a random set of inputs should be well distributed within that range) and collision resistance (different values can result in the same hash code, but the less it happens the better). There are a number of algorithms, some specialized for specific inputs (e.g. short strings), some designed for specific hardware. WyHash is a popular option that works well across a number of inputs and characteristics. You essentially feed it bytes and, once done, get a u64 out (or u32 depending on the version).

const std = @import("std");

pub fn main() !void {
  {
    const name = "Teg";

    var h = std.hash.Wyhash.init(0);
    h.update(name);
    std.debug.print("{d}\n", .{h.final()});
  }

  {
    const name = "Teg";
    const err = @intFromError(error.OutOfMemory);

    var h = std.hash.Wyhash.init(0);
    h.update(name);
    h.update(std.mem.asBytes(&err));
    std.debug.print("{d}\n", .{h.final()});
  }
}

This code outputs: 17623169834704516898 followed by 7758855794693669122. The numbers shouldn't mean anything. The goal is only to show how data can be fed into our hashing function to generate a hash code.

Let's look at another example. Say we have a User that we'd like to use as a key in a hash map:

const User = struct {
  id: u32,
  name: []const u8,
};

We can't use an AutoHashMap for this since it doesn't support slices (which our name is). Let's start with a skeleton:

const std = @import("std");

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  const allocator = gpa.allocator();

  var h = std.HashMap(User, i32, User.HashContext, std.hash_map.default_max_load_percentage).init(allocator);
  defer h.deinit();
  try h.put(.{.id = 1, .name = "Teg"}, 100);
  try h.put(.{.id = 2, .name = "Duncan"}, 200);

  std.debug.print("{d}\n", .{h.get(.{.id = 1, .name = "Teg"}).?});
  std.debug.print("{d}\n", .{h.get(.{.id = 2, .name = "Duncan"}).?});
}

const User = struct {
  id: u32,
  name: []const u8,

  pub const HashContext = struct {
    pub fn hash(_: HashContext, u: User) u64 {
      // TODO
    }

    pub fn eql(_: HashContext, a: User, b: User) bool {
      // TODO
    }
  };
};

We need to implement the hash and eql functions. eql, as is often the case, is straightforward:

pub fn eql(_: HashContext, a: User, b: User) bool {
  if (a.id != b.id) return false;
  return std.mem.eql(u8, a.name, b.name);
}

And if you look at our other hash examples, you might be able to come up with its implementation:

pub fn hash(_: HashContext, u: User) u64 {
  var h = std.hash.Wyhash.init(0);
  h.update(u.name);
  h.update(std.mem.asBytes(&u.id));
  return h.final();
}

Plug in the two functions and the above example should work.

Conclusion

Hopefully you now have a better understanding of how Zig's hash maps are implemented and how to leverage them within your code. In most cases, std.StringHashMap or std.AutoHashMap are all you'll need. But knowing about the existence and purpose of the *Unmanaged variants, as well as the more generic std.HashMap, might prove useful. If nothing else, the documentation and their implementation should now be easier to follow.

In the next part we'll dive deep into hash map keys and values, how they're stored and how they're managed.