Building a Cache in Elixir

Jul 11, 2022

I recently wrote a caching library for Elixir which aims to have a simple but effective implementation. It's meant to provide good out of the box performance while being extensible. This posts examines some of the ideas that came from it.

Without access to mutable data structures that can be shared across processes, building a cache in Elixir isn't that straightforward. If we're willing to compromise a little, we can build something quite powerful with minimal effort.

We use use an ETS table as our concurrent-safe mutable map. We'll use this to store and get values. Note that the process that creates the table owns it. In most cases, you'll want to create it when the application starts. While it might not be "pure", I generally start my caches directly in the root supervisor.

defmodule MyApp.Users do
  def setup() do
    :ets.new(:myapp_users, [

      # gives us key=>value semantics
      :set,

      # allows any process to read/write to our table
      :public,

      # allow the ETS table to access by it's name, `:myapp_users`
      :named_table,

      # favor read-locks over write-locks
      read_concurrency: true,

      # internally split the ETS table into buckets to reduce
      # write-lock contention
      write_concurrency: true,
    ])
  end

  def get(key) do
    case :ets.lookup(:myapp_users, key) do
      [{_key, value}] -> {:ok, value}
      _ -> nil
    end
  end

  def put(key, value) do
    :ets.insert(:myapp_users, {key, value})
  end
end

This is a start, but there are two things we need to add. The first is the ability to expire items. The second is placing some limit on the size of our cache.

Time To Live

Adding an expiry to items is easy, providing we're ok with lazily removing them. In order to expire items, we need to know when to expire them. We'll change our put/2 function and add a third parameter: ttl. This is the time-to-live in seconds:

def put(key, value, ttl) do
  expires = :erlang.system_time(:second) + ttl
  :ets.insert(:myapp_users, {key, value, expires})
end

Here we take a time-to-live (ttl) as relative seconds, but we could change this to take an absolute DateTime or some other time unit. What's important is that by storing the expiration time, we can modify our existing get/1 function to evict expired values:

def get(key) do
  entry = :ets.lookup(:myapp_users, key)
  with [{_key, value, expiry}] <- entry,
       true <- :erlang.system_time(:second) > expiry
  do
    value
  else
     [] -> nil  # key not found in ets
     false ->
      # key found but past expiry, delete
      :ets.delete(:myapp_users, key)
      nil
  end
end

You could call this lazily evicting expired values. We say it's lazy because it only happens re-actively when we try to get the value, not proactively when the value actually expires.

Purging

If you're only dealing with a small and slow-growing number of entries, the above might be all you need. But for larger and more dynamic data sets, solely depending on lazy removal can result in serious memory issues. Entries that aren't fetched after they expire will never be deleted.

Full Scan

The simplest option is to scan our ETS table and remove expired entries:

def purge() do
  # we'll delete any where the expiry < now
  now = :erlang.system_time(:second)

  # to be able to safely iterate through an ETS table
  # (to handle additions and removals), we need to fix the table
  :ets.safe_fixtable(:t, true)

  # iterate through the table
  purge(:ets.first(:t), now)
after

  # were're done iterate, we can unfix the table
  :ets.safe_fixtable(:t, false)
end

defp purge(:"$end_of_table", _), do: :ok

defp purge(key, now) do
  with [{_, _, expiry}] <- :ets.lookup(:t, key),
       true <- expiry < now
  do
    :ets.delete(:t, key)
  end
  purge(:ets.next(:t, key), now)
end

One question we need to figure out is when should purge be called? One option is to call it periodically. But then we need to decide on a good frequency. Too often, and we're needlessly scanning the entire cache. But if we wait too long, the cache might grow significantly between scans. As a general solution, I dislike the idea of periodic purges.

Max Size

In order to move forward, we need a way to enforce a limit on the number of values in our cache. To accomplish this, we'll store a configurable max_size within our ETS on setup:

def setup(max_size) do
  :ets.new(:myapp_users, [...])
  :ets.insert(:myapp_users, {:__max_size, max_size, nil})
end

Why are we storing {:__max_size, max_size, nil} instead of just {:__max_size, max_size}? Because this makes our __max_size entry look like any other regular entry (i.e. {key, value, ttl}). nil works as our ttl because nil is greater than any integer. In other words, the :__max_size entry never expires.

Purge When Full

With a configured max_size we can initiate our scan whenever our cache grows:

def put(key, value, ttl) do
  expires = :erlang.system_time(:second)
  entry = {key, value, expires}

  # Only purge if we're inserting a new value
  # AND we've reached our size limit
  # (If we're updating a value, then our size won't increase
  # so we won't need to purge)
  if !exists?(key) && :ets.info(:myapp_users, :size) > get_max_size() do
    purge()
  end

  :ets.insert(:myapp_users, entry)
end

def exists?(key) do
  :ets.member(:myapp_users, key)
end

defp get_max_size() do
  [{_key, max_size, _dummy_ttl} = :ets.get(:myapp_users, __max_size)
  max_size
end

If our cache has 10_000 entries in it, and we update one of those entries (i.e. if we call put/3 on a key that already exists), then the cache won't grow, it will still have 10_000 entries. Only when we insert a new entry do we need to check if we've outgrown the cache and, if so, purge.

Forced Eviction

What if our cache reaches its maximum size and there are no expired items to evict? The above purge function won't remove any items. This is a real challenge to solve in Elixir. In most other languages, we'd use a separate mutable data structure to track usage pattern to inform our eviction policy. For example, we could use a doubly linked list to track usage recency or creation age and evict the least recently used (LRU) or oldest created (FIFO), but this doesn't work easily in Elixir.

Imagine an implementation that tracks when an item was inserted into our cache. The entry would now look like {key, value, ttl, creation}. Deleting the N-oldest entries would require our purge code to build up a sorted creation=>key lookup. This is certainly doable, using Erlang's gb_trees module for example, but not particularly efficient.

Another ETS table, defined as a ordered_set, could be used. The creation date would be the key, and the value would be a list of cache keys (a list because creation date isn't unique). One concern with this approach is that, while ETS tables allow access from multiple processes, such access still involves copying values. This would add overhead to both our get and put operation (get for when we lazily delete expired items).

We could come up with other strategies. But there isn't a single best solution, especially considering what's best for one application might not be best for another. My dcache library opts for a customizable purging behavior with a simple default: randomly evicting existing entries when no expired entries are found. You can imagine a slight change to our purge path which deletes entries regardless of the expiry

# we've deleted 100 keys, that's enough
defp purge(_, 100), do: :ok

# we're reach the end of the table
defp purge(:"$end_of_table", _), do: :ok

defp purge(key, count) do
  :ets.delete(t, key)
  purge(:ets.next(key), count + 1)
end

Conclusion

It's a bit unfortunate that ETS gets us 90% of what we want, but that last 10% - enforcing an upper size - isn't straightforward to implement (at least, not without compromise). Still, I believe that in the vast number of cases, falling back to random eviction is more than sufficient.

I find it particularly interesting and fun to explore this simple approach with the goal of maximizing flexibility and performance. For example, while writing my library, I learnt about the :ets.slot/2 function, which I used instead of first/1 and next/2. I've always been a firm believer that micro-optimization is a critical task for developers to undertake. In the worst case, it tends to be a great teacher. It's also only through such efforts that informed decisions, based on previously gained knowledge and facts, can be made about what truly is and isn't worth pursuing with respect to performance.