Learning Elixir's ETS - Part 1

May 28, 2017

This post, the first of two, continues our previous discussion about concurrent programming in Elixir. From what we've looked at so far though, calling things "concurrent" is a stretch. Elixir favors safety above everything else. To this end, one Elixir process owns a piece of data. Through various OTP modules like GenServer, other processes can ask the owning process to manipulate this data. At best, we can let other processes ask for a copy of the data. Put differently, everything we've looked at so far is specifically structured to serialize concurrency.

This gives some insight into why they're called processes: their memory is isolated. If we wanted to scale an Agent or GenServer, we'd need to apply the same techniques that we'd use to scale an OS process. Namely, we could duplicate the data per process or shard the data and distribute a request to the owning shard. There are a lot of cases where the disadvantages to these approach are unpalatable.

Where does that leave us? There's another OTP module, called Erlang Term Storage (ETS), that we can use. On the downside, there isn't an idiomatic Elixir layer around like we have with GenServer. So, using it is a bit ugly (but the API is simple enough, so no big deal). On the plus side, the performance characteristics (and I believe even the implementation) is going to feel a lot more natural to people used to threads.

What do I mean by this last point? Well, as far as I can tell, ETS is implemented using a read-write mutex against an actual hashtable. Maps in elixir are trees and have Log(N) performance for typical operations. ETS, backed by a hashtable, not only provides O(1) access, but also allows concurrent reads and writes from multiple processes.

Let's first look at a very basic example, then go over it in more details:

:ets.new(:players, [:set, :public, :named_table])
:ets.insert(:players, {1, "leto", 28.1, 29.5})
:ets.insert(:players, {2, "paul", 28.1, 44.4})
:ets.lookup(:players, 2)
out => [{2, "paul", 28.1, 44.4}]

First, we create the table and give it the name :players. Because we passed the :named_table option, we can use this name, from any process, to access the table. Without the :named_table option, we'd have to use the id returned by :ets.new.

Now, if the process that creates the table dies, the table disappears. For this reason, you'll want to create a GenServer to create the table and add it to our supervisor tree as a worker:

defmodule Game.Board do
  use GenServer

  def start_link(_) do
    {:ok, pid} = GenServer.start_link(__MODULE__, [])
    GenServer.call(pid, :create_tables)
    {:ok, pid}

  # It's important that we do this in a GenServer.call,
  # since start_link executes under the calling processes,
  # not the GenServer process.
  def handle_call(:create_tables, _from, _) do
    :ets.new(:players, [:set, :public, :named_table])

Our GenServer does nothing, so it hopefully won't crash. If it does, the supervisor will restart it, but our players table will be empty (this is also true of using a GenServer or Agent directly). There are more advanced recovery options possible though.

Although our GenServer owns the table, any process can read and write to it directly. This is only true because we passed the public flag. The other two options are protected which allows any process to read, but only the owning process to write, and private which only allows the owning process to read and write. (If you're using private maybe you should just use a GenServer.)

The last option that we passed was set. Other options include sorted_set, bag and duplicate_bag. To understand the difference between these, first know that ETS deals with tuples. By default, the first element of our tuple is the key. This can be changed by passing keypos: index as another option to :ets.new. A set only allows one value with a given key. An ordered_set has the same restriction, but also keeps all your data in order. The order is done by comparing the entire tuple. Therefore, if the key is the first value, and since this is a set and the key must be unique, it would order by the key.

A bag allows multiple values for a given key, as long as those values are unique. If we had a table that gave us the family members of a person, we could use a bag:

:ets.new(:family, [:bag, :public, :named_table])
:ets.insert(:family, {"leto", "ghanima"})
:ets.insert(:family, {"leto", "paul"})
:ets.insert(:family, {"leto", "chani"})
:ets.lookup(:family, "leto")
out => [{"leto", "ghanima"}, {"leto", "paul"}, {"leto", "chani"}]

Inserting {"leto", "chani"} again would not cause a duplicate entry. That is, unless we use the last type: duplicate_bag. It's worth noting that insert into a duplicate_code bag is faster than inserting into a bag so, if your application already makes it impossible to get duplicates, you might want to use a duplicate_bag.

So far, we've only looked at insert and lookup. You probably already guessed that there's a delete function. There's also a number of other things we can do, such as checking for membership, inserting only if new, iterating (though read the documentation carefully for the steps needed to do this safely, or just use the foldl function). There are specialized function for dealing with counters and a convenient update_element function that lets you update a specific element of the tuple without having to first select it.

However, what I want to talk about are the various match function which largely all take a "pattern". They are useful but a little confusing. The important thing to remember is that they work just like all the other matching in Elixir. So, if your tuple has 5 element, you need to provide a 5 element tuple (though some of those elements can be wildcards). And, if you add a 6th element, then all your matches will fail because, just like in normal Elixir, {1, 2, 3, 4, 5} does not match {1, 2, 3, 4, 5, 6}.

To match, we use the :_ atom as a wildcard. So, if we go back to our original table, which looks like:

:ets.insert(:players, {1, "leto", 28.1, 29.5})
:ets.insert(:players, {2, "paul", 28.1, 44.4})


# these will match
:ets.match(:players, {1, :_, :_, :_})
:ets.match(:players, {:_, "leto", :_, :_})
:ets.match(:players, {2, "paul", 28.1, 44.4})

# this will match both
:ets.match(:players, {:_, :_, 28.1, :_})

# these will match nothing
:ets.match(:players, 1)
:ets.match(:players, {:_, "leto"})
:ets.match(:players, {2, "paul", 28.1, 44.4, "spice"})

If you run the above though, you'll see that the "matching" lines all return [[]] and all the non-matching line returns []. What gives?

First, one simple solution is to use match_object instead, which will return the full tuple. However, if you only want some of the data, you can "select" it by specifying :"$INDEX" where index is the position you want that element returned in. Consider:

:ets.match(:players, {1, :"$1", :"$2", :_})
out => [["leto", 32.4]]

If we switch the :"$1" and :"$2", the order of the returned data will be reversed:

:ets.match(:players, {1, :"$2", :"$1", :_})
out => [[32.4, "leto"]]

Note that it doesn't matter what numbers you use. They will be returned in numeric order. We could replace :"$2", :"$1" with :"$4994822", :"$0" and get the same result.

From what I can tell, there's no way to both match and select. If you need this, use match_object.

If your match includes the key, the query will be fast (but lookup appears quite a bit faster, so use that if you just was the record by id). Otherwise, it'll scan the entire table. match can also take a 3rd integer parameter to limit the number of matched results return (and in such a case, it'll return a continuation that can be passed as the sole parameter to a subsequent match to page through the results). There's also a match_delete function that lets us delete based on a match.

Beyond match there are a number of select function which accepts an even more powerful structure (called a MatchSpec) to query the table. We can use the :ets.fun2ms to convert an elixir function into a match spec. It's really cool. Look:

ms = :ets.fun2ms fn {_, name, x, y} when name == "leto" -> {x, y} end
out => [
  {{:_, :"$1", :"$2", :"$3"}, [{:==, :"$1", "leto"}], [{{:"$2", :"$3"}}]}

ets.select(:players, ms)
out => [{32.4, 29.5}]

The above is the same as matching on {:_, "leto", "$1", $2}, but hopefully you can see how much more powerful a MatchSpec is. Of course, this is also going to be a linear scan through the table.

Maybe you're thinking "ETS all-the-things". But, as a generalization, you should favor Agents and GenServers and only switch to ETS after some deliberation.