Data Modeling In Redis

Jun 28, 2012

When you first start using Redis you probably see it with a fairly narrow perspective. Maybe you're thinking of it for caching, or you are looking to its sets to maintain a unique list, like tags. The more you use it, the more you find it useful.

Up until now, I've always used Redis to implement specific features (or parts of a feature). I've used it for ranking, real time statistics, simple queues and so on. I've never used it to hold an applications' domain data. Lately though I've been playing with the idea of re-writing mogade to use Redis exclusively. So far it's been fun to think about how I'd model it and to play with some prototypes. I thought I'd share some preliminary findings.

Specifically, I've been focusing on scores, since that's the most important part of mogade. Redis doesn't have secondary indexes. For the most part, you can only get a value by a key. That means that if you store a user with her username as the key, there's no way to find the user by email. That is, not unless you maintain your own index. The point is that you really need to think about how you need to query your data.

In mogade scores are queried two different ways. The first, is to get a page of scores for a leaderboard. The second is to get a specific user's existing score to see if the new score is better. There's a bit more to it than that. For example, a score belongs to a leaderboard which belongs to a game. Also, a score belongs to a specific scope: daily, weekly or overall. As you'll see in the first example, this doesn't really change anything.

Before we can look at how to model our scores, note that we'll be using sorted sets to order them. For example, if we want to get the 10 leaders of a leaderboard (say, id 23323) for the daily scope (1), we'd do:

# zrevrange instead of zrange because we want the top/highest scores first
  redis.zrevrange '23323:1', 0, 10

What the above would do is return the top 10 usernames. I've already explored this part of the design. What we want to focus on here is how to store and retrieve the other information associated with a score (time, points, level and so on).

Our first modeling attempt will be to use Redis strings, the most basic data structure. As an example, we might have the following keys:

redis.set '23323:1:leto',    {:points => 9001, :time => ..., :level -> ...}
  redis.set '23323:1:ghanima', {:points => 3994, :time => ..., :level -> ...}
  redis.set '23323:1:duncan',  {:points => 3444, :time => ..., :level -> ...}
  redis.set '23323:2:leto',    {:points => 10000, :time => ..., :level -> ...}

Using this approach, we can retrieve a page of leaderboard records using the mget command:

users = redis.zrevrange '23323:1', 0, 10
  redis.mget users.map{|u| "23323:1:#{u}" }

Give the above dummy data, the code would essentially result in the following being executes:

redis.mget ['23323:1:leto', '23323:1:ghanima', '23323:1:duncan']

To avoid polluting the global key space, we could store all the scores within a hash. Pretty much just adding a level of indirection:

redis.hset 'scores', '23323:1:leto',    {:points => 9001, :time => ..., :level -> ...}
  redis.hset 'scores', '23323:1:ghanima', {:points => 3994, :time => ..., :level -> ...}
  redis.hset 'scores', '23323:1:duncan',  {:points => 3444, :time => ..., :level -> ...}
  redis.hset 'scores', '23323:2:leto',    {:points => 1000, :time => ..., :level -> ...}

Using this approach, we can retrieve a page of leaderboard records using the hmget command:

users = redis.zrevrange '23323:1', 0, 10
  redis.hmget 'scores', users.map{|u| "23323:1:#{u}" }

There's no practical difference between the two. hmget and mget are both O(N) operations (where N is the number of values/fields being retrieved). Also, from what I've observed, there isn't too much of a memory difference between them. The second version is just a bit cleaner.

We can split our hash a number of different ways. For example, each leaderboard could be our key, or leaderboard + scope:

redis.hset '23323', '1:leto',    {:points => 9001, :time => ..., :level -> ...}
  redis.hset '23323', '1:ghanima', {:points => 3994, :time => ..., :level -> ...}
  redis.hset '23323', '1:duncan',  {:points => 3444, :time => ..., :level -> ...}

  #or
  redis.hset '23323:1', 'leto',    {:points => 9001, :time => ..., :level -> ...}
  redis.hset '23323:1', 'ghanima', {:points => 3994, :time => ..., :level -> ...}
  redis.hset '23323:1', 'duncan',  {:points => 3444, :time => ..., :level -> ...}

Again, in terms of memory and processing time, there isn't much (if any) difference between these different approaches. However, each one allows you to do unique things. For example, by storing everything in a single hash (named 'scores', for example), we can quickly get a count of all scores in our system:

redis.hlen 'scores'

Of course, by breaking it up by leaderboard, we can quickly get a count per-leaderboard, and we can easily delete/reset an entire leaderboard. What if we needed all of these features? An efficient way to count all the scores as well as reset an individual leaderboard? Depending on which approach you take, you'll get one and have to build the other. In this case, it'd probably be easiest (and much more efficient) to organize them by leaderboard for easy reset, and introduce a score counter:

#whenever there's a new score
  redis.incr 'score:count', 1

(It's worth noting that our sorted set used for ranking would come in handy here as it's essentially a secondary index that contains a list of all the scores in a particular leaderboard + scope. Therefore, both approaches, in this case, are quite simple).

So far the different ways we've seen to organize scores don't impact our main goal. One approach makes it easier to reset a leaderboard, while the other to get a total count of all scores, but they all take the same memory and getting a page of scores driven by rank is two efficient calls.

There is another way we could organize our data which could have a significant impact. This is because mogade's actual usernames are a SHA1 hash (username + deviceid). Given a key, the username accounts for most of the size (232:1:86f7e437faa5a7fce15d1ddcb9eaeaea377667b8). All of the examples we've looked at so far include the username for every single score of every single scope. We can organize our data and minimize username duplications:

redis.hset 'leto',    '23323:1', {:points => 9001, :time => ..., :level -> ...}
  redis.hset 'leto',    '23323:2', {:points => 1000, :time => ..., :level -> ...}
  redis.hset 'ghanima', '23323:1', {:points => 3994, :time => ..., :level -> ...}
  redis.hset 'duncal',  '23323:1', {:points => 3444, :time => ..., :level -> ...}

This means instead of having:

23323:1:86f7e437faa5a7fce15d1ddcb9eaeaea377667b8 -> data
  23323:2:86f7e437faa5a7fce15d1ddcb9eaeaea377667b8 -> data
  23323:3:86f7e437faa5a7fce15d1ddcb9eaeaea377667b8 -> data

We have:

86f7e437faa5a7fce15d1ddcb9eaeaea377667b8  -> 23323:1 -> data
                                            -> 23323:2 -> data
                                            -> 23323:3 -> data

For 3 million scores across 5 leaderboards and 3 scopes, this approach saved about 250MB. Given the small data associated with each score, that's around 50% of the overall memory.

There's a steep price to pay for this though. Our previous examples were all able to make use of either hmget or mget. However, organized this way, we need to get different fields in different keys. Rather than being able to get all our scores in a single query, we need to issue 1 query per score:

users = redis.zrevrange '23323:1', 0, 10
  scores = []
  users.each do |user|
    scores << redis.hget user, '23323:1'
  end

Redis can pipeline commands (fire a command before it received a response to the previous one), which we could leverage in this case:

users = redis.zrevrange '23323:1', 0, 10
  redis.pipelined do
    scores = users.map {|user| redis.hget user, '23323:1' }
  end

While fast, it won't be as fast as hmget/mget. Some initial benchmark showed it to be ~2x slower in Ruby. The node version, while faster in all other tests, struggled massively with this multi-query approach. I suspect that 20 000 000 asynchronous fetches is beyond the node's driver capability to gracefully handle.

2x slower might seem bad. But we are literally talking about microseconds. Users won't be able to tell the difference between 1 microsecond and 2 microseconds. And, CPU is cheaper to scale than memory. So, if I did rebuild this, I'd strongly consider this last approach for my specific needs.

Finally, in Redis 2.6 we can leverage lua scripting to implement our own command which will act similar to mget and hmget, but across different keys:

script = <<eos
  local scores = {}
  for index, user in ipairs(KEYS) do
    scores[index] = redis.call('hget', user, ARGV[1])
  end
  return scores
eos

users = redis.zrevrange '23323:1', 0, 10
scores = redis.eval script, users, '23323:1'

Rather than being 2.x, it's roughly only 20% slower.

To be clear, these performance numbers are from simulated runs fetching millions of scores in a tight loop. Running it again might result in 10% difference, or maybe a 50% difference. But we are really talking about the difference between 1 microsecond and 1.5 microsecond (for example).

Conclusion

Hopefully you've learned a couple things. First of all, there are a lot more ways to organize data in redis. Strings versus hashes, and where to partition your key versus your fields, may very well be driven by your specific requirements. Under some circumstances, it definitely can have an impact on memory size and query speed. The memory aspect of it is probably only significant when you have a lot of keys and the size of the data doesn't dwarf the size of the key (which is the case for mogade, so key size is worth considering.)

Pipelining is great, but the multi-get commands are better. If there is something to learn about that though it's not to worry about hitting Redis many times to achieve something. This is a big shift from relational databases, but an important one in order to effectively use redis.