Rethink Your Data

Nov 07, 2011

NoSQL isn't just about new tools, it's also about new [and old] modeling and data access techniques. Since traditional modeling approaches are still relevant, it would be foolish to forget everything that you know about data modeling. Equally foolish would be to fail to learn, experiment and consider different techniques; especially in the face of poor performing queries.

For most applications, reads far outnumber writes. Yet most code is written so that reads are often complex while writes are dead simple. When performance is an issue, going from complex queries (involving a lot of fields, aggregation or map reduce) to simple queries can have devastatingly awesome consequences.

The Sample Scenario

We want to build a leaderboard system for groups of friends. When a user goes to his or her score page, he doesn't only see his best-score per level, but also the best score out of his friends.

lvl	your score	best score	best by
1	500	900	Leto
2	200	350	Jessica
3	1423	1445	Paul

A First Attempt

Our first approach might be to store our data like so:

db.users.find()
  {_id: 'Gurney', friends: ['Leto', 'Paul', 'Duncan', 'Jessica', 'Glossu']}
  ...
db.scores.find()
 {_id: ObjectId("..."), level: 1, score: 500, user: 'Gurney'},
 {_id: ObjectId("..."), level: 2, score: 200, user: 'Gurney'},
 {_id: ObjectId("..."), level: 1, score: 900, user: 'Leto'},
 {_id: ObjectId("..."), level: 2, score: 150, user: 'Leto'},
 {_id: ObjectId("..."), level: 1, score: 800, user: 'Jessica'},
 {_id: ObjectId("..."), level: 2, score: 350, user: 'Jessica'},
 ...

This approach fits with how we've been taught to build systems. You have well defined and distinct entities which map to real world objects.

With this model, saving scores is simple. We just need to figure out how to display our results. It turns out that, at scale, there isn't a great way to do that. The two most likely solutions, I see, are to either do some type of aggregation (GROUP BY in SQL, Map Reduce in MongoDB) or hit the database multiple times (linear to the number of levels we are showing). Neither of which work particularly well with a lot of levels, a lot of friends, or with a sharded architecture.

Saving a score is fast and easy:

// do this as an upsert, 500 is the new score
db.scores.update({user: 'Paul', level: 4, score: {$lt: 500}}, {$set: {score: 500}}, true)

Getting scores, not so much:

var m = function() {
  emit(this.level, {user: this.user, score: this.score})
}

var r = function(name, scores) {
  var max = scores[0];
  for(var i = 1; i < scores.length; ++i) {
    if (scores[i].score > max.score) { max = scores[i]; }
  }
  return max;
}
var friendsList = db.users.findOne({_id: 'Gurney'}, {friends:1}).friends;
db.scores.mapReduce(m, r, {query: {user: {$in: friendsList}}, out: {inline: 1}})

A Better Approach

Rather than looking at your data and wondering how to query it, consider thinking about an ideal query and figuring out how to model it. Our idea query would be:

db.scores.find({user: 'Gurney'})

This query is simple, it requires a single index on user and it can scale (we'd shard on user). In order to work, a score document would need to look like:

db.scores.find()
 {_id: ObjectId("..."), level: 1, score: 500, user: 'Gurney', best: {user: Leto, score: 900}},
  ...

To reach this state, whenever a user scores a new personal best, we potentially have to update anyone who has that user as a friend:

// same as before
db.scores.update({user: 'Paul', level: 4, score: {$lt: 500}}, {$set: {score: 500}}, true)

// n returns the number of updated scores
var updated = db.runCommand( "getlasterror" ).n

// if this was a new personal best for Paul
if (updated) {
  //find everyone who has Paul as a friend
  var cursor = db.users.find({friends: 'Paul'}, {_id:1})
  var friends = cursor.map(function(doc) { return doc._id; })

  // update their best entry if Paul's new score is better than their existing best
  db.scores.update({user: {$in: friends}, 'best.score': {$lt: 500}},
                   {$set: {'best.user': 'Paul', 'best.score': 500}}, false, true);
}

Saving a score has gotten more complicated. However, writes often aren't the bottleneck. In this case, the complexity only happens when the score is a personal best. The rest of the time, it's the same code. If we wanted to, we could further optimize the code by having Paul know who has him as a friend.

To The Purists

Purists might not like this. But to me, this is more than about being pragmatic. For our system, this denormalized representation of a score is the correct one, because it's what a score actually is. In the real world, data is duplicated. And while duplicate date does have a cost associated with it, it also has a lot of advantages.

Conclusion

If a read query is slow, and you've done what you can with indexes, then the solution may be to push some of the work onto inserts, updates or a background task. If you have difficulty visualizing it, think of it first in terms of an ideal query, then the structure which would satisfy the query, and finally how to maintain that structure.

This approach can be abused. But more often than not, people are not, whatsoever, thinking about the problem from this perspective.

When a query is slow, don't only ask how can I make this faster? Ask how can I make this simpler? If another part must pay a price, at least that's an option for you to consider.