Yesterday, we experienced a very long downtime. All told, we were down for about 11 hours, which is unacceptably long. It sucked for everyone (including our team – we all check in everyday, too). We know how frustrating this was for all of you because many of you told us how much you’ve come to rely on foursquare when you’re out and about. For the 32 of us working here, that’s quite humbling. We’re really sorry.
This blog post is a bit technical. It has the details of what happened, and what we’re doing to make sure it doesn’t happen again in the future.
What happened The vast bulk of the data we store is from user check-in histories. The way our databases are structured is that that data is spread evenly across multiple database “shards”, each of which can only store so many check-ins. Starting around 11:00am EST yesterday, we noticed that one of these shards was performing poorly because a disproportionate share of check-ins were being written to it. For the next hour and a half, until about 12:30pm, we tried various measures to ensure a proper load balance. None of these things worked. As a next step, we introduced a new shard, intending to move some of the data from the overloaded shard to this new one.