Dog-pile Effect and How to Avoid it with Ruby on Rails memcache-client Patch
10 Mar2008

We were using memcache in our application for a long time and it helped a lot to reduce DB servers load on some huge queries. But there was a problem (sometimes called a “dog-pile effect”) – when some cached value was expired and we had a huge traffic, sometimes too many threads in our application were trying to calculate new value to cache it.

For example, if you have some simple but really bad query like

1
SELECT COUNT(*) FROM some_table WHERE some_flag = X

which could be really slow on a huge tables, and your cache expires, then ALL your clients calling a page with this counter will end up waiting for this counter to be updated. Sometimes there could be tens or even hundreds of such a queries running on your DB killing your server and breaking an entire application (number of application instances is constant, but more and more instances are locked waiting for a counter).

So, how could we avoid such a problem? First thing came to my mind was: “What if we’d mark old counter as ‘expired’ and then only one thread would re-calculate a counter while all other clients would use old value?”. The idea looks great, but when we cache something in memcached, we it is hard to say when a value vas saved to the cache and when it is going to be expired. After a small research I’ve found a much more elegant solution: we could create two keys in memcached: MAIN key with expiration time a bit higher than normal + a STALE key which expires earlier. So, when we try to read a value from memcached, we try to read STALE key too. If it is expired, it is time to start re-calculation (and set STALE key again with some short TTL).

Final solution we end up using is following (monkey patch for the ActiveRecord::Cache class from the RobotCoop’s memcache-client library):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Anti-dog-pile effect caching extension
module ActiveRecord
  class < < Cache
    STALE_REFRESH = 1
    STALE_CREATED = 2
 
    # Caches data received from a block
    #
    # The difference between this method and usual Cache.get
    # is following: this method caches data and allows user
    # to re-generate data when it is expired w/o running
    # data generation code more than once so dog-pile effect
    # won't bring our servers down
    #
    def smart_get(key, ttl = nil, generation_time = 30.seconds)
      # Fallback to default caching approach if no ttl given
      return get(key) { yield } unless ttl
   
      # Create window for data refresh
      real_ttl = ttl + generation_time * 2
      stale_key = "#{key}.stale"
   
      # Try to get data from memcache
      value = get(key)
      stale = get(stale_key)
       
      # If stale key has expired, it is time to re-generate our data
      unless stale
        put(stale_key, STALE_REFRESH, generation_time) # lock
        value = nil # force data re-generation
      end
   
      # If no data retrieved or data re-generation forced, re-generate data and reset stale key
      unless value
        value = yield
        put(key, value, real_ttl)
        put(stale_key, STALE_CREATED, ttl) # unlock
      end
   
      return value
    end
  end
end

Since it is a monkey patch, you need to place this piece of code wherever you want, but it should be used AFTER memcache-client is loaded (for example, you can put it to your config/initializers/ directory or just copy-paste to your environment.rb. And example usage of this patch is following:

1
2
3
4
5
6
7
8
# This would fall back to a generic get() method because TTL was not provided
Cache.smart_get('test') { some_huge_calc }

# This would cache your calculation results for a 160 and will re-generate cache in 100 seconds
Cache.smart_get('test', 100) { some_huge_calc }

# This would cache your calculation results for a 120 and will re-generate cache in 100 seconds
Cache.smart_get('test', 100, 10) { some_huge_calc }

So, this is it – with a simple change we’ve fixed really annoying problem and made our application much more stable.