We were using memcache in our application for a long time and it helped a lot to reduce DB servers load on some huge queries. But there was a problem (sometimes called a “dog-pile effect”) - when some cached value was expired and we had a huge traffic, sometimes too many threads in our application were trying to calculate new value to cache it.

For example, if you have some simple but really bad query like

SELECT COUNT(*) FROM some_table WHERE some_flag = X

which could be really slow on a huge tables, and your cache expires, then ALL your clients calling a page with this counter will end up waiting for this counter to be updated. Sometimes there could be tens or even hundreds of such a queries running on your DB killing your server and breaking an entire application (number of application instances is constant, but more and more instances are locked waiting for a counter).

So, how could we avoid such a problem? First thing came to my mind was: “What if we’d mark old counter as ‘expired’ and then only one thread would re-calculate a counter while all other clients would use old value?”. The idea looks great, but when we cache something in memcached, we it is hard to say when a value vas saved to the cache and when it is going to be expired. After a small research I’ve found a much more elegant solution: we could create two keys in memcached: MAIN key with expiration time a bit higher than normal + a STALE key which expires earlier. So, when we try to read a value from memcached, we try to read STALE key too. If it is expired, it is time to start re-calculation (and set STALE key again with some short TTL).

Final solution we end up using is following (monkey patch for the ActiveRecord::Cache class from the RobotCoop’s memcache-client library):

# Anti-dog-pile effect caching extension
module ActiveRecord
  class < < Cache
    STALE_REFRESH = 1
    STALE_CREATED = 2
 
    # Caches data received from a block
    #
    # The difference between this method and usual Cache.get
    # is following: this method caches data and allows user
    # to re-generate data when it is expired w/o running
    # data generation code more than once so dog-pile effect
    # won't bring our servers down
    #
    def smart_get(key, ttl = nil, generation_time = 30.seconds)
      # Fallback to default caching approach if no ttl given
      return get(key) { yield } unless ttl
   
      # Create window for data refresh
      real_ttl = ttl + generation_time * 2
      stale_key = "#{key}.stale"
   
      # Try to get data from memcache
      value = get(key)
      stale = get(stale_key)
       
      # If stale key has expired, it is time to re-generate our data
      unless stale
        put(stale_key, STALE_REFRESH, generation_time) # lock
        value = nil # force data re-generation
      end
   
      # If no data retrieved or data re-generation forced, re-generate data and reset stale key
      unless value
        value = yield
        put(key, value, real_ttl)
        put(stale_key, STALE_CREATED, ttl) # unlock
      end
   
      return value
    end
  end
end

Since it is a monkey patch, you need to place this piece of code wherever you want, but it should be used AFTER memcache-client is loaded (for example, you can put it to your config/initializers/ directory or just copy-paste to your environment.rb. And example usage of this patch is following:

# This would fall back to a generic get() method because TTL was not provided
Cache.smart_get('test') { some_huge_calc }

# This would cache your calculation results for a 160 and will re-generate cache in 100 seconds
Cache.smart_get('test', 100) { some_huge_calc }

# This would cache your calculation results for a 120 and will re-generate cache in 100 seconds
Cache.smart_get('test', 100, 10) { some_huge_calc }

So, this is it - with a simple change we’ve fixed really annoying problem and made our application much more stable.