Best way to persist only a subset of Redis keys to disk - caching

Is it possible to persist only certain keys to disk using Redis? Is the best solution for this as of right now to run separate Redis servers where one server can have throw away caches and the other one has more important data that we need to flush to disk periodically (such as counters to visits on a web page)

You can set expirations on a subset of your keys. They will be persisted to disk, but only until they expire. This may be sufficient for your use case.
You can then use the redis maxmemory and maxmemory-policy configuration options to cap memory usage and tell redis what to do when it hits the max memory. If you use the volatile-lru or volatile-ttl options Redis will discard only those keys that have an expiration when it runs out of memory, throwing out either the Least Recently Used or the one with the nearest expiration (Time To Live), respectively.
However, as stated, these values are still put to disk until expiration. If you really need to avoid this then your assumption is correct and another server looks to be the only option.

Related

Can cache admission strategy be useful to prune distributed cache writes

Assume some distributed CRUD Service that uses a distributed cache that is not read-through (just some Key-Value store agnostic of DB). So there are n server nodes connected to m cache nodes (round-robin as routing). The cache is supposed to cache data stored in a DB layer.
So the default retrieval sequence seems to be:
check if data is in cache, if so return data
else fetch from DB
send data to cache (cache does eviction)
return data
The question is whether the individual service nodes can be smarter about what data to send to the cache, to reduce cache capacity costs (achieve similar hit ratio with less required cache storage space).
Given recent benchmarks on optimal eviction/admission strategies (in particular LFU), some new caches might not even store data if it is deemed too infrequently used, maybe application nodes can do some best-effort guess.
So my idea is that the individual service nodes could evaluate whether data that was fetched from a DB should be send to the distributed cache or not based on an algorithm like LFU, thus reducing the network traffic between service and cache. I am thinking about local checks (suffering a lack of effectivity on cold startups), but checks against a shared list of cached keys may also be considered.
So the sequence would be
check if data is in cache, if so return data
else fetch from DB
check if data key is frequently used
if yes, send data to cache (cache does eviction). Else not.
return data
Is this possible, reasonable, has it already been done?
It is common in databases, search, and analytical products to guard their LRU caches with filters to avoid pollution caused by scans. For example see Postgres' Buffer Ring Replacement Strategy and ElasticSearch's filter cache. These are admission policies detached from the cache itself, which could be replaced if their caching algorithm was more intelligent. It sounds like your idea is similar, except a distributed version.
Most remote / distributed caches use classic eviction policies (LRU, LFU). That is okay because they are often excessively large, e.g. Twitter requires a 99.9% hit rate for their SLA targets. This means they likely won't drop recent items because the penalty is too high and oversize so that the victim is ancient.
However, that breaks down when batch jobs run and pollute the remote caching tier. In those cases, its not uncommon to see the cache population disabled to avoid impacting user requests. This is then a distributed variant of Postgres' problem described above.
The largest drawback with your idea is checking the item's popularity. This might be local only, which has a frequent cold start problem, or remote call which adds a network hop. That remote call would be cheaper than the traffic of shipping the item, but you are unlikely to be bandwidth limited. Likely you're goal would be to reduce capacity costs by a higher hit rate, but if your SLA requires a nearly perfect hit rate then you'll over provision anyway. It all depends on whether the gains by reducing cache-aside population operations are worth the implementation effort. I suspect that for most it hasn't been.

Redis: using two instances or just one (caching and storage)?

We need to perform rate limiting for requests to our API. We have a lot of web servers, and the rate limit should be shared between all of them. Also, the rate limit demands a certain amount of ephemeral storage (we want to store the users quota for a certain period of time).
We have a great rate limiting implementation that works with Redis by using SETEX. In this use case we need Redis to also be used a storage (for a short while, according to the expiration set on the SETEX calls). Also, the cache needs to be shared across all servers, and there is no way we could use something like an in-memory cache on each web server for dealing with the rate limiting since the rate limiting is per user - so we expect to have a lot of memory consumed for this purpose. So this process is a great use case for a Redis cluster.
Thing is - the same web server that performs the rate limit, also has some other caching needs. It fetches some stuff from a DB, and then caches the results in two layers: first, in an in-memory LRU-cache (on the actual server) and the second layer is Redis again - this time used as cache-only (no storage). In case the item gets evicted from the in-memory LRU-cache, it is passed on to be saved in Redis (so that even when a cache miss occurs in-memory, there would still be a cache-hit because thanks to Redis).
Should we use the same Redis instance for both needs (rate limiter that needs storage on one hand and cache layer that does not on the other)? I guess we could use a single Redis instance that includes storage (not the cache only option) and just use that for both needs? Would it be better, performance wise, for each server of ours to talk to two Redis instances - one that's used as cache-only and one that also features the storage option?
I always recommend dividing your setup into distinct data roles. Combining them sounds neat but in practice can be a real pain. In your case you ave two distinct "data roles": cached data and stored data. That is two major classes of distinction which means use two different instances.
In your particular case isolating them will be easier from an operational standpoint when things go wrong or need upgrading. You'll avoid intermingling services such that an issue in caching causes issues in your "storage" layer - or the inverse.
Redis usage tends to grow into more areas. If you get in the habit of dedicated Redis endpoints now you'll be better able to grow your usage in the future, as opposed to having to refactor and restructure into it when things get a bit rough.

NoSQL replacement for memcache

We are having a situation in which the values we store on memcache are bigger than 1MB.
It is not possible to make such values smaller, and even if there was a way, we need to persist them to disk.
One solution would be to recompile the memcache server to allow say 2MB values, but this is either not clean nor a complete solution (again, we need to persist the values).
Good news is that
We can predict quite acurately how many key/values pair we are going to have
We can also predict the total size we will need.
A key feature for us is the speed of memcache.
So question is: is there any noSQL replacement for memcache which will allow us to have values longer than 1MB AND store them in disk without loss of speed?
In the past I have used tokyotyrant/cabinet but seems to be deprecated now.
Any idea?
I'd use redis.
Redis addresses the issues you've listed, supports keys up to 512Mb, and values up to 2Gb.
You can persist data to disc using AOF snap-shotting given a frequency, 1s, 5s, etc., although RDB persistence provides maximum performance over AOF, in most cases.
We use redis for caching json documents. We've learned that, for maximum performance, deploy redis on physical hardware, if you can; virtual machines dramatically impacts redis network performance.
You also have Couchbase which is compatible with the Memcache API and allows you to either only store your data in Memcache or in a persisted cluster.
Redis is fine if the total ammount of your data will not exceed the size of you physical memory. If the total ammount of your data is too much to fit the memmory, you will need to install more Redis instances on different servers.
Or you may try SSDB(https://github.com/ideawu/ssdb), which will automatically migrate cold data into disk, so you will get more storage capacity with SSDB.
Any key/value store will do, really. See this list for example: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores
Also take a look at MongoDB - durability doesn't seem to be an issue for you, and that's basically where Mongo sucks, so you can get fast document-database (key/value store on steroids, basically) with indexes for free. At least until you grow too large.
I would go with couchbase, it can support up to 20mb for a document, it's possible to run a bucket as either memcache or couchbase protocol, the latter providing persistence.
Take a look at the other limits for keys/metadata here: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-server-limits.html
And a presentation on how mongodb/cassandra and couchbase stack up on throughput/operations a second. http://www.slideshare.net/renatko/couchbase-performance-benchmarking
I've used both redis and couchbase in production, for a persistent sit in replacement for memcache its hard to argue against a nosql db that is built upon the protocol.

Configuring redis to consistently evict older data first

I'm storing a bunch of realtime data in redis. I'm setting a TTL of 14400 seconds (4 hours) on all of the keys. I've set maxmemory to 10G, which currently is not enough space to fit 4 hours of data in memory, and I'm not using virtual memory, so redis is evicting data before it expires.
I'm okay with redis evicting the data, but I would like it to evict the oldest data first. So even if I don't have a full 4 hours of data, at least I can have some range of data (3 hours, 2 hours, etc) with no gaps in it. I tried to accomplish this by setting maxmemory-policy=volatile-ttl, thinking that the oldest keys would be evicted first since they all have the same TTL, but it's not working that way. It appears that redis is evicting data somewhat arbitrarily, so I end up with gaps in my data. For example, today the data from 2012-01-25T13:00 was evicted before the data from 2012-01-25T12:00.
Is it possible to configure redis to consistently evict the older data first?
Here are the relevant lines from my redis.cnf file. Let me know if you want to see any more of the cofiguration:
maxmemory 10gb
maxmemory-policy volatile-ttl
vm-enabled no
AFAIK, it is not possible to configure Redis to consistently evict the older data first.
When the *-ttl or *-lru options are chosen in maxmemory-policy, Redis does not use an exact algorithm to pick the keys to be removed. An exact algorithm would require an extra list (for *-lru) or an extra heap (for *-ttl) in memory, and cross-reference it with the normal Redis dictionary data structure. It would be expensive in term of memory consumption.
With the current mechanism, evictions occur in the main event loop (i.e. potential evictions are checked at each loop iteration before each command is executed). Until memory is back under the maxmemory limit, Redis randomly picks a sample of n keys, and selects for expiration the most idle one (for *-lru) or the one which is the closest to its expiration limit (for *-ttl). By default only 3 samples are considered. The result is non deterministic.
One way to increase the accuracy of this algorithm and mitigate the problem is to increase the number of considered samples (maxmemory-samples parameter in the configuration file).
Do not set it too high, since it will consume some CPU. It is a tradeoff between eviction accuracy and CPU consumption.
Now if you really require a consistent behavior, one solution is to implement your own eviction mechanism on top of Redis. For instance, you could add a list (for non updatable keys) or a sorted set (for updatable keys) in order to track the keys that should be evicted first. Then, you add a daemon whose purpose is to periodically check (using INFO) the memory consumption and query the items of the list/sorted set to remove the relevant keys.
Please note other caching systems have their own way to deal with this problem. For instance with memcached, there is one LRU structure per slab (which depends on the object size), so the eviction order is also not accurate (although more deterministic than with Redis in practice).

MongoDB caching counters

I'm writing a visit counter for products on a website which uses MongoDB as its' DB-Engine.
Here it says that Mongo keeps frequently accessed stuff in memory and has an integrated in-memory caching engine.
So can I just relay on this integrated caching system and dumbly set the counters up on every visit or does one still need another caching layer on a high-traffic environment?
They're two seperate things. MongoDB uses a simple paged memory management system that, by design, keeps the most accessed parts of the memory mapped disk space in memory.
As a result, this will help you most for counters that are requested frequently but do not change often. Unfortunately for website counters these two things are mutually exclusive. Because increasing counters will generally not cause MongoDB to move the document holding the counter on disk the read caching will still be fairly effective.
The main issue is your writes, basically doing an increase per visit is not going to be very cost effective. I suggest a strategy where your counter webapp caches incoming visits and only pushes counter updates every X visits or every Y seconds, whichever comes first. Your main goal here is to reduce writes per second so you definitely do not want a db write per counter visit.
Although I have never worked on the kind of system you describe, I would do the following (assuming that I have read your question correctly and that you do indeed simply want to increment the counter for each visit).
Use the $inc operator to atomically perform the incrementation, or use upserts with modifiers to create the document structure if it is not already there
Use an appropriate Write Concern to speed up updates if that is safe to do so (ie with a Write Concern of NONE your call to update will return immediately and you'll just have to trust Mongo to persist it to disk). Of course whether this is safe or not depends on the use case. If you are counting millions of hits then 1 failed hit may not be a problem.
If the scale of data you are storing is truly enormous, look into using sharding to partition writes

Resources