Growing hash-of-queues beyond main memory limits - data-structures

I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.

Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
away)
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
*Requirements
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.

I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.
https://www.rabbitmq.com/

Related

Which caching mechanism to use in my spring application in below scenarios

We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks
I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.
Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html

Cassandra client code with high read throughput with row_cache optimization

Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over? I believe row_cache_size_in_mb is supposed to cache frequently used records in memory, but setting it to say 10MB seems to make no difference.
I tried cassandra-stress of course, but the highest read throughput it achieves with 1KB records (-col size=UNIFORM\(1000..1000\)) is ~15K/s.
With low numbers like above, I can easily write an in-memory hashmap based cache that will give me at least a million reads per second for a small working set size. How do I make cassandra do this automatically for me? Or is it not supposed to achieve performance close to an in-memory map even for a tiny working set size?
Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over?
There are some solution for this scenario
One idea is to use row cache but be careful, any update/delete to a single column will invalidate the whole partition from the cache so you loose all the benefit. Row cache best usage is for small dataset and are frequently read but almost never modified.
Are you sure that your cassandra-stress scenario never update or write to the same partition over and over again ?
Here are my findings: when I enable row_cache, counter_cache, and key_cache all to sizable values, I am able to verify using "top" that cassandra does no disk I/O at all; all three seem necessary to ensure no disk activity. Yet, despite zero disk I/O, the throughput is <20K/s even for reading a single record over and over. This likely confirms (as also alluded to in my comment) that cassandra incurs the cost of serialization and deserialization even if its operations are completely in-memory, i.e., it is not designed to compete with native hashmap performance. So, if you want get native hashmap speeds for a small-working-set workload but expand to disk if the map grows big, you would need to write your own cache on top of cassandra (or any of the other key-value stores like mongo, redis, etc. for that matter).
For those interested, I also verified that redis is the fastest among cassandra, mongo, and redis for a simple get/put small-working-set workload, but even redis gets at best ~35K/s read throughput (largely independent, by design, of the request size), which hardly comes anywhere close to native hashmap performance that simply returns pointers and can do so comfortably at over 2 million/s.

How to distribute data and computation to maximize locality?

Please bear with me, this is a basic architectural question for my first attempt at a "big data" project, but I believe your answers will be of general interest to anyone who is starting out in this field.
I've googled and read the high-level descriptions of Kafka, Storm, Memcached, MongoDB, etc., but now that I'm ready to dig in to start designing my app, I still need some further insight on how in fact the data should be distributed and shared.
The performance of my app is critical, so one objective is to somehow maximize the locality of the data in the RAM of the machines doing the distributed calculations. I need advice for this part of the design.
If my app had some clear criteria for a priori sharding the data and distributing the calculations (such as geographical regions or company divisions) then the solution would be obvious. But unfortunately my app's data access patterns are dynamic and depend on the results of previous calculations.
My app is an analysis program with distinct stages. In the first stage, all the data is accessed once and a metric is calculated for each data object. In the second stage, a subset of the data objects may be accessed, with the probability of access being proportional to each data object's metric that was calculated in the previous stage. In the final stage, a relatively small subset of data objects will be accessed many times for many calculations.
At all stages, it is required that the calculations be distributed across several servers. The calculations are embarassingly parallel, and each distributed calculation only needs to access a few data objects. It is also required that the number of servers can be specified before the app runs (for example, run on one server, or run on fifty servers).
It seems to me that I need some mechanism that distributes the appropriate data objects to the appropriate compute servers, as opposed to just blindly fetching the data from some database service (whether centralized or distributed). Also, it seems to me that some sort of smart caching system might be appropriate, since the data access pattern depends on the previous calculations and cannot be predicted a priori. But as far as I can tell, Memcached is not such a system because the sharding is determined a priori.
I've read many times that the operating system cache performs better than any monkeying around that we may try. I think the ideal solution is that each compute server's RAM cache somehow captures the data objects' dynamic access patterns, but it's not clear to me how this would work with a NoSQL or Memcached service.
Thanks for bearing with me this far. I realize this is a basic question, but the answer eludes me so far. I can't resolve the dynamic access patterns of my app with the a priori sharding of the NoSQL/Memcached packages. Any advice would be greatly appreciated.
I recommend you to take a look at http://tarantool.org. Shard to maximize locality for the most common data access pattern, use Lua for local computations, and net.box to issue a remote RPC when calculation needs to continue on another node. All data is stored in RAM, if you write your computation code carefully it could take advantage of the Just In Time compiler.

What is the difference and how to choose between distributed queue and distributed computing platform?

there are many files need to process with two computers real-timely,I want to distribute them to the two computers and these tasks need to be completed as soon as possibile(means real-time processing),I am thinking about the below plan:
(1) distributed queue like Gearman
(2)distributed computing platform like hadoop/spark/storm/s4 and so on
I have two questions
(1)what is the advantage and disadvantage between (1) and (2)?
(2) How to choose in (2),hadoop?spark?storm?s4?or other?
thanks!
Maybe I have not described the question clearly. In most case,there are 1000-3000 files with the same format , these files are independent,you do not need to care their order,the size of one file maybe tens to hundreds of KB and in the future, the number of files and size of single file will rise. I have wrote a program , it can process the file and pick up the data and then store the data in mongodb. Now there are only two computers, I just want a solution that can process these files with the program quickly(as soon as possibile) and is easy to extend and maintain
distributed queue is easy to use in my case bur maybe hard to extend and maintain , hadoop/spark is to "big" in the two computers but easy to extend and maintain, which is better, i am confused.
It depends a lot on the nature of your "processing". Some dimensions that apply here are:
Are records independent from each other or you need some form of aggregation? i.e: do you need some pieces of data to go together? Say, all transactions from a single user account.
Is you processing CPU bound? Memory bound? FileSystem bound?
What will be persisted? How will you persist it?
Whenever you see new data, do you need to recompute any of the old?
Can you discard data?
Is the data somewhat ordered?
What is the expected load?
A good solution will depend on answers to these (and possibly others I'm forgetting). For instance:
If computation is simple but storage and retrieval is the main concern, you should maybe look into a distributed DB rather than either of your choices.
It could be that you are best served by just logging things into a distributed filesystem like HDFS and then run batch computations with Spark (should be generally better than plain hadoop).
Maybe not, and you can use Spark Streaming to process as you receive the data.
If order and consistency are important, you might be better served by a publish/subscribe architecture, especially if your load could be more than what your two servers can handle, but there are peak and slow hours where your workers can catch up.
etc. So the answer to "how you choose?" is "by carefully looking at the constraints of your particular problem, estimate the load demands to your system and picking the solution that better matches those". All of these solutions and frameworks dominate the others, that's why they are all alive and kicking. The choice is all in the tradeoffs you are willing/able to make.
Hope it helps.
First of all, dannyhow is right - this is not what real-time processing is about. There is a great book http://www.manning.com/marz/ which says a lot about lambda archtecture.
The two ways you mentioned serves completly different purposes and are connected to the definition of word "task". For example, Spark will take a whole job you got for him and divide it into "tasks", but the outcome of one task is useless for you, you still need to wait for whole job to finish. You can create small jobs working on the same dataset and use spark's caching to speed it up. But then you won't get much advantage from distribution (if they have to be run one after another).
Are the files big? Are there connected somehow to each other? If yes, I'd go with Spark. If no, distributed queue.

Memcached vs. Redis? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 2 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
We're using a Ruby web-app with Redis server for caching. Is there a point to test Memcached instead?
What will give us better performance? Any pros or cons between Redis and Memcached?
Points to consider:
Read/write speed.
Memory usage.
Disk I/O dumping.
Scaling.
Summary (TL;DR)
Updated June 3rd, 2017
Redis is more powerful, more popular, and better supported than memcached. Memcached can only do a small fraction of the things Redis can do. Redis is better even where their features overlap.
For anything new, use Redis.
Memcached vs Redis: Direct Comparison
Both tools are powerful, fast, in-memory data stores that are useful as a cache. Both can help speed up your application by caching database results, HTML fragments, or anything else that might be expensive to generate.
Points to Consider
When used for the same thing, here is how they compare using the original question's "Points to Consider":
Read/write speed: Both are extremely fast. Benchmarks vary by workload, versions, and many other factors but generally show redis to be as fast or almost as fast as memcached. I recommend redis, but not because memcached is slow. It's not.
Memory usage: Redis is better.
memcached: You specify the cache size and as you insert items the daemon quickly grows to a little more than this size. There is never really a way to reclaim any of that space, short of restarting memcached. All your keys could be expired, you could flush the database, and it would still use the full chunk of RAM you configured it with.
redis: Setting a max size is up to you. Redis will never use more than it has to and will give you back memory it is no longer using.
I stored 100,000 ~2KB strings (~200MB) of random sentences into both. Memcached RAM usage grew to ~225MB. Redis RAM usage grew to ~228MB. After flushing both, redis dropped to ~29MB and memcached stayed at ~225MB. They are similarly efficient in how they store data, but only one is capable of reclaiming it.
Disk I/O dumping: A clear win for redis since it does this by default and has very configurable persistence. Memcached has no mechanisms for dumping to disk without 3rd party tools.
Scaling: Both give you tons of headroom before you need more than a single instance as a cache. Redis includes tools to help you go beyond that while memcached does not.
memcached
Memcached is a simple volatile cache server. It allows you to store key/value pairs where the value is limited to being a string up to 1MB.
It's good at this, but that's all it does. You can access those values by their key at extremely high speed, often saturating available network or even memory bandwidth.
When you restart memcached your data is gone. This is fine for a cache. You shouldn't store anything important there.
If you need high performance or high availability there are 3rd party tools, products, and services available.
redis
Redis can do the same jobs as memcached can, and can do them better.
Redis can act as a cache as well. It can store key/value pairs too. In redis they can even be up to 512MB.
You can turn off persistence and it will happily lose your data on restart too. If you want your cache to survive restarts it lets you do that as well. In fact, that's the default.
It's super fast too, often limited by network or memory bandwidth.
If one instance of redis/memcached isn't enough performance for your workload, redis is the clear choice. Redis includes cluster support and comes with high availability tools (redis-sentinel) right "in the box". Over the past few years redis has also emerged as the clear leader in 3rd party tooling. Companies like Redis Labs, Amazon, and others offer many useful redis tools and services. The ecosystem around redis is much larger. The number of large scale deployments is now likely greater than for memcached.
The Redis Superset
Redis is more than a cache. It is an in-memory data structure server. Below you will find a quick overview of things Redis can do beyond being a simple key/value cache like memcached. Most of redis' features are things memcached cannot do.
Documentation
Redis is better documented than memcached. While this can be subjective, it seems to be more and more true all the time.
redis.io is a fantastic easily navigated resource. It lets you try redis in the browser and even gives you live interactive examples with each command in the docs.
There are now 2x as many stackoverflow results for redis as memcached. 2x as many Google results. More readily accessible examples in more languages. More active development. More active client development. These measurements might not mean much individually, but in combination they paint a clear picture that support and documentation for redis is greater and much more up-to-date.
Persistence
By default redis persists your data to disk using a mechanism called snapshotting. If you have enough RAM available it's able to write all of your data to disk with almost no performance degradation. It's almost free!
In snapshot mode there is a chance that a sudden crash could result in a small amount of lost data. If you absolutely need to make sure no data is ever lost, don't worry, redis has your back there too with AOF (Append Only File) mode. In this persistence mode data can be synced to disk as it is written. This can reduce maximum write throughput to however fast your disk can write, but should still be quite fast.
There are many configuration options to fine tune persistence if you need, but the defaults are very sensible. These options make it easy to setup redis as a safe, redundant place to store data. It is a real database.
Many Data Types
Memcached is limited to strings, but Redis is a data structure server that can serve up many different data types. It also provides the commands you need to make the most of those data types.
Strings (commands)
Simple text or binary values that can be up to 512MB in size. This is the only data type redis and memcached share, though memcached strings are limited to 1MB.
Redis gives you more tools for leveraging this datatype by offering commands for bitwise operations, bit-level manipulation, floating point increment/decrement support, range queries, and multi-key operations. Memcached doesn't support any of that.
Strings are useful for all sorts of use cases, which is why memcached is fairly useful with this data type alone.
Hashes (commands)
Hashes are sort of like a key value store within a key value store. They map between string fields and string values. Field->value maps using a hash are slightly more space efficient than key->value maps using regular strings.
Hashes are useful as a namespace, or when you want to logically group many keys. With a hash you can grab all the members efficiently, expire all the members together, delete all the members together, etc. Great for any use case where you have several key/value pairs that need to grouped.
One example use of a hash is for storing user profiles between applications. A redis hash stored with the user ID as the key will allow you to store as many bits of data about a user as needed while keeping them stored under a single key. The advantage of using a hash instead of serializing the profile into a string is that you can have different applications read/write different fields within the user profile without having to worry about one app overriding changes made by others (which can happen if you serialize stale data).
Lists (commands)
Redis lists are ordered collections of strings. They are optimized for inserting, reading, or removing values from the top or bottom (aka: left or right) of the list.
Redis provides many commands for leveraging lists, including commands to push/pop items, push/pop between lists, truncate lists, perform range queries, etc.
Lists make great durable, atomic, queues. These work great for job queues, logs, buffers, and many other use cases.
Sets (commands)
Sets are unordered collections of unique values. They are optimized to let you quickly check if a value is in the set, quickly add/remove values, and to measure overlap with other sets.
These are great for things like access control lists, unique visitor trackers, and many other things. Most programming languages have something similar (usually called a Set). This is like that, only distributed.
Redis provides several commands to manage sets. Obvious ones like adding, removing, and checking the set are present. So are less obvious commands like popping/reading a random item and commands for performing unions and intersections with other sets.
Sorted Sets (commands)
Sorted Sets are also collections of unique values. These ones, as the name implies, are ordered. They are ordered by a score, then lexicographically.
This data type is optimized for quick lookups by score. Getting the highest, lowest, or any range of values in between is extremely fast.
If you add users to a sorted set along with their high score, you have yourself a perfect leader-board. As new high scores come in, just add them to the set again with their high score and it will re-order your leader-board. Also great for keeping track of the last time users visited and who is active in your application.
Storing values with the same score causes them to be ordered lexicographically (think alphabetically). This can be useful for things like auto-complete features.
Many of the sorted set commands are similar to commands for sets, sometimes with an additional score parameter. Also included are commands for managing scores and querying by score.
Geo
Redis has several commands for storing, retrieving, and measuring geographic data. This includes radius queries and measuring distances between points.
Technically geographic data in redis is stored within sorted sets, so this isn't a truly separate data type. It is more of an extension on top of sorted sets.
Bitmap and HyperLogLog
Like geo, these aren't completely separate data types. These are commands that allow you to treat string data as if it's either a bitmap or a hyperloglog.
Bitmaps are what the bit-level operators I referenced under Strings are for. This data type was the basic building block for reddit's recent collaborative art project: r/Place.
HyperLogLog allows you to use a constant extremely small amount of space to count almost unlimited unique values with shocking accuracy. Using only ~16KB you could efficiently count the number of unique visitors to your site, even if that number is in the millions.
Transactions and Atomicity
Commands in redis are atomic, meaning you can be sure that as soon as you write a value to redis that value is visible to all clients connected to redis. There is no wait for that value to propagate. Technically memcached is atomic as well, but with redis adding all this functionality beyond memcached it is worth noting and somewhat impressive that all these additional data types and features are also atomic.
While not quite the same as transactions in relational databases, redis also has transactions that use "optimistic locking" (WATCH/MULTI/EXEC).
Pipelining
Redis provides a feature called 'pipelining'. If you have many redis commands you want to execute you can use pipelining to send them to redis all-at-once instead of one-at-a-time.
Normally when you execute a command to either redis or memcached, each command is a separate request/response cycle. With pipelining, redis can buffer several commands and execute them all at once, responding with all of the responses to all of your commands in a single reply.
This can allow you to achieve even greater throughput on bulk importing or other actions that involve lots of commands.
Pub/Sub
Redis has commands dedicated to pub/sub functionality, allowing redis to act as a high speed message broadcaster. This allows a single client to publish messages to many other clients connected to a channel.
Redis does pub/sub as well as almost any tool. Dedicated message brokers like RabbitMQ may have advantages in certain areas, but the fact that the same server can also give you persistent durable queues and other data structures your pub/sub workloads likely need, Redis will often prove to be the best and most simple tool for the job.
Lua Scripting
You can kind of think of lua scripts like redis's own SQL or stored procedures. It's both more and less than that, but the analogy mostly works.
Maybe you have complex calculations you want redis to perform. Maybe you can't afford to have your transactions roll back and need guarantees every step of a complex process will happen atomically. These problems and many more can be solved with lua scripting.
The entire script is executed atomically, so if you can fit your logic into a lua script you can often avoid messing with optimistic locking transactions.
Scaling
As mentioned above, redis includes built in support for clustering and is bundled with its own high availability tool called redis-sentinel.
Conclusion
Without hesitation I would recommend redis over memcached for any new projects, or existing projects that don't already use memcached.
The above may sound like I don't like memcached. On the contrary: it is a powerful, simple, stable, mature, and hardened tool. There are even some use cases where it's a little faster than redis. I love memcached. I just don't think it makes much sense for future development.
Redis does everything memcached does, often better. Any performance advantage for memcached is minor and workload specific. There are also workloads for which redis will be faster, and many more workloads that redis can do which memcached simply can't. The tiny performance differences seem minor in the face of the giant gulf in functionality and the fact that both tools are so fast and efficient they may very well be the last piece of your infrastructure you'll ever have to worry about scaling.
There is only one scenario where memcached makes more sense: where memcached is already in use as a cache. If you are already caching with memcached then keep using it, if it meets your needs. It is likely not worth the effort to move to redis and if you are going to use redis just for caching it may not offer enough benefit to be worth your time. If memcached isn't meeting your needs, then you should probably move to redis. This is true whether you need to scale beyond memcached or you need additional functionality.
Use Redis if
You require selectively deleting/expiring items in the cache. (You need this)
You require the ability to query keys of a particular type. eq. 'blog1:posts:*', 'blog2:categories:xyz:posts:*'. oh yeah! this is very important. Use this to invalidate certain types of cached items selectively. You can also use this to invalidate fragment cache, page cache, only AR objects of a given type, etc.
Persistence (You will need this too, unless you are okay with your cache having to warm up after every restart. Very essential for objects that seldom change)
Use memcached if
Memcached gives you headached!
umm... clustering? meh. if you gonna go that far, use Varnish and Redis for caching fragments and AR Objects.
From my experience I've had much better stability with Redis than Memcached
Memcached is multithreaded and fast.
Redis has lots of features and is very fast, but completely limited to one core as it is based on an event loop.
We use both. Memcached is used for caching objects, primarily reducing read load on the databases. Redis is used for things like sorted sets which are handy for rolling up time-series data.
This is too long to be posted as a comment to already accepted answer, so I put it as a separate answer
One thing also to consider is whether you expect to have a hard upper memory limit on your cache instance.
Since redis is an nosql database with tons of features and caching is only one option it can be used for, it allocates memory as it needs it — the more objects you put in it, the more memory it uses. The maxmemory option does not strictly enforces upper memory limit usage. As you work with cache, keys are evicted and expired; chances are your keys are not all the same size, so internal memory fragmentation occurs.
By default redis uses jemalloc memory allocator, which tries its best to be both memory-compact and fast, but it is a general purpose memory allocator and it cannot keep up with lots of allocations and object purging occuring at a high rate. Because of this, on some load patterns redis process can apparently leak memory because of internal fragmentation. For example, if you have a server with 7 Gb RAM and you want to use redis as non-persistent LRU cache, you may find that redis process with maxmemory set to 5Gb over time would use more and more memory, eventually hitting total RAM limit until out-of-memory killer interferes.
memcached is a better fit to scenario described above, as it manages its memory in a completely different way. memcached allocates one big chunk of memory — everything it will ever need — and then manages this memory by itself, using its own implemented slab allocator. Moreover, memcached tries hard to keep internal fragmentation low, as it actually uses per-slab LRU algorithm, when LRU evictions are done with object size considered.
With that said, memcached still has a strong position in environments, where memory usage has to be enforced and/or be predictable. We've tried to use latest stable redis (2.8.19) as a drop-in non-persistent LRU-based memcached replacement in workload of 10-15k op/s, and it leaked memory A LOT; the same workload was crashing Amazon's ElastiCache redis instances in a day or so because of the same reasons.
Memcached is good at being a simple key/value store and is good at doing key => STRING. This makes it really good for session storage.
Redis is good at doing key => SOME_OBJECT.
It really depends on what you are going to be putting in there. My understanding is that in terms of performance they are pretty even.
Also good luck finding any objective benchmarks, if you do find some kindly send them my way.
If you don't mind a crass writing style, Redis vs Memcached on the Systoilet blog is worth a read from a usability standpoint, but be sure to read the back & forth in the comments before drawing any conclusions on performance; there are some methodological problems (single-threaded busy-loop tests), and Redis has made some improvements since the article was written as well.
And no benchmark link is complete without confusing things a bit, so also check out some conflicting benchmarks at Dormondo's LiveJournal and the Antirez Weblog.
Edit -- as Antirez points out, the Systoilet analysis is rather ill-conceived. Even beyond the single-threading shortfall, much of the performance disparity in those benchmarks can be attributed to the client libraries rather than server throughput. The benchmarks at the Antirez Weblog do indeed present a much more apples-to-apples (with the same mouth) comparison.
I got the opportunity to use both memcached and redis together in the caching proxy that i have worked on , let me share you where exactly i have used what and reason behind same....
Redis >
1) Used for indexing the cache content , over the cluster . I have more than billion keys in spread over redis clusters , redis response times is quite less and stable .
2) Basically , its a key/value store , so where ever in you application you have something similar, one can use redis with bothering much.
3) Redis persistency, failover and backup (AOF ) will make your job easier .
Memcache >
1) yes , an optimized memory that can be used as cache . I used it for storing cache content getting accessed very frequently (with 50 hits/second)with size less than 1 MB .
2) I allocated only 2GB out of 16 GB for memcached that too when my single content size was >1MB .
3) As the content grows near the limits , occasionally i have observed higher response times in the stats(not the case with redis) .
If you ask for overall experience Redis is much green as it is easy to configure, much flexible with stable robust features.
Further , there is a benchmarking result available at this link , below are few higlight from same,
Hope this helps!!
Test. Run some simple benchmarks. For a long while I considered myself an old school rhino since I used mostly memcached and considered Redis the new kid.
With my current company Redis was used as the main cache. When I dug into some performance stats and simply started testing, Redis was, in terms of performance, comparable or minimally slower than MySQL.
Memcached, though simplistic, blew Redis out of water totally. It scaled much better:
for bigger values (required change in slab size, but worked)
for multiple concurrent requests
Also, memcached eviction policy is in my view, much better implemented, resulting in overall more stable average response time while handling more data than the cache can handle.
Some benchmarking revealed that Redis, in our case, performs very poorly. This I believe has to do with many variables:
type of hardware you run Redis on
types of data you store
amount of gets and sets
how concurrent your app is
do you need data structure storage
Personally, I don't share the view Redis authors have on concurrency and multithreading.
Another bonus is that it can be very clear how memcache is going to behave in a caching scenario, while redis is generally used as a persistent datastore, though it can be configured to behave just like memcached aka evicting Least Recently Used items when it reaches max capacity.
Some apps I've worked on use both just to make it clear how we intend the data to behave - stuff in memcache, we write code to handle the cases where it isn't there - stuff in redis, we rely on it being there.
Other than that Redis is generally regarded as superior for most use cases being more feature-rich and thus flexible.
It would not be wrong, if we say that redis is combination of (cache + data structure) while memcached is just a cache.
A very simple test to set and get 100k unique keys and values against redis-2.2.2 and memcached. Both are running on linux VM(CentOS) and my client code(pasted below) runs on windows desktop.
Redis
Time taken to store 100000 values is = 18954ms
Time taken to load 100000 values is = 18328ms
Memcached
Time taken to store 100000 values is = 797ms
Time taken to retrieve 100000 values is = 38984ms
Jedis jed = new Jedis("localhost", 6379);
int count = 100000;
long startTime = System.currentTimeMillis();
for (int i=0; i<count; i++) {
jed.set("u112-"+i, "v51"+i);
}
long endTime = System.currentTimeMillis();
System.out.println("Time taken to store "+ count + " values is ="+(endTime-startTime)+"ms");
startTime = System.currentTimeMillis();
for (int i=0; i<count; i++) {
client.get("u112-"+i);
}
endTime = System.currentTimeMillis();
System.out.println("Time taken to retrieve "+ count + " values is ="+(endTime-startTime)+"ms");
One major difference that hasn't been pointed out here is that Memcache has an upper memory limit at all times, while Redis does not by default (but can be configured to). If you would always like to store a key/value for certain amount of time (and never evict it because of low memory) you want to go with Redis. Of course, you also risk the issue of running out of memory...
Memcached will be faster if you are interested in performance, just even because Redis involves networking (TCP calls). Also internally Memcache is faster.
Redis has more features as it was mentioned by other answers.
The biggest remaining reason is specialization.
Redis can do a lot of different things and one side effect of that is developers may start using a lot of those different feature sets on the same instance. If you're using the LRU feature of Redis for a cache along side hard data storage that is NOT LRU it's entirely possible to run out of memory.
If you're going to setup a dedicated Redis instance to be used ONLY as an LRU instance to avoid that particular scenario then there's not really any compelling reason to use Redis over Memcached.
If you need a reliable "never goes down" LRU cache...Memcached will fit the bill since it's impossible for it to run out of memory by design and the specialize functionality prevents developers from trying to make it so something that could endanger that. Simple separation of concerns.
We thought of Redis as a load-takeoff for our project at work. We thought that by using a module in nginx called HttpRedis2Module or something similar we would have awesome speed but when testing with AB-test we're proven wrong.
Maybe the module was bad or our layout but it was a very simple task and it was even faster to take data with php and then stuff it into MongoDB. We're using APC as caching-system and with that php and MongoDB. It was much much faster then nginx Redis module.
My tip is to test it yourself, doing it will show you the results for your environment. We decided that using Redis was unnecessary in our project as it would not make any sense.
Redis is better.
The Pros of Redis are ,
It has a lot of data storage options such as string , sets , sorted sets , hashes , bitmaps
Disk Persistence of records
Stored Procedure (LUA scripting) support
Can act as a Message Broker using PUB/SUB
Whereas Memcache is an in-memory key value cache type system.
No support for various data type storages like lists , sets as in
redis.
The major con is Memcache has no disk persistence .
Here is the really great article/differences provided by Amazon
Redis is a clear winner comparing with memcached.
Only one plus point for Memcached
It is multithreaded and fast. Redis has lots of great features and is very fast, but limited to one core.
Great points about Redis, which are not supported in Memcached
Snapshots - User can take a snapshot of Redis cache and persist on
secondary storage any point of time.
Inbuilt support for many data structures like Set, Map, SortedSet,
List, BitMaps etc.
Support for Lua scripting in redis

Resources