Is Redis just a cache? - caching

I have been reading some Redis docs and trying the tutorial at http://try.redis-db.com/. So far, I can't see any difference between Redis and caching technologies like Velocity or the Enterprise Library Caching Framework
You're effectively just adding objects to an in-memory data store using a unique key. There do not seem to be any relational semantics...
What am I missing?

No, Redis is much more than a cache.
Like a Cache, Redis stores key=value pairs. But unlike a cache, Redis lets you operate on the values. There are 5 data types in Redis - Strings, Sets, Hash, Lists and Sorted Sets. Each data type exposes various operations.
The best way to understand Redis is to model an application without thinking about how you are going to store it in a database.
Lets say we want to build StackOverflow.com. To keep it simple, we need Questions, Answers, Tags and Users.
Modeling Questions, Users and Answers
Each object can be modeled as a Map. For example, a Question is a map with fields {id, title, date_asked, votes, asked_by, status}. Similarly, an Answer is a map with fields {id, question_id, answer_text, answered_by, votes, status}. Similarly, we can model a user object.
Each of these objects can be directly stored in Redis as a Hash. To generate unique ids, you can use the atomic increment command. Something like this -
$ HINCRBY unique_ids question 1
(integer) 1
$ HMSET question:1 title "Is Redis just a cache?" asked_by 12 votes 0
OK
$ HINCRBY unique_ids answer 1
(integer) 1
$ HMSET answer:1 question_id 1 answer_text "No, its a lot more" answered_by 15 votes 1
OK
Handling Up Votes
Now, everytime someone upvotes a question or an answer, you just need to do this
$ HINCRBY question:1 votes 1
(integer) 1
$ HINCRBY question:1 votes 1
(integer) 2
List of Questions for Homepage
Next, we want to store the most recent questions to display on the home page. If you were writing a .NET or Java program, you would store the questions in a List. Turns out, that is the best way to store this in Redis as well.
Every time someone asks a question, we add its id to the list.
$ lpush questions question:1
(integer) 1
$ lpush questions question:2
(integer) 1
Now, when you want to render your homepage, you ask Redis for the most recent 25 questions.
$ lrange questions 0 24
1) "question:100"
2) "question:99"
3) "question:98"
4) "question:97"
5) "question:96"
...
25) "question:76"
Now that you have the ids, retrieve items from Redis using pipelining and show them to the user.
Questions by Tags, Sorted by Votes
Next, we want to retrieve questions for each tag. But SO allows you to see top voted questions, new questions or unanswered questions under each tag.
To model this, we use Redis' Sorted Set feature. A Sorted Set allows you to associate a score with each element. You can then retrieve elements based on their scores.
Lets go ahead and do this for the Redis tag
$ zadd questions_by_votes_tagged:redis 2 question:1
(integer) 1
$ zadd questions_by_votes_tagged:redis 10 question:2
(integer) 1
$ zadd questions_by_votes_tagged:redis 5 question:613
(integer) 1
$ zrange questions_by_votes_tagged:redis 0 5
1) "question:1"
2) "question:613"
3) "question:2"
$ zrevrange questions_by_votes_tagged:redis 0 5
1) "question:2"
2) "question:613"
3) "question:1"
What did we do over here? We added questions to a sorted set, and associated a score (number of votes) to each question. Each time a question gets upvoted, we will increment its score. And when a user clicks "Questions tagged Redis, sorted by votes", we just do a zrevrange and get back the top questions.
Realtime Questions without refreshing page
And finally, a bonus feature. If you keep the questions page opened, SO will notify you when a new question is added. How can Redis help over here?
Redis has a pub-sub model. You can create channels, for example "channel_questions_tagged_redis". You then subscribe users to a particular channel. When a new question is added, you would publish a message to that channel. All users would then get the message. You will have to use a web technology like web sockets or comet to actually deliver the message to the browser, but Redis helps you with all the plumbing on the server side.
Persistence, Reliability etc.
Unlike a Cache, Redis persists data on the hard disk. You can have a master-slave setup to provide better reliability. To learn more, go through Persistence and Replication topics over here - http://redis.io/documentation

Not just a cache.
In memory key-value storage
Support multiple datatypes (strings, hashes, lists, sets, sorted sets, bitmaps, and hyperloglogs)
It provides an ability to store cache data into physical storage (if needed).
Support pub-sub model
Redis cache provides replication for high availability (master/slave)

Redis has unique abilities like ultra-fast lua-scripts. Its execution time equals to C commands execution. This also brings atomicity for sophisticated Redis data manipulation required for work many advanced objects like Locks and Semaphores.
There is a Redis based in memory data grid called Redisson which allows to easily build distributed application on Java. Thanks to distributed Lock, Semaphore, ReadWriteLock, CountDownLatch, ConcurrentMap objects and many others.
Perfectly works in cloud and supports AWS Elasticache, AWS Elasticache Cluster and Azure Redis Cache support

Actually there is no dependency between relative data representation (or any type of data representation) and database role (cache, permanent persistence etc).
Redis is good for cache it's true, but it's much more then just a cache. It's high speed fully in-memory database. It does persist data on disk. It's not relational, it's key-value storage.
We use it in production. Redis helps us to build software that handles thousands of requests per second and keep customer business data during whole natural lifecycle.

Redis is a cache which best suited for distributed environment/Microservice architecture.
It is fast, reliable, provides atomicity and consistency and has range of datatypes such as sets, hashes, list etc.
I am using it from last one year and it really comes as a saviour when you to need provide a production ready solution very fast and for any performance related issues as you can always use it to cache data.

Redis supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.
implementaion with python
https://beyondexperiment.com/vijayravichandran06/redis-data-structure-with-python/

Usages of Redis:
Cache with multiple data structures, like: string, set, zset, list, hash and bitmap (which could be used in many aggregation use cases)
KV DB. Data in Reids memory can be stored on disk: RDB and AOF can get the snapshots and edit logs.
Message Queue. But one message can only be consumed by one consumer
Pubsub
Distributed lock. Rely on the setnx command, and only the first thread executing it successfully will hold the lock. https://redis.io/commands/setnx

it is not just key-value cache, it is key-dataStructure cache.
Redis is not only cache, but also a data store. whatever is written to the cache is also written to the disk. that allows us to take backups. this allows us to restart our cache nodes. If we restart them, our cache nodes will be prepopulated with the backup. we can restart the entire cluster. But in Memcached, when a Memcached node fails or restarts, all keys stored on that node are lost
redis is also used as a message-queue

As an addition, Redis has capabilities beside caching purpose. Based on latest Redis Documentation (https://redis.io/docs/modules/), Redis has some external modules that support different kind of tasks such as:
Redis Search, full-text search capability
Redis Graph, graph database on top of Redis
Redis Time Series, module that adds a time series data structure to Redis.
Redis AI,
Neural Network for Redis, neural networks module for Redis
etc.
Personally, I used Redis for message queue by utilize Celery for Django REST Framework application beside caching at production.

Its key value datastore ,mainly deployed in private subnet main in conjunction with cloud databases to provide micro second latency. Its able to provide that with either lazy loading or write through strategy ,based on specific use-case.
It way more complex than memcached & operates in cluster -enabled/disabled mode.
It supports shards, which makes data highly avialable ,multi- az deployment.
It supports encryption of data # rest & in transit
& is extremely useful for use-cases such as streaming application ,messaging ,real time analytics ..& applications where data's value depreciates at a very fast pace w.r.t time...
Hence its not just cache ,it brings allot many more features with it ,which makes it all the more useful

Besides being a cache server, Redis is specifically a data structure server.
Being a cache in the form of a data structure server means a lot, because data structures are fundamentals of programs, or applications. Consider you are using SQL databases as storage technology and need to construct a list, a hash map, a ranking set or things like that, it's kind of pain in the neck. Redis can provide you these functionalities directly in a very simple way, thus highly simplify the development.
On the other hand, a data structure server does not have to be in the form of a cache. There are projects compatible with Redis but have persistent storage engines.

In addition to so far made answer's and then to summarize
Redis is a very fast non-relational database that stores a mapping of keys to five different types of values (strings, hashes, lists, sets, sorted sets, bitmaps, and hyperloglogs). This is explained by details #Sripathi Krishnan answers.
Redis supports in-memory persistent storage on disk
Replication to scale read performance
Client-side sharding to scale write performance
If you want to get more detail and depth information about Redis, you can look at Redis In Action and Redis Essentials's books.

Related

Difference between In-Memory cache and In-Memory Database

I was wondering if I could get an explanation between the differences between In-Memory cache(redis, memcached), In-Memory data grids (gemfire) and In-Memory database (VoltDB). I'm having a hard time distinguishing the key characteristics between the 3.
Cache - By definition means it is stored in memory. Any data stored in memory (RAM) for faster access is called cache. Examples: Ehcache, Memcache Typically you put an object in cache with String as Key and access the cache using the Key. It is very straight forward. It depends on the application when to access the cahce vs database and no complex processing happens in the Cache. If the cache spans multiple machines, then it is called distributed cache. For example, Netflix uses EVCAche which is built on top of Memcache to store the users movie recommendations that you see on the home screen.
In Memory Database - It has all the features of a Cache plus come processing/querying capabilities. Redis falls under this category. Redis supports multiple data structures and you can query the data in the Redis ( examples like get last 10 accessed items, get the most used item etc). It can span multiple machine and is usually very high performant and also support persistence to disk if needed. For example, Twitter uses Redis database to store the timeline information.
I don't know about gemfire and VoltDB, but even memcached and redis are very different. Memcached is really simple caching, a place to store variables in a very uncomplex fashion, and then retrieve them so you don't have to go to a file or database lookup every time you need that data. The types of variable are very simple. Redis on the other hand is actually an in memory database, with a very interesting selection of data types. It has a wonderful data type for doing sorted lists, which works great for applications such as leader boards. You add your new record to the data, and it gets sorted automagically.
So I wouldn't get too hung up on the categories. You really need to examine each tool differently to see what it can do for you, and the application you're building. It's kind of like trying to draw comparisons on nosql databases - they are all very different, and do different things well.
I would add that things in the "database" category tend to have more features to protect and replicate your data than a simple "cache". Cache is temporary (usually) where as database data should be persistent. Many cache solutions I've seen do not persist to disk, so if you lost power to your whole cluster, you'd lose everything in cache.
But there are some cache solutions that have persistence and replication features too, so the line is blurry.
An in-memory Cache is a common query store therefore relieves DB of read Workloads. Common examples of in-memory cache are Redis cache. An example could be Web site storing popular searches made by clients thereby relieving the DB of some load.
In-memory Cache provides query functionality on top of caching (storing session data in RAM (temporary storage)).
Memcache falls in the temp store caching category.

Proper strategy for Redis caching relational data

We have the following use case example:
We have users, stores, friends (relationships between users) and likes. We store these tables in MySQL and as a key-value stores in Redis, in order to read from the Redis cache and not hit the database. Writes are done to both data stores.
Our app is therefore VERY fast, and scalable since we rarely hit the database for reads. We are using AWS for scalable Redis.
However, we have a problem when a user is logged in and we have to show a list of stores, AND which of his friends like that store. This is a join, and Redis does not support joins directly. We'd like to know what is the best way to store and show this data. Ex: if this should be stored in a Redis table where the key value is "store/user_who likes" and mantained with every write, or maybe have an hourly cron that construct this. Then we can read already stored data or we should construct this join on demand?
We notice that not even Facebook updates this info in realtime, but rather it takes several minutes for a friend to see which of my friends likes a page we have in common.
Thanks in advance for any responses.
Depends how important it is to you. Why not store each person's friends as a set, and each store's likes as a set, and then when you need the friends who like a given store, you just take the SINTER (set intersection) between the two. Should be fast, and storing friends and store likes as sets will get you a lot of similarly nice operations as well. Not sure how you're currently using Redis cache, but you could use these as a likely cheaper memory replacement as well for getting users' friends, stores' likes, etc...
As for cron, not sure how that would help. Redis is more than fast enough to handle the above sorts of writes. Memory will be your bottleneck first.

Handle huge data imported from facebook

I'm currently create a program that imports all groups and feeds from Facebook which the user wants.
I used to use the Graph API with OAuth and this works very well.
But I came to the point that I realized that one request can't handle the import of 1000 groups plus the feeds.
So I'm looking for a solution that imports this data in the background (like a cron job) into a database.
Requirements
Runs in background
Runs under Linux
Restful
Questions
What's you experience about that?
Would hadoop the right solution?
You can use neo4j.
Neo4j is a graph database, reliable and fast for managing and querying highly connected data
http://www.neo4j.org/
1) Decide structure of nodes, relationships, and there properties and accordingly
You need to create API that will get data from facebook and store it in Neo4j.
I have used neo4j in 3 big projects, and it is best for graph data.
2) Create a cron jon that will get data from facebook and store into the neo4j.
I think implementing mysql for graph database is not a good idea. for large data neo4j is the good option.
Interestingly you designed the appropriate solution yourself already. So in fact you need following components:
a relational database, since you want to request data in a structured, quick way
-> from experiences I would pressure the fact to have a fully normalized data model (in your case with tables users, groups, users2groups), also have 4-Byte surrogate keys over larger keys from facebook (for back referencing you can store their keys as attributes, but internal relations are more efficient on surrogate keys)
-> establish indexes based on hashes rather than strings (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING))
-> never,ever establish unique columns based on strings with length > 8 Byte; unique bulk inserts can be done based on hashes+string checking via insert...select
-> once you got that settled you could also look into sparse matrices (see wikipedia) and bitmaps to get your users2groups optimized (however I have learned that this is an extra that should not hinder you to come up with a first version soon)
a cron job that is run periodically
-> ideally along the caps, facebook is giving you (so if they rule you to not request more often than once per second, stick to that - not more, but also try to come as close as possible to the cap) -> invest some time in getting the management of this settled, if different types of requests need to be fired (request for user records <> requests for group records, but maybe hit by the same cap)
-> most of the optimization can only be done with development - so if I were you I would stick to any high level programming language that does not bother to much with var type juggling and that also comes along with a broad support for associative arrays such as PHP and I would programm that thing myself
-> I made good experiences with setting up the cron job as web page with deactivated output buffering (for php look at ob_end_flush(void)) - easy to test and the cron job can be triggered via curl; if you channel status outputs via an own function (eg with time stamps) this could then also become flexible to either run viw browser or via command line -> which means efficient testing + efficient production running
your user ui, which only requests your database and never, ever, never the external system api
lots of memory, to keep your performance high (optimal: all your data+index data fits into database memory/cache dedicated to the database)
-> if you use mysql as database you should look into innodb_flush_log_at_trx_commit=0, and innodb_buffer_pool_size (just google, if interested)
Hadoop is a file system layer - it could help you with availability. However I would put this into the category of "sparse matrix", which is nothing that stops you from coming up with a solution. From my experience availability is not a primary constraint in data exposure projects.
-------------------------- UPDATE -------------------
I like neo4j from the other answer. So I wondered what I can learn for my future projects. My experiences with mysql is that RAM is usually the biggest constraint. So increasing your RAM to be able to load the full database can gain you performance improvements by a factor of 2-1000 - depending on from where you are coming from. Everything else such as index improvements and structure somehow follows. So if I would need to make up a performance prioritization list, it would be something like this:
MYSQL + enough RAM dedicated to the database to load all data
NEO4J + enough RAM dedicated to the database to load all data
I would still prefer MYSQL. It stores records efficiently, but needs to run joins for deriving relations (which neo4j does not require to that extend). Join-costs are usually low with the right indexes and according to http://docs.neo4j.org/chunked/milestone/configuration-caches.html neo4j does need to add extra management data to the property separation. For big data projects those management data sums up and in full load to memory set ups requires you buy more memory. Performance wise these both options are ultimate. Further, much further down the line you would find this:
NEO4J + not enough RAM dedicated to the database to load all data
MYSQL + not enough RAM dedicated to the database to load all data
In worst case MYSQL will even put indexes to disk (at least partly), which can result in massive read delay. In comparison with NEO4J you could perform a ' direct jump from node to node' exercise, which should - at least in theory - be faster.

Couchbase as a cache and cache invalidation

I'm thinking about using Couchbase as a cache layer. I'm aware of the many advantages provided by Couchbase, like the easy scalability. But what interests me more is the rich document model of couchbase, compared to the simple key-value one of memcached.
My RDBMS is SQL Server, and we use NHibernate. The queries and the database are already quite optimized and I think that caching is the best option for further scaling.
My project is to implement a simple relationnel model between entities (much simpler than the one in the RDBMS), to handle invalidation. When an entity is invalidated (removed from cache) by the application, all dependent entities could also be removed. The logic of defining the dependencies between entities would be handled at the application level by a dedicated component. There would be 10 or 12 different entities (I don't want to cache all my application domain).
My document model in Couchbase would look like this:
Key (the one generated by the application), keys' format depends on entity type
Hashed key (to have a uniform unique key accross all entities)
Entity
Dependencies - list of hashed keys of the entities that must be removed when main entity is removed
So my questions are:
On invalidation, we would need to resolve a graph of dependencies (asynchronously). Is it fast to look for specific keys with around 500k entities?
Any feedback on the general idea?
Maintaining the dependencies between entities can be quite simplified, and might not be such a big issue.
Pierre
I use Couchbase 2.2 in production as a persistent cache layer and really happy with it (running about 2M documents). My app getting really fast gets (1 millisecond). Your idea is valid and I don't see anything wrong with using Couchbase as a entity storage for invalidation. Its a mature and very stable product.
You are correct in your entity design. You can have a main json doc that has list of references to other child documents. So that before deleting main document you will delete all children first.
Also, not sure if its applicable in your case, you can take advantage of Couchbase ability to expire documents. When you insert key/value(json doc) you can specify TTL(time to live) if you know it upfront. This way you don't need to explicitly delete entities from Couchbase.
Delete operation itself is fast (you can run it as asynchronous operation) and having 500K documents in the Couchbase cluster it really small size. You should see under 1 millisecond get operations.
But consider having minimum 3 Couchbase nodes in one cluster, so that you can take one node down at any given point of time without compromising data stored in the cluster. See Sizing a Couchbase Server 2.0 cluster
Some additional resources:
10 things developers should know about Couchbase
Top 10 things an Ops / Sys admin must know about Couchbase
App Development with Documents, their Schemas and Relationships
Couchbase Models
Here are my thoughts:
On invalidation, we would need to resolve a graph of dependencies
(asynchronously). Is it fast to look for specific keys with around
500k entities?
Are you looking for keys in your RDBMS or in CB? If in CB, you will need to use a view/index; now, views are disk-based, but stored in sorted order so they are no slower than SQL indices. Accessing them in parallel will be faster than in series. It will be the slow point in your operation though if you use CB.
Continuing along with this thought, I have used CB successfully to store and navigate a hierarchical data structure with 500k+ nodes in it. CB performs well, but does take a few seconds to spit out the whole index if I need it (which I do if I need to do a mass-update operation).
Any feedback on the general idea?
The idea is sound. In fact, I'm seeing 10x the performance of SQL with hierarchical queries when I run them on my Couchbase cluster. I also found that a single couchbase instance outperforms multiple instances when doing an index lookup - I do not know why that is (the 2-instance cb index is 5x faster than my SQL setup). To speed things up further, you can parellelize the queries to the cb index.

Caching strategy suggestions needed

We have a fantasy football application that uses memcached and the classic memcached-object-read-with-sql-server-fallback. This works fairly well, but recently I've been contemplating the overhead involved and whether or not this is the best approach.
Case in point - we need to generate a drop down list of the users teams, so we follow this pattern:
Get a list of the users teams from memcached
If not available get the list from SQL server and store in memcached.
Do a multiget to get the team objects.
Fallback to loading objects from sql store these.
This is all very well - each cached piece of data is relatively easily cached and invalidated, but there are two major downsides to this:
1) Because we are operating on objects we are incurring a rather large overhead - a single team occupies some hundred bytes in memcached and what we really just need for this case is a list of team names and ids - not all the other stuff in the team objects.
2) Due to the fallback to loading individual objects, the number of SQL queries generated on an empty cache or when the items expire can be massive:
1 x Memcached multiget (which misses, which and causes)
1 x SELECT ... FROM Team WHERE Id IN (...)
20 x Store in memcached
So that's 21 network request just for this one query, and also the IN query is slower than a specific join.
Obviously we could just do a simple
SELECT Id, Name FROM Teams WHERE UserId = XYZ
And cache that result, but this this would mean that this data would need to be specifically invalidated whenever the user creates a new team. In this case it might seem relatively simple , but we have many of these type of queries, and many of them operate on axes that are not easily invalidated (like a list of id and names of the teams that your friends have created in a specific game).
Sooo.. My question is - do any of you have ideas for resolving the mentioned drawbacks, or should I just accept that there is an overhead and that cache misses are bad, live with it?
First, cache what you need, maybe that two fields, not a complete record.
Second, cache what you need again, break the result set into records and cache them seperately
about caching:
You generally use caching to offload the slower disc-based storage, in this case mysql. The memory cache scales up rather easily, mysql scales less easy.
Given that, even if you double the cpu/netowork/memory usage of the cache and putting it all together again, it will still offload the db. Adding another nodejs instance or another memcached server is easy.
back to your question
You say its a user's team, you could go and fetch it when the user logs-in, and keep it updated in cache while the user changes it throughout his session.
I presume the team member's names do not change, if so you can load all team members by id,name and store those in cache or even local on nodejs, use the same fallback strategy as you do now. Only step 1 and 2 and 4 will be left then.
personally i usually try to split the sql results into smaller ready-made pieces and cache those, and keep the cache updated as long as possible, untimately trying to use mysql only as storage and never read from it
usually you will run some logic on the returned rows form mysql anyways, theres no need to keep repeating that.

Resources