Will Caching be useful when we need multiple items in one go - caching

We are working on a ecom site, where admin can store some configuration on the combination of Product-Category-manufacturer or on Product-Category.
We have some reports, which can return 10000 Product's transactions (with 100-1000 unique combination of product-category-manufacturer ).
In this report, we also need to use configuration as well.
One option could be to fetch configurations from the same stored procedure for all unique Product-Category-manufacturer.
Another option could be to cache all these combination in some outproc cache (like redis). And once transaction data is fetched from stored procedure, system will pull the data from cache for all 1000 Product-Category-Feature combinations. But in this case, we will have to request cache 1000 times and if some of keys are not found in cache, we will have to hit database.
In fact there can be some combination where data does not exist in database. If we request for these combination, system will not find it in cache, and it will have to hit database every-time. To resolve this, we will have to form a set of all the Product-Category-Feature combination where there is data available in cache.
Could anybody suggest that if cache will be useful in this case?

We use caching mainly in 2 occasions,
To Reduce latency: Cache is closer to the client it takes less time for the resource to reach the client.
To Reduce network traffic: Most of the time we see that some resources are reusable but always fetch from original source which
is costly and make more unnecessary traffic. Adding a cache layer
solves this.
So to answer your question, "Will Caching be useful when we need multiple items in one go?" You have to think on the above 2 points. How much you are reusing (cache hit percentage). And cost difference between cache call and call to original source.
If your issue is getting 1000 items at once, Redis don't have issue providing that. It will be so much faster than the transnational DB. And you can have set of all the Product-Category-Feature combinations, its better as we will no have cache misses. However think about the size of the Redis DB, before you proceed.

Related

what is the best strategy to sync data between DB and redis cache

We are using Oracle db, we would like to use Redis Cache mechanism, We add some subset of DB data to cache, does it sync with DB automatically when there is a change in the data in DB or we will have to implement the sync strategy, if yes, what is the best way to do it.
does it sync with DB automatically when there is a change in the data in DB
No, it doesn't.
we will have to implement the sync strategy, if yes, what is the best way to do it.
This will depend on your particular case. Usually caches are sync'd in two common ways:
Data cached with expiration. Once cached data has expired, a background process adds fresh data to cache, and so on. Usually there's data that will be refreshed in different intervals: 10 minutes, 1 hour, every day...
Data cached on demand. When an user requests some data, that request goes through the non-cached road, and that request stores the result in cache, and a limited number of subsequent requests will read cached data directly if cache is available. This approach can fall into #1 one too in terms of cache invalidation interval.
Now I believe that you've enough details to think about what could be your best strategy in your particular case!
Additionally to what mathias wrote, you can look ath the problem from dynamic/static perspective:
Real/Time approach: each time a process changes the DB data, you dispatch an event or a message to a queue where a worker handles corresponding indexing of the cache. Some might event implement it as a DB Trigger (I don't like)
Static/delayed approach: Once a day/hour/minute.. depending on your needs there is a process that does a batch/whole indexing of the DB data to the cache.

Is it normal to have a lot of records in Memached with Laravel?

I have an instance of Laravel up and running with a load balancer in place. We've setup memcached (two server nodes) to handle session management. So far the site is running fine in our test environment. The site largely ties into a web based API, so we only store a few values (other than user authentication data) in a user's session to work with the site.
After a short amount of usage by one or two users, there are about 3000 items in the cache. I don't have full access to the nodes, so I don't know exactly what the items are. However we don't appear to be maxing out the nodes with memory and the application functionality is good.
Is this to be expected? I understand that the cache management will clear out old records over time as they expire, so these could just be "remnant" data records, but this is my first time working with memcached so I want to verify that this is normal behavior.
It's quite normal for any caching solution to rack up a number of items. Especially for lots of small objects it's often more efficient for a cache to keep them beyond their expiry (but no longer serve them) and then clear them out in a big sweep periodically.
"Remnant records" pretty much describes it.
As long as your application performs as expected, I wouldn't worry. You should worry when you get a lot of cache misses for objects that were supposed to be in cache but kicked out before expiry due to lack of memory to store them all.
Yes
It is normal to have lots of records in Memcache. But you need to have proper session management.
Store small amount of values per session. (Data which is required most of the API's, Like user access token)
Cache expiration
The biggest challenge when using Memcache is avoiding cache staleness while still writing clean code. Most developers store data to Memcache and delete or update data when it changes. This strategy can get messy very quickly – Memcache code becomes riddled throughout an application. Rails’ Sweepers can help with this problem, but other languages and frameworks don’t have similar alternatives.
One simple strategy to avoid code complexity is to write data to Memcache with an expiration. Data with an expiration will automatically expire when the expiration is reached. Most applications can benefit from time-based cache expiration with infrequently changing content such as static assets, headers, footers, blog posts, etc.
List management
A simple list stored in Memcache can be useful for maintaining denormalized relationships.
For example An e-commerce website may want to store a small table of recent purchases. Rather than keeping a serialized list in Memcache and recalculating it when a new purchase is made, append and prepend can be used to store denormalized data, avoiding a database query.
Note - Memcache only supports a max value size of 1 MB. Be careful creating lists that may grow larger in size than the maximum allowed value size
Also Check these links-
https://cloud.google.com/appengine/docs/adminconsole/memcache
http://docs.oracle.com/cd/E17952_01/refman-5.6-en/ha-memcached-faq.html
http://symas.com/mdb/memcache/

Torquebox Infinispan Cache - Too many open files

I looked around and apparently Infinispan has a limit on the amount of keys you can store when persisting data to the FileStore. I get the "too many open files" exception.
I love the idea of torquebox and was anxious to slim down the stack and just use Infinispan instead of Redis. I have an app that needs to cache allot of data. The queries are computationally expensive and need to be re-computed daily (phone and other productivity metrics by agent in a call center).
I don't run a cluster though I understand the cache would persist if I had at least one app running. I would rather like to persist the cache. Has anybody run into this issue and have a work around?
Yes, Infinispan's FileCacheStore used to have an issue with opening too many files. The new SingleFileStore in 5.3.x solves that problem, but it looks like Torquebox still uses Infinispan 5.1.x (https://github.com/torquebox/torquebox/blob/master/pom.xml#L277).
I am also using infinispan cache in a live application.
Basically we are storing database queries and its result in cache for tables which are not up-datable and smaller in data size.
There are two approaches to design it:
Use queries as key and its data as value
It leads to too many entries in cache when so many different queries are placed into it.
Use xyz as key and Map as value (Map contains the queries as key and its data as value)
It leads to single entry in cache whenever data is needed from this cache (I call it query cache) retrieve Map first by using key xyz then find the query in Map itself.
We are using second approach.

Caching strategy suggestions needed

We have a fantasy football application that uses memcached and the classic memcached-object-read-with-sql-server-fallback. This works fairly well, but recently I've been contemplating the overhead involved and whether or not this is the best approach.
Case in point - we need to generate a drop down list of the users teams, so we follow this pattern:
Get a list of the users teams from memcached
If not available get the list from SQL server and store in memcached.
Do a multiget to get the team objects.
Fallback to loading objects from sql store these.
This is all very well - each cached piece of data is relatively easily cached and invalidated, but there are two major downsides to this:
1) Because we are operating on objects we are incurring a rather large overhead - a single team occupies some hundred bytes in memcached and what we really just need for this case is a list of team names and ids - not all the other stuff in the team objects.
2) Due to the fallback to loading individual objects, the number of SQL queries generated on an empty cache or when the items expire can be massive:
1 x Memcached multiget (which misses, which and causes)
1 x SELECT ... FROM Team WHERE Id IN (...)
20 x Store in memcached
So that's 21 network request just for this one query, and also the IN query is slower than a specific join.
Obviously we could just do a simple
SELECT Id, Name FROM Teams WHERE UserId = XYZ
And cache that result, but this this would mean that this data would need to be specifically invalidated whenever the user creates a new team. In this case it might seem relatively simple , but we have many of these type of queries, and many of them operate on axes that are not easily invalidated (like a list of id and names of the teams that your friends have created in a specific game).
Sooo.. My question is - do any of you have ideas for resolving the mentioned drawbacks, or should I just accept that there is an overhead and that cache misses are bad, live with it?
First, cache what you need, maybe that two fields, not a complete record.
Second, cache what you need again, break the result set into records and cache them seperately
about caching:
You generally use caching to offload the slower disc-based storage, in this case mysql. The memory cache scales up rather easily, mysql scales less easy.
Given that, even if you double the cpu/netowork/memory usage of the cache and putting it all together again, it will still offload the db. Adding another nodejs instance or another memcached server is easy.
back to your question
You say its a user's team, you could go and fetch it when the user logs-in, and keep it updated in cache while the user changes it throughout his session.
I presume the team member's names do not change, if so you can load all team members by id,name and store those in cache or even local on nodejs, use the same fallback strategy as you do now. Only step 1 and 2 and 4 will be left then.
personally i usually try to split the sql results into smaller ready-made pieces and cache those, and keep the cache updated as long as possible, untimately trying to use mysql only as storage and never read from it
usually you will run some logic on the returned rows form mysql anyways, theres no need to keep repeating that.

How to avoid database query storms using cache-aside pattern

We are using a PostgreSQL database and AppFabric Server, running a moderately busy ASP.NET MVC e-commerce site.
Following the cache-aside pattern we request data from our cache, and if it is not available, we query the database.
This approach results in 'query storms' where the database recieves multiple queries for the same data in a short space of time, while a given object in the cache is being refreshed. This issue is exacerbated by longer running queries, and obviously multiple requests for the same data can cause the query to run longer, forming an unpleasant feedback loop.
One solution to this problem is to use read-locking on the cache. However this can itself cause performance issues in a web farm situation (or even on a single busy web server) as web servers are blocked on reads for no reason, in case there is a database query taking place.
Another solution is to drop the cache-aside pattern and seed the cache independently. This is the approach we have taken to mitigate the immediate issues we are seeing with this problem, however it is not possible with all data.
Am I missing something here? And what other approaches have people taken to avoid this behaviour?
Depending on the number of servers you have and your current cache architecture it may be worthwhile to evaluate adding a server-level (or in-process) cache as well. In effect you use this as a fallback cache, and it's especially helpful where hitting the primary storage (database) is either very resource intensive or slow.
When I've used this I've used the cache-aside pattern for the primary cache and a read-through design for the secondary--in which the secondary is locking and ensures the database isn't over-saturated by the same request. With this architecture a primary cache-miss results in at most one query per entity per server (or process) to the database.
So the basic workflow is:
1) Try to retrieve from primary / shared cache pool
* If successful, return
* If unsuccessul, continue
2) Check in-process cache for value
* If successful, return (optionally seeding primary cache)
* If unsuccessul, continue
3) Get lock by cache key (and double-check in-process cache, in case it's been added by another thread)
4) Retrieve object from primary persistence (db)
5) Seed in-process cache and return
I've done this using injectable wrappers, my cache layers all implement the relevant IRepository interface, and StructureMap injects the correct stack of caches. This keeps the actual cache behaviors flexible, focused, and easy to maintain despite being fairly complex.
We've used AppFabric successfully with the seeding strategy you mention above. We actually do use both solutions:
Seed known data where possible (we have a limited set, so this is actually easy for us to figure out)
Within each cache access method, make sure to do look-aside as necessary, and populate cache on retrieval from data store.
The look-aside is necessary, as items may be evicted due to memory pressure, or simply because they were missed in the seeding operation. We have a "warming" service that pulses on an interval (an hour) and keeps the cache populated with the necessary data. We keep analysis on cache misses, and use that to tweak our warming strategy if we see frequent misses during the warming interval.

Resources