How to keep your distributed cache clean? - caching

In a N-Tier architecture, what would be the best patterns to use so that you can keep your cache clean?
I know it's easy to just set an absolute/sliding timeout, but is there a better mechanism available to allow you to mark your cache as dirty after you update the underlying persistence.
The difficulty I"m trying to wrap my head around is that Cache are usually stored as KVP. But a query is usually a fair bit more complex than that. So how can the gateway service tell the cache store that for such and such query, it needs to refetch from persistence.
I also can't afford to hand-code the cache update per query. I'm looking for a more systematic approach.
Is this just a pipe dream, or is there some way to do this elegantly?
Link/Guide/Post appreciated.

I have worked with AppFabric and I think tried to do what you are asking about. I was working on an auction site and I wanted to pro-actively invalidate items in the cache.
For example, we had listings (things for sale) and they would be present all over the cache (AppFabric). The data that represented a listing was in 10 different places. What I initially wanted was a way to say, "Ok, my listing has changed. Let me go find everywhere it exists in cache, and then update." (I think you say "mark as dirty" in your question)
I found doing this was incredibly difficult. There are tags in AppFabric that I tried to use, so I would mark a given object (or collection of objects) with a tag and that would let me query the cache and remove items. In other words, if an object had a LISTING tag, I would find it and invalidate it.
Eventually I settled on a two-pronged attack.
For 95% of the data I let it expire. It was a happy day when I decided this because everything got much easier to develop. I had to make some concessions in the UI etc., but it was well worth it.
For the last 5% of the data I resolved to only ever store it once. For example, a bid on a listing. Whenever a new bid came in, we'd pro-actively invalidate that object, and then everything that needed that information would be updated as well.

Related

Design a share, re-share functionality for a website, avoiding duplication

This is an interesting interview question that I found somewhere. To elaborate more:
You are expected to design classes and data structures for some website such as facebook or linkedin where your activity can be shared and re-shared. Design should be such that it avoids redundancy and duplication.
While thinking of this problem I was stuck on "link vs copy" problem as discussed here
But since the problem states that duplication should be avoided I decided to go "link" way. This makes sharing/re-sharing easier but deleting very difficult. i.e. if the original user deletes their post all the shares should be deleted. (programmatically speaking all the objects on the pointing to the particular activity should be made null. And this is the difficult part here, i.e. to find all the pointing objects)
Wouldn't it be better to keep the shares? The original user deletes
their post, fine, it's gone. But everyone who has linked to it should
not suddenly have it disappear on them.
This could be done the way Unix handles hard links. "Deleting" just
means removing one link to an object -- an inode, in Unix terms. You
don't remove the object itself until the link count is zero.
It's not obvious from the original specification that deletion should work as you describe. It might be desired that when the original user deletes the item, it is not deleted elsewhere; in that case you don't necessarily need to track all references, just keep a reference count on each post, and remove it from the database only when the count hits zero.
If you do want the behavior you describe, it may be achievable by simply removing broken links as and when you encounter them, again relieving you of the need to track each reference. The cost of tracking and updating every reference to every post is replaced with the comparable cost of one failed lookup for each referring page. The latter case is simpler to implement, though, and the cost doesn't hit your server all at once.
In real life, I would implement all references as bidirectional anyway, because it's likely to be needed sooner or later as you add features. For example, a "like" counter seems pretty simple, but to prevent duplicate votes you need to keep track of who has liked each item, and then if you want to remove their "like" when they delete their profile, you need to keep a list of each user's outbound "likes" too.
It takes a lot of database activity to implement something like Facebook...

What should be stored in cache for web app?

I realize that this might be a vague question the bequests a vague answer, but I'm in need of some real world examples, thoughts, &/or best practices for caching data for a web app. All of the examples I've read are more technical in nature (how to add or remove cache data from the respective cache store), but I've not been able to find a higher level strategy for caching.
For example, my web app has an inbox/mail feature for each user. What I've been doing to date is storing typical session data in the cache. In this example, when the user logs in I go to the database and retrieve the user's mail messages and store them in cache. I'm beginning to wonder if I should just maintain a copy of all users' messages in the cache, all the time, and just retrieve them from cache when needed, instead of loading from the database upon login. I have a bunch of other data that's loaded on login (product catalogs and related entities) and login is starting to slow down.
So I guess my question to the community, is what would you do/recommend as an approach in this scenario?
Thanks.
This might be better suited to https://softwareengineering.stackexchange.com/, but generally you want to cache:
Metadata/configuration data that does not change frequently. E.g. country/state lists, external resource addresses, logic/branching settings, product/price/tax definitions, etc.
Data that is costly to retrieve or generate and that does not need to frequently change. E.g. historical data sets for reports.
Data that is unique to the current user's session.
The last item above is where you need to be careful as you can drastically increase your app's memory usage, by adding a few megabytes to the data for every active session. It also implies different levels of caching -- application wide, user session, etc.
Generally you should NOT cache data that is under active change.
In larger systems you also need to think about where the cache(s) will sit. Is it possible to have one central cache server, or is it good enough for each server/process to handle its own caching?
Also: you should have some method to quickly reset/invalidate the cached data. For a smaller or less mission-critical app, this could be as simple as restarting the web server. For the large system that I work on, we use a 12 hour absolute expiration window for most cached data, but we have a way of forcing immediate expiration if we need it.
This is a really broad question, and the answer depends heavily on the specific application/system you are building. I don't know enough about your specific scenario to say if you should cache all the users' messages, but instinctively it seems like a bad idea since you would seem to be effectively caching your entire data set. This could lead to problems if new messages come in or get deleted. Would you then update them in the cache? Would that not simply duplicate the backing store?
Caching is only a performance optimization technique, and as with any optimization, measure first before making substantial changes, to avoid wasting time optimizing the wrong thing. Maybe you don't need much caching, and it would only complicate your app. Maybe the data you are thinking of caching can be retrieved in a faster way, or less of it can be retrieved at once.
Cache anything that causes duplicate database queries.
Client side file caching is important as well. Assuming files are marked with an id in your database, cache them on every network request to avoid many network requests for the same file. A resource to do this can be found here (https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API). If you don't need to cache files, web storage, local storage and cookies are good for smaller pieces of data.
//if file is in cache
//refer to cache
//else
//make network request and push file to cache

Improving NHibernate performance with too many objects in session

Our app was originally built with NHibernate and its limitations of batch processing in mind. However, over time it has transformed into a data cruncher and we are observing a significant performance decay.
The session ends up having to maintain about 1000 objects or more and our profiling has revealed that auto flushing and dirty checking are the biggest offenders here. We tried shutting auto flush and managing it ourselves on Save/Update operations but that led to disastrous performance for a batch save/update.
We're now looking at the option of evicting unrequired objects from the session.
I came across 2nd level-cache eviction method (sessionFactory.Evict(typeof(Cat));) which lets us evict by type but we do not use a 2nd level cache. Can I still use this method to evict objects from the 1st level cache?
I also read about one pattern of fetching objects, evicting them from session, and then reassociating them, if needed, with session by calling Update() on them. Is this a recommended and accepted pattern cause I also read that NH3 has put up a wall to this? (We can still use it as we have not upgraded to NH3)
While we realize that we are not using NHibernate in the best way, we are just looking to improve the current situation somehow. Answers to the above questions and any other suggestions/recommendations are greatly appreciated. Thanks.
Update
After looking at NH documentation and code, I realize that 1 is probably not possible. I'm still looking at some pointers or tips on using Evict(). I was able to drastically reduce the number of objects in a session. But still do not know if there is a price to pay while updating or deleting evicted objects. Thanks for your help in advance.
It's hard to say without knowing more about your requirements but maybe you could use IStatelessSession. It doesn't have a 1st level cache to worry about.
Ayende has a good post on using it for bulk operations
here
Why not use more sessions, instead of one large one? That, in conjunction with turning off autoflush has helped me in the past. Also, you should really think about using HQL for bulk updates if possible.
I know that this is old, but I just came across this while looking for something else -- having just solved this. I did solve as Trent mentioned, by using more than one session. I would create one session to fetch all of the objects I wanted, then closed that session. The case I had, was iterating through the list and operating on each object and trying to commit on each iteration. I would then create the foreach over my list, creating and disposing of a new session inside the loop, reattaching my object from the list to the new session. That took a process that was taking about 2.5 hours down to 2 minutes 40 seconds!
See this article for the inspiration to how I solved it -- although not exactly as I have unit of work wrappers around NHibernate:
http://weblogs.asp.net/ricardoperes/archive/2013/03/21/attaching-disconnected-entities-in-nhibernate-without-going-to-the-database.aspx

Cache vs HashMap for simple usecase

This must be a very basic:- Just curious, If I don't need distributed, cache-as-sor models, why do we need third party cache libraries (ehcache, memcached) when all you need (for simple use case) is just a key-value pair holder, something like HashMap ?
A lot of thought goes into producing software, and the more thought and testing by others (and fixes) improves the value of the software and also validates the code as a model (I didn't say a good model).
For the example, above, how would you handle the deleting of "old" cache items? You would have to add more code/features to insure that the cache could be emptied.
Using memcache may be overkill for a simple program, but it's already solved many of the problems that you will have and gives you a bit of extra ability.
I would also use Redis as an example. You can DO a lot of stuff in your own language, but sometimes, Redis would make other items easier.
YMMV!
-daniel

Organizing memcache keys

Im trying to find a good way to handle memcache keys for storing, retrieving and updating data to/from the cache layer in a more civilized way.
Found this pattern, which looks great, but how do I turn it into a functional part of a PHP application?
The Identity Map pattern: http://martinfowler.com/eaaCatalog/identityMap.html
Thanks!
Update: I have been told about the modified memcache (memcache-tag) that apparently does do a lot of this, but I can't install linux software on my windows development box...
Well, memcache use IS an identity map pattern. You check your cache, then you hit your database (or whatever else you're using). You can go about finding information about the source by storing objects instead of just values, but you'll take a performance hit for that.
You effectively cannot ask the cache what it contains as a list. To mass invalidate, you'll have to keep a list of what you put in and iterate it, or you'll have to iterate every possible key that could fit the pattern of concern. The resource you point out, memcache-tag can simplify this, but it doesn't appear to be maintained inline with the memcache project.
So your options now are iterative deletes, or totally flushing everything that is cached. Thus, I propose a design consideration is the question that you should be asking. In order to get a useful answer for you, I query thus: why do you want to do this?

Resources