Possibility of stale data in cache-aside pattern - caching

Just to re-cape cache-aside pattern it defines following steps when fetching and updating data.
Fetching Item
Return the item from cache if found in it.
If not found in cache, read from data store.
Put the read item in cache and return it.
Updating Item
Write the item in data store.
Remove the corresponding entry from cache.
This works perfectly in almost all cases, but it seems to fail in one theoretical scenario.
What if step 1 & 2 of updating item, happen between step 2 & 3 of fetching item. In other words, consider that initially data store had the value 'A' and it was not in cache. So when fetching item, we read 'A' from data store but before we put into the cache, the item was updated to 'B' in another thread (So 'B' was written in data store and tried to remove the entry from cache, which was not there at that time). Now when the fetching thread puts the item it read (i.e. 'A') in cache. So now 'A' will stay cached, and further fetches will return stale data, until item expires or updated again.
So am I missing something here, is my understanding of pattern is wrong. Or that the scenario is just practically impossible, and there is no need to worry about it.
Also I would like to know if some changes can be made in the pattern to avoid this problem.

Your understanding of the pattern appears perfectly correct, according to the MSDN definition. In fact, it mentions the same failure scenario that you describe.
The order of the steps in this sequence is important. If the item is removed before the cache is updated, there is a small window of opportunity for a client application to fetch the data (because it is not found in the cache) before the item in the data store has been changed, resulting in the cache containing stale data.
The MSDN article does note that, "it is usually impractical to expect that cached data will always be completely consistent with the data in the data store." Expiration and eviction are two strategies mentioned for dealing with this problem.
An old computer science joke goes like this.
There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors.
You've stumbled upon the first of these problems.

Also I would like to know if some changes can be made in the pattern
to avoid this problem.
There is no way to avoid this situation in general. Memcached protocol introduces a special command:
"cas" is a check and set operation which means "store this data but
only if no one else has updated since I last fetched it."
Scenario should be modified:
Fetching Item
Return the item from cache if found in it.
If not found in cache, read from data store.
Check and swap the corresponding entry in cache and return it.
Updating Item
Check and swap the corresponding entry in cache.
Write the item in data store.
This scenario also does not guarantee full consistency.
Imagine the following situation:
Writing item in data store fails, while updating item in cache succeed. The latest item value will be kept in cache only.

Related

crawl small homepage with metadata.transfer and N:M-relationships

hi folks,
We use StormCrawler with elasticsearch to make an index of our homepage which consist of "old pages" and "new pages".
My Question in short:
If two pages A(old),B(new) link to page X, how to pass metadata from B to X?
My Question in long:
We relauched our homepage step by step. So at time we have pdf-Files which are reachable via only the old html-pages, via only the new html-page or on both ways.
For "order by" purpose we must mark all pdf-Files which are reachable by the new html-pages.
So we insert "newHomepage=true" to seeds.txt and "metadata.transfer/-newHomepage" to "crawler-conf.yaml": Fine :-)
But for the pdf-Files which are reachable from old !and! new html-pages, we now have a race condition: If our pdf-File is "DISCOVERED" from an old page this information (newHomepage=false) is in Status-Index and can not be overridden.
( StatusUpdaterBolt does not override documents, IndexerBolt does override by default).
To make the thinks more complicate: in our case a URL (at html-page) to a PDF is redirected two times, before the file is delivered.
So from my point of view we have two possibilities:
Start the crawler two times. First we only index our new pages (and all reachable pdf files), second we index our old pages.
--> Problems with new pages which are changed after crawler was started
Store "outbound_links" and use them to set "newHomepage" independently from the crawler
--> short times with wrong metadata in index
Any advice or other ideas?
Best regards
Karsten
thanks for sharing your problem and great to hear that you are using SC. This is an interesting and unusual use case.
Your analysis of the problem is correct. An intuitive approach would be to extend the default StatusUpdaterBolt so that it updates the metadata if a document already exists. You'd need to remove the part that does the check on whether the doc has a status of DISCOVERED.
This would slow things down, but since you are dealing with a single website, this should not have a massive impact.
You could push the logic even further by setting a new nextFetchDate if the document had been fetched so that it gets refetched and updated quicker in the doc index (as opposed to the status one).

When to use Update vs Invalidate Cache Protocols

In what scenarios would it be better to use an update protocol vs an invalidate protocol? Also when would it be better to use an invalidate vs update?
I'm not able to think of any scenarios in which either would be used. If you're going to invalidate a cache line why not just update it at the same time?
Cache invalidation could be on multiple bases. It could be based on time, sliding window, based on other items within the cache or it could be from any data source.
Updating a cache is relatively a more expensive process. Considering what your data source is, it might cost you precious resources for something that would not be needed for some time.
So the question would be as why to invalidate items and why / when should you update them ?
Well, it completely depends on what is your use case. Do you want your items to automatically expire or have a dependency on any item.
When and why do you want to update them is also dependent on your use case. Would you need that item if it has not been accessed for the last 15 minutes or hours ? Why not update it only when it has been invalidated or expired.
In caches there is another concept of Read-Through. It calls for an update of item from your data source if it does not exist in the cache.

How to keep your distributed cache clean?

In a N-Tier architecture, what would be the best patterns to use so that you can keep your cache clean?
I know it's easy to just set an absolute/sliding timeout, but is there a better mechanism available to allow you to mark your cache as dirty after you update the underlying persistence.
The difficulty I"m trying to wrap my head around is that Cache are usually stored as KVP. But a query is usually a fair bit more complex than that. So how can the gateway service tell the cache store that for such and such query, it needs to refetch from persistence.
I also can't afford to hand-code the cache update per query. I'm looking for a more systematic approach.
Is this just a pipe dream, or is there some way to do this elegantly?
Link/Guide/Post appreciated.
I have worked with AppFabric and I think tried to do what you are asking about. I was working on an auction site and I wanted to pro-actively invalidate items in the cache.
For example, we had listings (things for sale) and they would be present all over the cache (AppFabric). The data that represented a listing was in 10 different places. What I initially wanted was a way to say, "Ok, my listing has changed. Let me go find everywhere it exists in cache, and then update." (I think you say "mark as dirty" in your question)
I found doing this was incredibly difficult. There are tags in AppFabric that I tried to use, so I would mark a given object (or collection of objects) with a tag and that would let me query the cache and remove items. In other words, if an object had a LISTING tag, I would find it and invalidate it.
Eventually I settled on a two-pronged attack.
For 95% of the data I let it expire. It was a happy day when I decided this because everything got much easier to develop. I had to make some concessions in the UI etc., but it was well worth it.
For the last 5% of the data I resolved to only ever store it once. For example, a bid on a listing. Whenever a new bid came in, we'd pro-actively invalidate that object, and then everything that needed that information would be updated as well.

Key based caching

I'm reading this article:
http://37signals.com/svn/posts/3113-how-key-based-cache-expiration-works
I'm not using rails so I don't really understand their example.
It says in #3:
When the key changes, you simply write the new content to this new
key. So if you update the todo, the key changes from
todos/5-20110218104500 to todos/5-20110218105545, and thus the new
content is written based on the updated object.
How does the view know to read from the new todos/5-20110218105545 instead of the old one?
I was confused about that too at first -- how does this save a trip to the database if you have to read from the database anyway to see if the cache is valid? However, see Jesse's comments (1, 2) from Feb 12th:
How do you know what the cache key is? You would have to fetch it from the database to know the mtime right? If you’re pulling the record from the database already, I would expect that to be the greatest hit, no?
Am I missing something?
and then
Please remove my brain-dead comment. I just realized why this doesn’t matter: the caching is cascaded, so yes a full depth regeneration incurs a DB hit. The next cache hit will incur one DB query for the top-level object—all the descendant objects are not queried because the cache for the parent object includes cached versions for the children (thus, no query necessary).
And Paul Leader's comment 2 below that:
Bingo. That’s why is works soooo well. If you do it right it doesn’t just eliminate the need to generate the HTML but any need to hit the db. With this caching system in place, our data-vis app is almost instantaneous, it’s actually useable and the code is much nicer.
So given the models that DHH lists in step 5 of the article and the views he lists in step 6, and given that you've properly setup your relationships to touch the parent objects on update, and given that your partials access your child data as parent.children, or even child.children in nested partials, then this caching system should have a net gain because as long as the parent's cache-key is still valid then the parent.children lookup will never happen and will also be pulled from cache, etc.
However, this method may be pointless if your partials reference lots of instance variables from the controller since those queries will already have been performed by the time Rails sees the calls to cache in the view templates. In that case you would probably be better off using other caching patterns.
Or at least this is my understanding of how it works. HTH

LINQ to XML updates - how does it handle multiple concurrent readers/writers?

I have an old system that uses XML for it's data storage. I'm going to be using the data for another mini-project and wanted to use LINQ to XML for querying/updating the data; but there's 2 scenarios that I'm not sure whether I need to handle myself or not:
1- If I have something similar to the following code, and 2 people happen to hit the Save() at the same time? Does LINQ to XML wait until the file is available again before saving, or will it just throw? I don't want to put locks in unless I need to :)
// I assume the next line doesn't lock the file
XElement doc = XElement.Load("Books.xml");
XElement newBook = new XElement("Book",
new XAttribute("publisher", "My Publisher"),
new XElement("author", "Me")));
doc.Add(newBook);
// What happens if two people try this at the same time?
doc.Save("Books.xml");
2- If I Load() a document, add a entry under a particular node, and then hit Save(); what happens if another user has already added a value under that node (since I hit my Load()) or even worse, deleted the node?
Obviously I can workaround these issues, but I couldn't find any documentation that could tell me whether I have to or not, and the first one at least would be a bit of a pig to test reliably.
It's not really a LINQ to XML issue, but a basic concurrency issue.
Assuming the two people are hitting Save at the same time, and the backing store is a file, then depending on how you opened the file for saving, you might get an error. If you leave it to the XDocument class (by just passing in a file name), then chances are it is opening it exclusively, and someone else trying to do the same (assuming the same code hitting it) will get an exception. You basically have to synchronize access to any shared resource that you are reading from/writing to.
If another user has already added a value, then assuming you don't have a problem obtaining the resource to write to, your changes will overwrite the resource. This is a frequent issue with databases known as optimistic concurrency, and you need some sort of value to indicate whether a change has occurred between the time you loaded the data, and when you save it (most databases will generate timestamp values for you).

Resources