Key based caching - caching

I'm reading this article:
http://37signals.com/svn/posts/3113-how-key-based-cache-expiration-works
I'm not using rails so I don't really understand their example.
It says in #3:
When the key changes, you simply write the new content to this new
key. So if you update the todo, the key changes from
todos/5-20110218104500 to todos/5-20110218105545, and thus the new
content is written based on the updated object.
How does the view know to read from the new todos/5-20110218105545 instead of the old one?

I was confused about that too at first -- how does this save a trip to the database if you have to read from the database anyway to see if the cache is valid? However, see Jesse's comments (1, 2) from Feb 12th:
How do you know what the cache key is? You would have to fetch it from the database to know the mtime right? If you’re pulling the record from the database already, I would expect that to be the greatest hit, no?
Am I missing something?
and then
Please remove my brain-dead comment. I just realized why this doesn’t matter: the caching is cascaded, so yes a full depth regeneration incurs a DB hit. The next cache hit will incur one DB query for the top-level object—all the descendant objects are not queried because the cache for the parent object includes cached versions for the children (thus, no query necessary).
And Paul Leader's comment 2 below that:
Bingo. That’s why is works soooo well. If you do it right it doesn’t just eliminate the need to generate the HTML but any need to hit the db. With this caching system in place, our data-vis app is almost instantaneous, it’s actually useable and the code is much nicer.
So given the models that DHH lists in step 5 of the article and the views he lists in step 6, and given that you've properly setup your relationships to touch the parent objects on update, and given that your partials access your child data as parent.children, or even child.children in nested partials, then this caching system should have a net gain because as long as the parent's cache-key is still valid then the parent.children lookup will never happen and will also be pulled from cache, etc.
However, this method may be pointless if your partials reference lots of instance variables from the controller since those queries will already have been performed by the time Rails sees the calls to cache in the view templates. In that case you would probably be better off using other caching patterns.
Or at least this is my understanding of how it works. HTH

Related

Possibility of stale data in cache-aside pattern

Just to re-cape cache-aside pattern it defines following steps when fetching and updating data.
Fetching Item
Return the item from cache if found in it.
If not found in cache, read from data store.
Put the read item in cache and return it.
Updating Item
Write the item in data store.
Remove the corresponding entry from cache.
This works perfectly in almost all cases, but it seems to fail in one theoretical scenario.
What if step 1 & 2 of updating item, happen between step 2 & 3 of fetching item. In other words, consider that initially data store had the value 'A' and it was not in cache. So when fetching item, we read 'A' from data store but before we put into the cache, the item was updated to 'B' in another thread (So 'B' was written in data store and tried to remove the entry from cache, which was not there at that time). Now when the fetching thread puts the item it read (i.e. 'A') in cache. So now 'A' will stay cached, and further fetches will return stale data, until item expires or updated again.
So am I missing something here, is my understanding of pattern is wrong. Or that the scenario is just practically impossible, and there is no need to worry about it.
Also I would like to know if some changes can be made in the pattern to avoid this problem.
Your understanding of the pattern appears perfectly correct, according to the MSDN definition. In fact, it mentions the same failure scenario that you describe.
The order of the steps in this sequence is important. If the item is removed before the cache is updated, there is a small window of opportunity for a client application to fetch the data (because it is not found in the cache) before the item in the data store has been changed, resulting in the cache containing stale data.
The MSDN article does note that, "it is usually impractical to expect that cached data will always be completely consistent with the data in the data store." Expiration and eviction are two strategies mentioned for dealing with this problem.
An old computer science joke goes like this.
There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors.
You've stumbled upon the first of these problems.
Also I would like to know if some changes can be made in the pattern
to avoid this problem.
There is no way to avoid this situation in general. Memcached protocol introduces a special command:
"cas" is a check and set operation which means "store this data but
only if no one else has updated since I last fetched it."
Scenario should be modified:
Fetching Item
Return the item from cache if found in it.
If not found in cache, read from data store.
Check and swap the corresponding entry in cache and return it.
Updating Item
Check and swap the corresponding entry in cache.
Write the item in data store.
This scenario also does not guarantee full consistency.
Imagine the following situation:
Writing item in data store fails, while updating item in cache succeed. The latest item value will be kept in cache only.

slow-loading persistent store coordinator in core data

I have been developing a Cocoa app with Core Data. Initially everything seemed fine, but as I added data to the application, I found that the initial data window took ages to load. To fix that, I moved to another startup window that didn't have the data, so start-up was snappy. However, no matter what I do, my first fetch AND my first attempt to load a data window (with tables views) are always slow. (That is, if I fetch slowly and then ask for the data window, both will be slow the first time around.) After that, performance is acceptable.
I traced through my application and found that while I can quickly step through the program, no matter what, the step that retrieves the persistent store coordinator is incredibly slow ... 15 - 20 seconds can elapse with a spinning beach ball.
I've read elsewhere that I might want to denormalize the data. I don't think that will be sufficient. An earlier version was far less "interconnected" between the entities, and it still was a slug at startup. Now I'm looking at entities that may have as high as 18,000 managed objects. Some of the relations are essential to having the data work correctly.
I've also read about the option of employing a separate managed object context in the background. The problem with this is that even this background context would take too long to be usable. If the user tries to run a search, he or she will still be waiting forever for that context to load. I might buy myself a few seconds while the user decides what to type in to the search field, but I can't afford to stall for 25 seconds.
I noticed that once data is imported into the persistent store, even searches on a table that is not related to others (and only has 1000 objects) still takes ages to load. The reason seems to be that it's the coordinator retrieval itself that's slow, not the actual fetch or the context.
Can anyone point me in the right direction on how to resolve this? Thanks!
Before you create your data model:
If you’re storing large objects such as photos, audio or video, you need to be very careful with your model design.
The key point to remember is that when you bring a managed object into a context, you’re bringing all of its data into memory.
If large photos are within managed objects cut from the same entity that drives a table-view, performance will suffer. Even if you’re using a fetched results controller, you could still be loading over a dozen high-resolution images at once, which isn’t going to be instant.
To get around this issue, attributes that will hold large objects should be split off into a related entity. This way the large objects can remain in the persistent store and can be represented by a fault instead, until they really are needed.
If you need to display photos in a table view, you should use auto-generated thumbnail images instead.
Read the whole article
You might be getting ahead of yourself thinking PSC is the culprit.
There is more going on behind the scenes with CoreData than is readily obvious -- PSC is very flexible and must be directed.
A realistic approach for the data size you specified (18K) is to focus on modularizing the logic of your fetch request templates and validation for specific size cases (think small medium large XtraLarge, etc.).
The suggestion to denormalize your data does not take into account the overhead to get your data into a fully denormalized state, plus a (sometimes) unintended side-effect of denormalization is sparsity (unless you have very specific model of course).
Since you usually do not know beforehand what data will be accessed and modified beforehand, make a one-to-many relationship between your central task and any subtasks. This will free up some constraints on your data access.
You can always give your end users the option to choose how they want to handle the larger datasets.

How to keep your distributed cache clean?

In a N-Tier architecture, what would be the best patterns to use so that you can keep your cache clean?
I know it's easy to just set an absolute/sliding timeout, but is there a better mechanism available to allow you to mark your cache as dirty after you update the underlying persistence.
The difficulty I"m trying to wrap my head around is that Cache are usually stored as KVP. But a query is usually a fair bit more complex than that. So how can the gateway service tell the cache store that for such and such query, it needs to refetch from persistence.
I also can't afford to hand-code the cache update per query. I'm looking for a more systematic approach.
Is this just a pipe dream, or is there some way to do this elegantly?
Link/Guide/Post appreciated.
I have worked with AppFabric and I think tried to do what you are asking about. I was working on an auction site and I wanted to pro-actively invalidate items in the cache.
For example, we had listings (things for sale) and they would be present all over the cache (AppFabric). The data that represented a listing was in 10 different places. What I initially wanted was a way to say, "Ok, my listing has changed. Let me go find everywhere it exists in cache, and then update." (I think you say "mark as dirty" in your question)
I found doing this was incredibly difficult. There are tags in AppFabric that I tried to use, so I would mark a given object (or collection of objects) with a tag and that would let me query the cache and remove items. In other words, if an object had a LISTING tag, I would find it and invalidate it.
Eventually I settled on a two-pronged attack.
For 95% of the data I let it expire. It was a happy day when I decided this because everything got much easier to develop. I had to make some concessions in the UI etc., but it was well worth it.
For the last 5% of the data I resolved to only ever store it once. For example, a bid on a listing. Whenever a new bid came in, we'd pro-actively invalidate that object, and then everything that needed that information would be updated as well.

When is it better to generate a static page or dynamically generate?

The title pretty much sums up my question.
When is it more efficient to generate a static page, that a user can access, as apposed to using dynamically generated pages that query a database? As in what situations would one be better than the other.
To serve up a static page, your web server just needs to read the page off the disk and send it. Virtually no processing will be required. If the page is frequently accessed, it will probably be cached in memory, so even the disk access will not be needed.
Generating pages dynamically obviously has more overhead. There is a cost for every DB access you make, no matter how simple the query is. (On a project I worked on recently, I measured a minimum overhead of 0.7ms for each query, even for SELECT 1;) So if you can just generate a static page and save it to disk, page accesses will be faster. How much faster? It just depends on how much work is being done to generate the page dynamically. We don't know what you are doing, so we can't comment on that.
Now, if you generate a static page and save it to disk, that means you need to re-generate it every time the data which went into generating that page changes. If the data changes more often than the page is actually accessed, you could be doing more work rather than less! But in most cases, that's a very unlikely situation.
More likely, the biggest problem you will experience from generating static pages and saving them to disk is coding (and maintaining) the logic for re-generating the pages whenever necessary. You will need to keep track of exactly what data goes into each page, and in every place in the code where data can be changed, you will need to invoke re-generation of all the relevant pages. If you forget just one, then your users may be looking at stale data some of the time.
If you mix dynamic generation per-request and caching generated pages on disk, then your code will be harder to read and maintain, because of mixing the two styles.
And you can't really cache generated pages on disk in certain situations -- like responding to POST requests which come from a form submission. Or imagine that when your users invoke certain actions, you have to send a request to a 3rd party API, and the data which comes back from that API will be used in the page. What comes back from the API may be different each time, so in this case, you need to generate the page dynamically each time.
Static pages (or better resources) are filled with content, that does not change or at least not often, and does not allow further queries on it: About Page, Contact, ...
In this case it doesn't make any sense to query these pages. On the other side we have Data (e.g. in a Database) and want to query it/give the user the opportunity to query it. In this case you give the User a page with the possibility to specify the query and return a rendered page with the dynamically generated data.
In my opinion it depends on the result you want to present to the user. Either it is only an information or it is the possibility to query a Datasource. The first result is known before you do something, the second (query data) is known after you have the query parameters, which means you don't know the result beforehand (it could be empty or invalid).
It depends on your architecture, but when you consider that GET Requests should be idempotent it should be also easy to cache dynamic Pages with a Proxy, and invalidate the cache, when something new happens to the data which is displayed on the cached path. In this case one could save a lot of time, because the system behaves like the cached pages would be static, but instead coming from the filesystem, they come from your memory, which is really fast.
Cheers
Laidback

LINQ to XML updates - how does it handle multiple concurrent readers/writers?

I have an old system that uses XML for it's data storage. I'm going to be using the data for another mini-project and wanted to use LINQ to XML for querying/updating the data; but there's 2 scenarios that I'm not sure whether I need to handle myself or not:
1- If I have something similar to the following code, and 2 people happen to hit the Save() at the same time? Does LINQ to XML wait until the file is available again before saving, or will it just throw? I don't want to put locks in unless I need to :)
// I assume the next line doesn't lock the file
XElement doc = XElement.Load("Books.xml");
XElement newBook = new XElement("Book",
new XAttribute("publisher", "My Publisher"),
new XElement("author", "Me")));
doc.Add(newBook);
// What happens if two people try this at the same time?
doc.Save("Books.xml");
2- If I Load() a document, add a entry under a particular node, and then hit Save(); what happens if another user has already added a value under that node (since I hit my Load()) or even worse, deleted the node?
Obviously I can workaround these issues, but I couldn't find any documentation that could tell me whether I have to or not, and the first one at least would be a bit of a pig to test reliably.
It's not really a LINQ to XML issue, but a basic concurrency issue.
Assuming the two people are hitting Save at the same time, and the backing store is a file, then depending on how you opened the file for saving, you might get an error. If you leave it to the XDocument class (by just passing in a file name), then chances are it is opening it exclusively, and someone else trying to do the same (assuming the same code hitting it) will get an exception. You basically have to synchronize access to any shared resource that you are reading from/writing to.
If another user has already added a value, then assuming you don't have a problem obtaining the resource to write to, your changes will overwrite the resource. This is a frequent issue with databases known as optimistic concurrency, and you need some sort of value to indicate whether a change has occurred between the time you loaded the data, and when you save it (most databases will generate timestamp values for you).

Resources