Cache like mechanism (which Data Structure)? - algorithm

I am fetching some questions from the server (database) and showing it to client (user) in the browser. The client will answer the question and based on his/her answer the next set of questions will be fetched from the database. Now, I want to pre-fetch the next set of questions while the user read the present question so that the waiting time for user to see the next question will be shorter.
My questions is, how to store the pre-fetched questions i.e. which data structure should I use to store the pre-fetched questions in the memory so that I can get better performance? I want a "cache" type of thing. Also once the user hit any question from the cache the question won't be there any more.
PS: Each question has unique Id.
Thanks
Naveen

There are multiple options to go about it. One that makes a big difference, one that makes little.
Little difference would be to fetch questions and store it in user's session. It's basically depends on where your session is stored, could also be database, or a file. This only makes sense if your db tables are very denormalized and it requires lots of joins to get the answer. I doubt that's the case so this won't make much difference for the user no matter which data structure used.
Big difference would make prefetching them with AJAX using javascript straight into the browser. In this case a simple array would suffice. JS gives you flexibility to build any objects with any properties, anything would be good enough. So write a poller in JS which fetches the questions from server while user is looking at the question, return them using JSON for example. JSON will become a simple object. Since each user stores only a couple of questions prefetched in their browser particular data structure choice won't make a difference here either.

Try using LinkedHashMap as You will have LRU algorithm implemented quickly with good performance.
Read this link as well :
LinkedHashMap as cache

First a few questions to adapt to your context :
assuming you use Java ?
using Hibernate also ?
If you want to prefetch in the server, many caching solutions exists.
Taking into account your unique id (see PS), if this ID is database related and you are using Hibernate, the easiest solution would be to configure the Hibernate second-level cache for that entity. Then, your only code would be to run the query in advance....
If theses requisites do not fit, I used EhCache as the caching solutions.
Somehow easy to start using, and it has plenty of features available when you later need them.

Related

Caching data with frequently changed fields, what are the best solutions?

I'm currently implementing a news feed feature in our application, where an user should be able to query a number of posts that were pre-generated for them and were cached in redis.
The problem is, each post contains a lot of fields that are frequently updated (number of likes, comments, etc...) and if I run these write operations to redis itself, I'm afraid it would affects the read performance, since there are very large number of users currently using our application.
Do you recommend any solution for this?
Even a few seconds can help greatly. I set most API objects to at least 3-5 seconds.
Here are some best practices recommended from AWS:
https://aws.amazon.com/caching/best-practices/

How to keep your distributed cache clean?

In a N-Tier architecture, what would be the best patterns to use so that you can keep your cache clean?
I know it's easy to just set an absolute/sliding timeout, but is there a better mechanism available to allow you to mark your cache as dirty after you update the underlying persistence.
The difficulty I"m trying to wrap my head around is that Cache are usually stored as KVP. But a query is usually a fair bit more complex than that. So how can the gateway service tell the cache store that for such and such query, it needs to refetch from persistence.
I also can't afford to hand-code the cache update per query. I'm looking for a more systematic approach.
Is this just a pipe dream, or is there some way to do this elegantly?
Link/Guide/Post appreciated.
I have worked with AppFabric and I think tried to do what you are asking about. I was working on an auction site and I wanted to pro-actively invalidate items in the cache.
For example, we had listings (things for sale) and they would be present all over the cache (AppFabric). The data that represented a listing was in 10 different places. What I initially wanted was a way to say, "Ok, my listing has changed. Let me go find everywhere it exists in cache, and then update." (I think you say "mark as dirty" in your question)
I found doing this was incredibly difficult. There are tags in AppFabric that I tried to use, so I would mark a given object (or collection of objects) with a tag and that would let me query the cache and remove items. In other words, if an object had a LISTING tag, I would find it and invalidate it.
Eventually I settled on a two-pronged attack.
For 95% of the data I let it expire. It was a happy day when I decided this because everything got much easier to develop. I had to make some concessions in the UI etc., but it was well worth it.
For the last 5% of the data I resolved to only ever store it once. For example, a bid on a listing. Whenever a new bid came in, we'd pro-actively invalidate that object, and then everything that needed that information would be updated as well.

What should be stored in cache for web app?

I realize that this might be a vague question the bequests a vague answer, but I'm in need of some real world examples, thoughts, &/or best practices for caching data for a web app. All of the examples I've read are more technical in nature (how to add or remove cache data from the respective cache store), but I've not been able to find a higher level strategy for caching.
For example, my web app has an inbox/mail feature for each user. What I've been doing to date is storing typical session data in the cache. In this example, when the user logs in I go to the database and retrieve the user's mail messages and store them in cache. I'm beginning to wonder if I should just maintain a copy of all users' messages in the cache, all the time, and just retrieve them from cache when needed, instead of loading from the database upon login. I have a bunch of other data that's loaded on login (product catalogs and related entities) and login is starting to slow down.
So I guess my question to the community, is what would you do/recommend as an approach in this scenario?
Thanks.
This might be better suited to https://softwareengineering.stackexchange.com/, but generally you want to cache:
Metadata/configuration data that does not change frequently. E.g. country/state lists, external resource addresses, logic/branching settings, product/price/tax definitions, etc.
Data that is costly to retrieve or generate and that does not need to frequently change. E.g. historical data sets for reports.
Data that is unique to the current user's session.
The last item above is where you need to be careful as you can drastically increase your app's memory usage, by adding a few megabytes to the data for every active session. It also implies different levels of caching -- application wide, user session, etc.
Generally you should NOT cache data that is under active change.
In larger systems you also need to think about where the cache(s) will sit. Is it possible to have one central cache server, or is it good enough for each server/process to handle its own caching?
Also: you should have some method to quickly reset/invalidate the cached data. For a smaller or less mission-critical app, this could be as simple as restarting the web server. For the large system that I work on, we use a 12 hour absolute expiration window for most cached data, but we have a way of forcing immediate expiration if we need it.
This is a really broad question, and the answer depends heavily on the specific application/system you are building. I don't know enough about your specific scenario to say if you should cache all the users' messages, but instinctively it seems like a bad idea since you would seem to be effectively caching your entire data set. This could lead to problems if new messages come in or get deleted. Would you then update them in the cache? Would that not simply duplicate the backing store?
Caching is only a performance optimization technique, and as with any optimization, measure first before making substantial changes, to avoid wasting time optimizing the wrong thing. Maybe you don't need much caching, and it would only complicate your app. Maybe the data you are thinking of caching can be retrieved in a faster way, or less of it can be retrieved at once.
Cache anything that causes duplicate database queries.
Client side file caching is important as well. Assuming files are marked with an id in your database, cache them on every network request to avoid many network requests for the same file. A resource to do this can be found here (https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API). If you don't need to cache files, web storage, local storage and cookies are good for smaller pieces of data.
//if file is in cache
//refer to cache
//else
//make network request and push file to cache

Cache Management with Numerous Similar Database Queries

I'm trying to introduce caching into an existing server application because the database is starting to become overloaded.
Like many server applications we have the concept of a data layer. This data layer has many different methods that return domain model objects. For example, we have an employee data access object with methods like:
findEmployeesForAccount(long accountId)
findEmployeesWorkingInDepartment(long accountId, long departmentId)
findEmployeesBySearch(long accountId, String search)
Each method queries the database and returns a list of Employee domain objects.
Obviously, we want to try and cache as much as possible to limit the number of queries hitting the database, but how would we go about doing that?
I see a couple possible solutions:
1) We create a cache for each method call. E.g. for findEmployeesForAccount we would add an entry with a key account-employees-accountId. For findEmployeesWorkingInDepartment we could add an entry with a key department-employees-accountId-departmentId and so on. The problem I see with this is when we add a new employee into the system, we need to ensure that we add it to every list where appropriate, which seems hard to maintain and bug-prone.
2) We create a more generic query for findEmployeesForAccount (with more joins and/or queries because more information will be required). For other methods, we use findEmployeesForAccount and remove entries from the list that don't fit the specified criteria.
I'm new to caching so I'm wondering what strategies people use to handle situations like this? Any advice and/or resources on this type of stuff would be greatly appreciated.
I've been struggling with the same question myself for a few weeks now... so consider this a half-answer at best. One bit of advice that has been working out well for me is to use the Decorator Pattern to implement the cache layer. For example, here is an article detailing this in C#:
http://stevesmithblog.com/blog/building-a-cachedrepository-via-strategy-pattern/
This allows you to literally "wrap" your existing data access methods without touching them. It also makes it very easy to swap out the cached version of your DAL for the direct access version at runtime quite easily (which can be useful for unit testing).
I'm still struggling to manage my cache keys, which seem to spiral out of control when there are numerous parameters involved. Inevitably, something ends up not being properly cleared from the cache and I have to resort to heavy-handed ClearAll() approaches that just wipe out everything. If you find a solution for cache key management, I would be interested, but I hope the decorator pattern layer approach is helpful.

Organizing memcache keys

Im trying to find a good way to handle memcache keys for storing, retrieving and updating data to/from the cache layer in a more civilized way.
Found this pattern, which looks great, but how do I turn it into a functional part of a PHP application?
The Identity Map pattern: http://martinfowler.com/eaaCatalog/identityMap.html
Thanks!
Update: I have been told about the modified memcache (memcache-tag) that apparently does do a lot of this, but I can't install linux software on my windows development box...
Well, memcache use IS an identity map pattern. You check your cache, then you hit your database (or whatever else you're using). You can go about finding information about the source by storing objects instead of just values, but you'll take a performance hit for that.
You effectively cannot ask the cache what it contains as a list. To mass invalidate, you'll have to keep a list of what you put in and iterate it, or you'll have to iterate every possible key that could fit the pattern of concern. The resource you point out, memcache-tag can simplify this, but it doesn't appear to be maintained inline with the memcache project.
So your options now are iterative deletes, or totally flushing everything that is cached. Thus, I propose a design consideration is the question that you should be asking. In order to get a useful answer for you, I query thus: why do you want to do this?

Resources