What is the name of this kind of cache/ data structure? - algorithm

I need a fixed-size cache of objects that keeps track how many times each object was requested. When it is full and a new object is added, the object with the lowest usage score gets removed.
So this is different from a LRU-cache of size N in that if some object is heavily requested, then even adding N new objects won't push it out of cache.
Some kind of mix of a cache and a priority queue. Is there a name for that?
Thanks!

Without a time element, this kind of cache clogs up with things that were used a lot in the past, but aren't used currently. Replacement becomes impossible, because everything in the cache has been used more than once, so you won't evict anything in favor of a new item.
You could write some code that degrades the value of the count over time (i.e. take into account the time since last used), but doing so is just a really complicated way of simulating an LRU cache. I experimented with it at one point, but found that it didn't perform any better than the simple LRU cache. At least not in my application.

Related

VSAM Search VS COBOL search/loop

I have a file that could contain about 3 million records. Certain records of this file will need to be updated multiple times throughout the run of the program. If I need to pull specific records from this file, which of the following is more efficient:
Indexed VSAM search
Indexed flat file with a COBOL search all
Buffering all of the data into working storage and writing a loop to handle the search
Obviously, if you can buffer all of the data into memory (and if the host system can support a working-set of pages which is big enough to allow all of it to actually remain in RAM without paging, then this would probably be the fastest possible approach.
But, be very careful to consider "hidden disk-I/O" caused by the virtual-memory paging subsystem! If the requested "in-memory" data is, in fact, not "in memory," a page-fault will occur and your process will stop in its tracks until the page has been retrieved. (And if "page stealing" occurs, well, you're in trouble. Your "in-memory" strategy just turned into a possibly very-inefficient(!) disk-based one. If keys are distributed randomly, then your process has a gigantic working-set that it is accessing randomly. If all of that memory is not actually in memory, and will stay there, you're in trouble.
If you are making updates to a large file, consider sorting the updates-delta file before processing it, so that all occurrences of the same key will be adjacent. You can now write your COBOL program to take advantage of this (and, of course, to abend if an out-of-sequence record is ever detected!). If the key in "this" record is identical to the key of the "previous" one, then you do not need to re-read the record. (And, you do not actually need to write the old record, until the key does change.) As the indexed-file access method is presented with the succession of keys, each key is likely to be "close to" the one previously-requested, such that some of the necessary index-tree pages will already be in-memory. Obviously, you will need to benchmark this, but the amount of time spent sorting the file can be far less than the amount of time spent in index-lookups. (Which actually can be considerable.)
The answer of Mike has the important issue about "hidden I/O" in (depends on the machine, configuration, amount of data)...
If you very likely need to update many records the option Mike suggest is the most useful one.
If you very likely need to update not much records (I'd guess you're likely below 2%) another approach can be quite faster (needs a benchmark !):
read every key via indexed VSAM search
store the changed record in memory (big occurs table), if you will only change some values and the record is quite big then only store all possible changed values + key in the table without an actual REWRITE
before doing a VSAM search: look in your occurs table if you read the key
already, take the values either from there or get a new one
...
at program end: go through your occurs and REQRITE all records (if you have the complete record a REWRITE is enough, otherwise you'd need a READ first to get the complete record)
Performance is often: "know your data and possible program flow, then try the best 2-3 approach, benchmark and decide".

LRU cache with objects that should not be removed

I use an LRU cache (LruCache class from android.util) in my Android app. which is generally working fine.
Now I have a special requirement for this LRU cache: I would like that some objects are never removed. Explaining the situation: I have an array of objects (named mymetadata objects) that should be never removed and I have a lot of other objects (named dynamicdata objects) that should be removed with the LRU rule. I would like to store mymetadata objects in the LRU cache, because the array of objects can also grow and using an LRU cache helps avoiding running out of memory.
Is there any trick to guarantee that mymetadata objects are never removed from the LRU cache? Or should I simply access an object from the array so that it is marked as last used?
Is there any trick to guarantee that mymetadata is never removed
from the LRU cache? Or should I simply access a object of the array
and it is marked as last used?
Besides regularly touching the objects you want to keep in the LRU cache (to force raising their ranks), I don't see what else could be done. One problem with this might be when should these objects be touched and what is the performance impact of this operation?
A different approach would be to split the storage of your objects depending on their persistence. Keep a standard map for your persistent objects and an LRU cache for objects that can expire. This mix of two data structures can then be hidden behind a single interface similar to the ones of Map or LruCache (each query is directed to the right internal storage).
I would like to put the mymetadata objects into the LRU cache, because the array of objects can also grow
This seems to be in conflict with your "never removed" requirement for some object. How do you decide when a persistent object is allowed to expire?
Anyway, yet another approach would consist in reimplementing the LRU cache data structure, keeping two separate ordered lists of objects instead of a single one: one for mymetadata objects and one for dynamicdata objects. Each query to this data structure is then directed to the right list and both kind of objects can expire independently (the size of the cache can also be chosen independently for each set of objects). But both kind of objects are stored in the same hash table / map.

How to store few millions of cache and then track down 20 oldest cache

I got an interview question saying I need to store few millions of cache and then I need to keep a track on 20 oldest cache and as soon as the threshold of cache collection increases, replace the 20 oldest with next set of oldest cache.
I answered to keep a hashmap for it, again the question increases
what if we wanna access any of the element on hashmap fastly, how to
do, so I told its map so accessing won't be time taking but
interviewer was not satisfied. So what should be the idle way for such
scenarios.
A queue is well-suited to finding and removing the oldest members.
A queue implemented as a doubly linked list has O(1) insertion and deletion at both ends.
A priority queue lends itself to giving different weights to different items in the queue (e.g. some queue elements may be more expensive to re-create than others).
You can use a hash map to hold the actual elements and find them quickly based on the hash key, and a queue of the hash keys to track age of cache elements.
By using a double-linked list for the queue and also maintaining a hash map of the elements you should be able to make a cache that supports a max size (or even a LRU cache). This should result in references to objects being stored 3 times and not the object being stored twice, be sure to check for this if you implement this (a simple way to avoid this is to just queue the hash key)
When checking for overflow you just pop the last item off the queue and then remove it from the hash map
When accessing an item you can use the hash map to find the cached item. Then if you are implementing a LRU cache you just remove it from the queue and add it back to the beginning, this.
By using this structure Insert, Update, Read, Delete are all going to be O(1).
The follow on question to expect is for an interviewer to ask for the ability for items to have a time-to-live (TTL) that varies per cached item. For this you need to have another queue that maintains items ordered by time-to-live, the only problem here is that inserts now become O(n) as you have to scan the TTL queue and find the spot to expire, so you have to decide if the memory usage of storing the TTL queue as a n-tree will be worthwhile (thereby yielding O(log n) insert time). Or you could implement your TTL queue as buckets for each ~1minute interval or similar, you get ~O(1) inserts still and just degrade the performance of your expiration background process slightly but not greatly (and it's a background process).

MongoDB caching counters

I'm writing a visit counter for products on a website which uses MongoDB as its' DB-Engine.
Here it says that Mongo keeps frequently accessed stuff in memory and has an integrated in-memory caching engine.
So can I just relay on this integrated caching system and dumbly set the counters up on every visit or does one still need another caching layer on a high-traffic environment?
They're two seperate things. MongoDB uses a simple paged memory management system that, by design, keeps the most accessed parts of the memory mapped disk space in memory.
As a result, this will help you most for counters that are requested frequently but do not change often. Unfortunately for website counters these two things are mutually exclusive. Because increasing counters will generally not cause MongoDB to move the document holding the counter on disk the read caching will still be fairly effective.
The main issue is your writes, basically doing an increase per visit is not going to be very cost effective. I suggest a strategy where your counter webapp caches incoming visits and only pushes counter updates every X visits or every Y seconds, whichever comes first. Your main goal here is to reduce writes per second so you definitely do not want a db write per counter visit.
Although I have never worked on the kind of system you describe, I would do the following (assuming that I have read your question correctly and that you do indeed simply want to increment the counter for each visit).
Use the $inc operator to atomically perform the incrementation, or use upserts with modifiers to create the document structure if it is not already there
Use an appropriate Write Concern to speed up updates if that is safe to do so (ie with a Write Concern of NONE your call to update will return immediately and you'll just have to trust Mongo to persist it to disk). Of course whether this is safe or not depends on the use case. If you are counting millions of hits then 1 failed hit may not be a problem.
If the scale of data you are storing is truly enormous, look into using sharding to partition writes

How to deal with expiring item (due to TTL) in memcached on high-load website?

When you have peaks of 600 requests/second, then the memcache flushing an item due to the TTL expiring has some pretty negative effects. At almost the same time, 200 threads/processes find the cache empty and fire of a DB request to fill it up again
What is the best practice to deal with these situations?
p.s. what is the term for this situation? (gives me chance to get better google results on the topic)
If you have memcached objects which will be needed on a large number of requests (which you imply is the case), then I would look into having a separate process or cron job that regularly calculated and refreshed these objects. That way it should never go TTL. It's a common trade-off: you add a little unnecessary load during low traffic time to help reduce the load during peaking (the time you probably care the most about).
I found out this is referred to as "stampeding herd" by the memcached folks, and they discuss it here: http://code.google.com/p/memcached/wiki/NewProgrammingTricks#Avoiding_stampeding_herd
My next suggestion was actually going to be using soft cache limits as discussed in the link above.
If your object is expiring because you've set an expiry and it's gone past date, there is nothing you can do but increase the expiry time.
If you are worried about stale data, a few techniques exist you could consider:
Consider making the cache the authoritative source for whatever data you are looking at, and make a thread whose job is to keep it fresh. This will make the other threads block on refilling the cache, so it may only make sense if you can
Rather than setting a TTL on the data, change whatever process updates the data to update the cache. One technique I use for frequently changing data is to do this probabilistically -- 10% of the time data is written, it is updated. You can tune this for whatever is sensible, depending on how expensive the DB query is and how severe the impact of stale data.

Resources