How to implement LRU algorithm? - caching

I'm writing a program that simulates LRU caching, the program function should look to the cache to see whether the new input is inside it, if its already there just increment the number of hits and the frequency of the index that contains the element, if the input is not in the cache and the cache still have empty space, just push the input into the cache. if the cache has no space and the input is not inside it, get the index with the least frequency and push the input inside it, then reset the frequency of the index to 0. However I didn't get the expected result. After some analysis, I figured that I didn't handle the case of two indexes with the same minimum frequency. I have no idea how to handle that, do you?

Related

Sequential data output without writing to disk

4kb memory
1gb data
8 bytes per data
Sequential data output without writing to disk.
First, you can only accumulate and output a maximum of 4kb per time through the data. So you will need at least something like 250,000 passes.
How can you accumulate stuff in a pass?
In pseudocode the idea is like this.
while not done:
for each data (8 bytes) in dataset:
if this data has never been output:
if it might belong in the current batch:
add to current batch (evicting something else if needed)
if current_batch not empty:
sort current batch
emit current batch
update "never been output" filter
else:
done
What does that filter look like? It needs to know three things:
What is the maximum value so far emitted?
How many times has it been emitted?
How many times has it been seen on this pass?
Any value below the maximum value gets ignored. After you've seen the value enough times, you can add it to the current batch.
Now how about the current batch you're accumulating? That can be a heap that tells you the maximum value in the batch. If the heap is not full, or if the current value is below the maximum in the batch, you add it to the batch and lose the current max.
If the heap is arranged in memory so that the smallest is first, when the batch is done you can remove the max, which will free up the last slot (that's how heaps work), and put the max there. Keep doing that and you'll heapsort the batch. Now you can easily update the filter, and then emit the batch.
I don't think you can get significantly more efficient than this.
If I was asked this in an interview, I'd know the answer, but I'd also see being asked the question as a sign that the company's hiring process is suboptimal. This would make me less inclined to be hired there unless there was some purpose I could see to why they hired this way. (I know why FAANGs do. But at most companies I'd call it a red flag.)

Check for duplicate input items in a data-intensive application

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.
Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...
Any idea?
Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).
If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations.
To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).
If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).
Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).
Here is an example using hash-sets in Python:
seen = set() # List of duplicated values seen so far
for value in inputStream: # Iterate over the stream value
if value not in seen: # O(1) lookup
log.write(value) # Value not duplicated here
seen.add(value) # O(1) appending

What is the name of this kind of cache/ data structure?

I need a fixed-size cache of objects that keeps track how many times each object was requested. When it is full and a new object is added, the object with the lowest usage score gets removed.
So this is different from a LRU-cache of size N in that if some object is heavily requested, then even adding N new objects won't push it out of cache.
Some kind of mix of a cache and a priority queue. Is there a name for that?
Thanks!
Without a time element, this kind of cache clogs up with things that were used a lot in the past, but aren't used currently. Replacement becomes impossible, because everything in the cache has been used more than once, so you won't evict anything in favor of a new item.
You could write some code that degrades the value of the count over time (i.e. take into account the time since last used), but doing so is just a really complicated way of simulating an LRU cache. I experimented with it at one point, but found that it didn't perform any better than the simple LRU cache. At least not in my application.

VSAM Search VS COBOL search/loop

I have a file that could contain about 3 million records. Certain records of this file will need to be updated multiple times throughout the run of the program. If I need to pull specific records from this file, which of the following is more efficient:
Indexed VSAM search
Indexed flat file with a COBOL search all
Buffering all of the data into working storage and writing a loop to handle the search
Obviously, if you can buffer all of the data into memory (and if the host system can support a working-set of pages which is big enough to allow all of it to actually remain in RAM without paging, then this would probably be the fastest possible approach.
But, be very careful to consider "hidden disk-I/O" caused by the virtual-memory paging subsystem! If the requested "in-memory" data is, in fact, not "in memory," a page-fault will occur and your process will stop in its tracks until the page has been retrieved. (And if "page stealing" occurs, well, you're in trouble. Your "in-memory" strategy just turned into a possibly very-inefficient(!) disk-based one. If keys are distributed randomly, then your process has a gigantic working-set that it is accessing randomly. If all of that memory is not actually in memory, and will stay there, you're in trouble.
If you are making updates to a large file, consider sorting the updates-delta file before processing it, so that all occurrences of the same key will be adjacent. You can now write your COBOL program to take advantage of this (and, of course, to abend if an out-of-sequence record is ever detected!). If the key in "this" record is identical to the key of the "previous" one, then you do not need to re-read the record. (And, you do not actually need to write the old record, until the key does change.) As the indexed-file access method is presented with the succession of keys, each key is likely to be "close to" the one previously-requested, such that some of the necessary index-tree pages will already be in-memory. Obviously, you will need to benchmark this, but the amount of time spent sorting the file can be far less than the amount of time spent in index-lookups. (Which actually can be considerable.)
The answer of Mike has the important issue about "hidden I/O" in (depends on the machine, configuration, amount of data)...
If you very likely need to update many records the option Mike suggest is the most useful one.
If you very likely need to update not much records (I'd guess you're likely below 2%) another approach can be quite faster (needs a benchmark !):
read every key via indexed VSAM search
store the changed record in memory (big occurs table), if you will only change some values and the record is quite big then only store all possible changed values + key in the table without an actual REWRITE
before doing a VSAM search: look in your occurs table if you read the key
already, take the values either from there or get a new one
...
at program end: go through your occurs and REQRITE all records (if you have the complete record a REWRITE is enough, otherwise you'd need a READ first to get the complete record)
Performance is often: "know your data and possible program flow, then try the best 2-3 approach, benchmark and decide".

What is the best buffer management drop policy?

I am working on project that contains a fixed-size buffer of type (FIFO): First input First Output, where clients send their requests to that buffer, and the system handles them.When the buffer is full, I have to apply one of the following overloading policies (Drop Policies): DRPH : Drop one Request from the Head of buffer. DRPT: Drop one Request from Tail of buffer.DRPR: Drop 25% of elements in the buffer randomly. BLCK: block new connections until space is available in buffer.
I made a simulation to measure the performance using Httperf by sending many requests per second and measuring the response time, but I have got unstable values for response time especially when the requests number is large. so by simulation I can not get the best drop policy. I repeated the simulation many times, each time I have got different values.
The question is :
theoretically, what is the best buffer management drop policy among the mentioned policies? .
It definitely depends on your data and in which order it is needed. But usually, with a FIFO, the data at the end of the buffer is the oldest and so the one with the least likelhood to be required again. So DRPR is probably the best solution. But only if you can afford losing data (e.g. because it can be re-inserted later). If that is not the case you have to block connections until buffer space is available again.
Another thing: I would strive for a dynamic buffer. Start with a reasonable default size and see how quick it fills up. Above a certain rate increase the buffer size (and below a certain threshold you can lower it again) up to a certain maximum.

Resources