Is there an efficient way to implement CLOCK cache algorithm? - algorithm

The CLOCK cache algorithm is quite easy to implement with a reference array recording whether an item has been reference or not, and this is an example. However, it seems that the time complexity of this solution is O(N) (N is the capacity of the cache), since we need to iterate the reference array to find a victim whose reference bit is 0. Is there any more efficient solution?
Another question is how to support del operation efficiently for CLOCK algorithm? There're many resources on how to do get and set operations. But nothing mentioned about del operation. If some items are deleted, e.g. either explicitly deleted or expired with a TTL, there will be some holes in the reference array.
In this case, when we need to find a victim, the following solution should work:
Element in reference array has 3 state: hole, referenced, not referenced.
if cache is not full, iterate reference array to find a hole, and use that hole as victim. No item will be evicted. In this case, we don't move the clock hand, and untouched the state of other elements.
if cache is full, iterate the array with moving the clock hand, to find a victim. Also modify states of elements we passed by from referenced to not referenced.
Again, this also has O(N) complexity. Is there any way to optimize it?

Related

Best statically allocated data structure for writing and extending contiguous blocks of data?

Here's what I want to do:
I have an arbitrary number of values of a different kind: string, int, float, bool, etc. that I need to store somehow. Multiple elements are often written and read as a whole, forming "contiguous blocks" that can also be extended and shortened at the users wish and even elements in the middle might be taken out. Also, the whole thing should be statically allocated.
I was thinking about using some kind of statically allocated forward lists. The way I imagine this to work is defining an array of a struct containing one std::variant field and a field "previous head" which always points to the location of the previous head of the list. A new element is always placed at the globally known "head" which it stores inside "previous head" field. This way I can keep track of holes inside my list because once an element is taken out, its location is written to global head and will be filled up by subsequent inserts.
This approach however has downsides: When a "contiguous block" is extended, there might be the case that further elements of other blocks have already queued up in the list past its last element. So I either need to move all subsequent entries or copy over the last element in the previous list and insert a link object that allows me to jump to the new location when traversing the contiguous block.
The priority to optimize this datastructure is following (by number of use cases):
Initially write contigous blocks
read the whole data structure
add new elements to contigous blocks
remove elements of contigous blocks
At the moment my data structure will have time complexity of O(1) für writes, O(n) for continous reads (with the caveat that in the worst case there is a jump to the next location inside the array every other element), O(1) for adding new elements and O(1) for removing elements. However, space complexity is S(2n) in the worst case (when I have to do a jump every second time the slot to store data is lost to the "link").
What I'm wondering now is: Is the described way the best viable way to accomplish what I'm trying or is there a better data structure? Is there an official name for this data structure?

How to save a matrix in C++ in a non-linear way

I have to program an optimized multi-thread implementation of the Levenshtein distance problem. It can be computed using dynamic programming with a matrix, the wikipedia page on Levenshtein distance covers that well enough.
Now, I can compute diagonal elements concurrently. That is all alright.
My problem now comes with caches. Matrices in c++ are normaly saved in memory row by row, correct? Well, that is not good for me as I need 2 element of the previous row and 1 element of the current row to compute my result, that is horrible cache-wise. The cache will hold the current row (or part of it), then I ask for the previous one which it will probably not hold anymore.
Then for another one, I need a different part of the diagonal, so yet again, I ask for completely different rows and the cache will not have those ready for me.
Therefore, I would like to save my matrix to memory in blocks or maybe diagoals. That will result in fewer cachce misses and make my implementation faster again.
How do you do that? I tried searching the internet, but I could never find anything that would show me the way. Is it possible to tell c++ how to order that type in memory?
EDIT: As some of you seem confused about the nature of my question. I want to save a matrix (does not matter if I will make it a 2D array or any other way) in a custom way into the MEMORY. Normally, a 2D array will save row after row, I need to work with diagonals therefore caches will miss a lot on the huge matrices I will work at (possibly millions of rows and columns).
I believe you may have a mis-perception of (CPU) cache.
It's true that CPU caching is linear - that is, if you access an address in memory, it will bring into the cache some previous and some successive memory locations - which is like "guessing" that subsequent accesses will involve 1-dimensional-close elements. However, this is true on the micro-level. A CPU's cache is made up of a large number of small "lines" (64 Bytes on all cache levels in recent Intel CPUs). The locality is limited to the line; different cache lines can come from completely different places in memory.
Thus, if you "need two elements of the previous row and one element of the current row" of your matrix, then the cache should work very well for you: Some of the cache will hold elements of the previous row, and some will hold elements of the current row. And when you advance to the next element, the cache overall will usually contain the matrix elements you need to access. Just make sure your order of iteration agrees with the order of progression within the cache line.
Also, in some cases you could be faced with a situation where different threads are thrashing the same cache lines due to the mapping from main memory into the cache. Without getting into details, that is something you need to think about (but again, has nothing to do with 2D vs 1D data).
Edit: As geza notes, if your matrix' lines are long, you will still be reading each memory location twice with the straightforward approach: Once as the current-line, then again as the previous-line, since each value will be evicted from the cache before it's used as a previous-line value. If you want to avoid this, you can iterate over tiles of your matrix, whose size (length x width x sizeof(element)) fits into the L1 cache (along with whatever else needs to be there). You can also consider storing your data in tiles, but I don't think that would be too useful.
Preliminary comment: "Levenshtein distance" is edit distance (under the common definition). This is a very common problem; you probably don't even need to bother writing a solution yourself. Look for existing code.
Now, finally, for a proper answer... You don't actually need have a matrix at all, and you certainly don't need to "save" it: It's enough to keep merely a "front" of your dynamic programming matrix rather than the whole thing.
But what "front" shall you choose, and how do you advance it? I suggest you use anti-diagonals as your front, and given each anti-diagonal, compute concurrently the next anti-diagonal. Thus it'll be {(0,0)}, then {(0,1),(1,0)}, then {(0,2),(1,1),(2,0)} and so on. Each anti-diagonal requires at most two earlier anti-diagonals - and if we keep the values of each anti-diagonal consecutively in memory, then the access pattern going up the next anti-diagonal is a linear progression along the previous anti-diagonals - which is great for the cache (see my other answer).
So, you'll "concurrentize" the computation give each thread a bunch of consecutive anti-diagonal elements to compute; that should do the trick. And at any time you will only keep 3 anti-diagonal in memory: the one you're working on and the two previous ones. You can cycle between three such buffers so you don't re-allocate memory all the time (but then make sure to pre-allocate buffers with the maximum anti-diagonal length).
This whole thing should work basically the same for the non-square case.
I'm not absolutely sure, but i think a matrix is stored as a long array one row after the other and is mapped with pointer arithmetic to a matrix, so you always refer to the same address and calculate the distance in the memory where your value is located
Otherwise you can implement it easily as this type and implement operator[int, int] for your matrix

Why Use A Doubly Linked List and HashMap for a LRU Cache Instead of a Deque?

I have implemented the design a LRU Cache Problem on LeetCode using the conventional method (Doubly Linked List+Hash Map). For those unfamiliar with the problem, this implementation looks something like this:
I understand why this method is used (quick removal/insertion at both ends, fast access in the middle). What I am failing to understand is why someone would use both a HashMap and a LinkedList when one could simply use a array-based deque (in Java ArrayDeque, C++ simply deque). This deque allows for ease of insertion/deletion at both ends, and quick access in the middle which is exactly what you need for an LRU cache. You also would use less space because you wouldn't need to store a pointer to each node.
Is there a reason why the LRU cache is almost universally designed (on most tutorials at least) using the latter method as opposed to the Deque/ArrayDeque method? Would the HashMap/LinkedList method have any benefits?
When an LRU cache is full, we discard the Least Recently Used item.
If we're discarding items from the front of the queue, then, we have to make sure the item at the front is the one that hasn't been used for the longest time.
We ensure this by making sure that an item goes to the back of the queue whenever it is used. The item at the front is then the one that hasn't been moved to the back for the longest time.
To do this, we need to maintain the queue on every put OR get operation:
When we put a new item in the cache, it becomes the most recently used item, so we put it at the back of the queue.
When we get an item that is already in the cache, it becomes the most recently used item, so we move it from its current position to the back of the queue.
Moving items from the middle to the end is not a deque operation and is not supported by the ArrayDeque interface. It's also not supported efficiently by the underlying data structure that ArrayDeque uses. Doubly-linked lists are used because they do support this operation efficiently.
The purpose of an LRU cache is to support two operations in O(1) time: get(key) and put(key, value), with the additional constraint that least recently used keys are discarded first. Normally the keys are the parameters of a function call and the value is the cached output of that call.
Regardless of how you approach this problem we can agree that you MUST use a hashmap. You need a hashmap to map a key already present in the cache to the value in O(1).
In order to deal with the additional constraint of least recently used keys being discarded first you can use a LinkedList or ArrayDeque. However since we don't actually need to access the middle, a LinkedList is better since you don't need to resize.
Edit:
Mr. Timmermans discussed in his answer why ArrayDeques cannot be used in an LRU cache due to the necessity of moving elements from the middle to the end. With that being said here is an implementation of an LRU cache that successfully submits on leetcode using only appends and poplefts in the deque. Note that python's collections.deque is implemented as a doubly linked list, however we are only using operations in collections.deque that are also O(1) in a circular array, so the algorithm stays the same regardless.
from collections import deque
class LRUCache:
def __init__(self, capacity: 'int'):
self.capacity = capacity
self.hashmap = {}
self.deque = deque()
def get(self, key: 'int') -> 'int':
res = self.hashmap.get(key, [-1, 0])[0]
if res != -1:
self.put(key, res)
return res
def put(self, key: 'int', value: 'int') -> 'None':
self.add(key, value)
while len(self.hashmap) > self.capacity:
self.remove()
def add(self, key, value):
if key in self.hashmap:
self.hashmap[key][1] += 1
self.hashmap[key][0] = value
else:
self.hashmap[key] = [value, 1]
self.deque.append(key)
def remove(self):
k = self.deque.popleft()
self.hashmap[k][1] -=1
if self.hashmap[k][1] == 0:
del self.hashmap[k]
I do agree with Mr. Timmermans that using the LinkedList approach is preferable - but I want to highlight that using an ArrayDeque to build an LRU cache is possible.
The main mixup between myself and Mr. Timmermans is how we interpreted capacity. I took capacity to mean caching the last N get / put requests, while Mr. Timmermans took it to mean caching the last N unique items.
The above code does have a loop in put which slows the code down - but this is just to get the code to conform to caching the last N unique items. If we had the code cache the last N requests instead, we could replace the loop with:
if len(self.deque) > self.capacity: self.remove()
This will make it as fast if not faster than the linked-list variant.
Regardless of what maxsize is interpreted as, the above method still works as an LRU cache - least recently used elements get discarded first.
I just want to highlight that the designing an LRU cache in this manner is possible. The source is right there - try to submit it on Leetcode!
Doubly linked list is the implementation of the queue. Because doubly linked lists have immediate access to both the front and end of the list, they can insert data on either side at O(1) as well as delete data on either side at O(1). Because doubly linked lists can insert data at the end in O(1) time and delete data from the front in O(1) time, they make the perfect underlying data structure for a queue. Queeus are lists of items in which data can only be inserted at the end and removed from the beginning.
Queues are an example of an abstract data type, and that we are able to use an array to implement them under the hood. Now, since queues insert at the end and delete from the beginning, arrays are only so good as the underlying data structure. While arrays are O(1) for insertions at the end, they’re O(N) for deleting from the beginning. A doubly linked list, on the other hand, is O(1) for both inserting at the end and for deleting from the beginning. That’s what makes it a perfect fit for serving as the queue’s underlying data structure.
Pyhon deque uses a linked list as part of its data structure. This is the kind of linked list it uses. With doubly linked lists, deque is capable of inserting or deleting elements from both ends of a queue with constant O(1) performance. pyhton-deque

Assembly sorting

I am working on a Task Schedule Simulator which needs to be programmed in Assembly language.
I've been struggling about Task sorting:
I am allocating new memory for each Task (user can insert the task and by using the sbrk instruction i allocate 20 byte that contain a word for Task's numeric ID, another word for it's priority expressed as an int, another word for the number of cycles to finish the task) and I'm storing the address of each new Task in the stack.
My problem is: i need to sort this tasks and the sorting can either be based on priority or number of cycles. When I pop these Tasks i can easily access the right field (since the structure is very rigid, i just need to type the right offset in the lw instruction and voilat), but then comparing and sorting gets complicated.
I am working on the pseudocode for this part of the program and can't find any way to untie the knot.
Let me first try and paraphrase what you have indicate as the problem.
You have a stack, that has "records" of the structure
{ word : id, word : priority, word : cycle_count, dword : address}
Since the end objective is to "pop" these in desired order, we have to execute an in-place sort. There are many choices of algorithms, but to keep matters simple (also taking a cue from an underlying assumption that the count of tasks is not that many), I am explaining using bubble-sort. There exist a vast cornucopia of literature comparing each probable sort algorithm to their finest details, and if relevant, you may consider wikipedia as the perfect starting point.
Step 1 : Make the data pointer = stack pointer+count_of_records*20 - effectively, for next few steps, the data pointer points to the top of the "table of records" which happens to be located at the stack. An advanced consideration but not required in MIPS, is to assert DS=SS.
Step 2 : Next, identify which record pair needs to be swapped, and use the appropriate index within a record to identify the field that defines the swapping order.
Step 3: Allocate a 20 byte space as temporary, and use that space to temporarily hold the swapped record. An advanced consideration here is whether the environment can do an interrupt while the swap is going on. MIPS does not seem to have an atomic lock so this memory move needs to be carefully done.
Once the requisite number of passes are completed, the table will appear sorted, and will remain in place. The temporary buffer to store a record may be released.
The vital statistics for Bubble-Sort is that its O(n^2), responds well to almost sorted situations (not very likely in your example), and will handle the fact well that in midst of sorting, some records may find the processor free to start running, and therefore, will have to be removed from the queue by a POP and the sort needs to be restarted. This restart, however, will find the table almost sorted, and therefore, on a continuous basis, the table will display fairly strong pre-sorted behavior. Most importantly, has the perhaps most efficient code footprint among all in-situ algorithms.
Trust this helps
You might want to introduce a level of indirection, and sort pointers to your structs based on comparing the pointed-to data.
If your sort keys are all integers of the same size at different offsets within the structs, your sort function could take an offset as a parameter. e.g. lw from base + off to get the integer that you're going to compare.
Insertion sort is probably easiest to code, and much better that BubbleSort in the not-almost-sorted case. If you care about having a result ready to pop ASAP, before the whole array is sorted, then use Selection Sort.
It wasn't clear if your code is itself going to be multi-threaded, or if you can just write a normal sort function. #qasar66's answer seems to be suggesting a BubbleSort with atomic swaps, so other threads can safely look at the partially-sorted array while its being sorted.
If you only ever need to pop the min element, one of the best data structures is a Heap. It takes more code to implement, so if ease of implementation is your top goal, use a simple sort. Heapifying an un-sorted array is cheaper than doing a full sort: the full O(n log n) cost of extracting all elements in order is amortized over the extracts. So it's great if you want to be able to change the sort key, since you don't have to do all the work of fully sorting.

Optimizing Inserting into the Middle of a List

I have algorithms that works with dynamically growing lists (contiguous memory like a C++ vector, Java ArrayList or C# List). Until recently, these algorithms would insert new values into the middle of the lists. Of course, this was usually a very slow operation. Every time an item was added, all the items after it needed to be shifted to a higher index. Do this a few times for each algorithm and things get really slow.
My realization was that I could add the new items to the end of the list and then rotate them into position later. That's one option!
Another option, when I know how many items I'm adding ahead of time, is to add that many items to the back, shift the existing items and then perform the algorithm in-place in the hole I've made for myself. The negative is that I have to add some default value to the end of the list and then just overwrite them.
I did a quick analysis of these options and concluded that the second option is more efficient. My reasoning was that the rotation with the first option would result in in-place swaps (requiring a temporary). My only concern with the second option is that I am creating a bunch of default values that just get thrown away. Most of the time, these default values will be null or a mem-filled value type.
However, I'd like someone else familiar with algorithms to tell me which approach would be faster. Or, perhaps there's an even more efficient solution I haven't considered.
Arrays aren't efficient for lots of insertions or deletions into anywhere other than the end of the array. Consider whether using a different data structure (such as one suggested in one of the other answers) may be more efficient. Without knowing the problem you're trying to solve, it's near-impossible to suggest a data structure (there's no one solution for all problems). That being said...
The second option is definitely the better option of the two. A somewhat better option (avoiding the default-value issue): simply copy 789 to the end and overwrite the middle 789 with 456. So the only intermediate step would be 0123789789.
Your default-value concern is, however, (generally) not a big issue:
In Java, for one, you cannot (to my knowledge) even assign memory for an array that's not 0- or null-filled. C++ STL containers also enforce this I believe (but not C++ itself).
The size of a pointer compared to any moderate-sized class is minimal (thus assigning it to a default value also takes minimal time) (in Java and C# everything is pointers, in C++ you can use pointers (something like boost::shared_ptr or a pointer-vector is preferred above straight pointers) (N/A to primitives, which are small to start, so generally not really a big issue either).
I'd also suggest forcing a reallocation to a specified size before you start inserting to the end of the array (Java's ArrayList::ensureCapacity or C++'s vector::reserve). In case you didn't know - varying-length-array implementations tend to have an internal array that's bigger than what size() returns or what's accessible (in order to prevent constant reallocation of memory as you insert or delete values).
Also note that there are more efficient methods to copy parts of an array than doing it manually with for loops (e.g. Java's System.arraycopy).
You might want to consider changing your representation of the list from using a dynamic array to using some other structure. Here are two options that allow you to implement these operations efficiently:
An order statistic tree is a modified type of binary tree that supports insertions and selections anywhere in O(log n) time, as well as lookups in O(log n) time. This will increase your memory usage quite a bit because of the overhead for the pointers and extra bookkeeping, but should dramatically speed up insertions. However, it will slow down lookups a bit.
If you always know the insertion point in advance, you could consider switching to a linked list instead of an array, and just keep a pointer to the linked list cell where insertions will occur. However, this slows down random access to O(n), which could possibly be an issue in your setup.
Alternatively, if you always know where insertions will happen, you could consider representing your array as two stacks - one stack holding the contents of the array to the left of the insert point and one holding the (reverse) of the elements to the right of the insertion point. This makes insertions fast, and if you have the right type of stack implementation could keep random access fast.
Hope this helps!
HashMaps and Linked Lists were designed for the problem you are having. Given a indexed data structure with numbered items, the difficulty of inserting items in the middle requires a renumbering of every item in the list.
You need a data structure which is optimized to make inserts a constant O(1) complexity. HashMaps were designed to make insert and delete operations lightning quick regardless of dataset size.
I can't pretend to do the HashMap subject justice by describing it. Here is a good intro: http://en.wikipedia.org/wiki/Hash_table

Resources