Pseudo Least Recently Used Binary Tree - caching

The logic behind Pseudo LRU is to use less bits and to speed up the replacement of the block. The logic is given as "let 1 represent that the left side has been referenced more recently than the right side, and 0 vice-versa"
But I am unable to understand the implementation given in the following diagram:
Details are given at : http://courses.cse.tamu.edu/ejkim/614/CSCE614-2011c-HW4-tutorial.pptx

I'm also studying about Pseudo-LRU.
Here is my understand. Hope it's helpful.
"Hit CL1": there's a referent to CL1, and hit
LRU state (B0 and B1) are changed to inform CL1 is recently referred.
"Hit CL0": there's a referent to CL0, and hit
LRU state (B1) is updated to inform that CL0 is recently used (than CL1)
"Miss; CL2 replace"
There's a miss, and LRU is requested for replacement index.
As current state, CL2 is chose.
LRU state (B0 and B2) are updated to inform CL2 is recently used.
(it's also cause next replacement will be CL1)

I know there is already an answer which clearly explains the photo, but I wanted to post my way of thinking to implement a fast pseudo-LRU algorithm and it's advantages over classic LRU.
From the memory point of view, if there are N objects(pointers, 32/64 bit values) you need N-1 flag bits and a HashMap to store the information of objects( pointer to actual address and position in the array) for querying if an elements exists already in cache. It doesn't use less memory than classic LRU, actually it uses N-1 auxiliar bits.
The optimization comes from cpu time. Comparing some flags takes really no time because they are bits. In classic LRU you must have some sort of structure which permits insertion/deletion and you can take the LRU fast(maybe heap). This structure takes O(log(N)) for a usual operation, but also the comparison between values is expensive. So in the end you end up with O(log(N)^2) complexity per operation, instead of O(log(N)) for Pseudo-LRU.
Even if Pseudo-LRU doesn't always take the LRU object out when there is a cache miss, in practice it seems that it behaves pretty good and it's not a major drawback.

Related

Space complexity of reassigning an array

What is the space complexity of the following Java code?
public int[] foo(int[] x) {
x = new int[x.length];
// Do stuff with x that does not require additional memory
return x
}
Is it O(1) or O(N)? I've seen both answers. But I can't understand how it could be O(1). I would guess that it's O(N). We create a new array of the same size while the original array might still exist. Thus the original array is not replaced, i.e. we allocated additional storage space that increases linear with the length N of the input array. Am I correct?
The semantic of this piece of code is unsure, as the language is not specified. In any case, O(1) isn't possible because one allocates a new array at the same time that the original exists. (In a garbage collected language, one could imagine, with a lot of bad faith, that x is deallocated then immediately reallocated at the same place.)
O(N) with details that are highly dependent on various external factors such as your operating system.
A naive implementation requires actually zeroing out memory and assigning it. If the space requested is small, and particularly if you've already freed a good place to put it, this is probably what is going to happen. That operation is O(N).
If the space requested is large, you're probably just going to set up page table entries and NOT allocate any space. This is again O(N), but with extremely good constants. As you use the memory it actually has to get assigned, which is no faster than doing it up front. (It is actually slower.) But, in the meantime, being slow to use up memory is good for reducing contention on RAM.

Is there difference between Cache index address calculation vs Division hash function?

Upon studying hash data structure and cache memory from computer architecture, I noticed that they're very similar.
Division hash function calculates index by hash(k) = k Mod (table size M) but my DS book says M should be a prime number or at least an odd number, because if M is an even number, the result is always even when k is even, odd when k is odd, so even M should be avoided since you often use memory addresses which are always even.
And yet, my CA book says for direct-mapped cache you use (Block address) Mod (Number of blocks in the cache) and the result indices look uniform. Why is this? It's all very confusing because MIPS uses 32 bit address every 4 bytes which is even number. But I think it's because they threw out the last 2 bits since they're byte offsets?
And, since it uses (Block address) Mod (Number of blocks in the cache), it makes the cache size power of 2 so that you can just use the lower x bits of the block address.
But this method looks exactly the same as division hash function, except you make the hash table power of 2, which is even (data structure book said use prime or odd) and use the lower bits of the block address.
Are these 2 different methods? If so, what's the cache one called? I would really appreciate a reply please. Thank you.
The reason for not using an even number for hash table is described here.
And how caches use addresses to calculate line numbers are described here. And its ok for caches to map more than one entry to the same line. Just because an address is mapped to a cacheline which has data, we don't blindly use the data in that cacheline. We also do a tag comparison to make sure that the content is the cacheline is what exactly we are looking for.
The reason for using a prime to take the modulo by is to get "mixing" of the bits, which is helpful if the integers that you're hashing have a poor structure. That isn't the only way to deal with it though, and for example the Java standard library doesn't use that, it uses a separate "mixing" function (that XORs the input with right-shifted versions of itself) and then uses a power-of-two sizes table. Either way it's protection against badly distributed input, which isn't necessary in and of itself - if the input was always nicely distributed you wouldn't need it.
Memory addresses are usually fairly nicely distributed, because it's typically used in sequential pieces. The obvious exception is that you will see highly aligned big objects, which would conflict with each other in the cache if nothing was done about it. Of course you will probably use a set-associative cache rather than direct mapped, since it is far more robust against degradation, and that would take care of a lot of that. But nothing is ever immune to bad patterns (that also goes for hash-mod-prime, which you can easily defeat if you know the prime), but a fairly simple improvement (which is also used in practice, or at least was, more advanced techniques exist now - combined with adaptive replacement strategies that mitigate bad access patterns) is to XOR some of the higher address bits into the index. This is hash-strengthening, the same technique used in the Java standard library, but a much simpler version of it.
Computing a remainder by a prime number (or really anything that isn't a power of two) is not something you'd want to do in this case, it's a slow computation by itself, and it leaves you with an awkwardly sized cache that doesn't fully use the power of its decoders, which adds to the slowness (or reduces cache size for a given latency, depending on how you look at it). The difference between that and XORing some of the high bits into the low bits is much bigger in hardware than it is in software, since XOR is really a trivial operation in hardware, much faster as a circuit operation than as an instruction.

"cut and paste" the last k elements of std::vector efficiently?

Is it possible in C++11 "cut and paste" the last k elements of an std::vector A to a new std:::vector B in constant time?
One way would be to use B.insert(A.end() - k, A.end()) and then use erase on A but these are both O(k) time operations.
Mau
No, vectors own their memory.
This operation is known as splice. forward_list is ridiculously slow otherwise, but it does have an O(1) splice.
Typically, the process of deciding which elements to move is already O(n), so having the splice itself take O(n) time is not a problem. The other operations being faster on vector more than make up for it.
This isn't possible in general, since (at least in the C++03 version -- there it's 23.2.4/1) the C++ standard guarantees that the memory used by a vector<T> is a single contiguous block. Thus the only way to "transfer" more than a fixed number of elements in O(1) time would be if the receiving vector were empty, and you had somehow arranged to have it's allocated block of memory begin at the right place inside the first vector -- in which case the "transfer" could be argued to have taken no time at all. (Deliberately overlapping objects in this way is almost certain to constitute Undefined Behaviour in theory -- and in practice, it's also very fragile, since any operation that invalidates iterators to a vector<T> can also reallocate memory, thus breaking things.)
If you're prepared to sacrifice a whole bunch of portability, I've heard it's possible to play OS-level (or hardware-level) tricks with virtual memory mapping to achieve tricks like no-overhead ring buffers. Maybe these tricks could also be applied here -- but bear in mind that the assumption that the mapping of virtual to physical memory within a single process is one-to-one is very deeply ingrained, so you could be in for some surprises.

Purpose of Xor Linked List?

I stumbled on a Wikipedia article about the Xor linked list, and it's an interesting bit of trivia, but the article seems to imply that it occasionally gets used in real code. I'm curious what it's good for, since I don't see why it makes sense even in severely memory/cache constrained systems:
The main advantage of a regular doubly linked list is the ability to insert or delete elements from the middle of the list given only a pointer to the node to be deleted/inserted after. This can't be done with an xor-linked list.
If one wants O(1) prepending or O(1) insertion/removal while iterating then singly linked lists are good enough and avoid a lot of code complexity and caveats with respect to garbage collection, debugging tools, etc.
If one doesn't need O(1) prepending/insertion/removal then using an array is probably more efficient in both space and time. Even if one only needs efficient insertion/removal while iterating, arrays can be pretty good since the insertion/removal can be done while iterating.
Given the above, what's the point? Are there any weird corner cases where an xor linked list is actually worthwhile?
Apart from saving memory, it allows for O(1) reversal, while still supporting all the other destructive update operations efficienctly, like
concating two lists destructively in O(1)
insertAfter/insertBefore in O(1), when you only have a reference to the node and its successor/predecessor (which differs slightly from standard doubly linked lists)
remove in O(1), also with a reference to either the successor or predecessor.
I don't think the memory aspect is really important, since for most scenarios where you might use a XOR list, you can use a singly-linked list instead.
It is about saving memory. I had a situation where my data structure was 40 bytes. The memory manager aligned things on a 16 byte boundary, so each allocation was 48 bytes; regardless of the fact that I only needed 40. By using xor chain list, I was able to eliminate 8 bytes and drop my data structure size down to 32 bytes. Now, I can fit 2 nodes in the 64 byte pipeline cache at the same time. So, I was able to reduce memory usage, and improve performance.
Its purpose is (or more precisely was) just to save memory.
With a xor-linked-list you can do anything you can do with a ordinary doubly-linked list. The only difference is that you have to decode the previous and next memory addresses from the xor-ed pointer for each node every time you need them.

Is it possible to count the number of Set bits in Number in O(1)? [duplicate]

This question already has answers here:
Count the number of set bits in a 32-bit integer
(65 answers)
Count bits in the number [duplicate]
(3 answers)
Closed 8 years ago.
I was asked the above question in an interview and interviewer is very certain of the answer. But i am not sure. Can anyone help me here?
Sure. The obvious brute force method is just a big lookup table with one entry for every possible value of the input number. That's not very practical if the number is very big, but is still enough to prove it's possible.
Edit: the notion has been raised that this is complete nonsense, and the same could be said of essentially any algorithm.
To a limited degree, that's a fair statement -- but the limitations are so severe that for most algorithms it remains utterly meaningless.
My original point (at least as well as I remember it) was that population counting is about equivalent to many other operations like addition and subtraction that we normally assume are O(1).
At the hardware level, circuitry for a single-cycle POPCNT instruction is probably easier than for a single-cycle ADD instruction. Just for one example, for any practical size of data word, we can use table lookups on 4-bit chunks in parallel, then add the results from those pieces together. Even using fairly unlikely worst-case assumptions (e.g., separate storage for each of those tables) this would still be easy to implement in a modern CPU -- in fact, it's probably at least somewhat simpler than the single-cycle addition or subtraction mentioned above1.
This is a decided contrast to many other algorithms. For one obvious example, let's consider sorting. For even the most trivial sort most people can imagine -- 2 items, 8 bits apiece, we're already at a 64 kilobyte lookup table to get constant complexity. Long before we can do even a fairly trivial sort (e.g., 100 items) we need a lookup table that contains far more data items than there are atoms in the universe.
Looking at it from the opposite direction, it's certainly true that at some point, essentially nothing is O(1) any more. Let's consider the most trivial operations possible. For an N-bit CPU, bitwise OR is normally implemented as a set of N OR gates in parallel. Unlike addition, there's no interaction between one bit and another, so for any practical size of CPU, this easy to execute in a single instruction.
Nonetheless, if I specify a bit-wise OR in which each operand is 100 petabits, there's nothing even approaching a practical way to do the job with constant complexity. Using the usual method of parallel OR gates, we end up with (among other things) 300 petabits worth of input and output lines -- a number that completely dwarfs even the number of pins on the largest CPUs.
On reasonable hardware, doing a bitwise OR on 100 petabit operands is going to take a while (not to mention quite a bit of hard drive space). If we increase that to 200 petabit operands, the time is likely to (about) double -- so from that viewpoint, it's an O(N) operation. Obviously enough, the same is going to be true with the other "trivial" operations like addition, subtraction, bit-wise AND, bit-wise XOR, and so on.
Nonetheless, unless you have very specific instructions to say you're going to be dealing with utterly immense operands, you're typically going to treat every one of these as a constant complexity operation. Looked at in these terms, a POPCNT instruction falls about halfway between bit-wise AND/OR/XOR on one hand, and addition/subtraction on the other, in terms of the difficulty to execute in fixed time.
1. You might wonder how it could possibly be simpler than an add when it actually includes an add after doing some other operations. If so, kudos -- it's an excellent question.
The answer is that it's because it only needs a much smaller adder. For example, a 64-bit CPU needs one half-adder and 63 full-adders. In the simple implementation, you carry out the addition bit-wise -- i.e., you add bit-0 of one operand to bit-0 of the other. That generates an output bit, and a carry bit. That carry bit becomes an input to the addition for the next pair of bits. There are some tricks to parallelize that to some degree, but the nature of the beast (so to speak) is bit-serial.
With a POPCNT instruction, we have an addition after doing the individual table lookups, but our result is limited to the size of the input words. Given the same size of inputs (64 bits) our final result can't be any larger than 64. That means we only need a 6-bit adder instead of a 64-bit adder.
Since, as outlined above, addition is basically bit-serial, this means that the addition at the end of the POPCNT instruction is fundamentally a lot faster than a normal add. To be specific, it's logarithmic on the operand size, whereas simple addition is roughly linear on the operand size.
If the bit size is fixed (e.g. natural word size of a 32- or 64-bit machine), you can just iterate over the bits and count them directly in O(1) time (though there are certainly faster ways to do it). For arbitrary precision numbers (BigInt, etc.), the answer must be no.
Some processors can do it in one instruction, obviously for integers of limited size. Look up the POPCNT mnemonic for further details.
For integers of unlimited size obviously you need to read the whole input, so the lower bound is O(n).
The interviewer probably meant the bit counting trick (the first Google result follows): http://www.gamedev.net/topic/547102-bit-counting-trick---new-to-me/

Resources