LRU with arrays - R&D - data-structures

Why LRU uses linked list ? Can't we use array to store items, so item with most frequently used will be stored at front of array & least at last of array.
The only reason i can think of not to use array - is that when removing/updating array, it has less performance as compared to linked list.
Is there any more reason ?
Note:- This question is for academic purpose or for research purpose, to have better idea about LRU.

Some context: A LRU Cache is a custom data structure that saves least recently used results for fast access (this makes sense since take a messaging app for example — you don't want to wait 30 seconds to open up a chat, you want the last 10 messages to appear instantly and if you want to get stuff even farther back then you can get that from the database and wait).
Now that you have this important context reminded to you we can get to why we want to use a linkedlist:
It generally gives constant insertion/deletion time given that you know what to insert/delete (which you usually get in constant time as well with a HashMap). If you had to insert/delete to an array constantly, the array resizes (which are O(N)) would be very detrimental to performance (remember we don't want chat messages taking forever to load) whereas deleting from a linkedlist is as easy as:
public void deleteNode(ListNode node) {
node.val = node.next.val;
node.next = node.next.next;
}
You can use an array, it just won't be as efficient — there is more reason for using linkedlists however: we want to keep a structure where (in constant time) we can add the most recently used requests listnodes to the front of our linkedlist (after the head/sentinel dummy node). This allows you to have a constant time get() method.

Related

Why Use A Doubly Linked List and HashMap for a LRU Cache Instead of a Deque?

I have implemented the design a LRU Cache Problem on LeetCode using the conventional method (Doubly Linked List+Hash Map). For those unfamiliar with the problem, this implementation looks something like this:
I understand why this method is used (quick removal/insertion at both ends, fast access in the middle). What I am failing to understand is why someone would use both a HashMap and a LinkedList when one could simply use a array-based deque (in Java ArrayDeque, C++ simply deque). This deque allows for ease of insertion/deletion at both ends, and quick access in the middle which is exactly what you need for an LRU cache. You also would use less space because you wouldn't need to store a pointer to each node.
Is there a reason why the LRU cache is almost universally designed (on most tutorials at least) using the latter method as opposed to the Deque/ArrayDeque method? Would the HashMap/LinkedList method have any benefits?
When an LRU cache is full, we discard the Least Recently Used item.
If we're discarding items from the front of the queue, then, we have to make sure the item at the front is the one that hasn't been used for the longest time.
We ensure this by making sure that an item goes to the back of the queue whenever it is used. The item at the front is then the one that hasn't been moved to the back for the longest time.
To do this, we need to maintain the queue on every put OR get operation:
When we put a new item in the cache, it becomes the most recently used item, so we put it at the back of the queue.
When we get an item that is already in the cache, it becomes the most recently used item, so we move it from its current position to the back of the queue.
Moving items from the middle to the end is not a deque operation and is not supported by the ArrayDeque interface. It's also not supported efficiently by the underlying data structure that ArrayDeque uses. Doubly-linked lists are used because they do support this operation efficiently.
The purpose of an LRU cache is to support two operations in O(1) time: get(key) and put(key, value), with the additional constraint that least recently used keys are discarded first. Normally the keys are the parameters of a function call and the value is the cached output of that call.
Regardless of how you approach this problem we can agree that you MUST use a hashmap. You need a hashmap to map a key already present in the cache to the value in O(1).
In order to deal with the additional constraint of least recently used keys being discarded first you can use a LinkedList or ArrayDeque. However since we don't actually need to access the middle, a LinkedList is better since you don't need to resize.
Edit:
Mr. Timmermans discussed in his answer why ArrayDeques cannot be used in an LRU cache due to the necessity of moving elements from the middle to the end. With that being said here is an implementation of an LRU cache that successfully submits on leetcode using only appends and poplefts in the deque. Note that python's collections.deque is implemented as a doubly linked list, however we are only using operations in collections.deque that are also O(1) in a circular array, so the algorithm stays the same regardless.
from collections import deque
class LRUCache:
def __init__(self, capacity: 'int'):
self.capacity = capacity
self.hashmap = {}
self.deque = deque()
def get(self, key: 'int') -> 'int':
res = self.hashmap.get(key, [-1, 0])[0]
if res != -1:
self.put(key, res)
return res
def put(self, key: 'int', value: 'int') -> 'None':
self.add(key, value)
while len(self.hashmap) > self.capacity:
self.remove()
def add(self, key, value):
if key in self.hashmap:
self.hashmap[key][1] += 1
self.hashmap[key][0] = value
else:
self.hashmap[key] = [value, 1]
self.deque.append(key)
def remove(self):
k = self.deque.popleft()
self.hashmap[k][1] -=1
if self.hashmap[k][1] == 0:
del self.hashmap[k]
I do agree with Mr. Timmermans that using the LinkedList approach is preferable - but I want to highlight that using an ArrayDeque to build an LRU cache is possible.
The main mixup between myself and Mr. Timmermans is how we interpreted capacity. I took capacity to mean caching the last N get / put requests, while Mr. Timmermans took it to mean caching the last N unique items.
The above code does have a loop in put which slows the code down - but this is just to get the code to conform to caching the last N unique items. If we had the code cache the last N requests instead, we could replace the loop with:
if len(self.deque) > self.capacity: self.remove()
This will make it as fast if not faster than the linked-list variant.
Regardless of what maxsize is interpreted as, the above method still works as an LRU cache - least recently used elements get discarded first.
I just want to highlight that the designing an LRU cache in this manner is possible. The source is right there - try to submit it on Leetcode!
Doubly linked list is the implementation of the queue. Because doubly linked lists have immediate access to both the front and end of the list, they can insert data on either side at O(1) as well as delete data on either side at O(1). Because doubly linked lists can insert data at the end in O(1) time and delete data from the front in O(1) time, they make the perfect underlying data structure for a queue. Queeus are lists of items in which data can only be inserted at the end and removed from the beginning.
Queues are an example of an abstract data type, and that we are able to use an array to implement them under the hood. Now, since queues insert at the end and delete from the beginning, arrays are only so good as the underlying data structure. While arrays are O(1) for insertions at the end, they’re O(N) for deleting from the beginning. A doubly linked list, on the other hand, is O(1) for both inserting at the end and for deleting from the beginning. That’s what makes it a perfect fit for serving as the queue’s underlying data structure.
Pyhon deque uses a linked list as part of its data structure. This is the kind of linked list it uses. With doubly linked lists, deque is capable of inserting or deleting elements from both ends of a queue with constant O(1) performance. pyhton-deque

Synchronize two lists of objects

Problem
I have two lists of objects. Each object contains the following:
GUID (allows to determine if objects are the same — from business
point of view)
Timestamp (updates to current UTC each time the
object changed)
Version (positive integer; increments each time
the object changed)
Deleted (boolean flag; switches to "true" instead
of actual object deleting)
Data (some useful payload)
Any other fields if need
Next, I need to sync two lists according to these rules:
If object with some GUID presented only in one list, it should be copied to another list
If object with some GUID presented in both lists, the instance with less Version should be replaced with one having greater Version (nothing to do if versions are equal)
Real-world requirements:
Each list has 50k+ objects, each object is about 1 Kb
Lists are placed on different machines connected via Internet (e.g., mobile app and remote server), thus, algorithm shouldn't waste the traffic or CPU much
Most of time (say, 96%) lists are already synced before sync process, hence, the algorithm should determine it with minimal effort
If there are any differences, most of time they are pretty small (3-5 objects changed/added)
Should proceed OK if one list is empty (and other still has 50k+ items)
Solution #1 (currently implemented)
Client stores the time-of-last-sync-succeed (say T)
Both lists are asked for all objects having Timestamp > T (i.e. recently modified; in the production it's ... > (T - day) for better robustness)
These lists of recently modified objects are synced naively:
items presented only in first list are saved to second list
items presented only in second list are saved to first list
other items has their Version's compared and saved to appropriative list (if need)
Procs:
Works great with small changes
Almost fits the requirements
Cons:
Depends on T, which makes the algorithm fragile: it's easy to sync last updates, but hard to make sure lists are completely synced (using minimal T like 1970-01-01 just hangs the sync process)
My questions:
Is there any common / best-practice / proved way to sync object lists?
Is there any better [than #1] solutions for my case?
P.S. Already viewed, not duplicates:
Compare Two List Of Objects For Synchronization
Two list synchronization
Summary
All answers has some worth points. To summarize, here is the compiled answer I was looking for, based on finally implemented working sync system:
In general, use Merkle trees. They are dramatically efficient in comparing large amounts of data.
If you can, rebuild your hash tree from scratch every time you need it.
Check the time required to rebuild hash tree. Most likely it's pretty fast (e.g., in my case on Nexus 4 rebuilding tree for 20k items takes ~2 sec: 1.8 sec for fetching data from DB + 0.2 sec for building tree; the server performs ~20x faster), so you don't need to store the tree in the DB and maintain it when data changed (my first try was rebuilding only relevant branches — it's not too complicated to implement, but is very fragile).
Nevertheless, it's ok to cache and reuse tree if no data modifications was done at all. Once modification happened, invalidate the whole cache.
Technical details
GUID is 32 chars long without any hyphens/braces, lowercase;
I use 16-ary tree with the height of 4, where each branch is related to the GUID's char. It may be implemented as actual tree or map:
0000 → (hash of items with GUID 0000*)
0001 → (hash of items with GUID 0001*)
...
ffff → (hash of items with GUID ffff*);
000 → (hash of hashes 000_)
...
00 → (hash of hashes 00_)
...
() → (root hash, i.e. hash of hashes _)
Thus, the tree has 65536 leafs and requires 2 Mb of memory; each leaf covers ~N/65536 data items. Binary trees would be 2x more efficient in terms of memory, but it's a harder to implement.
I had to implement these methods:
getHash() — returns root hash; used for primary check (as mentioned,
in 96% that's all we need to test);
getHashChildren(x) — returns list of hashes x_ (at most 16); used for effective, single-request discovering data difference;
findByIdPrefix(x) — returns items with GUID x*, x must contain exactly 4 chars; used for requesting leaf items;
count(x) — returns number of items with GUID x*; when reasonably small, we can dismiss checking tree branch-by-branch and transfer bunch of items with single request;
As far as syncing is done per-branch transmitting small amounts of data, it's very responsive (you can check the progress at any time) + very robust for unexpected terminating (e.g., due to network failure) and easily restarts from the last point if need.
IMPORTANT: sometimes you will stuck with conflicting state: {version_1 = version_2, but hash_1 != hash_2}: in this case you must make some decision (maybe with user's help or comparing timestamps as last resort) and rewrite some item with another to resolve the conflict, otherwise you'll end up with unsynced and unsyncable hash trees.
Possible improvements
Implement transmitting (GUID, Version) pairs without payload for lightweighting requests.
Two suggestions come to mind, the first one is possibly something you're doing already:
1) Don't send entire lists of items with timestamps > T. Instead, send a list of (UUID, Version) tuples of objects with timestamps > T. Then the other side can figure out which objects it needs to update from that. Send the UUIDs of those back to request the actual objects. This avoids sending full objects if they have timestamp > T, but are nonetheless newer already (or present already with the latest Version) on the other side.
2) Don't process the full list at once, but in chunks, i.e. first sync 10%, then the next 10% etc. to avoid transferring too much data at once for big syncs (and to allow for restarting points if a connection should break). This can be done by e.g. starting with all UUIDs with a checksum equivalent to 1 modulo 10, then 1 modulo 10 etc.
Another possibility would be proactive syncing, e.g. asynchronously posting chances, possibly via UCP (unreliable as opposed to TCP). You would still need to sync when you need current information, but chances are most of it is current.
You need to store not time of last synchronization, but the state of the objects (eg. the hash of object data) at time of last synchronization. Then you compare each list with the stored list and find, what objects have changed on each side.
This is much more reliable than rely on time, cause time requires that both sides have synchronized timer which gives precise time (and this is not the case on most systems). For the same reason your idea of detecting changes based on time + version can be more error-prone than it initially seems.
Also you don't initially transfer object data but only GUIDs.
BTW we've made a framework (free with source) which addresses exactly your problems. I am not giving the link because some alternatively talented people would complain.

What data structure will optimzied to represent stock market?

Data for various stocks is coming from various stock exchange continuously. Which data structure is suitable to store these data?
things to consider are :
a) effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
I thought of using Heap as the number of stocks would be more or less constant and the most frequent used operations are retrieval and update so heap should perform well for this scenario.
b) need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
I am nt sure about how to got about this.
c) as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
Ps: This is a interview question from Morgan Stanley.
A heap doesn't support efficient random access (i.e. look-up by index) nor getting the top k elements without removing elements (which is not desired).
My answer would be something like:
A database would be the preferred choice for this, as, with a proper table structure and indexing, all of the required operations can be done efficiently.
So I suppose this is more a theoretical question about understanding of data structures (related to in-memory storage, rather than persistent).
It seems multiple data structures is the way to go:
a) Effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
A map would make sense for this one. Hash-map or tree-map allows for fast look-up.
b) How to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)?
Just about any sorted data structure seems to make sense here (with the above map having pointers to the correct node, or pointing to the same node). One for activity and one for profit.
I'd probably go with a sorted (double) linked-list. It takes minimal time to get the first or last n items. Since you have a pointer to the element through the map, updating takes as long as the map lookup plus the number of moves of that item required to get it sorted again (if any). If an item often moves many indices at once, a linked-list would not be a good option (in which case I'd probably go for a Binary Search Tree).
c) How can you store all the transactional data persistently?
I understand this question as - if the connection to the database is lost or the database goes down at any point, how do you ensure there is no data corruption? If this is not it, I would've asked for a rephrase.
Just about any database course should cover this.
As far as I remember - it has to do with creating another record, updating this record, and only setting the real pointer to this record once it has been fully updated. Before this you might also have to set a pointer to the old record so you can check if it's been deleted if something happens after setting the pointer away, but before deletion.
Another option is having a active transaction table which you add to when starting a transaction and remove from when a transaction completes (which also stores all required details to roll back or resume the transaction). Thus, whenever everything is okay again, you check this table and roll back or resume any transactions that have not yet completed.
If I have to choose , I would go for Hash Table:
Reason : It is synchronized and thread safe , BigO(1) as average case complexity.
Provided :
1.Good hash function to avoid the collision.
2. High performance cache.
While this is a language agnostic question, a few of the requirements jumped out at me. For example:
effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
The java class HashMap uses the hash code of a key value to rapidly access values in its collection. It actually has an O(1) runtime complexity, which is ideal.
need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
This is an implementation based issue. Your best bet is to implement a fast sorting algorithm, like QuickSort or Mergesort.
as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
A database would have been my first choice, but it depends on your resources.

Optimizing Inserting into the Middle of a List

I have algorithms that works with dynamically growing lists (contiguous memory like a C++ vector, Java ArrayList or C# List). Until recently, these algorithms would insert new values into the middle of the lists. Of course, this was usually a very slow operation. Every time an item was added, all the items after it needed to be shifted to a higher index. Do this a few times for each algorithm and things get really slow.
My realization was that I could add the new items to the end of the list and then rotate them into position later. That's one option!
Another option, when I know how many items I'm adding ahead of time, is to add that many items to the back, shift the existing items and then perform the algorithm in-place in the hole I've made for myself. The negative is that I have to add some default value to the end of the list and then just overwrite them.
I did a quick analysis of these options and concluded that the second option is more efficient. My reasoning was that the rotation with the first option would result in in-place swaps (requiring a temporary). My only concern with the second option is that I am creating a bunch of default values that just get thrown away. Most of the time, these default values will be null or a mem-filled value type.
However, I'd like someone else familiar with algorithms to tell me which approach would be faster. Or, perhaps there's an even more efficient solution I haven't considered.
Arrays aren't efficient for lots of insertions or deletions into anywhere other than the end of the array. Consider whether using a different data structure (such as one suggested in one of the other answers) may be more efficient. Without knowing the problem you're trying to solve, it's near-impossible to suggest a data structure (there's no one solution for all problems). That being said...
The second option is definitely the better option of the two. A somewhat better option (avoiding the default-value issue): simply copy 789 to the end and overwrite the middle 789 with 456. So the only intermediate step would be 0123789789.
Your default-value concern is, however, (generally) not a big issue:
In Java, for one, you cannot (to my knowledge) even assign memory for an array that's not 0- or null-filled. C++ STL containers also enforce this I believe (but not C++ itself).
The size of a pointer compared to any moderate-sized class is minimal (thus assigning it to a default value also takes minimal time) (in Java and C# everything is pointers, in C++ you can use pointers (something like boost::shared_ptr or a pointer-vector is preferred above straight pointers) (N/A to primitives, which are small to start, so generally not really a big issue either).
I'd also suggest forcing a reallocation to a specified size before you start inserting to the end of the array (Java's ArrayList::ensureCapacity or C++'s vector::reserve). In case you didn't know - varying-length-array implementations tend to have an internal array that's bigger than what size() returns or what's accessible (in order to prevent constant reallocation of memory as you insert or delete values).
Also note that there are more efficient methods to copy parts of an array than doing it manually with for loops (e.g. Java's System.arraycopy).
You might want to consider changing your representation of the list from using a dynamic array to using some other structure. Here are two options that allow you to implement these operations efficiently:
An order statistic tree is a modified type of binary tree that supports insertions and selections anywhere in O(log n) time, as well as lookups in O(log n) time. This will increase your memory usage quite a bit because of the overhead for the pointers and extra bookkeeping, but should dramatically speed up insertions. However, it will slow down lookups a bit.
If you always know the insertion point in advance, you could consider switching to a linked list instead of an array, and just keep a pointer to the linked list cell where insertions will occur. However, this slows down random access to O(n), which could possibly be an issue in your setup.
Alternatively, if you always know where insertions will happen, you could consider representing your array as two stacks - one stack holding the contents of the array to the left of the insert point and one holding the (reverse) of the elements to the right of the insertion point. This makes insertions fast, and if you have the right type of stack implementation could keep random access fast.
Hope this helps!
HashMaps and Linked Lists were designed for the problem you are having. Given a indexed data structure with numbered items, the difficulty of inserting items in the middle requires a renumbering of every item in the list.
You need a data structure which is optimized to make inserts a constant O(1) complexity. HashMaps were designed to make insert and delete operations lightning quick regardless of dataset size.
I can't pretend to do the HashMap subject justice by describing it. Here is a good intro: http://en.wikipedia.org/wiki/Hash_table

class/object to generate unique id's

I'm using C# but even if you don't know it, it should be pretty easy to follow along with this question.
Here's my problem: I have some objects that I'd like to keep in a hashset-like-data structure so that I can look them up based on an int ID. These objects have mutable properties, so hashing them is not an option (I would need something constant about them to hash, yes?).
What I've done is develop the following interface:
public interface IUniqueIDCollection
{
// Can return any int that hasn't been requested yet.
public int RequestUniqueID();
// Undos the requesting of an int
public int ReleaseUniqueID(int uniqueID);
}
My initial thought is to just store an internal counter in the IUniqueIDCollection that increments as ID's are requested. However once ID's are released, I would have to keep track of ranges or individual ID's that have been removed. I think the latter would be better. But if I used a counter (or any cyclic function) to generate the ID's, I would have the problem of having to go through checking sequences of ID's that have been successively requested by not released once the counter wraps around.
The heuristics are this: Let's say a maximum of 5,000 ID's will be requested at once. HOWEVER, very often ID's will requested and then released. Releasing will tend to happen in ranges -- i.e. maybe 100 will be requested all at once, and then all 100 will be released in a short time interval.
I know I could use a GUID or something instead of an int, but I'd like to save space/bandwidth/processing time of the ID's.
So my question is: What should the request and release methods look like in the interface I gave above, in terms of pseudo code, given the heuristics?
If you're sure that released ID's are safe to be reused immediately (i.e., there won't be stale references to old ID's hanging around that would be confused if a new object was assigned a recently-released ID), you can use the released ID's first. So when an ID is released, you put it at the end of a queue. When a new ID is requested, you use the first one in the queue. If the queue is empty, you increment the internal counter and give out the new number.
Advantage of this implementation:
All operations are O(1). You're never iterating over a collection or range. You only ever insert at the end of the queue, remove from the front of the queue, or increment your counter.
The memory footprint should be fairly low because you're trying to use up the queue as quickly as possible.
The implementation is straightforward.
Disadvantages:
You'll be reusing ID's quickly, so you won't be using your whole index range to keep new objects from using the same ID as recently-released objects.
You won't be able to even guess at the age of an object by looking at its ID.
Probably a worse idea than Tom Panning's above in almost all cases, but you could use a BitArray to keep track of IDs that are in use. The memory usage is as many bits as you would ever have live IDs in total; worst case would be 512MB for mapping out all 32-bit ints. Releasing is easy: just set the corresponding bit to 0. Acquiring (or requesting) an ID requires searching for a 0 bit, and if you don't find it, extending the BitArray.
If you still have the option of extending your BitArray (i.e. you're not at 512MB yet), you would probably not want to search all of the BitArray before deciding to extend - doing that all the time would be slow. You certainly wouldn't always want to start at the same index: it might be a good idea to keep track of the last 0 that you found and start searching from there.
The one advantage that I can see is memory usage once all, or almost all, of the objects are released. Then Tom Panning's solution requires at least 32 times as much memory as this one. However, I'd expect that in typical usage that solution uses less.

Resources