How can I merge k sorted data streams using O(1) RAM ? How should I define the data stream object and its related functions/operations ?
My solution : Well I thought of using array lists as the data stream object. I planned to find the minimum value of the 0th index of the k array lists.The minimum value should be removed from that array list and should be put it in the output array list. This process should be repeated until all the k array lists have become null.But I guess this would take O(k*length of each array list). Any ideas how to do it in O(1) ?
Making an O(1) ram algorithm is very dependent on your underlying datastructure and language of choice. Assuming you know how to manipulate your data structure with O(1) ram see this:
http://en.wikipedia.org/wiki/Merge_sort
The merging function takes O(1) memory. Now all you need is an index into your set of data streams and merge all streams into the first stream and you are done.
Related
How the problem of sorting a very huge list is tackled?
I suppose we divide the list and have them process in each CPU and produce small sorted lists.
But how can we combine and produce a final sorted list?
You can merge mutiple sorted lists using priority queue (based on binary heap).
Fill queue with pairs (current element of list or its index; list id).
At every step:
extract pair with min element from queue
add value to result
get the next element of the same list (if possible)
insert new pair into queue again
How huge is your list relative to available memory?
For useful clues start from wiki external sorting page
Basic approach should be create a min-heap of size (n) where n is the number of partitioned sorted list from huge list.
Each node of binary heap should be represented like index/sorted_list_number and value.
The top node of min heap will point the min value of huge list and index will point from which sorted list its coming, Now pop top from min heap add its value in huge list and add new value from popped index list in to heap and heapify again.
Repeat till node finish , also take care for heapsize when one/more of the list are getting empty in process.
Since the issue is that your list is larger than memory, I would say external sort is the solution:
https://en.wikipedia.org/wiki/External_sorting
Say we have N blocks of main memory, we can load N-1 blocks of two list. Use the remaining one block as an output buffer
Merge two list by performing usual merging through comparing the front element. Output the result to the output buffer.
When the buffer is full, write the output back to secondary memory.
Repeat the steps until all the lists are merged.
So I have a set of objects X, and each of them has a value v[x].
How can I store the objects X in a way that allows me to efficiently compute the x with the highest value?
Also I would like to be able to change the value of v[x], and have x automatically fall to the correct place in the data structure.
I thought about using a priority queue for this but my friend told me I should use a hashmap instead. Which confused me because hashmaps are unordered.
You are correct, and your friend is wrong: hash map is not going to work, because it is unordered. Hash map may be useful if you wish to maintain values v externally to your objects x, but then it would need a separate data structure, in addition to the one providing the ordering.
Priority queue with a comparator that compares the value v attached to the object x will provide you with a fast way to get the object with the highest value.
No matter what data structure you are going to use, it would be up to you to update it when the value v[x] changes. Generally, you will need to remove the object from the structure, and then insert it back right away, so that it could be placed at its new position according to its updated value.
You have 2 operations that you wish to support efficiently:
Find maximum
Update value
For #1, a priority queue (i.e. heap) is a good idea, but it doesn't allow you to efficiently do #2 - you'll have to look through the whole queue to find the correct node, then update and move (or delete and reinsert) it - this takes O(n).
To support #2 efficiently, you can use a hash map in addition to a priority queue (perhaps this is what your friend was talking about) - have each object map to the applicable node in the tree, then you can find the correct node in expected O(1) and update it in O(log n).
As an alternative, you can use a (self-balancing) binary search tree. You'll firstly sort on the value, then on a unique member of the object (like a unique ID). This will allow you to find any object in O(log n). #1 can be implemented to take O(1) and #2 will take O(log n) (through delete and reinsert).
Lastly, for completeness, elements in a hash map are unordered - you'll have to look through all the values to find the maximum (i.e. it takes O(n)) (but update can be performed in expected O(1)).
Summary:
Find Max Update
Heap only O(1) O(n)
Heap + HM O(1) O(log n) (expected)
BST O(1) O(log n)
HM only O(n) O(1) (expected)
This was an interview question asked to me almost 3 years back and I was pondering about this a while back.
Design a data structure that supports the following operations:
insert_back(), remove_front() and find_mode(). Best complexity
required.
The best solution I could think of was O(logn) for insertion and deletion and O(1) for mode. This is how I solved it: Keep a queue DS for handling which element is inserted and deleted.
Also keep an array which is max heap ordered and a hash table.
The hashtable contains an integer key and an index into the heap array location of that element. The heap array contains an ordered pair (count,element) and is ordered on the count property.
Insertion : Insert the element into the queue. Find the location of the heap array index from the hashtable. If none exists, then add the element to the heap and heapify upwards. Then add the final location into the hashtable. Increment the count in that location and heapify upwards or downwards as needed to restore the heap property.
Deletion : Remove element from the head of the queue. From the hash table, find a location in the heap array index. Decrement the count in the heap and reheapify upward or downwards as needed to restore the heap property.
Find Mode: The element at the head of the array heap (getMax()) will give us the mode.
Can someone please suggest something better. The only optimization I could think of was using a Fibonacci heap but I am not sure if that is a good fit in this problem.
I think there is a solution with O(1) for all operations.
You need a deque, and two hashtables.
The first one is a linked hashtable, where for each element you store its count, the next element in count order and a previous element in count order. Then you can look the next and previous element's entries in that hashtable in a constant time. For this hashtable you also keep and update the element with the largest count. (element -> count, next_element, previous_element)
In the second hashtable for each distinct number of elements, you store the elements with that count in the start and in the end of the range in the first hashtable. Note that the size of this hashtable will be less than n (it's O(sqrt(n)), I think). (count -> (first_element, last_element))
Basically, when you add an element to or remove an element from the deque, you can find its new position in the first hashtable by analyzing its next and previous elements, and the values for the old and new count in the second hashtable in constant time. You can remove and add elements in the first hashtable in constant time, using algorithms for linked lists. You can also update the second hashtable and the element with the maximum count in constant time as well.
I'll try writing pseudocode if needed, but it seems to be quite complex with many special cases.
Deleting a node from the middle of the heap can be done in O(lg n) provided we can find the element in the heap in constant time. Suppose the node of a heap contains id as its field. Now if we provide the id, how can we delete the node in O(lg n) time ?
One solution can be that we can have a address of a location in each node, where we maintain the index of the node in the heap. This array would be ordered by node ids. This requires additional array to be maintained though. Is there any other good method to achieve the same.
PS: I came across this problem while implementing Djikstra's Shortest Path algorithm.
The index (id, node) can be maintained separately in a hashtable which has O(1) lookup complexity (on average). The overall complexity then remains O(log n).
Each data structure is designed with certain operations in mind. From wikipedia about heap operations
The operations commonly performed with a heap are:
create-heap: create an empty heap
find-max or find-min: find the maximum item of a max-heap or a minimum item of a min-heap, respectively
delete-max or delete-min: removing the root node of a max- or min-heap, respectively
increase-key or decrease-key: updating a key within a max- or min-heap, respectively
insert: adding a new key to the heap
merge joining two heaps to form a valid new heap containing all the elements of both.
This means, heap is not the best data structure for the operation you are looking for. I would advice you to look for a better suited data structure(depending on your requirements)..
I've had a similar problem and here's what I've come up with:
Solution 1: if your calls to delete some random item will have a pointer to item, you can store your individual data items outside of the heap; have the heap be of pointers to these items; and have each item contain its current heap array index.
Example: the heap contains pointers to items with keys [2 10 5 11 12 6]. The item holding value 10 has a field called ArrayIndex = 1 (counting from 0). So if I have a pointer to item 10 and want to delete it, I just look at its ArrayIndex and use that in the heap for a normal delete. O(1) to find heap location, then usual O(log n) to delete it via recursive heapify.
Solution 2: If you only have the key field of the item you want to delete, not its address, try this. Switch to a red-black tree, putting your payload data in the actual tree nodes. This is also O( log n ) for insert and delete. It can additionally find an item with a given key in O( log n ), which makes delete-by-key continue to be log n.
Between these, solution 1 will require an overhead of constantly updating ArrayIndex fields with every swap. It also results in a kind of strange one-off data structure that the next code maintainer would need to study and understand. I think solution 2 would be about as fast, and has the advantage that it's a well-understood algo.
I have to implement a cache with normal cache operations along with the facility of fast retrieval of the maximum element from the cache.
Can you please suggest data structures to implement this?
I was thinking of using hash map along with a list to maintain the minimum element.
Suggest other approaches with better complexity.
heap is great for fast retrival of max element.
There is a type of structure that I call exponential lookaside lists that are frequently used by OS's for keeping track of free chunks of memory. You start with some base size N (somewhere between 8 bytes, and the page size of the OS) and then build an array (or stack) of lists:
[list N]
[list N*2]
[list N*4]
[list N*8]
...
And so on up to some maximum. To maintain them, you just take the size of a new entry (S) and then use LOG2(S/N) as your offset into the lists array to determine which list to add the new chunk to. When you need to release (or return) your largest chunk, your just scan from the highest sized list down until you find the first non-empty list, then scan for the largest chunk in that list.