what i mean is that the implementation can only allocate a small (O(1)/O(log n)) amount of memory independently - most of the queue's data must be inside the hash table.
EDIT: this data structure should support (Push,Pop,Top,Len) operations, but under the hood, instead of being a linked-list/array, it will use a hash-table. The majority O(n) memory needed will be contained in a hash-table.
Any list-like data structure can be represented by a hash table, where every element in the list is mapped to its position. So this list: [a, b, c, d] can be represented by a hash table like this:
0: a
1: b
2: c
3: d
A queue is a FIFO data structure: first in, first out. So elements are popped in the same order they were pushed. It can be modeled with an list-like data structure where we push new elements to the list by adding them to the tail and we pop elements by taking them from the head.
the implementation can only allocate a small (O(1)/O(log n)) amount of memory independently
The only necessary data to handle independently from the hash table itself are the head and tail indexes.
So, using the [a, b, c, d] example, our head points to index 0 (which corresponds to a) and our tail to index 3 (which corresponds to d).
To push a new element to the queue (e.g. e), we insert it into our hash table with key tail + 1, that is 4, and we increment our tail by 1.
To pop an element, we get the element at the head position, remove it from the hash table and increment head by 1.
After this, our hash table ends up like this:
1: b
2: c
3: d
4: e
With this implementation, top and len are trivial to implement.
This basic idea can be extended to handle more complex hash tables.
I came across this question. after googling if it's advisable to do that. From what I know, the goal of a queue is to have constant retrieval time and constant removal time. 0(1). the best implementation would be a linked list approach or an array using the 'Unshift' and 'pop' method
Related
I'm looking for the data structure that stores an ordered list of E = (K, V) elements and supports the following operations in at most O(log(N)) time where N is the number of elements. Memory usage is not a problem.
E get(index) // get element by index
int find(K) // find the index of the element whose K matches
delete(index) // delete element at index, the following elements have their indexes decreased by 1
insert(index, E) // insert element at index, the following elements have their indexes increased by 1
I have considered the following incorrect solutions:
Use array: find, delete, and insert will still O(N)
Use array + map of K to index: delete and insert will still cost O(N) for shifting elements and updating map
Use linked list + map of K to element address: get and find will still cost O(N)
In my imagination, the last solution is the closest, but instead of linked list, a self-balancing tree where each node stores the number of elements on the left of it will make it possible for us to do get in O(log(N)).
However I'm not sure if I'm correct, so I want to ask whether my imagination is correct and whether there is a name for this kind of data structure so I can look for off-the-shelf solution.
The closest data structure i could think of is treaps.
Implicit treap is a simple modification of the regular treap which is a very powerful data structure. In fact, implicit treap can be considered as an array with the following procedures implemented (all in O(logN)O(logN) in the online mode):
Inserting an element in the array in any location
Removal of an arbitrary element
Finding sum, minimum / maximum element etc. on an arbitrary interval
Addition, painting on an arbitrary interval
Reversing elements on an arbitrary interval
Using modification with implicit keys allows you to do all operation except the second one (find the index of the element whose K matches). I'll edit this answer if i come up with a better idea :)
I see similar questions to this have been asked before, but I've been searching for a while and can't seem to find an answer.
The assignment I have right now is to use the quicksort algorithm to sort a simple array of 7 letters.
We need to show each step of the sort, underlining the pivot each time.
Our instructor asked that we use the rightmost value as the pivot for each step.
Based on this video, https://www.youtube.com/watch?v=aQiWF4E8flQ ,this is what I have so far (pivot in bold):
GACEFBD
A|GCEFBD
AC|GEFBD
ACB|EFGD
ACBD|FGE
But I'm unsure of where to go from here. On the left side of the partition, D is the pivot, but there are no values larger than D. So where does the pivot go?
Every tutorial I've seen uses the median of three as a pivot, or the leftmost, and I'm not the best at algorithms.
Part B has us showing every step of sorting ABCDEFG, with the same rules. Not sure where to begin there, since I have the same problem.
Sorry if this is a dumb question.
Consider what happens on each iteration.
Remember, quick sort works like this:
If the array is empty, return an empty array and exit
If the array has only one entry, return the entry and exit
Choose a pivot
Split the array in three sub-arrays:
Values smaller than the pivot
The pivot
Values larger than the pivot
For each non-empty array, apply quick sort again and concatenate the resulting arrays
(I know I'm using the word "array" vaguely... but I think the idea is clear)
I think that you're missing a simple fact: Once you choose the pivot, you put it in it's right place, and apply quick sort to the other arrays... you don't apply quick sort to the pivot again.
Let's say you have a function called QuickSort(Array, Pivot), and let's assume you always take the left-most entry of the array as pivot:
Start: QuickSort(GACEFBD , D)
1st. iteration: [QuickSort(ACB, B), D, QuickSort(GEF, F)]
As you can see, the right-most value can be a "good" pivot.
After the first iteration, D is already in its right place
2nd. iteration: [[QuickSort(A,A), B, QuickSort(C,C)], D, [QuickSort(E,E), F, QuickSort(G,G)]]
Result:
[A, B, C, D, E, F, G]
Punch line: Even if you take the right-most entry of the array, there may be cases where that entry is a "good" pivot value.
The real worst case would be applying quick sort on an already sorted array. But the same rules apply. Try to apply the above process to something like this: QuickSort(ABCDEFG, G)
Looking for a datastructure that logically represents a sequence of elements keyed by unique ids (for the purpose of simplicity let's consider them to be strings, or at least hashable objects). Each element can appear only once, there are no gaps, and the first position is 0.
The following operations should be supported (demonstrated with single-letter strings):
insert(id, position) - add the element keyed by id into the sequence at offset position. Naturally, the position of each element later in the sequence is now incremented by one. Example: [S E L F].insert(H, 1) -> [S H E L F]
remove(position) - remove the element at offset position. Decrements the position of each element later in the sequence by one. Example: [S H E L F].remove(2) -> [S H L F]
lookup(id) - find the position of element keyed by id. [S H L F].lookup(H) -> 1
The naïve implementation would be either a linked list or an array. Both would give O(n) lookup, remove, and insert.
In practice, lookup is likely to be used the most, with insert and remove happening frequently enough that it would be nice not to be linear (which a simple combination of hashmap + array/list would get you).
In a perfect world it would be O(1) lookup, O(log n) insert/remove, but I actually suspect that wouldn't work from a purely information-theoretic perspective (though I haven't tried it), so O(log n) lookup would still be nice.
A combination of trie and hash map allows O(log n) lookup/insert/remove.
Each node of trie contains id as well as counter of valid elements, rooted by this node and up to two child pointers. A bit string, determined by left (0) or right (1) turns while traversing the trie from its root to given node, is part of the value, stored in the hash map for corresponding id.
Remove operation marks trie node as invalid and updates all counters of valid elements on the path from deleted node to the root. Also it deletes corresponding hash map entry.
Insert operation should use the position parameter and counters of valid elements in each trie node to search for new node's predecessor and successor nodes. If in-order traversal from predecessor to successor contains any deleted nodes, choose one with lowest rank and reuse it. Otherwise choose either predecessor or successor, and add a new child node to it (right child for predecessor or left one for successor). Then update all counters of valid elements on the path from this node to the root and add corresponding hash map entry.
Lookup operation gets a bit string from the hash map and uses it to go from trie root to corresponding node while summing all the counters of valid elements to the left of this path.
All this allow O(log n) expected time for each operation if the sequence of inserts/removes is random enough. If not, the worst case complexity of each operation is O(n). To get it back to O(log n) amortized complexity, watch for sparsity and balancing factors of the tree and if there are too many deleted nodes, re-create a new perfectly balanced and dense tree; if the tree is too imbalanced, rebuild the most imbalanced subtree.
Instead of hash map it is possible to use some binary search tree or any dictionary data structure. Instead of bit string, used to identify path in the trie, hash map may store pointer to corresponding node in trie.
Other alternative to using trie in this data structure is Indexable skiplist.
O(log N) time for each operation is acceptable, but not perfect. It is possible, as explained by Kevin, to use an algorithm with O(1) lookup complexity in exchange for larger complexity of other operations: O(sqrt(N)). But this can be improved.
If you choose some number of memory accesses (M) for each lookup operation, other operations may be done in O(M*N1/M) time. The idea of such algorithm is presented in this answer to related question. Trie structure, described there, allows easily converting the position to the array index and back. Each non-empty element of this array contains id and each element of hash map maps this id back to the array index.
To make it possible to insert element to this data structure, each block of contiguous array elements should be interleaved with some empty space. When one of the blocks exhausts all available empty space, we should rebuild the smallest group of blocks, related to some element of the trie, that has more than 50% empty space. When total number of empty space is less than 50% or more than 75%, we should rebuild the whole structure.
This rebalancing scheme gives O(MN1/M) amortized complexity only for random and evenly distributed insertions/removals. Worst case complexity (for example, if we always insert at leftmost position) is much larger for M > 2. To guarantee O(MN1/M) worst case we need to reserve more memory and to change rebalancing scheme so that it maintains invariant like this: keep empty space reserved for whole structure at least 50%, keep empty space reserved for all data related to the top trie nodes at least 75%, for next level trie nodes - 87.5%, etc.
With M=2, we have O(1) time for lookup and O(sqrt(N)) time for other operations.
With M=log(N), we have O(log(N)) time for every operation.
But in practice small values of M (like 2 .. 5) are preferable. This may be treated as O(1) lookup time and allows this structure (while performing typical insert/remove operation) to work with up to 5 relatively small contiguous blocks of memory in a cache-friendly way with good vectorization possibilities. Also this limits memory requirements if we require good worst case complexity.
You can achieve everything in O(sqrt(n)) time, but I'll warn you that it's going to take some work.
Start by having a look at a blog post I wrote on ThriftyList. ThriftyList is my implementation of the data structure described in Resizable Arrays in Optimal Time and Space along with some customizations to maintain O(sqrt(n)) circular sublists, each of size O(sqrt(n)). With circular sublists, one can achieve O(sqrt(n)) time insertion/removal by the standard insert/remove-then-shift in the containing sublist followed by a series of push/pop operations across the circular sublists themselves.
Now, to get the index at which a query value falls, you'll need to maintain a map from value to sublist/absolute-index. That is to say, a given value maps to the sublist containing the value, plus the absolute index at which the value falls (the index at which the item would fall were the list non-circular). From these data, you can compute the relative index of the value by taking the offset from the head of the circular sublist and summing with the number of elements which fall behind the containing sublist. To maintain this map requires O(sqrt(n)) operations per insert/delete.
Sounds roughly like Clojure's persistent vectors - they provide O(log32 n) cost for lookup and update. For smallish values of n O(log32 n) is as good as constant....
Basically they are array mapped tries.
Not quite sure on the time complexity for remove and insert - but I'm pretty sure that you could get a variant of this data structure with O(log n) removes and inserts as well.
See this presentation/video: http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey
Source code (Java): https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/PersistentVector.java
Merge Sort divide the list into the smallest unit (1 element), then compare each element with the adjacent list to sort and merge the two adjacent list. Finally all the elements are sorted and merged.
I want to implement the merge sort algorithm in such a way that it divides the list into a smallest unit of two elements and then sort and merge them. ?
How i can implement that???
MERGE-SORT (A, p, r)
IF p < r // Check for base case
THEN q = FLOOR[(p + r)/2] // Divide step
MERGE (A, p, q) // Conquer step.
MERGE (A, q + 1, r) // Conquer step.
MERGE (A, p, q, r) // Conquer step.
something like p < r+1 .
I've done something that sounds this before. Here are 2 variations.
Variation 1: Go through the list, sorting each pair. Then go through the list, merging each pair of pairs. Then each pair of 4s, and so on. When you've merged the whole list, you're done.
Variation 2: Have a stack of sorted arrays. Each element merges into the bottom array, and then cascade, but merging down until there is only one, or the second from the top is larger than the top. After your last element has been added, collapse the array by merging it.
The case where I've used variation 2 was one where I had a very large amount of data streaming in. I kept the first few stacks of sorted arrays in memory, and then later ones stored on disk. This lead to good locality of reference, and efficient use of disk. (You ask why I didn't use an off the shelf solution? Well the dataset I had coming in was bigger than the disk I had to handle it on, there was custom merging logic in there, and the sort really wasn't that hard to write.)
There are two sets of URL, both contains millions of URLs. Now, How can I get an URL from A that is not in B. What's The best methods?
Note: you can use any technique, use any tools like database, mapreduce, hashcode, etc, . We should consider the memory efficient, time efficient. You have to consider that every set (A and B) have millions of URL. We should try to find the specific URLs using less memory and less time.
A decent algorithm might be:
load all of set A into a hashmap, O(a)
traverse set B, and for each item, delete the identical value from set A (from the hashmap) if it exists, O(b)
Then your hashmap has the result. This would be O(a+b) where a is size of set A and b is size of set B. (In practice, this would be multiplied by the hash time, which ideally corresponds to approximately O(1) for a good hash.)
Something perhaps a little naive might be a procedure like
Sort list A
Sort list B
Navigate list A and B together such that:
a. Increment pointer to A and pointer to B when elements match
b. Increment pointer to B until the element matches the next element in a or until the record b in B would appear after the next element in a (this rule discards elements in B that are not in A)
c. A match has been found when incrementing subject to these rules such that the next element b in B does not match the next element a in A.
This might actually be an interesting place to apply Bloom filters: construct a Bloom filter for set B then for every URL in set A determine if it is in set B. With diminishingly small probability of error you should be able to find all URLs in A not in B.
(sort -u A; cat B B) | sort | uniq -u