This question already has answers here:
Why are two different concepts both called "heap"? [duplicate]
(9 answers)
Closed last year.
I have read about memory allocation for applications and I understand that, more or less, the heap, in memory, for a given application is dynamically allocated at the startup. However, there is also another concept called a min-heap, which is a data structure organized in the form of a tree where each parent node is smaller or equal to its children.
So, my question is: What are the relationships between the heap that is allocated at startup for a given application and the min-heap data structure that includes functions commonly referred to as 'heapify' and etc? Is there any relationship or is the min-heap data structure more of a higher level programming concept? If there is no relationship, is there any reason they've been given the same name?
This may seem like a silly question for some, but it has actually stirred up a debate at work.
heap is a data structure which is actually a complete binary tree with some extra properties.
There are 2 types of Heaps:
MIN Heap
MAX Heap
in min heap the root has the lowest value in the tree and when you pop out the root the next lowest element comes on the top. To convert a tree into heap we use heapify algorithm.
It is also know as priority queue in c++. and usually as a competitive programmer we use STL function for heap so that we dont have to get into the hustle of creating a heap from scratch.
Max heap is just the opposite with largest at the root.
Usually heap is used because it has a O(logN) time complexity for removing and inserting elements and hence can even work with tight constraints like 10^6.
Now i can understand you confusion between heap in memory and heap data structure but they are completely different things. Heap in data structure is just a way to store the data.
Related
I want to know the best algorithm where I can create a "sorted" list based on the key (ranging from 0 to 2 power 32) and traverse them in sorted order when needed in an embedded device. I am aware of possible options namely
sorted linklist
As number of nodes in the linked list increases searching for the right node in the list for insertion/update operations takes more time O(n)
Hash
Might be the current best choice until and unless we do not have collisions with the hashing logic
Table of size 2 power 32
Wastage of space
Is there any other best alternative which is suited to be used in an embedded device ?
There are many design choices to be weighed.
Generalities
Since you're working on an embedded device, it's reasonable to assume that you have limited memory. In such a situation, you'll probably want to choose memory-compact data structures over performant data structures.
Linked lists tend to scatter their contents across memory in a way which can make accesses slow, though this will depend somewhat on your architecture.
The Options You've Proposed
Sorted linked-list. This structure is already slow to access (O(n)), slow to construct (O(NĀ²)), and slow to traverse (because a linked-list scatters memory, which reduces your ability to pre-fetch).
Hash table: This is a fast structure (O(1) access, O(N) construction). There are two problems, though. If you use open addressing, the table must be no more than about 70% full or performance will degrade. That means you'll be wasting some memory. Alternatively, you can use linked list buckets, but this has performance implications for traversal. I have an answer here which shows order-of-magnitude differences in traversal performance between a linked-list bucket design and open addressing for a hash table. More problematically, hash tables work by "randomly" distributing data across memory space. Getting an in-order traversal out of that will require an additional data structure of some sort.
Table of size 2 power 32. There's significant wastage of space for this solution. But also poor performance since, I expect, most of the entries of this table will be empty, but they must all be traversed.
An Alternative
Sort before use
If you do not need your list to always be sorted, I'd suggest adding new entries to an array and then sorting just prior to traversal. This gives you tight control over your memory layout, which is contiguous, so you'll get good memory performance. Insertion is quick: just throw your new data at the beginning or end of the array. Traversal, when it happens, will be fast because you just walk along the array. The only potentially slow bit is the sort.
You have several options for sort. You'll want to keep in mind that your array is either mostly sorted (only a few insertions between traversals) or mostly unsorted (many insertions between traversals). In the mostly-sorted case, insertion sort is a good choice. In the mostly-unsorted case, [quicksort](https://en.wikipedia.org/wiki/Quicksort] is solid. Both have the benefit of being in-place, which reduces memory consumption. Timsort balances these strategies.
One of the biggest advantages of using heaps for priority queues (as opposed to, say, red-black trees) seems to be space-efficiency: unlike balanced BSTs, heaps only require O(1) auxiliary storage.
i.e, the ordering of the elements alone is sufficient to satisfy the invariants and guarantees of a heap, and no child/parent pointers are necessary.
However, my question is: is the above really true?
It seems to me that, in order to use an array-based heap while satisfying the O(log n) running time guarantee, the array must be dynamically expandable.
The only way we can dynamically expand an array in O(1) time is to over-allocate the memory that backs the array, so that we can amortize the memory operations into the future.
But then, doesn't that overallocation imply a heap also requires auxiliary storage?
If the above is true, that seems to imply heaps have no complexity advantages over balanced BSTs whatsoever, so then, what makes heaps "interesting" from a theoretical point of view?
You appear to confuse binary heaps, heaps in general, and implementations of binary heaps that use an array instead of an explicit tree structure.
A binary heap is simply a binary tree with properties that make it theoretically interesting beyond memory use. They can be built in linear time, whereas building a BST necessarily takes n log n time. For example, this can be used to select the k smallest/largest values of a sequence in better-than-n log n time.
Implementing a binary heap as an array yields an implicit data structure. This is a topic formalized by theorists, but not actively pursued by most of them. But in any case, the classification is justified: To dynamically expand this structure one indeed needs to over-allocate, but not every data structure has to grow dynamically, so the case of a statically-sized pre-allocated data structure is interesting as well.
Furthermore, there is a difference between space needed for faster growth and space needed because each element is larger than it has to be. The first can be avoided by not growing, and also reduced to arbitrarily small constant factor of the total size, at the cost of a greater constant factor on running time. The latter kind of space overhead is usually unavoidable and can't be reduced much (the pointers in a tree are at least log n bits, period).
Finally, there are many heaps other than binary heaps (fibonacci, binominal, leftist, pairing, ...), and almost all except binary heaps offer better bounds for at least some operations. The most common ones are decrease-key (alter the value of a key already in the structure in a certain way) and merge (combined two heaps into one). The complexity of these operations are important for the analysis of several algorithms using priority queues, and hence the motivation for a lot of research into heaps.
In practice the memory use is important. But (with over-allocation and without), the difference is only a constant factor overall, so theorists are not terribly interested in binary heaps. They'd rather get better complexity for decrease-key and merge; most of them are happy if the data structure takes O(n) space. The extremely high memory density, ease of implementation and cache friendliness are far more interesting for practitioners, and it's them who sing the praise of binary heaps far and wide.
I'm looking for heap structures for priority queues that can be implemented using Object[] arrays instead of Node objects.
Binary heaps of course work well, and so do n-ary heaps. Java's java.util.PriorityQueue is a binary heap, using an Object[] array as storage.
There are plenty of other heaps, such as Fibonacci heaps for example, but as far as I can tell, these need to be implemented using nodes. And from my benchmarks I have the impression that the overhead paid by managing all these node objects comes at a cost that may well eat up all benefits gained. I find it very hard to implement a heap that can comete with the simple array-backed binary heap.
So I'm currently looking for advanced heap / priority queue structures that also do not have the overhead of using Node objects. Because I want it to be fast in reality, not just better in complexity theory... and there is much more happening in reality: for example the CPU has L1, L2 and L3 caches etc. that do affect performance.
My question also focuses on Java for the very reason that I have little influence on memory management here, and there are no structs as in C. A lot of heaps that work well when implemented in C become costly in Java because of the memory management overhead and garbage collection costs.
Several uncommon heap structures can be implemented this way. For example, the Leonardo Heap used in Dijkstra's smoothsort algorithm is typically implemented as an array augmented with two machine words. Like a binary heap, the array representation is an array representation of a specific tree structure (or rather, a forest, since it's a collection of trees).
The poplar heap structure was introduced as a slightly less efficient theoretically but more efficient practically heap structure with the same basic idea - the data is represented as an array with some extra small state information, and is a compressed representation of a forest structure.
A more canonical array-based heap is a d-ary heap, which is a natural generalization of the binary heap to have d children instead of two.
As a very silly example, a sorted array can be used as a priority queue, though it's very inefficient.
Hope this helps!
The definition of heap given in wikipedia (http://en.wikipedia.org/wiki/Heap_(data_structure)) is
In computer science, a heap is a specialized tree-based data structure
that satisfies the heap property: If A is a parent node of B then
key(A) is ordered with respect to key(B) with the same ordering
applying across the heap. Either the keys of parent nodes are always
greater than or equal to those of the children and the highest key is
in the root node (this kind of heap is called max heap) or the keys of
parent nodes are less than or equal to those of the children (min
heap)
The definition says nothing about the tree being complete. For example, according to this definition, the binary tree 5 => 4 => 3 => 2 => 1 where the root element is 5 and all the descendants are right children also satisfies the heap property. I want to know the precise definition of the heap data structure.
As others have said in comments: That is the definition of a heap, and your example tree is a heap, albeit a degenerate/unbalanced one. The tree being complete, or at least reasonably balanced, is useful for more efficient operations on the tree. But an inefficient heap is still a heap, just like an unbalanced binary search tree is still a binary search tree.
Note that "heap" does not refer to a data structure, it refers to any data structure fulfilling the heap property or (depending on context) a certain set of operations. Among the data structures which are heaps, most efficient ones explicitly or implicitly guarantee the tree to be complete or somewhat balanced. For example, a binary heap is by definition a complete binary tree.
In any case, why do you care? If you care about specific lower or upper bounds on specific operations, state those instead of requiring a heap. If you discuss specific data structure which are heaps and complete trees, state that instead of just speaking about heaps (assuming, of course, that the completeness matters).
Since this question was asked, the Wikipedia definition has been updated:
In computer science, a heap is a specialized tree-based data structure which is essentially an almost complete1 tree that satisfies the heap property: in a max heap, for any given node C, if P is a parent node of C, then the key (the value) of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C.2 The node at the "top" of the heap (with no parents) is called the root node.
However, "heap data structure" really denotes a family of different data structures, which also includes:
Binomial heap
Fibonaicci heap
Leftist heap
Skew heap
Pairing heap
2-3 heap
...and these are certainly not necessarily complete trees.
On the other hand, the d-ary heap data structures -- including the binary heap -- most often refer to complete trees, such that they can be implemented in an array in level-order, without gaps:
The š-ary heap consists of an array of š items, each of which has a priority associated with it. These items may be viewed as the nodes in a complete š-ary tree, listed in breadth first traversal order.
The classic O(1) random access data structure is the array. But an array relies on the programming language being used supporting guaranteed continuous memory allocation (since the array relies on being able to take a simple offset of the base to find any element).
This means that the language must have semantics regarding whether or not memory is continuous, rather than leaving this as an implementation detail. Thus it could be desirable to have a data structure that has O(1) random access, yet doesn't rely on continuous storage.
Is there such a thing?
How about a trie where the length of keys is limited to some contant K (for example, 4 bytes so you can use 32-bit integers as indices). Then lookup time will be O(K), i.e. O(1) with non-contiguous memory. Seems reasonable to me.
Recalling our complexity classes, don't forget that every big-O has a constant factor, i.e. O(n) + C, This approach will certainly have a much larger C than a real array.
EDIT: Actually, now that I think about it, it's O(K*A) where A is the size of the "alphabet". Each node has to have a list of up to A child nodes, which will have to be a linked list keep the implementation non-contiguous. But A is still constant, so it's still O(1).
In practice, for small datasets using contiguous storage is not a problem, and for large datasets O(log(n)) is just as good as O(1); the constant factor is rather more important.
In fact, For REALLY large datasets, O(root3(n)) random access is the best you can get in a 3-dimensional physical universe.
Edit:
Assuming log10 and the O(log(n)) algorithm being twice as fast as the O(1) one at a million elements, it will take a trillion elements for them to become even, and a quintillion for the O(1) algorithm to become twice as fast - rather more than even the biggest databases on earth have.
All current and foreseeable storage technologies require a certain physical space (let's call it v) to store each element of data. In a 3-dimensional universe, this means for n elements there is a minimum distance of root3(n*v*3/4/pi) between at least some of the elements and the place that does the lookup, because that's the radius of a sphere of volume n*v. And then, the speed of light gives a physical lower boundary of root3(n*v*3/4/pi)/c for the access time to those elements - and that's O(root3(n)), no matter what fancy algorithm you use.
Apart from a hashtable, you can have a two-level array-of-arrays:
Store the first 10,000 element in the first sub-array
Store the next 10,000 element in the next sub-array
etc.
Thus it could be desirable to have a data structure that has O(1) random access, yet
doesn't rely on continuous storage.
Is there such a thing?
No, there is not. Sketch of proof:
If you have a limit on your continuous block size, then obviously you'll have to use indirection to get to your data items. Fixed depth of indirection with a limited block size gets you only a fixed-sized graph (although its size grows exponentially with the depth), so as your data set grows, the indirection depth will grow (only logarithmically, but not O(1)).
Hashtable?
Edit:
An array is O(1) lookup because a[i] is just syntactic sugar for *(a+i). In other words, to get O(1), you need either a direct pointer or an easily-calculated pointer to every element (along with a good-feeling that the memory you're about to lookup is for your program). In the absence of having a pointer to every element, it's not likely to have an easily-calculated pointer (and know the memory is reserved for you) without contiguous memory.
Of course, it's plausible (if terrible) to have a Hashtable implementation where each lookup's memory address is simply *(a + hash(i)) Not being done in an array, i.e. being dynamically created at the specified memory location if you have that sort of control.. the point is that the most efficient implementation is going to be an underlying array, but it's certainly plausible to take hits elsewhere to do a WTF implementation that still gets you constant-time lookup.
Edit2:
My point is that an array relies on contiguous memory because it's syntactic sugar, but a Hashtable chooses an array because it's the best implementation method, not because it's required. Of course I must be reading the DailyWTF too much, since I'm imagining overloading C++'s array-index operator to also do it without contiguous memory in the same fashion..
Aside from the obvious nested structures to finite depth noted by others, I'm not aware of a data structure with the properties you describe. I share others' opinions that with a well-designed logarithmic data structure, you can have non-contiguous memory with fast access times to any data that will fit in main memory.
I am aware of an interesting and closely related data structure:
Cedar ropes are immutable strings that provide logarithmic rather than constant-time access, but they do provide a constant-time concatenation operation and efficient insertion of characters. The paper is copyrighted but there is a Wikipedia explanation.
This data structure is efficient enough that you can represent the entire contents of a large file using it, and the implementation is clever enough to keep bits on disk unless you need them.
Surely what you're talking about here is not contiguous memory storage as such, but more the ability to index a containing data structure. It is common to internally implement a dynamic array or list as an array of pointers with the actual content of each element elsewhere in memory. There are a number of reasons for doing this - not least that it enables each entry to be a different size. As others have pointed out, most hashtable implementations also rely on indexing too. I can't think of a way to implement an O(1) algorithm that doesn't rely on indexing, but this implies contiguous memory for the index at least.
Distributed hash maps have such a property. Well, actually, not quite, basically a hash function tells you what storage bucket to look in, in there you'll probably need to rely on traditional hash maps. It doesn't completely cover your requirements, as the list containing the storage areas / nodes (in a distributed scenario), again, is usually a hash map (essentially making it a hash table of hash tables), although you could use some other algorithm, e.g. if the number of storage areas is known.
EDIT:
Forgot a little tid-bit, you'd probably want to use different hash functions for the different levels, otherwise you'll end up with a lot of similar hash-values within each storage area.
A bit of a curiosity: the hash trie saves space by interleaving in memory the key-arrays of trie nodes that happen not to collide. That is, if node 1 has keys A,B,D while node 2 has keys C,X,Y,Z, for example, then you can use the same contiguous storage for both nodes at once. It's generalized to different offsets and an arbitrary number of nodes; Knuth used this in his most-common-words program in Literate Programming.
So this gives O(1) access to the keys of any given node, without reserving contiguous storage to it, albeit using contiguous storage for all nodes collectively.
It's possible to allocate a memory block not for the whole data, but only for a reference array to pieces of data. This brings dramatic increase decrease in length of necessary contiguous memory.
Another option, If the elements can be identified with keys and these keys can be uniquely mapped to the available memory locations, it is possible not to place all the objects contiguously, leaving spaces between them. This requires control over the memory allocation so you can still distribute free memory and relocate 2nd-priroty objects to somewhere else when you have to use that memory location for your 1st-priority object. They would still be contiguous in a sub-dimension, though.
Can I name a common data structure which answers your question? No.
Some pseudo O(1) answers-
A VList is O(1) access (on average), and doesn't require that the whole of the data is contiguous, though it does require contiguous storage in small blocks. Other data structures based on numerical representations are also amortized O(1).
A numerical representation applies the same 'cheat' that a radix sort does, yielding an O(k) access structure - if there is another upper bound of the index, such as it being a 64 bit int, then a binary tree where each level correspond to a bit in the index takes a constant time. Of course, that constant k is greater than lnN for any N which can be used with the structure, so it's not likely to be a performance improvement (radix sort can get performance improvements if k is only a little greater than lnN and the implementation of the radix sort runs better exploits the platform).
If you use the same representation of a binary tree that is common in heap implementations, you end up back at an array.