Heap Type Implementation - data-structures

I was implementing a heap sort and I start wondering about the different implementations of heaps. When you don need to access the elements by index(like in a heap sort) what are the pros and cons of implementing a heap with an array or doing it like any other linked data structure.
I think it's important to take into account the memory wasted by the nodes and pointers vs the memory wasted by empty spaces in an array, as well as the time it takes to add or remove elements when you have to resize the array.
When I should use each one and why?

As far as space is concerned, there's very little issue with using arrays if you know how much is going into the heap ahead of time -- your values in the heap can always be pointers to the larger structures. This may afford for better cache localization on the heap itself, but you're still going to have to go out someplace to memory for extra data. Ideally, if your comparison is based on a small morsel of data (often just a 4 byte float or integer) you can store that as the key with a pointer to the full data and achieve good cache coherency.
Heap sorts are already not particularly good on cache hits throughout traversing the heap structure itself, however. For small heaps that fit entirely in L1/L2 cache, it's not really so bad. However, as you start hitting main memory performance will dive bomb. Usually this isn't an issue, but if it is, merge sort is your savior.
The larger problem comes in when you want a heap of undetermined size. However, this still isn't so bad, even with arrays. Anymore, in non-embedded environments with nice, pretty memory systems growing an array with some calls (e.g. realloc, please forgive my C background) really isn't all that slow because the data may not need to physically move in memory -- just some address pointer magic for most of it. Added to the fact that if you use a array-size-doubling strategy (array is too small, double the size in a realloc call) you're still ending up with an O(n) amortized cost with relatively few reallocs and at most double wasted space -- but hey, you'd get that with linked lists anyways if you're using a 32-bit key and 32-bit pointer.
So, in short, I'd stick with arrays for the smaller base data structures. When the heap goes away, so do the pointers I don't need anymore with a single deallocation. However, it's easier to read pointer-based code for heaps in my opinion since dealing with the indexing magic isn't quite as straightforward. If performance and memory aren't a concern, I'd recommend that to anyone in a heartbeat.

Related

List Cache Behavior

OCaml From the Ground Up states that ...
At the machine level, a linked list is a pair of a head value and a pointer to the tail.
I have heard that linked lists (in imperative languages) tend to be slow because of cache misses, memory overhead and pointer chasing. I am curious if OCaml's garbage collector or memory management system avoids any of these issues, and if they do what sort of techniques or optimizations they employ internally that might be different from linked lists in other languages.
OCaml manages its own memory, it calls system-level memory allocation and deallocation primitives in its own terms (e.g. it can allocate a big chunk of heap memory during the start of the program, and manage OCaml values on it), so if the compiler and/or the runtime knows that you are allocating a list of a fixed sized, it can arrange for the cells to be close to each other in memory. And since there is no pointer type in the language itself, it can move values around during garbage collection, to avoid memory fragmentation, something a language like C or C++ cannot do (or with great effort to maintain the abstraction while allowing moves).
Those are general pointers about how garbage collected languages (imperative or not) may optimize memory management, but Understanding the Garbage Collector has more details about how the garbage collector actually works in OCaml.
A linked list is indeed a horrible structure to iterate over in general.
But this is mitigated a lot by the way OCaml allocates memory and how lists are created most of the time.
In OCaml the GC allocates a large block of memory as it's (minor) heap and maintains a pointer to the end of the used portion. An allocation simply increases the pointer by the needed amount of memory.
Combine that with the fact that most of the time lists are constructed in a very short time. Often the list creation is the only thing allocating memory. Think of List.map for example, or List.rev. That will produce a list where the nodes of the list are contiguous in memory. So the linked list isn't jumping all over the address space but is rather contained on a small chunk. This allows caching to work far better than you would expect for a linked list. Iterating the list will actually access memory sequentially.
The above means that a lot of lists are much more ordered than in other languages. And a lot of the time lists are temporary and will be purely in cache. It performs a lot better than it ought to.
There is another layer to this. In OCaml the garbage collector is a generational GC. New values are created on the minor heap which is scanned frequently. Temporary values are thus quickly reclaimed. Values that remain alive on the minor heap are copied to the major heap, which is scanned less frequent. The copy operation compacts the values, eliminating any holes caused by values that are no longer alive. This will bring list nodes closer together again if it had gaps in it in the first place. The same thing happens when the major heap is scanned, the memory is compacted bringing values that where allocated close in time nearer together.
While none of that guarantees that lists will be contiguous in memory it seems to avoid a lot of the bad effects associated with linked lists in other languages. None the less you shouldn't use lists when you need to iterate over data, or worse access the n-th node, frequently. Use an array instead. Appending is bad too unless your list is small (and will overflow the stack for large lists). Due to the later you often build a list in reverse, adding items to the front instead of appending at the end, and then reverse the list as final step. And that final List.rev will give you a perfectly contiguous list.

How can I better understand the impact of modern caching on algorithm performance?

I'm reading the following paper: http://www-db.in.tum.de/~leis/papers/ART.pdf and in it, they say in the abstract:
Main memory capacities have grown up to a point where most databases
fit into RAM. For main-memory database systems, index structure
performance is a critical bottleneck. Traditional in-memory data
structures like balanced binary search trees are not efficient on
modern hardware, because they do not optimally utilize on-CPU caches.
Hash tables, also often used for main-memory indexes, are fast but
only support point queries.
How can I better understand this utilization of on-CPU caches and how it impacts the performance of particular data structures/algorithms?
Just somewhere to get started would be great because this sort of analysis is really opaque to me and I don't know where to go to start understanding.
This is going to be a really basic answer, as it would otherwise be extremely broad. I'm also not an expert on the subject (picking up bits and pieces to help understand how to optimize my hotspots better). But it might help you get started investigating this subject.
The topic reminds me of my university days when computer architecture
courses only taught about registers, DRAM, and disk, while glossing
over the CPU cache in between. The CPU cache is one of the most
dominant factors these days in performance.
The memory of the computer is divided into a hierarchy ranging from the absolute biggest but slowest (disk) to absolute smallest but fastest (registers).
Below disk is DRAM which is still pretty slow. And above registers is the CPU cache which is pretty damned fast (especially the smallest L1 cache).
Accessing One Node
Now let's say you request to access memory in some form from some data structure, say a linked structure like a tree or linked list and we're just accessing one node.
Note, I'm inverting the view of memory access for simplicity. Typically it begins with an instruction to load something into a register with the process working backwards and forwards, rather than merely forwards.
Virtual to Physical (DRAM)
In this case, unless the memory is already mapped to physical memory, the operating system has to map a page from virtual memory to a physical address in DRAM (this is freaking slow, especially in the worst-case scenario where the page fault involves a disk access). This is often done in pretty hefty chunks (the machine grabs memory by the handful), like aligned 4-kilobyte chunks. So we end up grabbing a big old 4-kilobyte aligned chunk of memory just for this one node.
DRAM to CPU Cache
Now that this 4-kilobyte page is physically mapped, we still want to do something with the node (most instructions have to operate at the register level) so the computer moves it down through the CPU cache hierarchy (this is pretty slow). Typically all levels of CPU cache have the same cache-line size, like 64-byte cache lines on Intel.
To move the memory from DRAM into these CPU caches, we have to grab a chunk of cache-line-sized-and-aligned memory from DRAM and move it into the CPU cache. We might also have to evict some data already in various levels of the CPU cache hierarchy on the way, like the least recently used memory. So now we're grabbing a 64-byte aligned handful of memory for this node.
Maybe at this point, the cache line memory might look like this. Let's say the relevant node data is 42, while the stuff in ??? is irrelevant memory surrounding it that's not part of our linked data structure.
CPU Cache to Register
Now we move the memory from CPU cache into a register (this occurs very quickly). And here we're still grabbing memory in sort of a handful, but a pretty small one. For example, we might grab a 64-bit aligned chunk of memory and move it into a general-purpose register. So we grab the memory around "42" here and move it into a register.
Finally we do some operations on the register and store the results, and the results often kind of work their way back up the memory hierarchy.
Accessing One Other Node
When we access the next node in the linked structure, we end up having to potentially do this all over again, just to read one little node's data. The contents of the cache line might look like this (with 22 being the node data of interest).
We can see potentially how much wasted effort the hardware and operating system are applying, moving big, aligned chunks of data from slower memory to faster memory only in order to access one little teeny bit of it prior to eviction.
And that's why little objects all allocated separately, as in the case of linked nodes or languages which can't represent user-defined types contiguously, aren't very cache or page-friendly. They tend to invoke a lot of page faults and cache misses as we traverse them, accessing their data. That is, unless they have help from a memory allocator which allocates these nodes in a more contiguous fashion (in which case the data or two or more nodes might be right next to each other and accessed together).
Contiguity and Spatial Locality
The most cache-friendly data structures tend to be based on contiguous arrays (it doesn't have to be one gigantic array, but perhaps arrays linked together, e.g., as is the case of an unrolled list). When we iterate through an array and access the first element, we might have to do the motions described above yet we might be able to get this once the memory is moved into a cache line:
Now we can iterate through the array and access all the elements while it's in the second-fastest form of memory on the machine, the L1 cache, simply moving data from L1 cache to register after the initial compulsory cache miss/page fault. If we start at 17, we have the initial compulsory cache miss but all the subsequent elements in this cache line can then be accessed without repeating the motions above. This is extremely fast, and the computer can blaze through such data.
So that was what was meant by this part:
Traditional in-memory data structures like balanced binary search
trees are not efficient on modern hardware, because they do not
optimally utilize on-CPU caches.
Note that it is possible to make linked structures like trees and linked lists substantially more cache-friendly than they would naturally be using a custom memory allocator, but they lack this inherent cache-friendliness at the basic data structure level.
Hash tables, on the other hand, tend to be contiguous table structures based on arrays. They might use chaining and linked bucket structures, but those are also easier to make cache-efficient with a little teeny bit of help from the custom allocator (far less than the tree due to the simpler, sequential access patterns within a hash bucket).
So anyway, that's a little brief overview on the subject, a bit oversimplified, but hopefully enough to help get started. If you want to understand this subject at a deeper level, keywords would be cache/memory efficiency/optimization and locality of reference.

Hashtable for a system with memory constraints

I have read about the variants of hashtable but it is not clear to me which one is more appropriate for a system that is low on memory (we have a memory constraint limit).
Linear/Quadratic probing works well for sparse tables.
I think Double hashing is the same as Quadratic in this aspect.
External chaining does not have issue with clustering.
Most textbooks I have checked seem to assume that a extra space will always be available but practically in most example implementations I have seen since the hashtable is never halved take much more space than really needed.
So which variant of a hashtable is most efficient when we want to make the best usage of memory?
Update:
So my question is not only about the size of the buckets. My understanding is that both the size of the buckets and the performance under load is what matters. Because if the size of the bucket is small but the table degrades on 50% load then this means we need to resize to a larger table often.
See this variant of Cukoo Hashing.
This will require from you more hash functions, but, it makes sense - you need to pay something for the memory savings.

What is the main performance gain from garbage collection?

The llvm documentation says:
In practice, however, the locality and performance benefits of using aggressive garbage collection techniques dominates any low-level losses.
So what is it, exactly, that causes the performance gain when using garbage collection as opposed to manually managing memory? (besides the obvious decrease in code writing time) Is the benefit solely that performing heap compaction increases spatial locality and cache utilization? Or is there something else that helps more, like deleting everything at once?
On modern processors the memory caches are King. Suffering a cache miss can stall the processor for hundreds of cpu cycles, waiting for the slow bus to supply the data.
Making the caches effective requires locality of reference. In other words, if the next memory access is close to the previous one then the odds that the data is already in the cache are high.
A garbage collector can help a lot to make that work out well. The big win is not the collection, it is its ability to rebuild the object graph and reorganize the data structure while doing so. Compacting.
Imagine the typical data structure, an array of pointers to objects. Which is slowly being built up while, say, reading a bunch of strings from a file and turning them into field values of an object. Allocated objects will be scatter-shot in the address space doing so. Long lived objects pointed-to by the array separated by the worker objects, like strings. Iterating that array later is going to be pretty slow.
Until the garbage collector runs and rebuilds the data structure. Putting all of the pointed-to objects in order.
Now iterating the collection is very fast, since accessing element N makes it very likely that element N+1 is readily available. If not in the L1 cache then very good odds for L2 or L3 (if you have it).
Very big win, it is the one feature that made garbage collection competitive with explicit memory management. With the explicit kind having the problem of not supporting moving objects because it will invalidate a pointer.
I can only speak for the Oracle (ex-Sun) and IBM JVMs; their efficiency relies on the fact that newly-created objects are unlikely to live very long. So segregating them into their own area allows that area to be frequently compacted, since with few survivors that's a cheap operation. Frequent compaction means that free space can be kept contiguous, so object creation is also cheap because there's no free chain to traverse and no memory fragmentation.
Manual memory management schemes are rarely this efficient because this is a relatively complex way of doing things that is unlikely to be reinvented for each application. These garbage collectors have evolved and been optimised over a longer period and with more effort than individual applications ever receive. It would be surprising and disappointing if they weren't much more performant.
I doubt locality helps performance at all - admittedly small objects tend to be created at the same time in the same area of the heap (but this applies to C as well), over time, these small objects that remain will be compacted into a closely related area of the heap and it is supposedly this that give you an advantage over C-style allocations. However, show me a program that uses just these small objects and I'll show you a program that does sod all. Show me a program that passes all objects that are to be used on the stack and I'll show you one that screams with speed.
The de-allocation of memory is a performance benefit, short-term as they do not need to be de-allocated. However, when the garbage collector does kick in, this benefit disappears. Usually though, the collection occurs when nothing else is happening in the system (theoretically) so the cost is effectively nullified.
Compaction of the heap also helps allocation, all allocations can come from the beginning of the heap, and the memory manager doesn't have to walk the heap looking for the next free space block of the right size. However, traditional systems can gain the same amount of speed by using multiple fixed-block heaps (which mean you always allocate from a heap for the size of block you want, and you always allocate a fixed block, so walking the heap is just to find the first free block, and this can be removed using a bitmap)
So all in all, there isn't much of a benefit at all, except in benchmarks of course. In my experience the GC can and will jump in and slow you down dramatically at just the wrong time, usually when the system memory is getting filled because the user has done something like load a new page that required a lot of memory allocations.... which in turn required a collection.
It also has a tendency to use a lot of memory - 'memory is cheap' is the mantra of GC languages, so programs are written with this in mind, which means memory allocations are much more common, especially for temporaries and intermediate objects. Just look to StringBuilder classes for the evidence that this is well known. Strings may be 'solved' using this, but many other objects are still allocated with wild abandon. Any program that uses a lot of memory will find itself struggling with RAM IO - all that memory has to be brought into the CPU caches to be used, the more memory you use, the more IO your CPU MM will have to do and that can kill performance in the wrong circumstances.
In addition, when a GC occurs, you have to handle Finalised objects too, this isn't quite as bad as it used to be, but it can still halt your program while the finalisers are run.
Old Java GCs were dreadful for perf, though a lot of research has made them significantly better, they are still not perfect.
EDIT:
one more thing about localisation, imagine creating an array and adding a few items, then do a load of allocations, then you want to add another item to the array - with a GC system the added array element will not be localised, even after a compaction, each object in the array will be stored as an individual item on the heap. This is why I think the localisation issue is not as big a deal as it's made out to be. Now, compare that to an array that is allocated with a buffer and objects are allocated within the buffer space. That may require a re-alloc and copy to add a new item, but reading and modifying it is super fast.
One factor not yet mentioned is that, especially in multi-threaded systems, it can sometimes be difficult to predict with certainty what object will end up holding the last surviving reference to some other object. If one doesn't have to worry about object graphs that might contain cycles, it's possible to use reference counts for this purpose. Before copying a reference to an object, increment its reference count. Before destroying a reference to an object, decrement its reference count. It decrementing the reference count makes it hit zero, destroy the object as well as the reference. Such an approach works well on computers with only one CPU core; if only one thread can actually be running at any given time, one doesn't have to worry about what will happen if two threads try to adjust the same object's reference count simultaneously. Unfortunately, in systems with multiple CPU cores, any CPU that wants to adjust a reference count would have to coordinate that action with all the other CPUs to ensure that two CPUs never hit the counter at the exact same time. Such coordination is "free" with a single CPU, but is relatively expensive in multi-core systems.
When using a batch-mode garbage collector, object references may generally be freely assigned, copied, and destroyed, without inter-CPU coordination. It will periodically be necessary to have all the CPUs stop and run a garbage-collection cycle, but requiring all the CPUs to coordinate with each other once every few seconds or so is a lot cheaper than requiring them to coordinate with each other on every single object-reference assignment.

Reducing calls to main memory, given heap-allocated objects

The OP here mentions in the final post (4th or so para from bottom):
"Now one thing that always bothered me about this is all the child
pointer checking. There are usually a lot of null pointers, and
waiting on memory access to fill the cache with zeros just seems
stupid. Over time I added a byte that contains a 1 or 0 to tell if
each of the pointers is NULL. At first this just reduced the cache
waste. However, I've managed cram 9 distance comparisons, 8 pointer
bits, and 3 direction bits all through some tables and logic to
generate a single switch variable that allows the cases to skip the
pointer checks and only call the relevant children directly. It is in
fact faster than the above, but a lot harder to explain if you haven't
seen this version."
He is referring to octrees as the data structure for real-time volume rendering. These would be allocated on the heap, due to their size. What I am trying to figure out is:
(a) Are his assumptions in terms of waiting on memory access, valid? My understanding is that he's referring to waiting on a full run out to main memory to fetch data, since he's assuming it won't be found in the cache due to generally not-too-good locality of reference when using dynamically-allocated octrees (common for this data structure in this sort of application).
(b) Should (a) prove to be true, I am trying to figure out how this workaround
Over time I added a byte that contains a 1 or 0 to tell if each of the
pointers is NULL.
would be implemented without still using the heap, and thus still incurring the same overhead, since I assume it would need to be stored in the octree node.
(a) Yes, his concerns about memory wait time are valid. In this case, he seems to be worried about the size of the node itself in memory; just the children take up 8 pointers, which is 64 bytes on a 64-bit architecture, or one cache line just for the children.
(b) That bitfield is stored in the node itself, but now takes up only 1 byte (1 bit for 8 pointers). It's not clear to me that this is an advantage though, as the line(s) containing the children will get loaded anyway when they are searched. However, he's apparently doing some bit tricks that allow him to determine which children to search with very few branches, which may increase performance. I wish he had some benchmarks that would show the benefit.

Resources