Hashtable for a system with memory constraints - performance

I have read about the variants of hashtable but it is not clear to me which one is more appropriate for a system that is low on memory (we have a memory constraint limit).
Linear/Quadratic probing works well for sparse tables.
I think Double hashing is the same as Quadratic in this aspect.
External chaining does not have issue with clustering.
Most textbooks I have checked seem to assume that a extra space will always be available but practically in most example implementations I have seen since the hashtable is never halved take much more space than really needed.
So which variant of a hashtable is most efficient when we want to make the best usage of memory?
Update:
So my question is not only about the size of the buckets. My understanding is that both the size of the buckets and the performance under load is what matters. Because if the size of the bucket is small but the table degrades on 50% load then this means we need to resize to a larger table often.

See this variant of Cukoo Hashing.
This will require from you more hash functions, but, it makes sense - you need to pay something for the memory savings.

Related

Practical Efficiency of binary search

When searching an element or an insertion point in a sorted array, there are basically two approaches: straight search (element by element) or binary search. From the time complexities O(n) vs O(log(n)) we know that binary search is ultimately more efficient, however this does not automatically imply that binary search will always be faster than "normal search".
My question therefore is: Can binary search be practically less efficent than "normal" search for low n? If yes, can we estimate the point at which binary search will be more efficient?
Thanks!
Yes, a binary search can be practically less efficient than "normal" search for a small n. However, this is very hard to estimate the point at which a binary search will be more efficient (if even possible) because this is very dependent of the problem (eg. data type, search predicate), the hardware (eg. processor, RAM) and even the dynamic state of the hardware used when the search is performed as well as the actual data in the sorted array on modern systems.
The first reason a binary search can be less efficient is vectorization. Indeed, modern processors can support SIMD instructions working on pretty big vectors. Thus, a linear search can work simultaneously on many item per processing cycle. Modern processors can even often execute few SIMD instructions in parallel per cycle. While linear searches can be often trivially vectorized, it is not the case of binary searches which are almost inherently sequential. One should keep in mind that vectorization is not always possible nor always automatically done by compilers, especially on non-trivial data types (eg. composite data structures, pointer-based types) or non-trivial search predicates (eg. the ones with conditionals or memory indirections).
The second reason a binary search can be less efficient is branch predictability. Indeed, modern processors try to predict branches ahead of time to avoid pipeline stall. If this prediction works, then branches can be taken very quickly, otherwise the processor can stall for several cycles (up to dozens). A branch can be easily predicted if it is always true or always false. A randomly taken branch cannot be predicted causing stalls. Because the array is sorted, branches in linear searches are easy to predict (branches are either always taken or never taken until the element is found), while this is clearly not the case for binary searches. As a result, the speed of a search is dependent of the searched item, and data inside the sorted array.
The same thing apply for cache misses and memory fetches: because the latency of the RAM is very big compared to executing arithmetic instructions, modern processors contains dedicated hardware prefetching units trying to predict the next memory fetches and prefetch data ahead of time in order to avoid cache misses. Prefetchers are good to predict linear/contiguous memory accesses but very bad for random memory accesses. Memory accesses of linear searches are trivial while the one of binary searches appear to be mostly random for many processors. A cache miss happening during a binary search will certainly cause the processor to stall for a lot of cycles. If the sorted array is already loaded in cache, a binary search on it can be much faster.
But this is not enough: using wide SIMD instructions or doing cache-misses can impact the frequency of the computing core and so the speed of the algorithm. Not to mention that the size of the data type also matters a lot as the memory throughput is limited and strided memory accesses are slower than contiguous one. One should also take into account the additional complexity of binary searches compared to linear ones (ie. often more instructions to execute). I guess I missed some important points in the above list.
As a programmer, you may need to define a threshold to choose which algorithm to use. If you really need that, the best solution is to find is automatically using a benchmark or autotuning methods. Practical experimentations shows that the threshold changed over the last decades for a given fixed context (data type, cache state, etc.), in favour to linear searches (so the thresholds are generally increasing over time).
My personal advice is not to use a binary search for value of n smaller than 256 / data_type_size_in_bytes with trivial/native data types on mainstream processors. I think it is a good idea to use a binary search when n is bigger than 1000, or also when the data-type is non-trivial as well as when the predicate is expensive.

Memory data layout vs algorithm performance

How layout of a data in memory effects on algorithm performance?
For example merge sort is know for it computational complexity of O(n log n).
But in real world machine that processing algorithm will load/unload blocks of memory into CPU caches / CPU registers and spend auxiliary time on it.
Elements of collection to be sorted could be very scattered throughout the memory, and I wonder it will cause in slower performance vs sorting over gathered together elements.
Is in necessary to take into account how collections are really stores the data in memory?
In terms of big O notation - no. The time you read each block from
RAM to cpu cache is bounded by some constant, let it be C, so even
if you need to load each element in every iteration from RAM to
cache, you are going to need O(C*nlogn) time, but since C is
constant - it remains O(nlogn) time complexity.
In real world applications, especially when dealing with real-time apps, cache performance could be indeed a factor, and should be considered, so the order of accessing data, could matter. This is one of the reasons why quicksort is usually regarded as "faster" - it tends to have nice cache performance.
In addition - there are some algorithms that are developed to enjoy the "best of two worlds" - both O(nlogn) worst case with better constants, such as Timsort.
However, as rule of thumb, you should usually first implement the "easy way", then benchmark to see if it's fast enough, profile if it's not - and optimize the bottleneck. If you'll try to optimize every piece of your code for best cache performance - you will probably never finish writing it.
Profiling, profiling, profiling.
Modern computer architectures have become so complicated that accurate predictions on the running time have become impossible. You should prefer an experimental approach.
Also note that running times are no more deterministic and you should resort to statistical methods.
Architecture killed the algorithmician.
How layout of a data in memory effects on algorithm performance?
Layout is very important especially for large amount of data because access to the main memory is still expensive even for modern CPU:
http://mechanical-sympathy.blogspot.ru/2013/02/cpu-cache-flushing-fallacy.html
And your algo may spend much time on each cache miss:
http://mechanical-sympathy.blogspot.ru/2012/08/memory-access-patterns-are-important.html
Moreover, now there is a special area in Computer Science called Cache-friendly data structures and algos. See, for example, just googled:
http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/LFN02.pdf
etc etc

Heap Type Implementation

I was implementing a heap sort and I start wondering about the different implementations of heaps. When you don need to access the elements by index(like in a heap sort) what are the pros and cons of implementing a heap with an array or doing it like any other linked data structure.
I think it's important to take into account the memory wasted by the nodes and pointers vs the memory wasted by empty spaces in an array, as well as the time it takes to add or remove elements when you have to resize the array.
When I should use each one and why?
As far as space is concerned, there's very little issue with using arrays if you know how much is going into the heap ahead of time -- your values in the heap can always be pointers to the larger structures. This may afford for better cache localization on the heap itself, but you're still going to have to go out someplace to memory for extra data. Ideally, if your comparison is based on a small morsel of data (often just a 4 byte float or integer) you can store that as the key with a pointer to the full data and achieve good cache coherency.
Heap sorts are already not particularly good on cache hits throughout traversing the heap structure itself, however. For small heaps that fit entirely in L1/L2 cache, it's not really so bad. However, as you start hitting main memory performance will dive bomb. Usually this isn't an issue, but if it is, merge sort is your savior.
The larger problem comes in when you want a heap of undetermined size. However, this still isn't so bad, even with arrays. Anymore, in non-embedded environments with nice, pretty memory systems growing an array with some calls (e.g. realloc, please forgive my C background) really isn't all that slow because the data may not need to physically move in memory -- just some address pointer magic for most of it. Added to the fact that if you use a array-size-doubling strategy (array is too small, double the size in a realloc call) you're still ending up with an O(n) amortized cost with relatively few reallocs and at most double wasted space -- but hey, you'd get that with linked lists anyways if you're using a 32-bit key and 32-bit pointer.
So, in short, I'd stick with arrays for the smaller base data structures. When the heap goes away, so do the pointers I don't need anymore with a single deallocation. However, it's easier to read pointer-based code for heaps in my opinion since dealing with the indexing magic isn't quite as straightforward. If performance and memory aren't a concern, I'd recommend that to anyone in a heartbeat.

Best heuristic for malloc

Consider using malloc() to allocate x bytes of memory in a fragmented heap. Assume the heap has multiple contiguous locations of size greater than x bytes.
Which is the best (that leads to least heap wastage) heuristic to choose a location among the following?
Select smallest location that is bigger than x bytes.
Select largest location that is bigger than x bytes.
My intuition is smallest location that is bigger than x bytes. I am not sure which is the best in practice.
No, this is not any assignment question. I was reading this How do malloc() and free() work? and this looks like a good follow up question to ask.
In a generic heap where allocations of different sizes are mixed, then of the two I'd go for putting the allocation in the smallest block that can accomodate it (to avoid reducing the size of the largest block we can allocate before we need to).
There are other ways of implementing a heap however that would make this question less relevant (such as the popular dlmalloc by Doug Lea - where it pools blocks of similar sizes to improve speed and reduce overall fragmentation).
Which solution is best always comes down to the way the application is performing its memory allocations. If you know an applications pattern in advance you should be able to beat the generic heaps both in size and speed.
It's better to select the smallest location. Think about future malloc requests. You don't know what they'll be, and you want to satisfy as many requests as you can. So it's better to find a location that exactly fits your needs, so that bigger requests can be satisfied in the future. In other words, selecting the smallest location reduces fragmentation.
The heuristics you listed are used in the Best Fit and Worst Fit algorithms, respectively. There is also the First Fit algorithm which simply takes the first space it finds that is large enough. It is approximately as good as Best Fit, and much faster.

Algorithms for Optimization with Fast Disk Storage (SSDs)?

Given that Solid State Disks (SSDs) are decreasing in price and soon will become more prevalent as system drives, and given that their access rates are significantly higher than rotating magnetic media, what standard algorithms will gain in performance from the use of SSDs for local storage? For example, the high random read speed of SSDs makes something like a disk-based hashtable a viability for large hashstables; 4GB of disk space is readily available, which makes hashing to the entire range of a 32-bit integer viable (more for lookup than population, though, which would still take a long time); while this size of a hashtable would be prohibitive to work with with rotating media due to the access speed, it shouldn't be as much of an issue with SSDs.
Are there any other areas where the impending transition to SSDs will provide potential gains in algorithmic performance? I'd rather see reasoning as to how one thing will work rather than opinion; I don't want this to turn contentious.
Your example of hashtables is indeed the key database structure that will benefit. Instead of having to load a whole 4GB or more file into memory to probe for values, the SSD can be probed directly. The SSD is still slower than RAM, by orders of magnitude, but it's quite reasonable to have a 50GB hash table on disk, but not in RAM unless you pay big money for big iron.
An example is chess position databases. I have over 50GB of hashed positions. There is complex code to try to group related positions near each other in the hash, so I can page in 10MB of the table at a time and hope to reuse some of it for multiple similar position queries. There's a ton of code and complexity to make this efficient.
Replaced with an SSD, I was able to drop all the complexity of the clustering and just use really dumb randomized hashes. I also got an increase in performance since I only fetch the data I need from the disk, not big 10MB chunks. The latency is indeed larger, but the net speedup is significant.. and the super-clean code (20 lines, not 800+), is perhaps even nicer.
SSDs are only significantly faster for random access. Sequential access to disk they are only twice as performant as mainstream rotational drives. Many SSDs have poorer performance in many scenarios causing them to perform worse, as described here.
While SSDs do move the needle considerably, they are still much slower than CPU operations and physical memory. For your 4GB hash table example, you may be able to sustain 250+ MB/s off of an SSD for accessing random hash table buckets. For a rotational drive, you'd be lucky to break the single digit MB/s. If you can keep this 4 GB hash table in memory, you could access it on the order of gigabytes a second - much faster than even a very swift SSD.
The referenced article lists several changes MS made for Windows 7 when running on SSD's, which can give you an idea of the sort of changes you could consider making. First, SuperFetch for prefetching data off of disk is disabled - it's designed to get around slow random access times for disk which are alleviated by SSDs. Defrag is disabled, because having files scattered across the disk aren't a performance hit for SSDs.
Ipso facto, any algorithm you can think of which requires lots of random disk I/O (random being the key word, which helps to throw the principle of locality to the birds, thus eliminating the usefulness of a lot of caching that goes on).
I could see certain database systems gaining from this though. MySQL, for instance using the MyISAM storage engine (where data records are basically glorified CSVs). However, I think very large hashtables are going to be your best bet for good examples.
SSD are a lot faster for random reads, a bit for sequential reads and properly slower for writes (random or not).
So a diskbased hashtable is properly not useful with an SSD, since it now takes significantly time to update it, but searching the disk becomes (compared to a normal hdd) very cheap.
Don't kid yourself. SSDs are still a whole lot slower than system memory. Any algorithm that chooses to use system memory over the hard disk is still going to be much faster, all other things being equal.

Resources