BINARY TREES used for data stored in internal memory - data-structures

the binary trees generally favors data stored in internal memory.
why do they favor internal memory retrieval???
And why can't we use them for retrieval of external information???

Because round trips to external memory is expensive and we want to retrieve larger blocks than a binary tree node. Data structures like B-Tree are specifically designed to serve that purpose. Random access to internal memory, however, is not that expensive and binary trees would be fast and simple enough.

I would second with Li, the cost of retrieval is alot to bear, also it depends upon the requirement. If you require constant gets don't go for external source. However if you only want to dump data external memory is what you should look for.

External memory is slower and generally larger compared to the smaller faster internal memory.
With that in mind, binary trees are space efficient and have a relatively good access time(depends o the operation).

Related

Practical Efficiency of binary search

When searching an element or an insertion point in a sorted array, there are basically two approaches: straight search (element by element) or binary search. From the time complexities O(n) vs O(log(n)) we know that binary search is ultimately more efficient, however this does not automatically imply that binary search will always be faster than "normal search".
My question therefore is: Can binary search be practically less efficent than "normal" search for low n? If yes, can we estimate the point at which binary search will be more efficient?
Thanks!
Yes, a binary search can be practically less efficient than "normal" search for a small n. However, this is very hard to estimate the point at which a binary search will be more efficient (if even possible) because this is very dependent of the problem (eg. data type, search predicate), the hardware (eg. processor, RAM) and even the dynamic state of the hardware used when the search is performed as well as the actual data in the sorted array on modern systems.
The first reason a binary search can be less efficient is vectorization. Indeed, modern processors can support SIMD instructions working on pretty big vectors. Thus, a linear search can work simultaneously on many item per processing cycle. Modern processors can even often execute few SIMD instructions in parallel per cycle. While linear searches can be often trivially vectorized, it is not the case of binary searches which are almost inherently sequential. One should keep in mind that vectorization is not always possible nor always automatically done by compilers, especially on non-trivial data types (eg. composite data structures, pointer-based types) or non-trivial search predicates (eg. the ones with conditionals or memory indirections).
The second reason a binary search can be less efficient is branch predictability. Indeed, modern processors try to predict branches ahead of time to avoid pipeline stall. If this prediction works, then branches can be taken very quickly, otherwise the processor can stall for several cycles (up to dozens). A branch can be easily predicted if it is always true or always false. A randomly taken branch cannot be predicted causing stalls. Because the array is sorted, branches in linear searches are easy to predict (branches are either always taken or never taken until the element is found), while this is clearly not the case for binary searches. As a result, the speed of a search is dependent of the searched item, and data inside the sorted array.
The same thing apply for cache misses and memory fetches: because the latency of the RAM is very big compared to executing arithmetic instructions, modern processors contains dedicated hardware prefetching units trying to predict the next memory fetches and prefetch data ahead of time in order to avoid cache misses. Prefetchers are good to predict linear/contiguous memory accesses but very bad for random memory accesses. Memory accesses of linear searches are trivial while the one of binary searches appear to be mostly random for many processors. A cache miss happening during a binary search will certainly cause the processor to stall for a lot of cycles. If the sorted array is already loaded in cache, a binary search on it can be much faster.
But this is not enough: using wide SIMD instructions or doing cache-misses can impact the frequency of the computing core and so the speed of the algorithm. Not to mention that the size of the data type also matters a lot as the memory throughput is limited and strided memory accesses are slower than contiguous one. One should also take into account the additional complexity of binary searches compared to linear ones (ie. often more instructions to execute). I guess I missed some important points in the above list.
As a programmer, you may need to define a threshold to choose which algorithm to use. If you really need that, the best solution is to find is automatically using a benchmark or autotuning methods. Practical experimentations shows that the threshold changed over the last decades for a given fixed context (data type, cache state, etc.), in favour to linear searches (so the thresholds are generally increasing over time).
My personal advice is not to use a binary search for value of n smaller than 256 / data_type_size_in_bytes with trivial/native data types on mainstream processors. I think it is a good idea to use a binary search when n is bigger than 1000, or also when the data-type is non-trivial as well as when the predicate is expensive.

How can I better understand the impact of modern caching on algorithm performance?

I'm reading the following paper: http://www-db.in.tum.de/~leis/papers/ART.pdf and in it, they say in the abstract:
Main memory capacities have grown up to a point where most databases
fit into RAM. For main-memory database systems, index structure
performance is a critical bottleneck. Traditional in-memory data
structures like balanced binary search trees are not efficient on
modern hardware, because they do not optimally utilize on-CPU caches.
Hash tables, also often used for main-memory indexes, are fast but
only support point queries.
How can I better understand this utilization of on-CPU caches and how it impacts the performance of particular data structures/algorithms?
Just somewhere to get started would be great because this sort of analysis is really opaque to me and I don't know where to go to start understanding.
This is going to be a really basic answer, as it would otherwise be extremely broad. I'm also not an expert on the subject (picking up bits and pieces to help understand how to optimize my hotspots better). But it might help you get started investigating this subject.
The topic reminds me of my university days when computer architecture
courses only taught about registers, DRAM, and disk, while glossing
over the CPU cache in between. The CPU cache is one of the most
dominant factors these days in performance.
The memory of the computer is divided into a hierarchy ranging from the absolute biggest but slowest (disk) to absolute smallest but fastest (registers).
Below disk is DRAM which is still pretty slow. And above registers is the CPU cache which is pretty damned fast (especially the smallest L1 cache).
Accessing One Node
Now let's say you request to access memory in some form from some data structure, say a linked structure like a tree or linked list and we're just accessing one node.
Note, I'm inverting the view of memory access for simplicity. Typically it begins with an instruction to load something into a register with the process working backwards and forwards, rather than merely forwards.
Virtual to Physical (DRAM)
In this case, unless the memory is already mapped to physical memory, the operating system has to map a page from virtual memory to a physical address in DRAM (this is freaking slow, especially in the worst-case scenario where the page fault involves a disk access). This is often done in pretty hefty chunks (the machine grabs memory by the handful), like aligned 4-kilobyte chunks. So we end up grabbing a big old 4-kilobyte aligned chunk of memory just for this one node.
DRAM to CPU Cache
Now that this 4-kilobyte page is physically mapped, we still want to do something with the node (most instructions have to operate at the register level) so the computer moves it down through the CPU cache hierarchy (this is pretty slow). Typically all levels of CPU cache have the same cache-line size, like 64-byte cache lines on Intel.
To move the memory from DRAM into these CPU caches, we have to grab a chunk of cache-line-sized-and-aligned memory from DRAM and move it into the CPU cache. We might also have to evict some data already in various levels of the CPU cache hierarchy on the way, like the least recently used memory. So now we're grabbing a 64-byte aligned handful of memory for this node.
Maybe at this point, the cache line memory might look like this. Let's say the relevant node data is 42, while the stuff in ??? is irrelevant memory surrounding it that's not part of our linked data structure.
CPU Cache to Register
Now we move the memory from CPU cache into a register (this occurs very quickly). And here we're still grabbing memory in sort of a handful, but a pretty small one. For example, we might grab a 64-bit aligned chunk of memory and move it into a general-purpose register. So we grab the memory around "42" here and move it into a register.
Finally we do some operations on the register and store the results, and the results often kind of work their way back up the memory hierarchy.
Accessing One Other Node
When we access the next node in the linked structure, we end up having to potentially do this all over again, just to read one little node's data. The contents of the cache line might look like this (with 22 being the node data of interest).
We can see potentially how much wasted effort the hardware and operating system are applying, moving big, aligned chunks of data from slower memory to faster memory only in order to access one little teeny bit of it prior to eviction.
And that's why little objects all allocated separately, as in the case of linked nodes or languages which can't represent user-defined types contiguously, aren't very cache or page-friendly. They tend to invoke a lot of page faults and cache misses as we traverse them, accessing their data. That is, unless they have help from a memory allocator which allocates these nodes in a more contiguous fashion (in which case the data or two or more nodes might be right next to each other and accessed together).
Contiguity and Spatial Locality
The most cache-friendly data structures tend to be based on contiguous arrays (it doesn't have to be one gigantic array, but perhaps arrays linked together, e.g., as is the case of an unrolled list). When we iterate through an array and access the first element, we might have to do the motions described above yet we might be able to get this once the memory is moved into a cache line:
Now we can iterate through the array and access all the elements while it's in the second-fastest form of memory on the machine, the L1 cache, simply moving data from L1 cache to register after the initial compulsory cache miss/page fault. If we start at 17, we have the initial compulsory cache miss but all the subsequent elements in this cache line can then be accessed without repeating the motions above. This is extremely fast, and the computer can blaze through such data.
So that was what was meant by this part:
Traditional in-memory data structures like balanced binary search
trees are not efficient on modern hardware, because they do not
optimally utilize on-CPU caches.
Note that it is possible to make linked structures like trees and linked lists substantially more cache-friendly than they would naturally be using a custom memory allocator, but they lack this inherent cache-friendliness at the basic data structure level.
Hash tables, on the other hand, tend to be contiguous table structures based on arrays. They might use chaining and linked bucket structures, but those are also easier to make cache-efficient with a little teeny bit of help from the custom allocator (far less than the tree due to the simpler, sequential access patterns within a hash bucket).
So anyway, that's a little brief overview on the subject, a bit oversimplified, but hopefully enough to help get started. If you want to understand this subject at a deeper level, keywords would be cache/memory efficiency/optimization and locality of reference.

Data structures used to build file systems?

What Data Structure is best to use for file organization? Are B-Trees the best or is there another data structure which obtains faster access to files and good organization? Thanks
All file systems are different, so there are a huge number of data structures that actually get used in file systems.
Many file systems use some sort of bit vector (usually referred to as a bitmap) to track where certain free blocks are, since they have excellent performance for querying whether a specific block of disk is in use and (for disks that aren't overwhelmingly full) support reasonably fast lookups of free blocks.
Many older file systems (ext and ext2) stored directory structures using simple linked lists. Apparently this was actually fast enough for most applications, though some types of applications that used lots of large directories suffered noticeable performance hits.
The XFS file system was famous for using B+-trees for just about everything, including directory structures and its journaling system. From what I remember from my undergrad OS course, the philosophy was that since it took so long to write, debug, and performance tune the implementation of the B+-tree, it made sense to use it as much as possible.
Other file systems (ext3 and ext4) use a variant of the B-tree called the HTree that I'm not very familiar with. Apparently it uses some sort of hashing scheme to keep the branching factor high so that very few disk accesses are used.
I have heard anecdotally that some operating systems tried using splay trees to store their directory structures but ran into trouble with them. Specifically, it prevented multithreaded access to the same directory from multiple readers (since in a splay tree, each access reshapes the tree) and encountered an edge case where the tree would degenerate to a linked list if all elements of the tree were accesses sequentially. That said, I don't know if this is just an urban legend, since these problems would have been apparent before anyone tried to code them up.
Microsoft's FAT32 system used a huge array (the file allocation table) that store what files were stored where and which disk sectors follow one another logically in a file. The main drawback is that the table had to be set up in advance, so there ended up being upper limits on the sizes of files that could be stored on the disk. However, the array-based system was pretty easy to implement.
This is not an exhaustive list - I'm sure that other file systems use other data structures. However, I hope it helps give you a push in the right direction.
Hope this helps!

Heap Type Implementation

I was implementing a heap sort and I start wondering about the different implementations of heaps. When you don need to access the elements by index(like in a heap sort) what are the pros and cons of implementing a heap with an array or doing it like any other linked data structure.
I think it's important to take into account the memory wasted by the nodes and pointers vs the memory wasted by empty spaces in an array, as well as the time it takes to add or remove elements when you have to resize the array.
When I should use each one and why?
As far as space is concerned, there's very little issue with using arrays if you know how much is going into the heap ahead of time -- your values in the heap can always be pointers to the larger structures. This may afford for better cache localization on the heap itself, but you're still going to have to go out someplace to memory for extra data. Ideally, if your comparison is based on a small morsel of data (often just a 4 byte float or integer) you can store that as the key with a pointer to the full data and achieve good cache coherency.
Heap sorts are already not particularly good on cache hits throughout traversing the heap structure itself, however. For small heaps that fit entirely in L1/L2 cache, it's not really so bad. However, as you start hitting main memory performance will dive bomb. Usually this isn't an issue, but if it is, merge sort is your savior.
The larger problem comes in when you want a heap of undetermined size. However, this still isn't so bad, even with arrays. Anymore, in non-embedded environments with nice, pretty memory systems growing an array with some calls (e.g. realloc, please forgive my C background) really isn't all that slow because the data may not need to physically move in memory -- just some address pointer magic for most of it. Added to the fact that if you use a array-size-doubling strategy (array is too small, double the size in a realloc call) you're still ending up with an O(n) amortized cost with relatively few reallocs and at most double wasted space -- but hey, you'd get that with linked lists anyways if you're using a 32-bit key and 32-bit pointer.
So, in short, I'd stick with arrays for the smaller base data structures. When the heap goes away, so do the pointers I don't need anymore with a single deallocation. However, it's easier to read pointer-based code for heaps in my opinion since dealing with the indexing magic isn't quite as straightforward. If performance and memory aren't a concern, I'd recommend that to anyone in a heartbeat.

Disk-based trie?

I'm trying to build a Trie but on a mobile phone which has very limited memory capacity.
I figured that it is probably best that the whole structure be stored on disk, and only loaded as necessary since I can tolerate a few disk reads. But, after a few attempts, it seems like this is a very complicated thing to do.
What are some ways to store a Trie on disk (i.e. only partially loaded) and keep the fast lookup property?
Is this even a good idea to begin with?
The paper B-tries for disk-based string management answers your question.
It makes the observation:
To our knowledge, there has yet to be a proposal in literature for a
trie-based data structure, such as the burst trie, the can reside
efficiently on disk to support common string processing tasks.
I've only glanced at it briefly, but Shang's "Trie methods for text and spatial data on secondary storage" discusses paged trie representations, and might be a useful starting point.

Resources