What are the complicated data structures you should have heard of? [closed] - performance

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
This is a derivative question, but I'm inquiring as to the data structures that you should at least be familiar with for their usefulness. These structures are too hard to implement without some expertise however.
I would say a good boundary between the two is a heap -- you should be able to code a heap, but it would take you a day. Not appropriate for this would be a BST, etc. Edit: I see the point that it depends on what you are doing. I think it would be awesome to have a list with a phrase summarizing why you use it!
Here's a list to start:
B+ trees: good general indexing structure on a single key
K-d tree: spatial data
Red-black tree: self-balancing BST; also AVL or splay tree
Skip list: good hybrid structure for either random or (pseudo)sequential access
Trie: linear time string search

Bloom filters

What about:
Binomial Heaps
Fibonacci Heaps
Disjoint Set Data Structures
Splay Trees

Finger trees

That is a good start; there is a comprehensive list of data structures on wikipedia, some of them should be examined. But as to which ones you need, that depends on the area you intend to... do whatever it is that you are doing.
Embedded systems guys will have very different ideas from web guys who will strongly disagree with the business logic guys. Figure out what you want to do; languages and platform will also effect the list you need.

To quote Martin Kay:
Suffix trees constitute a well
understood, extremely elegant, but
regrettably poorly appreciated data
structure with potentially many
applications (...)
See also: What are the lesser known but cool data structures?

van Emde Boas trees. I don't literally think that you "should" have heard of them, but I do believe they're an interesting example of what kind of complexity you can achieve with "bit tricks" --- namely O(log log n), exponentially better than binary trees!

R-Tree

Closely related to the B+ tree you mentioned: B*-tree. Along with a balancing approach known as the "dancing tree" approach, these form the basis of Reiser4.

Binary Decision Diagrams, specifically Reduced Order Binary Decision Diagrams (ROBDD). These get reinvented (poorly) a lot when someone decides to create their own filtering system.

Cuckoo hashing, a simple and elegant way of resolving hash-table collisions in expected constant time.

Deterministic finite automata (DFAs), or finite state machines, useful for expressing many things, such as basic lexers, regular expressions, state transitions, etc. See also the related directed acyclic word graphs, which can be useful for storing dictionaries compactly.

I would add Hash Tables to the list. They are pretty simple in concept, but can be complicated once you look at how to implement a good hashing function and efficient probing methods.

R-Tree and its variants, such as R*-Tree, X-Tree, Pyramid-Tree. Various M-Tree variants, such as the Slim-Tree.
As often, querying the tree is easy. There might be an easy bulk-loading, too (for R-Trees, STR often does a good job). The tricky part usually is the maintainance of a good tree across updates.

You can try:
y-fast trees
Approximate ordered sets
select heap
compact arrays
Monolithic lists
Succinct lists

Related

A tree that is both memory efficient and disk-space efficient?

I recently started reading about Data Structures in detail. I came across trees. AVL trees are designed taking fast memory access into consideration and B trees are designed taking efficient disk storage into consideration. Suppose I want to design a tree which is both memory efficient and disk storage efficient, what tree should I use? Is there any way I can combine AVL tree and B Tree? Is there any other tree that can do both? Is this fundamentally possible in a real-world scenario?
I want to design a tree which is both memory efficient and disk storage efficient (...) Is there any way I can combine AVL tree and B Tree?
Short answer is no, there isn't, unless you make a breakthrough discovery in the field of data structures. Both of them were designed with specific optimization requirements in mind, you can't have the best of both worlds.
There's a concept in computing called Space–time tradeoff which can be extended to other types of tradeoffs, like the one you're interested in. You can think of it like this: to improve a property of an already optimized algorithm you will have to worsen another (unless you discover some new approach no one thought before).
I suggest you take a look at the available implementations of optimized Binary Trees and start with the one that best fits your needs.

When is a treap useful?

In what kind of situation is a treap the optimal data structure to use? I have been searching for answers on this but haven't really found anything concrete.
There's another stackoverflow question asking when to use a treap but no real world examples are given there.
The most commonly given advantage seems to be that they are so much easier to implement than for example a red-black tree, but almost everyone uses pre-written implementations anyway, so it doesn't seem that relevant.
It's an optimal data structure to use as an example in randomized algorithms classes.
OK, flippancy aside, the narrow advantages suggested by Aragon and Seidel include the following.
They're simple. Yes, your standard library may have a red-black tree available, but it's likely that it doesn't provide enough hooks to do some of the interesting things that can be done with binary search trees (e.g., order statistics). Split and merge are much simpler too.
They use slightly less space than red-black trees, assuming that the priorities are computed by hashing the keys. In practice this doesn't matter if the red-black trees can steal a pointer bit for color.
They may be faster than red-black trees. I haven't searched for evidence either way.
The big downside is that the performance guarantees are in expectation only. People learned the hard way with hash tables that the oblivious adversary assumed by analyses of randomized algorithms usually isn't so oblivious in the real world.
I think it's fair to say that treaps were an interesting idea but one that turned out not to have a lot of practical impact. It's research. That happens.
One very unusual property of Treaps is that they are not sensitive to the order of the insertions/deletions.
Since insertion/deletion happens based on the random priority, if $n$ elements are added to an empty treap, irrespective of the order in which insertion happens, the treap will look exactly the same.
So an adversary cannot look at the treap and figure out the order in which the elements were inserted.
In a real-world example, treap is used in LFU Cache implementation. For LFU cache, hash map and treap are used.
Caching policies are named based on the eviction policy. In LFU cache, we purge least frequently used item. For this, each item holds the count variable that shows how many times they have been used.
But we have to be careful. We want to make sure that among the elements with the minimum number of entries, the oldest is removed first; otherwise, we could end up removing the latest entry over and over again, without giving it a chance to have its counter increased. So we have to keep track of two things: counter and time of insertion.

fp growth algorithm

I have to implement FP-growth algorithm using any language. The code should be a serial code with no recursion. Is it possible to implement such algorithm without recursion? I am not looking for code, I just need an explanation of how to do it.
FPGrowth is a recursive algorithm. Like some other people said here, you can always transform an algorithm into a non recursive algorithm by using a stack. But I don't see any good reasons to do that for FPGrowth.
By the way, if you want a Java implementation of FPGrowth and other frequent pattern mining algorithms such as Apriori, HMine, Eclat, etc., you can check my website. I have implemented more than 40 algorithms for frequent pattern mining, association rule mining, etc.:
http://www.philippe-fournier-viger.com/spmf/
I don't know what is the algorithm you talking about. But everyting what is possible with recursion it is possible also without it. You can implement such kind of algorithms using stack.
Here is an very clear explanation of how the code works. It looks like you have to build a tree and validate it.
Assuming by "FP growth Algorithm" you mean frequent Pattern growth algorithm, I would point you over to this document that gives a decent explanation on how it works.
http://www.florian.verhein.com/teaching/2008-01-09/fp-growth-presentation_v1%20%28handout%29.pdf
I wonder though, is this homework related?
You can probably visit http://code.google.com/p/lofia/ to get something over FP Tree.
This is for longest frequent itemset mining.
You can take look at the concept & implemenntation FP growth algoithm in Mahout

Why are "Algorithms" and "Data Structures" treated as separate disciplines?

This question was the last straw; and I've been wondering for a long time about it,
Why do people think about "Algorithms" and "Data structures" as about something that can be separated from each other?
I see a lot of evidence that they're separated in programmers' minds.
they request "Data Structures & Algorithms" books
they refer to "Data Structures" and "Algorithms" as separate university courses
they "know Algorithms", but are "weak in Data Structures" (can't find the link, sorry).
etc.
In my opinion "Data Structures" are algorithms, since the concept of "Data Structure" is about Algorithms to operate data that go in and out of the structures. But the opinion seems not a mainstream. What do I miss?
Edit: unfortunately, I did not formulate the question well. A separation of data structures and algorithms in programs people write is natural, since, well, the former is data, and the latter is functions (and in semi-functional frameworks like STL it's the core of the whole thing).
But the points above, and the question itself, refers to the way people think, to the way they arrange the knowledge in their heads. This doesn't have to even relate to the code writing.
Here are some links where people separate "algorithms" and "data structures" when they're the same thing:
Revisions: algorithm and data structure
They are different. Consider graphs, or trees to be more specific. Now, a tree appears to only be a tree. But you can browse it in preorder, inorder or postorder (3 algorithms for one structure).
You can have multiple or only 2 children for one node. The tree can be balanced (like AVL) or contain additional information (like B-tree indexes in data bases). That's different structures. But still you traverse them with the same algorithm.
See it now?
Another point: Algorithms sometimes are and sometimes are not independent from data structures. Certain algorithms have different complexity over different structures (finding paths in graph represented as list or a 2D table).
Algorithms and Data Structures are tightly wound together. Algorithm depends on data structures, if you change either of them, complexity will change considerably. They are not same, but are definitely two sides of the same coin. Selecting a good Data Structure is itself a path towards better algorithm.
For instance, Priority Queues can be implemented using binary heaps and binomial heaps, binary heaps allow peeking at highest priority element in constant time, whereas binomial heaps require O(log N) time for peeking.
So, a particular algorithm works best for that particular data-structure (in a particular context), hence Algorithms and Data Structures go hand-in-hand!
People refer to them as different entities because they are. Suppose I want to find an element from a set of data. If I put that data into an array, the array is a data-structure. Once it's in the array, I can use multiple different algorithms to find the element I'm interested in. I could sort the array (with any of multiple sorts) then use a binary search, I could just check each element linearly, etc. The choice of the array as the data structure I would use as opposed to say, a linked list, is not choosing an algorithm.
That said, it is important to understand one to understand the other. If you do not understand algorithms well then it is not obvious what the advantages and disadvantages of different data structures are, and vice versa. As such, it makes sense to teach them simultaneously. They are however different entities.
[Edit] Think about this: If you look at pseudo-code for most algorithms, a data structure isn't specified. You may have a "list" of elements to iterate through etc, but the exact implementation of that list is unimportant to the correctness of the algorithm.
I would say it's because functional programming separates what is operated on from the operations themselves. Targets and actions are certainly different, even if they're closely intertwined.
It was object-oriented programming that put data and operations into a single component. Perhaps if OO had come along earlier there would have been one discipline.
The way I see it is that algorithms are something that work with or on data structures, so there is a difference between the two. A simple data structure is an array, but there are a lot of algorithms that operate on simple arrays, so there has to be a way of separating the two. An array can also represent a tree, and trees are handled with specialized algorithms.
The difference isn't big, because you can't really have one without the other most of the times, but some times you can. Consider the trivial algorithm that determines whether a number is prime - it uses no data structures. Consider the GCD algorithm, also no data structures. You can talk about an algorithm without talking about data structures, but you can't talk about a data structure without talking about algorithms usually. You can talk about a tree, but you'll need algorithms for insertions, removals etc.
I think it's good that there is a distinction because they are, conceptually, different things. An algorithm is a set of steps used for accomplishing a task, while a data structure is something used to store data, the manipulation of said data is done with algorithms.
They are separate university courses. Typically, the data structures course emphasizes programming and is prerequisite to the algorithms course, which emphasizes mathematical analysis of algorithms. I don't think it's hard to see why many people with an undergraduate education in CS might think of them as separate.
I agree with you. Both are two sides of one and the same thing.
When talking about data structures, it's always about storing data in a way to optimize certain operations on this data, which leads us to algorithms and complexity.
The two are, of course, closely intertwined. This is why the posts you refer to requests books on both. Not always, though. The core of a sort algorithm, for example, is unchanged no matter what sort of data structure you're working on.
The title of the book Algorithm + Data Structures = Programs (1975) by none other than Niklaus Wirth suggests that both are essential in writing a program.

How do I determine which kind of tree data structure to choose?

Ok, so this is something that's always bothered me. The tree data structures I know of are:
Unbalanced binary trees
AVL trees
Red-black trees
2-3 trees
B-trees
B*-trees
Tries
Heaps
How do I determine what kind of tree is the best tool for the job? Obviously heaps are canonically used to form priority queues. But the rest of them just seem to be different ways of doing the same thing. Is there any way to choose the best one for the job?
Let’s pick them off one by one, shall we?
Unbalanced binary trees
For search tasks, never. Basically, their performance characteristics will be completely unpredictable and the overhead of balancing a tree won’t be so big as to make unbalanced trees a viable alternative.
Apart from that, unbalanced binary trees of course have other uses, but not as search trees.
AVL trees
They are easy to develop but their performance is generally surpassed by other balancing strategies because balancing them is comparatively time-intensive. Wikipedia claims that they perform better in lookup-intensive scenarios because their height is slightly less in the worst case.
Red-black trees
These are used inside most of C++’ std::map implemenations and probably in a few other standard libraries as well. However, there’s good evidence that they are actually worse than B(+) trees in every scenario due to caching behaviour of modern CPUs. Historically, when caching wasn’t as important (or as good), they surpassed B trees when used in main memory.
2-3 trees
B-trees
B*-trees
These require the most careful consideration of all the trees, since the different constants used are basically “magical” constans which relate in weird and sometimes unpredictable way to the underlying hardware architecture. For example, the optimal number of child nodes per level can depend on the size of a memory page or cache line.
I know of no good, general rule to distinguish between them.
Tries
Completely different. Tries are also search trees, but for text retrieval of substrings in a corpus. A trie is an uncompressed prefix tree (i.e. a tree in which the paths from root to leaf nodes correspond to all the prefixes of a given string).
Tries should be compared to, and offset against, suffix trees, suffix arrays and q-gram indices – not so much against other search trees because the data that they search is different: instead of discrete words in a corpus, the latter index structures allow a factor search.
Heaps
As you’ve already said, they are not search trees at all.
The same as any other data structure, you have to know the characteristics (complexity of search, insert, and delete operations) of each type of tree, and the requirements of the job you're selecting a tool for. The tree that has the best performance for the type of operations you'll do most often is usually the best tool for the job.
You can usually find the general characteristics for any kind of data structure on Wikipedia. Introduction to Algorithms also has at least a section (in some cases a whole chapter) on most of the data structures you've listed, so it's another good reference.
Similar question: When to choose RB tree, B-Tree or AVL tree?
Offhand, I'd say, write the simplest code that could possibly work (availing yourself of library-provided data structures if possible). Then measure its performance problems, if any.
If your performance needs are really extreme, read Konrad Rudolph's awesome answer. :)
Each of these has different complexity for insertion, deletion and retrieval, All have mostly O log(n) access times.
Each tree has specific characteristics which make them usefull in a certain way. You should compare there characteristics with the needs you have.

Resources