Which tree-based dictionary is the easiest to implement functionally? - data-structures

I'm looking for a tree-based dictionary data structure which is easy to implement in Haskell.
Do you have any experience with implementing AVL trees or RB trees? I'm also thinking about splay trees, but don't see how they could be implemented using immutable data.

Red-black trees are very easy to implement in a functional language, since you don't need to spend effort trying to shave off a few assignments, and the usual description of algorithms corresponds very well to pattern matching. See Okasaki, Red-Black Trees in a Functional Setting. In fact, his book, which is the revised and extended version of his thesis, is an excellent reference for many purely functional data structures.

Related

A tree that is both memory efficient and disk-space efficient?

I recently started reading about Data Structures in detail. I came across trees. AVL trees are designed taking fast memory access into consideration and B trees are designed taking efficient disk storage into consideration. Suppose I want to design a tree which is both memory efficient and disk storage efficient, what tree should I use? Is there any way I can combine AVL tree and B Tree? Is there any other tree that can do both? Is this fundamentally possible in a real-world scenario?
I want to design a tree which is both memory efficient and disk storage efficient (...) Is there any way I can combine AVL tree and B Tree?
Short answer is no, there isn't, unless you make a breakthrough discovery in the field of data structures. Both of them were designed with specific optimization requirements in mind, you can't have the best of both worlds.
There's a concept in computing called Space–time tradeoff which can be extended to other types of tradeoffs, like the one you're interested in. You can think of it like this: to improve a property of an already optimized algorithm you will have to worsen another (unless you discover some new approach no one thought before).
I suggest you take a look at the available implementations of optimized Binary Trees and start with the one that best fits your needs.

Is (pure) functional programming antagonistic with "algorithm classics"?

The classic algorithm books (TAOCP, CLR) (and not so classic ones, such as the fxtbook)are full of imperative algorithms. This is most obvious with algorithms whose implementation is heavily based on arrays, such as combinatorial generation (where both array index and array value are used in the algorithm) or the union-find algorithm.
The worst-case complexity analysis of these algorithms depends on array accesses being O(1). If you replace arrays with array-ish persistent structures, such as Clojure does, the array accesses are no longer O(1), and the complexity analysis of those algorithms is no longer valid.
Which brings me to the following questions: is pure functional programming incompatible with the classical algorithms literature?
With respect to data structures, Chris Okasaki has done substantial research into adopting classic data structures into a purely functional setting, as many of the standard data structures no longer work when using destructive updates. His book "Purely Functional Data Structures" shows how some structures, like binomial heaps and red/black trees, can be implemented quite well in a functional setting, while other basic structures like arrays and queues must be implemented with more elaborate concepts.
If you're interested in pursuing this branch of the core algorithms, his book would be an excellent starting point.
The short answer is that, so long as the algorithm does not have effects that can be observed after it finishes (other than what it returns), then it is pure. This holds even when you do things like destructive array updates or mutation.
If you had an algorithm like say:
function zero(array):
ix <- 0
while(ix < length(array)):
array[ix] <- 0
ix <- ix+1
return array
Assuming our pseudocode above is lexically scoped, so long as the array parameter is first copied over and the returned array is a wholly new thing, this algorithm represents a pure function (in this case, the Haskell function fmap (const 0) would probably work). Most "imperative" algorithms shown in books are really pure functions, and it is perfectly fine to write them that way in a purely functional setting using something like ST.
I would recommend looking at Mercury or the Disciple Disciplined Compiler to see pure languages that still thrive on destruction.
You may be interested in this related question: Efficiency of purely functional programming.
is there any problem for which the best known non-destructive algorithm is asymptotically worse than the best known destructive algorithm, and if so by how much?
It is not. But it is true that one can see in many book algorithms that look like they are only usable in imperative languages. The main reason is that pure functional programming was restrained to academic use for a long time. Then, the authors of these algorithms strongly relied on imperative features to be in the mainstream. Now, consider two widely spread algorithms: quick sort and merge sort. Quick sort is more "imperative" than merge sort; one of its advantage is to be in place. Merge sort is more "pure" than quick sort (in some way) since it needs to copy and keep its data persistent. Actually many algorithm can be implemented in pure functional programming without losing too much efficiency. This is true for many algorithms in the famous Dragon Book for example.

Why are "Algorithms" and "Data Structures" treated as separate disciplines?

This question was the last straw; and I've been wondering for a long time about it,
Why do people think about "Algorithms" and "Data structures" as about something that can be separated from each other?
I see a lot of evidence that they're separated in programmers' minds.
they request "Data Structures & Algorithms" books
they refer to "Data Structures" and "Algorithms" as separate university courses
they "know Algorithms", but are "weak in Data Structures" (can't find the link, sorry).
etc.
In my opinion "Data Structures" are algorithms, since the concept of "Data Structure" is about Algorithms to operate data that go in and out of the structures. But the opinion seems not a mainstream. What do I miss?
Edit: unfortunately, I did not formulate the question well. A separation of data structures and algorithms in programs people write is natural, since, well, the former is data, and the latter is functions (and in semi-functional frameworks like STL it's the core of the whole thing).
But the points above, and the question itself, refers to the way people think, to the way they arrange the knowledge in their heads. This doesn't have to even relate to the code writing.
Here are some links where people separate "algorithms" and "data structures" when they're the same thing:
Revisions: algorithm and data structure
They are different. Consider graphs, or trees to be more specific. Now, a tree appears to only be a tree. But you can browse it in preorder, inorder or postorder (3 algorithms for one structure).
You can have multiple or only 2 children for one node. The tree can be balanced (like AVL) or contain additional information (like B-tree indexes in data bases). That's different structures. But still you traverse them with the same algorithm.
See it now?
Another point: Algorithms sometimes are and sometimes are not independent from data structures. Certain algorithms have different complexity over different structures (finding paths in graph represented as list or a 2D table).
Algorithms and Data Structures are tightly wound together. Algorithm depends on data structures, if you change either of them, complexity will change considerably. They are not same, but are definitely two sides of the same coin. Selecting a good Data Structure is itself a path towards better algorithm.
For instance, Priority Queues can be implemented using binary heaps and binomial heaps, binary heaps allow peeking at highest priority element in constant time, whereas binomial heaps require O(log N) time for peeking.
So, a particular algorithm works best for that particular data-structure (in a particular context), hence Algorithms and Data Structures go hand-in-hand!
People refer to them as different entities because they are. Suppose I want to find an element from a set of data. If I put that data into an array, the array is a data-structure. Once it's in the array, I can use multiple different algorithms to find the element I'm interested in. I could sort the array (with any of multiple sorts) then use a binary search, I could just check each element linearly, etc. The choice of the array as the data structure I would use as opposed to say, a linked list, is not choosing an algorithm.
That said, it is important to understand one to understand the other. If you do not understand algorithms well then it is not obvious what the advantages and disadvantages of different data structures are, and vice versa. As such, it makes sense to teach them simultaneously. They are however different entities.
[Edit] Think about this: If you look at pseudo-code for most algorithms, a data structure isn't specified. You may have a "list" of elements to iterate through etc, but the exact implementation of that list is unimportant to the correctness of the algorithm.
I would say it's because functional programming separates what is operated on from the operations themselves. Targets and actions are certainly different, even if they're closely intertwined.
It was object-oriented programming that put data and operations into a single component. Perhaps if OO had come along earlier there would have been one discipline.
The way I see it is that algorithms are something that work with or on data structures, so there is a difference between the two. A simple data structure is an array, but there are a lot of algorithms that operate on simple arrays, so there has to be a way of separating the two. An array can also represent a tree, and trees are handled with specialized algorithms.
The difference isn't big, because you can't really have one without the other most of the times, but some times you can. Consider the trivial algorithm that determines whether a number is prime - it uses no data structures. Consider the GCD algorithm, also no data structures. You can talk about an algorithm without talking about data structures, but you can't talk about a data structure without talking about algorithms usually. You can talk about a tree, but you'll need algorithms for insertions, removals etc.
I think it's good that there is a distinction because they are, conceptually, different things. An algorithm is a set of steps used for accomplishing a task, while a data structure is something used to store data, the manipulation of said data is done with algorithms.
They are separate university courses. Typically, the data structures course emphasizes programming and is prerequisite to the algorithms course, which emphasizes mathematical analysis of algorithms. I don't think it's hard to see why many people with an undergraduate education in CS might think of them as separate.
I agree with you. Both are two sides of one and the same thing.
When talking about data structures, it's always about storing data in a way to optimize certain operations on this data, which leads us to algorithms and complexity.
The two are, of course, closely intertwined. This is why the posts you refer to requests books on both. Not always, though. The core of a sort algorithm, for example, is unchanged no matter what sort of data structure you're working on.
The title of the book Algorithm + Data Structures = Programs (1975) by none other than Niklaus Wirth suggests that both are essential in writing a program.

How do I determine which kind of tree data structure to choose?

Ok, so this is something that's always bothered me. The tree data structures I know of are:
Unbalanced binary trees
AVL trees
Red-black trees
2-3 trees
B-trees
B*-trees
Tries
Heaps
How do I determine what kind of tree is the best tool for the job? Obviously heaps are canonically used to form priority queues. But the rest of them just seem to be different ways of doing the same thing. Is there any way to choose the best one for the job?
Let’s pick them off one by one, shall we?
Unbalanced binary trees
For search tasks, never. Basically, their performance characteristics will be completely unpredictable and the overhead of balancing a tree won’t be so big as to make unbalanced trees a viable alternative.
Apart from that, unbalanced binary trees of course have other uses, but not as search trees.
AVL trees
They are easy to develop but their performance is generally surpassed by other balancing strategies because balancing them is comparatively time-intensive. Wikipedia claims that they perform better in lookup-intensive scenarios because their height is slightly less in the worst case.
Red-black trees
These are used inside most of C++’ std::map implemenations and probably in a few other standard libraries as well. However, there’s good evidence that they are actually worse than B(+) trees in every scenario due to caching behaviour of modern CPUs. Historically, when caching wasn’t as important (or as good), they surpassed B trees when used in main memory.
2-3 trees
B-trees
B*-trees
These require the most careful consideration of all the trees, since the different constants used are basically “magical” constans which relate in weird and sometimes unpredictable way to the underlying hardware architecture. For example, the optimal number of child nodes per level can depend on the size of a memory page or cache line.
I know of no good, general rule to distinguish between them.
Tries
Completely different. Tries are also search trees, but for text retrieval of substrings in a corpus. A trie is an uncompressed prefix tree (i.e. a tree in which the paths from root to leaf nodes correspond to all the prefixes of a given string).
Tries should be compared to, and offset against, suffix trees, suffix arrays and q-gram indices – not so much against other search trees because the data that they search is different: instead of discrete words in a corpus, the latter index structures allow a factor search.
Heaps
As you’ve already said, they are not search trees at all.
The same as any other data structure, you have to know the characteristics (complexity of search, insert, and delete operations) of each type of tree, and the requirements of the job you're selecting a tool for. The tree that has the best performance for the type of operations you'll do most often is usually the best tool for the job.
You can usually find the general characteristics for any kind of data structure on Wikipedia. Introduction to Algorithms also has at least a section (in some cases a whole chapter) on most of the data structures you've listed, so it's another good reference.
Similar question: When to choose RB tree, B-Tree or AVL tree?
Offhand, I'd say, write the simplest code that could possibly work (availing yourself of library-provided data structures if possible). Then measure its performance problems, if any.
If your performance needs are really extreme, read Konrad Rudolph's awesome answer. :)
Each of these has different complexity for insertion, deletion and retrieval, All have mostly O log(n) access times.
Each tree has specific characteristics which make them usefull in a certain way. You should compare there characteristics with the needs you have.

What are the complicated data structures you should have heard of? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
This is a derivative question, but I'm inquiring as to the data structures that you should at least be familiar with for their usefulness. These structures are too hard to implement without some expertise however.
I would say a good boundary between the two is a heap -- you should be able to code a heap, but it would take you a day. Not appropriate for this would be a BST, etc. Edit: I see the point that it depends on what you are doing. I think it would be awesome to have a list with a phrase summarizing why you use it!
Here's a list to start:
B+ trees: good general indexing structure on a single key
K-d tree: spatial data
Red-black tree: self-balancing BST; also AVL or splay tree
Skip list: good hybrid structure for either random or (pseudo)sequential access
Trie: linear time string search
Bloom filters
What about:
Binomial Heaps
Fibonacci Heaps
Disjoint Set Data Structures
Splay Trees
Finger trees
That is a good start; there is a comprehensive list of data structures on wikipedia, some of them should be examined. But as to which ones you need, that depends on the area you intend to... do whatever it is that you are doing.
Embedded systems guys will have very different ideas from web guys who will strongly disagree with the business logic guys. Figure out what you want to do; languages and platform will also effect the list you need.
To quote Martin Kay:
Suffix trees constitute a well
understood, extremely elegant, but
regrettably poorly appreciated data
structure with potentially many
applications (...)
See also: What are the lesser known but cool data structures?
van Emde Boas trees. I don't literally think that you "should" have heard of them, but I do believe they're an interesting example of what kind of complexity you can achieve with "bit tricks" --- namely O(log log n), exponentially better than binary trees!
R-Tree
Closely related to the B+ tree you mentioned: B*-tree. Along with a balancing approach known as the "dancing tree" approach, these form the basis of Reiser4.
Binary Decision Diagrams, specifically Reduced Order Binary Decision Diagrams (ROBDD). These get reinvented (poorly) a lot when someone decides to create their own filtering system.
Cuckoo hashing, a simple and elegant way of resolving hash-table collisions in expected constant time.
Deterministic finite automata (DFAs), or finite state machines, useful for expressing many things, such as basic lexers, regular expressions, state transitions, etc. See also the related directed acyclic word graphs, which can be useful for storing dictionaries compactly.
I would add Hash Tables to the list. They are pretty simple in concept, but can be complicated once you look at how to implement a good hashing function and efficient probing methods.
R-Tree and its variants, such as R*-Tree, X-Tree, Pyramid-Tree. Various M-Tree variants, such as the Slim-Tree.
As often, querying the tree is easy. There might be an easy bulk-loading, too (for R-Trees, STR often does a good job). The tricky part usually is the maintainance of a good tree across updates.
You can try:
y-fast trees
Approximate ordered sets
select heap
compact arrays
Monolithic lists
Succinct lists

Resources