Given a data structure specification such as a purely functional map with known complexity bounds, one has to pick between several implementations. There is some folklore on how to pick the right one, for example Red-Black trees are considered to be generally faster, but AVL trees have better performance on work loads with many lookups.
Is there a systematic presentation (published paper) of this knowledge (as relates to sets/maps)? Ideally I would like to see statistical analysis performed on actual software. It might conclude, for example, that there are N typical kinds of map usage, and list the input probability distribution for each.
Are there systematic benchmarks that test map and set performance on different distributions of inputs?
Are there implementations that use adaptive algorithms to change representation depending on actual usage?
These are basically research topics, and the results are generally given in the form of conclusions, while the statistical data is hidden. One can have statistical analysis on their own data though.
For the benchmarks, better go through the implementation details.
The 3rd part of the question is a very subjective matter, and the actual intentions may never be known at the time of implementation. However, languages like perl do their best to implement highly optimized solutions to every operation.
Following might be of help:
Purely Functional Data Structures by Chris Okasaki
http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf
Related
Locality sensitive hashing seems like a great technique for KNNs without any disadvantages. However, what would be a disadvantage of locality sensitive hashing if someone is using it in industry for practical applications? Under what situations will the LSH fail or do somewhat badly? Or does it take long time to code/tune?
This is a rather broad question, but since you are new here, I will attempt to answer.
LSH is not as perfect as you describe, of course, search for papers about it please. Maybe that question can help: How to understand Locality Sensitive Hashing?
There are many LSH libraries that provides automatic parameter configuration, but not for the most important one, R, used in solving a randomized
version of R-near neighbor. This is a major drawback, since the user has to
manually identify R at every input. That in my opinion, is a very important aspect you have to take into account, when it comes to practical applications.
About the performance, it all depends on your input! For, example in the kd-GeRaF project of mine, I had tested LSH thoroughly and I had seen that it may have some important issues when it comes to accuracy and search speed. The scope of the datasets where in a high dimensional space, where ANNS was performed.
I have read about "probabilistic" data structures like bloom filters and skip lists.
What are the common characteristics of probabilistic data structures and what are they used for?
There are probably a lot of different (and good) answers, but in my humble opinion, the common characteristics of probabilistic data structures is that they provide you with approximate, not precise answer.
How many items are here?
About 1523425 with probability of 99%
Update:
Quick search produced link to decent article on the issue:
https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
If you are interested in probabilistic data structures, you might want to read my recently published book "Probabilistic Data Structures and Algorithms for Big Data Applications" (ISBN: 9783748190486, available at Amazon) where I have explained many of such space-efficient data structures and fast algorithms that are extremely useful in modern Big Data applications.
In this book, you can find the state of the art algorithms and data structures that help to handle such common problems in Big Data processing as
Membership querying (Bloom filter, Counting Bloom filter, Quotient filter, Cuckoo filter).
Cardinality (Linear counting, probabilistic counting, LogLog, HyperLogLog, HyperLogLog++).
Frequency (Majority algorithm, Frequent, Count Sketch, Count-Min Sketch).
Rank (Random sampling, q-digest, t-digest).
Similarity (LSH, MinHash, SimHash).
You can get a free preview and all related information about the book at https://pdsa.gakhov.com
Probabilistic data structures can't give you a definite answer, instead they provide you with a reasonable approximation of the answer and a way to approximate this estimation. They are extremely useful for big data and streaming application because they allow to dramatically decrease the amount of memory needed (in comparison to data structures that give you exact answers).
In majority of the cases these data structures use hash functions to randomize the items. Because they ignore collisions they keep the size constant, but this is also a reason why they can't give you exact values. The advantages they bring:
they use small amount of memory (you can control how much)
they can be easily parallelizable (hashes are independent)
they have constant query time (not even amortized constant like in dictionary)
Frequently used probabilistic data structures are:
bloom filters
count-min sketch
hyperLogLog
There is a list of probabilistic data structures in wikipedia for your reference:
https://en.wikipedia.org/wiki/Category:Probabilistic_data_structures
There are different definitions about what "probabilistic data structure" is. IMHO, probabilistic data structure means that the data structure uses some randomized algorithm or takes advantage of some probabilistic characteristics internally, but they don't have to behave probabilistically or un-deterministically from the data structure user's perspective.
There are many "probabilistic data structures" with probabilistically
behavior such as the bloom filter and HyperLogLog mentioned
by the other answers.
At the same time, there are other "probabilistic data structures"
with determined behavior (from a user's perspective) such as skip
list. For skip list, users can use it similarly as a balanced binary search tree but is implemented with some probability related idea internally. And according to skip list's author William Pugh:
Skip lists are a probabilistic data structure that seem likely to
supplant balanced trees as the implementation method of choice for
many applications. Skip list algorithms have the same asymptotic
expected time bounds as balanced trees and are simpler, faster and use
less space.
Probabilistic data structures allow for constant memory space and extremely fast processing while still maintaining a low error rate with a specified degree on uncertainity.
Some use-cases are
Checking presence of value in a data set
Frequency of events
Estimate approximate size of a data set
Ranking and grouping
I have been reading parts of Introduction to Algorithms by Cormen et al, and have implemented some of the algorithms.
In order to test my implementations I wrote some glue code to do file io, then made some sample input by hand and some more sample input by writing programs that generate sample input.
However I am doubtful as to the quality of my own sample inputs -- corner cases; I may have missed the more interesting possibilities; I may have miscalculated the proper output; etc.
Is there a set of test inputs and outputs for various algorithms collected somewhere on the Internet so that I might be able to test my code? I am looking for test data reasonably specific to particular algorithms, rather than contest problems that often involve a problem solving component as well.
I understand that I might have to adjust my code depending on the format the input is collected in (e.g. The various constraints of the inputs; for graph algorithms, the representation of the graph; etc.) although, I am hoping that the change I would have to make would be reasonably trivial.
Edit:
Some particular datasets I am currently looking for are:
Lists of numbers
Skewed so that Quick sort performs badly.
Skewed so that Fibonacci Heap performs particularly well or poorly for specific operations.
Graphs (for which High Performance Mark has offered a number of interesting references)
Sparse graphs (with specific bounds on number of edges),
Dense graphs,
Since, I am still working through the book, if you are in a similar situation as I am, or you just feel the list could be improved, please feel free to edit the list -- some time soon, I may come to need datasets similar to what you are looking for. I am not entirely sure how editing privileges work, but if I have any say over it, I will try to approve it.
I don't know of any one resource which will provide you with sample inputs for all the types of algorithm that Cormen et al cover but for graph datasets here are a couple of references:
Knuth's Stanford Graphbase
and
the Stanford Large Network Dataset Collection
which I stumbled across while looking for the link to the former. You might be interested in this one too:
the Matrix Market
Why not edit your question and let SO know what other types of input you are looking for.
I am going to stick my head on the line and say that I do not know of any such source, and I very much doubt that such a source exists.
As you seem to be aware, algorithms can be applied to almost any sort of data, and so it would be fruitless to attempt to provide sample data.
Is there a chart or table anywhere that displays a lot of(at least the popular ones) data structures and algorithms with their running times and efficiency?
What I am looking for is something that I can glance at, and decide which structure/algorithm is best for a particular case. It would be helpful when working on a new project or just as a study guide.
A chart or table isn't going to be a particularly useful reference.
If you're going to be using a particular algorithm or data structure to tackle a problem, you'd better know and understand it inside and out. And that includes knowing (and knowing how to derive) their respective efficiencies. It's not particularly difficult. Most standard algorithms have simple, intuitive run-times like N^2, N*logN, etc.
That being said, run-time Big-O isn't everything. Take sorting for example. Heap sort has a better Big-O than say quick sort, yet quick sort performs much better in practice. Constant factors in Big-O's can also make a huge difference.
When you're talking about data structures, there's a lot more to them than meets the eye. For example, a hash map seems like just a tree map with much better performance, but you get additional sorting structure with a tree map.
Knowing what is the best algorithm/data structure to use is a matter of knowledge experience, not a look up table.
Though back to your question, I don't know of any such reference. It would be a good exercise to make one yourself though. Wikipedia has pretty decent articles on common algorithms/data structures along with some decent analysis.
I don't believe that any such list exists. The sheer number of known algorithms and data structures is staggering, and new ones are being developed all the time. Moreover, many of these algorithms and data structures are specialized, meaning that even if you had a list in front of you it would be difficult to know which ones were applicable for the particular problems you were trying to solve.
Another concern with such a list is how to quantify efficiency. If you were to rank algorithms in terms of asymptotic complexity (big-O), then you might end up putting certain algorithms and data structures that are asymptotically optimal but impractically slow on small inputs ahead of algorithms that are known to be fast for practical cases but might not be theoretically perfect. As an example, consider looking up the median-of-medians algorithm for linear time order statistics, which has such a huge constant factor that other algorithms tend to be much better in practice. Or consider quicksort, which in the worst-case is O(n2) but in practice has average complexity O(n lg n) and is much faster than other sorting algorithms.
On the other hand, were you to try to list the algorithms by runtime efficiency, the list would be misleading. Runtime efficiency is based on a number of factors that are machine- and input-specific (such as locality, size of the input, shape of the input, speed of the machine, processor architecture, etc.) It might be useful as a rule-of-thumb, but in many cases you might be mislead by the numbers to pick one algorithm when another is far superior.
There's also implementation complexity to consider. Many algorithms exist only in papers, or have reference implementations that are not optimized or are written in a language that isn't what you're looking for. If you find a Holy Grail algorithm that does exactly what you want but no implementation for it, it might be impossibly difficult to code up and debug your own version. For example, if there weren't a preponderance of red/black tree implementations, do you think you'd be able to code it up on your own? How about Fibonacci heaps? Or (from personal experience) van Emde Boas trees? Often it may be a good idea to pick a simpler algorithm that's "good enough" but easy to implement over a much more complex algorithm.
In short, I wish a table like this could exist that really had all this information, but practically speaking I doubt it could be constructed in a way that's useful. The Wikipedia links from #hammar's comments are actually quite good, but the best way to learn what algorithms and data structures to use in practice is by getting practice trying them out.
Collecting all algorithms and/or data structures is essentially impossible -- even as I'm writing this, there's undoubtedly somebody, somewhere is inventing some new algorithm or data structure. In the greater scheme of things, it's probably not of much significance, but it's still probably new and (ever so slightly) different from anything anybody's done before (though, of course, it's always possible it'll turn out to be a big, important thing).
That said, the US NIST has a Dictionary of Algorithms and Data Structures that lists more than most people ever know or care about. It covers most of the obvious "big" ones that everybody knows, and an awful lot of less-known ones as well. The University of Canterbury has another that is (or at least seems to me) a bit more modest, but still covers most of what a typical programmer probably cares about, and is a bit better organized for finding an algorithm to solve a particular problem, rather than being based primarily on already knowing the name of the algorithm you want.
There are also various collections/lists that are more specialized. For example, The Stony Brook Algorithm Repository is devoted primarily (exclusively?) to combinatorial algorithms. It's based on the Algorithm Design Manual, so it can be particularly useful if you have/use that book (and in case you're wondering, this book is generally quite highly regarded).
The first priority for a computer program is correctness and the second, most of the time, is programmer time - something directly linked to mantainability and extensibility.
Because of this, there is a school of programming that advocates just using simple stuff like arrays of records, unless it happens to be a performance sensitive part, in which case you need not only consider data structures and algorithms but also the "architechture" that led you to have that problem in the first place.
There are various types of trees I know. For example, binary trees can be classified as binary search trees, two trees, etc.
Can anyone give me a complete classification of all the trees in computer science?
Please provide me with reliable references or web links.
It's virtually impossible to answer this question since there are essentially arbitrarily many different ways of using trees. The issue is that a tree is a structure - it's a way of showing how various pieces of data are linked to one another - and what you're asking for is every possible way of interpreting the meaning of that structure. This would be similar, for example, to asking for all uses of calculus in engineering; calculus is a tool with which you can solve an enormous class of problems, but there's no concise way to explain all possible uses of the integral because in each application it is used a different way.
In the case of trees, I've found that there are thousands of research papers describing different tree structures and ways of using trees to solve problems. They arise in string processing, genomics, computational geometry, theory of computation, artificial intelligence, optimization, operating systems, networking, compilers, and a whole host of other areas. In each of these domains they're used to encode specific structures that are domain-specific and difficult to understand without specialized knowledge of the field. No one reference can cover all these ares in any reasonable depth.
In short, you seem to already know the structure of a tree, and this general notion is transferrable to any of the above domains. But to try to learn every possible way of using this structure or all its applications would be a Herculean undertaking that no one, not even the legendary Don Knuth, could ever hope to achieve in a lifetime.
Wikipedia has a nice compilation of the various trees at the bottom of the page
Dictionary of Algorithms and Data Structures has more information
What specifics are you looking for?