Locality sensitive hashing seems like a great technique for KNNs without any disadvantages. However, what would be a disadvantage of locality sensitive hashing if someone is using it in industry for practical applications? Under what situations will the LSH fail or do somewhat badly? Or does it take long time to code/tune?
This is a rather broad question, but since you are new here, I will attempt to answer.
LSH is not as perfect as you describe, of course, search for papers about it please. Maybe that question can help: How to understand Locality Sensitive Hashing?
There are many LSH libraries that provides automatic parameter configuration, but not for the most important one, R, used in solving a randomized
version of R-near neighbor. This is a major drawback, since the user has to
manually identify R at every input. That in my opinion, is a very important aspect you have to take into account, when it comes to practical applications.
About the performance, it all depends on your input! For, example in the kd-GeRaF project of mine, I had tested LSH thoroughly and I had seen that it may have some important issues when it comes to accuracy and search speed. The scope of the datasets where in a high dimensional space, where ANNS was performed.
Related
I have a very basic and general doubt related to algorithm design. I've learnt basic algorithm and now learning randomized algorithm. Everywhere I observed that a professor mostly focuses on designing the algorithm that will ultimately try to reduces the complexity.
The usual way(What I observed) is to learn some basic(or an older) algorithm which behaves badly in terms of complexity and so the objective is to modify that older one with a newer algorithm which should focus on reducing the complexity, without affecting the output.
But in most of algorithm I've studied, especially distributed algorithms (in distributed operating systems) such as algorithms for distributed mutual exclusion, distributed deadlock detection etc., what I observed is that(and mostly I think that) the design of the algorithm should not focus only on complexity enhancement but it should focus on the perfection of the algorithm as well.
Lets take an example of distributed mutual exclusion algorithm. The very basic algorithm is a Lamport's algorithm and the modified version(by enhancing the complexity) of it is the Ricart-Agarwala algorithm. Since in distributed environment the communication is mostly by means of message passing, for distributed mutual exclusion we have three kinds of messages : a) Request critical resource b) Reply the request c) Release critical resource. The basic algorithm uses extra release messages(to inform all sites that the my site has released the critical resource, so you can enter). But in the advanced version what they did is they discarded these release messages by accommodating it in reply messages. And so they came up with some reduced complexity solution.
But when I tried the implementation of these algorithms in java, I observed that even if the complexity of basic algorithm was bit higher but it was more perfect than the advanced one. Because by reducing the number of messages transferred (in advanced solution), local site is no longer aware of the fact that remote site has actually released the resource or not because on the confirmation of release message only site updates its local data structures such as request queue etc. If we don't send any explicit notification for release, then requests remains pending unnecessarily in request queue of the local site for entire run.
So my doubt is that if enhancement of complexity is so important, why can't perfection ? I mean if algorithm is producing perfect results at the cost of bit higher complexity then how does it matters as far as I am getting perfection in output as compared to the enhanced complexity solution which lacks in perfection ?
Note : By perfection I don't mean correct/incorrect results. Results are always correct. Only the perfection or accuracy of the produced result varies.
Principally a fair complexity comparision is done for two algoritms that produce exactly the same output. E.g sorting.
In your case it is different, you describe algoritms with different behaviour.
To choose the better suited algorithm many factors decide:
Ease of implementations (in praxis very important)
A faster algorithm, that lacks some functionallity like in your case must be incredible faster (faktor 10 on expected data volume) to choose it, or easier to implement.
robustness: well know algo, successfuly used since 10 years, or a new algo from a paper where chance are high that it works only the environment (optimized for the algo) by the scientist. (I know such a case for a telecom network algo)
Consider any NP-complete problem (e.g. the travelling salesman problem).
There are no known non-exponential exact algorithms for these problems (except in special cases), so it would literally take years (or much longer) to find an exact solution for any reasonably-sized version of these problems.
So, instead we use heuristics and approximations (and possibly some randomness) to get a non-exact solution in a reasonable time-frame.
NP-complete problems are just an extreme example - we can also just have a few seconds to get a solution (for whatever reason), but finding an exact solution will take a few minutes. So it all comes down to balancing out how long we want to run the algorithm for and how good we want the results to be (and development time also certainly plays a role).
I hope I understood what you were asking correctly and that this helps.
Instead of "perfection", maybe you should consider "fitness for a particular purpose".
For your example of a distributed mutual exclusion algorithm, consider the "simple" and "improved" algorithms from different viewpoints. As another answer pointed out, the two algorithms behave differently; my point is that different people are interested in different aspects of that behavior.
Someone using an algorithm for a particular purpose probably does not care about all aspects of its behavior. For your example, you are concerned about pending resource locks. However, if the mutual exclusion algorithm is expected to be running all the time, the user might not care, because the locks will be returned promptly anyway, while using fewer messages than the simple version. If you want both efficiency and promptness, there is likely some way to accommodate both -- at the cost of greater complexity -- and if you're looking for practical "perfection", this is the logical endpoint.
A computer scientist does not know how his algorithm might be used. In general, he cannot anticipate all possible variations on a particular technique, and you would not want to read them all if he could! When publishing an algorithm, clarity of expression is the "perfection" you're pursuing -- the idea should be described as simply as possible.
Given a data structure specification such as a purely functional map with known complexity bounds, one has to pick between several implementations. There is some folklore on how to pick the right one, for example Red-Black trees are considered to be generally faster, but AVL trees have better performance on work loads with many lookups.
Is there a systematic presentation (published paper) of this knowledge (as relates to sets/maps)? Ideally I would like to see statistical analysis performed on actual software. It might conclude, for example, that there are N typical kinds of map usage, and list the input probability distribution for each.
Are there systematic benchmarks that test map and set performance on different distributions of inputs?
Are there implementations that use adaptive algorithms to change representation depending on actual usage?
These are basically research topics, and the results are generally given in the form of conclusions, while the statistical data is hidden. One can have statistical analysis on their own data though.
For the benchmarks, better go through the implementation details.
The 3rd part of the question is a very subjective matter, and the actual intentions may never be known at the time of implementation. However, languages like perl do their best to implement highly optimized solutions to every operation.
Following might be of help:
Purely Functional Data Structures by Chris Okasaki
http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf
Is there a chart or table anywhere that displays a lot of(at least the popular ones) data structures and algorithms with their running times and efficiency?
What I am looking for is something that I can glance at, and decide which structure/algorithm is best for a particular case. It would be helpful when working on a new project or just as a study guide.
A chart or table isn't going to be a particularly useful reference.
If you're going to be using a particular algorithm or data structure to tackle a problem, you'd better know and understand it inside and out. And that includes knowing (and knowing how to derive) their respective efficiencies. It's not particularly difficult. Most standard algorithms have simple, intuitive run-times like N^2, N*logN, etc.
That being said, run-time Big-O isn't everything. Take sorting for example. Heap sort has a better Big-O than say quick sort, yet quick sort performs much better in practice. Constant factors in Big-O's can also make a huge difference.
When you're talking about data structures, there's a lot more to them than meets the eye. For example, a hash map seems like just a tree map with much better performance, but you get additional sorting structure with a tree map.
Knowing what is the best algorithm/data structure to use is a matter of knowledge experience, not a look up table.
Though back to your question, I don't know of any such reference. It would be a good exercise to make one yourself though. Wikipedia has pretty decent articles on common algorithms/data structures along with some decent analysis.
I don't believe that any such list exists. The sheer number of known algorithms and data structures is staggering, and new ones are being developed all the time. Moreover, many of these algorithms and data structures are specialized, meaning that even if you had a list in front of you it would be difficult to know which ones were applicable for the particular problems you were trying to solve.
Another concern with such a list is how to quantify efficiency. If you were to rank algorithms in terms of asymptotic complexity (big-O), then you might end up putting certain algorithms and data structures that are asymptotically optimal but impractically slow on small inputs ahead of algorithms that are known to be fast for practical cases but might not be theoretically perfect. As an example, consider looking up the median-of-medians algorithm for linear time order statistics, which has such a huge constant factor that other algorithms tend to be much better in practice. Or consider quicksort, which in the worst-case is O(n2) but in practice has average complexity O(n lg n) and is much faster than other sorting algorithms.
On the other hand, were you to try to list the algorithms by runtime efficiency, the list would be misleading. Runtime efficiency is based on a number of factors that are machine- and input-specific (such as locality, size of the input, shape of the input, speed of the machine, processor architecture, etc.) It might be useful as a rule-of-thumb, but in many cases you might be mislead by the numbers to pick one algorithm when another is far superior.
There's also implementation complexity to consider. Many algorithms exist only in papers, or have reference implementations that are not optimized or are written in a language that isn't what you're looking for. If you find a Holy Grail algorithm that does exactly what you want but no implementation for it, it might be impossibly difficult to code up and debug your own version. For example, if there weren't a preponderance of red/black tree implementations, do you think you'd be able to code it up on your own? How about Fibonacci heaps? Or (from personal experience) van Emde Boas trees? Often it may be a good idea to pick a simpler algorithm that's "good enough" but easy to implement over a much more complex algorithm.
In short, I wish a table like this could exist that really had all this information, but practically speaking I doubt it could be constructed in a way that's useful. The Wikipedia links from #hammar's comments are actually quite good, but the best way to learn what algorithms and data structures to use in practice is by getting practice trying them out.
Collecting all algorithms and/or data structures is essentially impossible -- even as I'm writing this, there's undoubtedly somebody, somewhere is inventing some new algorithm or data structure. In the greater scheme of things, it's probably not of much significance, but it's still probably new and (ever so slightly) different from anything anybody's done before (though, of course, it's always possible it'll turn out to be a big, important thing).
That said, the US NIST has a Dictionary of Algorithms and Data Structures that lists more than most people ever know or care about. It covers most of the obvious "big" ones that everybody knows, and an awful lot of less-known ones as well. The University of Canterbury has another that is (or at least seems to me) a bit more modest, but still covers most of what a typical programmer probably cares about, and is a bit better organized for finding an algorithm to solve a particular problem, rather than being based primarily on already knowing the name of the algorithm you want.
There are also various collections/lists that are more specialized. For example, The Stony Brook Algorithm Repository is devoted primarily (exclusively?) to combinatorial algorithms. It's based on the Algorithm Design Manual, so it can be particularly useful if you have/use that book (and in case you're wondering, this book is generally quite highly regarded).
The first priority for a computer program is correctness and the second, most of the time, is programmer time - something directly linked to mantainability and extensibility.
Because of this, there is a school of programming that advocates just using simple stuff like arrays of records, unless it happens to be a performance sensitive part, in which case you need not only consider data structures and algorithms but also the "architechture" that led you to have that problem in the first place.
I did a little GP (note:very little) work in college and have been playing around with it recently. My question is in regards to the intial run settings (population size, number of generations, min/max depth of trees, min/max depth of initial trees, percentages to use for different reproduction operations, etc.). What is the normal practice for setting these parameters? What papers/sites do people use as a good guide?
You'll find that this depends very much on your problem domain - in particular the nature of the fitness function, your implementation DSL etc.
Some personal experience:
Large population sizes seem to work
better when you have a noisy fitness
function, I think this is because the growth
of sub-groups in the population over successive generations acts
to give more sampling of
the fitness function. I typically use
100 for less noisy/deterministic functions, 1000+
for noisy.
For number of generations it is best to measure improvements in the
fitness function and stop when it
meets your target criteria. I normally run a few hundred generations and see what kind of answers are coming out, if it is showing no improvement then you probably have an issue elsewhere.
Tree depth requirements are really dependent on your DSL. I sometimes try to do an
implementation without explicit
limits but penalise or eliminate
programs that run too long (which is probably
what you really care about....). I've also found total node counts of ~1000 to be quite useful hard limits.
Percentages for different mutation / recombination operators don't seem
to matter all that much. As long as
you have a comprehensive set of mutations, any reasonably balanced
distribution will usually work. I think the reason for this is that you are basically doing a search for favourable improvements so the main objective is just to make sure the trial improvements are reasonably well distributed across all the possibilities.
Why don't you try using a genetic algorithm to optimise these parameters for you? :)
Any problem in computer science can be
solved with another layer of
indirection (except for too many
layers of indirection.)
-David J. Wheeler
When I started looking into Genetic Algorithms I had the same question.
I wanted to collect data variating parameters on a very simple problem and link given operators and parameters values (such as mutation rates, etc) to given results in function of population size etc.
Once I started getting into GA a bit more I then realized that given the enormous number of variables this is a huge task, and generalization is extremely difficult.
talking from my (limited) experience, if you decide to simplify the problem and use a fixed way to implement crossover, selection, and just play with population size and mutation rate (implemented in a given way) trying to come up with general results you'll soon realize that too many variables are still into play because at the end of the day the number of generations after which statistically you will get a decent result (whatever way you wanna define decent) still obviously depend primarily on the problem you're solving and consequently on the genome size (representing the same problem in different ways will obviously lead to different results in terms of effect of given GA parameters!).
It is certainly possible to draft a set of guidelines - as the (rare but good) literature proves - but you will be able to generalize the results effectively in statistical terms only when the problem at hand can be encoded in the exact same way and the fitness is evaluated in a somehow an equivalent way (which more often than not means you're ealing with a very similar problem).
Take a look at Koza's voluminous tomes on these matters.
There are very different schools of thought even within the GP community -
Some regard populations in the (low) thousands as sufficient whereas Koza and others often don't deem if worthy to start a GP run with less than a million individuals in the GP population ;-)
As mentioned before it depends on your personal taste and experiences, resources and probably the GP system used!
Cheers,
Jan
I'm not interested in tiny optimizations giving few percents of the speed.
I'm interested in the most important heuristics for alpha-beta search. And most important components for evaluation function.
I'm particularly interested in algorithms that have greatest (improvement/code_size) ratio.
(NOT (improvement/complexity)).
Thanks.
PS
Killer move heuristic is a perfect example - easy to implement and powerful.
Database of heuristics is too complicated.
Not sure if you're already aware of it, but check out the Chess Programming Wiki - it's a great resource that covers just about every aspect of modern chess AI. In particular, relating to your question, see the Search and Evaluation sections (under Principle Topics) on the main page. You might also be able to discover some interesting techniques used in some of the programs listed here. If your questions still aren't answered, I would definitely recommend you ask in the Chess Programming Forums, where there are likely to be many more specialists around to answer. (Not that you won't necessarily get good answers here, just that it's rather more likely on topic-specific expert forums).
MTD(f) or one of the MTD variants is a big improvement over standard alpha-beta, providing you don't have really fine detail in your evaluation function and assuming that you're using the killer heuristic. The history heuristic is also useful.
The top-rated chess program Rybka has apparently abandoned MDT(f) in favour of PVS with a zero-aspiration window on the non-PV nodes.
Extended futility pruning, which incorporates both normal futility pruning and deep razoring, is theoretically unsound, but remarkably effective in practice.
Iterative deepening is another useful technique. And I listed a lot of good chess programming links here.
Even though many optimizations based on heuristics(I mean ways to increase the tree depth without actualy searching) discussed in chess programming literature, I think most of them are rarely used. The reason is that they are good performance boosters in theory, but not in practice.
Sometimes these heuristics can return a bad(I mean not the best) move too.
The people I have talked to always recommend optimizing the alpha-beta search and implementing iterative deepening into the code rather than trying to add the other heuristics.
The main reason is that computers are increasing in processing power, and research[need citation I suppose] has shown that the programs that use their full CPU time to brute force the alpha-beta tree to the maximum depth have always outrunned the programs that split their time between a certain levels of alpha-beta and then some heuristics,.
Even though using some heuristics to extend the tree depth can cause more harm than good, ther are many performance boosters you can add to the alpha-beta search algorithm.
I am sure that you are aware that for alpha-beta to work exactly as it is intended to work, you should have a move sorting mechanisn(iterative deepening). Iterative deepening can give you about 10% performace boost.
Adding Principal variation search technique to alpha beta may give you an additional 10% boost.
Try the MTD(f) algorithm too. It can also increase the performance of your engine.
One heuristic that hasn't been mentioned is Null move pruning.
Also, Ed Schröder has a great page explaining a number of tricks he used in his Rebel engine, and how much improvement each contributed to speed/performance: Inside Rebel
Using a transposition table with a zobrist hash
It takes very little code to implement [one XOR on each move or unmove, and an if statement before recursing in the game tree], and the benefits are pretty good, especially if you are already using iterative deepening, and it's pretty tweakable (use a bigger table, smaller table, replacement strategies, etc)
Killer moves are good example of small code size and great improvement in move ordering.
Most board game AI algorithms are based on http://en.wikipedia.org/wiki/Minmax MinMax. The goal is to minimize their options while maximizing your options. Although with Chess this is a very large and expensive runtime problem. To help reduce that you can combine minmax with a database of previously played games. Any game that has a similar board position and has a pattern established on how that layout was won for your color can be used as far as "analyzing" where to move next.
I am a bit confused on what you mean by improvement/code_size. Do you really mean improvement / runtime analysis (big O(n) vs. o(n))? If that is the case, talk to IBM and big blue, or Microsoft's Parallels team. At PDC I spoke with a guy (whose name escapes me now) who was demonstrating Mahjong using 8 cores per opponent and they won first place in the game algorithm design competition (whose name also escapes me).
I do not think there are any "canned" algorithms out there to always win chess and do it very fast. The way that you would have to do it is have EVERY possible previously played game indexed in a very large dictionary based database and have pre-cached the analysis of every game. It would be a VERY compex algorithm and would be a very poor improvement / complexity problem in my opinion.
I might be slightly off topic but "state of the art" chess programs use MPI such as Deep Blue for massive parallel power.
Just consider than parallel processing plays a great role in modern chess