When data mining, when should you choose one of these algorithms over the other? Is there a specific reason? Also which one of these is the most efficient?
I'm going to give a table for example purposes.
One way to choose would be to try all of them and pick the best.
If I was to try and construct data to favour one or the other, here is what I might do.
1) To favour decision trees, have only a few attributes determine the correct answer, with all the others useless distractions.
2) To favour Naive Bayes, construct 2n+1 attributes by choosing at random either n 1s and n+1 -1s or n+1 1s and n -1s and assigning them to attributes at random. Make the right answer be whether the bare majority is for +1 or -1.
3) To favour kNN, use two dimensional data and draw a broad spiral pattern of 1s in a background of 0s, with about equal numbers of 0s or 1s. The right answer is whether you are on a 0 or a 1.
kNN will certainly take up more memory at the time you are making decisions, as you have to remember all the instances instead of boiling them down into weights and tree rules. I would also expect it to take more time at decision time, although there are libraries to attempt to speed this up. Naive Bayes is probably the fastest and smallest.
There are a huge number of different ways to use decision trees, and some very sophisticated developments of it, such as random forests, which could take a noticeable amount of time and memory, but might do better on some data.
Related
I'm interested in finding a comparison sorting algorithm that minimizes the number of times each single element is compared with others during the run of the algorithm.
For a randomly sorted list, I'm interested in two distributions: the number of comparisons that are needed to sort a list (this is the traditional criterion) and the number of comparisons in which each single element of the list is involved.
Among the algorithms that have a good performance in terms of the number of comparisons, say achieving O(n log(n)) on average, I would like to find out the one for which, on average, the number of times a single element is compared with others is minimized.
I supposed that the theoretical minimum is O(log(n)) which is obtained by dividing the above figure on the total number of comparisons by n.
I'm also interested in the case where data are likely to be already ordered to some extent.
Is perhaps a simulation the best way to go about finding an answer?
(My previous question has been put on hold - This is now a very clear question, if you can't understand it then please explain why)
Yes you definitely should do simulations.
There you will implicitely set the size and pre-ordering constraints in a way that may allow more specific statements than the general question you rose.
There can, however, not be a clear answer to such question in general.
Big-O deals with asymptotic behaviour while your question
seem to target smaller problem sizes. So Big-O could hint on the best candidates for sufficiently large input sets to a sort run. (But, e.g. if you are interested in size<=5 the results may be completely different!)
For getting proper estimate on comparison operations you would need
to analyze each individual algorithm.
At the end, the result (for a given algorithm) will necesarily be specific to the dataset being sorted.
Also, on avarage is not well defined in your context. I'd assume you intend to refer to the number of comparisions on the participating objects for a given sort and not avarage over a (sufficiently large) set of sort runs.
Even within a single algorithm the distribution of comparisions an individual object is taking place in may show a large standard deviation in one case and be (nearly) equally distributed in another case.
As complexity of a sorting algorithm is determined by the total number of comparisons (and position changes thereof), I do not assume there will be much from therotical analysis contributing to an answer.
Maybe you can add some background on what would make an answer to your question "interesting" in a practical sense?
Sports tracker applications usually record a timestamp and a location in regular intervals in order to store the entire track. Analytical applications then allow to find certain statistics, such as the track subsection with the highest speed of a fixed duration (e.g. time needed for 5 miles). Or vice versa, the longest distance traversed in certain time span (e.g. Cooper distance in 12 minutes).
I'm wondering what's the most elegant and/or efficient approach to compute such sections.
In a naive approach, I'd normalize and interpolate the waypoints to get a more fine grained list of waypoints, either with a fixed time interval or fix distance steps. Then, move a sliding window representing my time span resp. distance segement over the list and search for the best sub-list matching my criteria. Is there any better way?
Elegance and efficiency are in the eye of the beholder.
Personally, I think your interpolation idea is elegant.
I imagine the interpolation algorithm is easy to build and the search you'll perform on the subsequent data is easy to perform. This can lead to tight code whose correctness can be easily verified. Furthermore, the interpolation algorithms probably already exist and are multi-purpose, so you don't have to to repeat yourself (DRY). Your suggested solution has the benefit of separating data processing from data analysis. Modularity of this nature is often considered a component of elegance.
Efficiency - are we talking about speed, space, or lines of code? You could try to combine the interpolation step with the search step to save space, but this will probably sacrifice speed and code simplicity. Certainly speed is sacrificed in the sense that multiple queries cannot take advantage of previous calculations.
When you consider the efficiency of your code, worry not so much about how the computer will handle it, or how you will code it. Think more deeply about the intrinsic time complexity of your approach. I suspect both the interpolation and search can be made to take place in O(N) time, in which case it would take vast amounts of data to bog you down: it is difficult to make an O(N) algorithm perform very badly.
In support of the above, interpolation is just estimating intermediate points between two values, so this is linear in the number of values and linear in the number of intermediate points. Searching could probably be done with a numerical variant of the Knuth-Morris-Pratt Algorithm, which is also linear.
Suppose I have n line segments in general position. How can I quickly count, for each of my n segments, how many of the other n-1 it intersects?
I can do this naively in O(n2) time. I can find all intersections using a fairly straightforward sweep line algorithm (Bentley-Ottmann) in O((n + k) log n) time, where k is the number of such intersections, and then aggregate the intersections I found into a bunch of counts.
I don't need to find the intersections, though; I just want to know how many there are. I don't see how to modify the sweep line algorithm to be faster since it needs to reorder two things in a tree for every intersection, and I can't think of any other techniques that don't suffer from the same problem.
I'm also interested in hearing how to count how many total intersections there are.
I have a hard time believing that you can do better than Bentley Ottman in the general case. You can simplify the computation a bit if you don't care where the line segments intersect, but I don't see how you could count crossings without finding them.
In essence, Bentley Ottman is a way to simplify the search space for intersections. There are other ways, which might work for particular arrangements of line segments, but unless there is some predictable geometric relationship between your segments, you're not going to be able to better than first using some clever way of finding candidate intersections combined with an individual verification of each candidate.
Unless your problem domain has some specific features which might make provide exploitable substructure, I'd think your best bet for speed would be to adapt Bentley Ottman (or some similar sweeping algorithm) for parallel execution. (Clipping the line segments into bands is a simple one. Figuring out an optimal set of bands would be interesting, too.) Of course, that's a practical rather than an academic exercise; the parallel algorithm might well end up doing more work in total; it just exploits hardware to do the work in (a constant factor) less time.
I am trying to write a demo for an embedded processor, which is a multicore architecture and is very fast in floating point calculations. The problem is that the current hardware I have is the processor connected through an evaluation board where the DRAM to chip rate is somewhat limited, and the board to PC rate is very slow and inefficient.
Thus, when demonstrating big matrix multiplication, I can do, say, 128x128 matrices in a couple of milliseconds, but the I/O takes (lots of) seconds kills the demo.
So, I am looking for some kind of a calculation with higher complexity than n^3, the more the better (but preferably easy to program and to explain/understand) to make the computation part more dominant in the time budget, where the dataset is preferably bound to about 16KB per thread (core).
Any suggestion?
PS: I think it is very similar to this question in its essence.
You could generate large (256-bit) numbers and factor them; that's commonly used in "stress-test" tools. If you specifically want to exercise floating point computation, you can build a basic n-body simulator with a Runge-Kutta integrator and run that.
What you can do is
Declare a std::vector of int
populate it with N-1 to 0
Now keep using std::next_permutation repeatedly until they are sorted again i..e..next_permutation returns false.
With N integers this will need O(N !) calculations and also deterministic
PageRank may be a good fit. Articulated as a linear algebra problem, one repeatedly squares a certain floating-point matrix of controllable size until convergence. In the graphical metaphor, one "ripples" change coming into each node onto the other edges. Both treatments can be made parallel.
You could do a least trimmed squares fit. One use of this is to identify outliers in a data set. For example you could generate samples from some smooth function (a polynomial say) and add (large) noise to some of the samples, and then the problem is to find a subset H of the samples of a given size that minimises the sum of the squares of the residuals (for the polynomial fitted to the samples in H). Since there are a large number of such subsets, you have a lot of fits to do! There are approximate algorithms for this, for example here.
Well one way to go would be to implement brute-force solver for the Traveling Salesman problem in some M-space (with M > 1).
The brute-force solution is to just try every possible permutation and then calculate the total distance for each permutation, without any optimizations (including no dynamic programming tricks like memoization).
For N points, there are (N!) permutations (with a redundancy factor of at least (N-1), but remember, no optimizations). Each pair of points requires (M) subtractions, (M) multiplications and one square root operation to determine their pythagorean distance apart. Each permutation has (N-1) pairs of points to calculate and add to the total distance.
So order of computation is O(M((N+1)!)), whereas storage space is only O(N).
Also, this should not be either too hard, nor too intensive to parallelize across the cores, though it does take some overhead. (I can demonstrate, if needed).
Another idea might be to compute a fractal map. Basically, choose a grid of whatever dimensionality you want. Then, for each grid point, do the fractal iteration to get the value. Some points might require only a few iterations; I believe some will iterate forever (chaos; of course, this can't really happen when you have a finite number of floating-point numbers, but still). The ones that don't stop you'll have to "cut off" after a certain number of iterations... just make this preposterously high, and you should be able to demonstrate a high-quality fractal map.
Another benefit of this is that grid cells are processed completely independently, so you will never need to do communication (not even at boundaries, as in stencil computations, and definitely not O(pairwise) as in direct N-body simulations). You can usefully use O(gridcells) number of processors to parallelize this, although in practice you can probably get better utilization by using gridcells/factor processors and dynamically scheduling grid points to processors on an as-ready basis. The computation is basically all floating-point math.
Mandelbrot/Julia and Lyupanov come to mind as potential candidates, but any should do.
So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition