Is there any impact of number of training documents on classification time ?? I know for K-nn that all of computations in K-nn is carried out in classification while no or minimum work is done in training. Is same is the case with SVM, Naive Bayes, Decision Trees etc ?
Only lazy classifiers have such a characteristics, one of which is KNN.
SVM - classification time depends on the number of support vectors, which may, but not have to be - dependent on the number of training documents (they are the upper bound of the number of SVs)
Naive Bayes - there is no impact, unless these new documents carry many new words, as the NB classification time is O( number of features ), so if you do not enlarge the vocablurary (in case of BOW model) you are safe to use many training data
Decision Tree - the same as for NB, it depends only on the number of features (and the complexity of the problem, which do not change with number of instances)
Neural Network - here classification time only depends on the number of neurons
Related
KNN is such a straightforward algorithm that's easy to implement:
# for each test datapoint in X_test:
# calculate its distance from every points in X_train
# find the top k most closest points
# take majority vote of the k neighbors and use that as prediction for this test data point
Yet I think the time complexity is not good enough. How is the algorithm optimized when it is implemented in reality? (like what trick or data structure it's using?)
The k-nearest neighbor algorithm differs from other learning methods because no
model is induced from the training examples. The data remains as they are; they
are simply stored in memory.
A genetic algorithm is combined with k-NN to improve performance. Another successful technique known as instance
selection is also proposed to face simultaneously, the efficient storage and noise of
k-NN. you can try this: when a new instance should be classified; instead of
involving all learning instances to retrieve the k-neighbors which will increase the
computing time, a selection of a smaller subset of instances is first performed.
you can also try:
Improving k-NN speed by reducing the number of training
documents
Improving k-NN by neighborhood size and similarity
function
Improving k-NN by advanced storage structures
What you describe is the brute force kNN calculation with O(size(X_test)*size(X_train)*d), where d is the number of dimensions in the feature vectors.
More efficient solution use spatial indexing to put an index on the X_train data. This typically reduces the individual lookups to O( log(size(X_train)) * d) or even O( log(size(X_train)) + d).
Common spatial indexes are:
kD-Trees (they are often used, but scale badly with 'd')
R-Trees, such as the RStarTree
Quadtrees (Usually not efficient for large 'd', but for example the PH-Tree works well with d=1000 and has excellent remove/insertion times (disclaimer, this is my own work))
BallTrees (I don't really know much about them)
CoverTrees (Very fast lookup for high 'd', but long build-up times
There are also the class of 'approximate' NN searches/queries. These trade correctness with speed, they may skip a few of the closest neighbors. You can find a performance comparison and numerous implementations in python here.
If you are looking for Java implementations of some of the spatial indexes above, have a look at my implementations.
I read this question about finding the closest neighbor for 3-dimensions points. Octree is a solution for this case.
kd-Tree is a solution for small spaces (generally less than 50 dimensions).
For higher dimensions (vectors of hundreds of dimensions and millions of points) LSH is a popular solution for solving the AKNN (Aproxximate K-NN) problem, as pointed out in this question.
However, LSH is popular for K-NN solutions, where K>>1. For example, LSH has been successfully used for Content Based Image Retrieval (CBIR) applications, where each image is represented through a vector of hundreds of dimensions and the dataset is millions (or billions) of images. In this case, K is the number of top-K most similar images w.r.t. the query image.
But what if we are interested just to the most approximate similar neighbor (i.e. A1-NN) in high dimensional spaces? LSH is still the winner, or ad-hoc solutions have been proposed?
You might look at http://papers.nips.cc/paper/2666-an-investigation-of-practical-approximate-nearest-neighbor-algorithms.pdf and http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CTPAMI-TPTree.pdf. Both have figures and graphs showing the perfomance of LSH vs the performance of tree-based methods which also produce only approximate answers, for different values of k including k=1. The Microsoft paper claims that "It has been shown in [34] that randomized KD trees can
outperform the LSH algorithm by about an order of magnitude". Table 2 P 7 of the other paper appears to show speedups over LSH which are reasonably consistent for different values of k.
Note that this is not LSH vs kd-trees. This is LSH vs various clever tuned approximate search tree structures, where you typically search only the most promising parts of the tree, and not all of the parts of the tree that could possibly contain the closest point, and you search a number of different trees to get a decent probability of finding good points to compensate for this, tuning various parameters to get the fastest possible performance.
I'm currently working on a Machine Learning project for my Artificial Intelligence exam. The goal is to correctly choose two classification algorithms to compare using WEKA, bearing in mind that these two algorithms must be different enough to give the comparison a reason to be made. Besides, the algorithms must handle both nominal and numeric data (I suppose this is mandatory to let the comparison be made).
My professor suggested to choose a statistical classifier and a decision tree classifier, for example, or to delve into a comparison between a bottom-up classifier and a top-down one.
Since I have very little experience in the Machine Learning field, I am doing some research on the various algorithms WEKA offers, and I stepped on kNN, that is, k-nearest neighbors algorithm.
Is it statistical? And could it be compared with a Decision Stump algorithm, for example?
Or else, can you suggest a couple of algorithms that match these requirements I have pointed out above?
P. S.: Handled data must be both numerical and nominal. On WEKA there are numerical/nominal features and numerical/nominal classes. Do I have to choose algorithms with both numerical/nominal features AND classes or just one of them?
I would really appreciate any help guys, thanks for your patience!
Based on your professor's description, I would not consider k-Nearest Neighbors (kNN) a statistical classifier. In most contexts, a statistical classifier is one that generalizes via statistics of the training data (either by using statistics directly or by transforming them). An example of this is the Naïve Bayes Classifier.
By contrast, kNN is an example of Instance-Based Learning. It doesn't use statistics of the training data; rather, it compares new observations directly to the training instances to perform classification.
With regard to comparison, yes you can compare performance of kNN with a Decision Stump (or any other classifier). Since any two supervised classifiers will yield a classification accuracies with respect to your training/testing data, you can compare their performance.
Can anyone give some references showing how to determine the maximum likelihood and support vector machine classifiers' computation complexity?
I have been searching the web but don't seem to find a good docs that details how to find the equations that model the computation complexity of those classifier algorithms.
Thanks
Support vector machines, and a number of maximum likelihood fits are convex minimization problems. Therefore they could in theory be solved in polynomial time using http://en.wikipedia.org/wiki/Ellipsoid_method.
I suspect that you can get much better estimates if you consider methods. http://www.cse.ust.hk/~jamesk/papers/jmlr05.pdf says that standard SVM fitting on m instances costs O(m^3) time and O(m^2) space. http://research.microsoft.com/en-us/um/people/minka/papers/logreg/minka-logreg.pdf gives costs per iteration for logistic regression but does not give a theoretical basis for estimating the number of iterations. In practice I would hope that this goes to quadratic convergence most of the time and is not too bad.
Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?
The "L-Method" is detailed in:
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan
This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.
For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).
K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.
I had asked a similar question in the past here on SO. My question was about coming up with a consistent way of finding the knee to the L-shape you described. The curves in question represented the trade-off between complexity and a fit measure of the model.
The best solution was to find the point with the maximum distance d according to the figure shown:
Note: I haven't read the paper you linked to yet..