back with a question on datamining and working with Weka and WekaSharp on datamining. Through WekaSharp I have been doing some analysis on a fairly large dataset which is the KDD Cup 1999 10% database ( ~70 mb). I have had good results with the decision tree J48 algorithm and the Naive Bayes algorithm each taking between 10 and 30 min to complete. When I run this same data through the KNN algorithm and it never finishes the analysis, it does not error out it simply runs forever. I have tried all different parameters with no effect. When I run the same KNN algorithm on a smaller sample dataset such as the iris.arff it finishes with no difficulty. Here is the setup I have for the KNN parameters:
"-K 1 -W 0 -A \"weka.core.neighboursearch.KDTree -A \\"weka.core.EuclideanDistance -R first-last\\"\""
Is there an inherent issue with KNN and large datasets or is there a setup issue? Thank you very much.
kNN is subject to the "curse of dimensionality": spatial queries of high-dimensional datasets cannot be effectively optimized in the same way lower-dimensional datasets can, turning them effectively into brute-force searches.
NB laughs at dimensionality because it basically ignores dimensions. Many decision tree variants are also fairly good at dealing with high-dimensional data. kNN does not like high-dimensional data. Expect to wait for a long time.
Related
I want to know if there is spatial partition data structure that is better suited for a placement system than a quadtree. By better suited I mean for the data structure to have a O(logn) time complexity or less when search querying it and using less memory. I want to know what data structure can organize my data in such a way that querying it is faster than a quadtree. Its all 2D and its all rectangles which should never overlap. I currently have a quadtree done and it works great and its fast, I am just curious to know if there is a data structure that uses less resources and its faster than a quadtree for this case.
The fastest is probably brute forcing it on a GPU.
Also, it is really worth trying out different implementations, I found performance differences between implementations to be absolutely wild.
Another tip: measure performance with realistic data (potentially multiple scenarios), data and usage characteristics can have enormous influence on index performance.
Some of these characteristics are (you already mentioned "rectangle data" and "2D"):
How big is your dataset
How much overlap do you have between rectangles?
Do you need to update data often?
Do you have a large variance between small and large rectangles?
Do you have dense cluster of rectangles?
How large is the are you cover?
Are your coordinates integers or floats?
Is it okay if the execution time of operations varies or should it be consistent?
Can you pre-load data? Do you need to update the index?
Quadtrees can be a good initial choice. However they have some problems, e.g.:
They can get very deep (and inefficient) with dense clusters
They don't work very well when there is a lot of overlap between rectangles
Update operations may take longer if nodes are merged or split.
Another popular choice are R-Trees (I found R-star-Trees to be the best). Some properties:
Balanced (good for predictable search time but bad because update times can be very unpredictable due to rebalancing)
Quite complex to implement.
R-Trees can also be preloaded (takes longer but allows queries to be faster), this is called STR-Tree (Sort-tile-recurse-R-Tree)
It may be worth looking at the PH-Tree (disclaimer: self advertisement):
Similar to a quadtree but depth is limited to the bit-width of the data (usually 32 or 64 (bits)).
No rebalancing. Merging or splitting is guaranteed to move only one entry (=cheap)
Prefers integer coordinates but works reasonably well with floating point data as well.
Implementations can be quite space efficient (they don't need to store all bit of coordinates). However, not all implementations support that. Also, the effect varies and is strongest with integer coordinates.
I made some measurements here. The measurements include a 2D dataset where I store line segments from OpenStreetMap as boxes, the relevant diagrams are labeled with "OSM-R" (R for rectangles).
Fig. 3a shows timings for inserting a given amount of data into a tree
Fig. 9a shows memory usage
Fig. 15a shows query times for queries that return on average 1000 entries
Fig. 17a shows how query performance changes when varying the query window size (on an index with 1M entries)
Fig. 41a shows average times for updating an index with 1M entries
PH/PHM is the PH-Tree, PHM has coordinates converted to integer before storing them
RSZ/RSS are two different R-Tree implementations
STR is an STR-Tree
Q(T)Z is a quadtree
In case you are using Java, have a look at my spatial index collection.
Similar collections exist for other programming languages.
I am using t-SNE to make a 2D projection for visualization from a higher dimensional dataset (in this case 30-dims) and I have a question about the perplexity hyperparameter.
It's been a while since I used t-SNE and had previously only used it on smaller datasets <1000 data points, where the advised perplexity of 5-50 (van der Maaten and Hinton) was sufficient to display the underlying data structure.
Currently, I am working with a dataset with 340,000 data points and feel that as the perplexity influences the local vs non-local representation of the data, more data points would require a perplexity much higher than 50 (especially if the data is not highly segregated in the higher dimensional space).
Does anyone have any experience with setting the optimal perplexity on datasets with a larger number of data points (>100k)?
I would be really interested to hear your experiences and which methods you go about using to determine the optimal perplexity (or optimal perplexity range).
An interesting article suggests that the optimal perplexity follows a simple power law (~N^0.5), would be interested to know what others think about that?
Thanks for your help
Largely this is empirical, and so I recommend just playing around with values. But I can share my experience...
I had a dataset of about 400k records, each of ~70 dimensions. I reran scikit learn's implementation of tsne with perplexity values 5, 15, 50, 100 and I noticed that the clusters looked the same after 50. I gathered that 5-15 was too small, 50 was enough, and increased perplexity didn't make much difference. That run time was a nightmare though.
The openTSNE implementation is much faster, and offers an interesting guide on how to use smaller and larger perplexity values at different stages of the same run of the algorithm in order to to get the advantages of both. Loosely speaking what it does is initiate the algorithm with high perplexity to find global structure for a small number of steps, then repeats the algorithm with the lower perplexity.
I used this implementation on a dataset of 1.5million records with dimension ~ 200. The data came from the same source system as the first dataset I mentioned. I didn't play with the perplexity values here because total runtime on a 32 cpu vm was several hours, but the clusters looked remarkably similar to the ones on the smaller dataset (recreated binary classification-esque distinct clusters), so I was happy.
I have about 44 Million training examples across about 6200 categories.
After training, the model comes out to be ~ 450MB
And while testing, with 5 parallel mappers (each given enough RAM), the classification proceeds at a rate of ~ 4 items a second which is WAY too slow.
How can speed things up?
One way i can think of is to reduce the word corpus, but i fear losing accuracy. I had maxDFPercent set to 80.
Another way i thought of was to run the items through a clustering algorithm and empirically maximize the number of clusters while keeping the items within each category restricted to a single cluster. This would allow me to build separate models for each cluster and thereby (possibly) decrease training and testing time.
Any other thoughts?
Edit :
After some of the answers given below, i started contemplating doing some form of down-sampling by running a clustering algorithm, identifying groups of items that are "highly" close to one another and then taking a union of a few samples from those "highly" close groups and other samples that are not that tightly close to one another.
I also started thinking about using some form of data normalization techniques that involve incorporating edit distances while using n-grams (http://lucene.apache.org/core/4_1_0/suggest/org/apache/lucene/search/spell/NGramDistance.html)
I'm also considering using the hadoop streaming api to leverage some of the ML libraries available in Python from listed here http://pydata.org/downloads/ , and here http://scikit-learn.org/stable/modules/svm.html#svm (These I think use liblinear mentioned in one of the answers below)
Prune stopwords and otherwise useless words (too low support etc.) as early as possible.
Depending on how you use clustering, it may actually make in particular the test phase even more expensive.
Try other tools than Mahout. I found Mahout to be really slow in comparison. It seems that it somewhere comes at a really high overhead.
Using less training exampes would be an option. You will see that after a specific amount of training examples you classification accuracy on unseen examples won't increase. I would recommend to try to train with 100, 500, 1000, 5000, ... examples per category and using 20% for cross validating the accuracy. When it doesn't increase anymore, you have found the amount of data you need which may be a lot less then you use now.
Another approach would be to use another library. For document-classification i find liblinear very very very fast. It's may be more low-level then mahout.
"but i fear losing accuracy" Have you actually tried using less features or less documents? You may not lose as much accuracy as you fear. There may be a few things at play here:
Such a high number of documents are not likely to be from the same time period. Over time, the content of a stream will inevitably drift and words indicative of one class may become indicative of another. In a way, adding data from this year to a classifier trained on last year's data is just confusing it. You may get much better performance if you train on less data.
The majority of features are not helpful, as #Anony-Mousse said already. You might want to perform some form of feature selection before you train your classifier. This will also speed up training. I've had good results in the past with mutual information.
I've previously trained classifiers for a data set of similar scale and found the system worked best with only 200k features, and using any more than 10% of the data for training did not improve accuracy at all.
PS Could you tell us a bit more about your problem and data set?
Edit after question was updated:
Clustering is a good way of selecting representative documents, but it will take a long time. You will also have to re-run it periodically as new data come in.
I don't think edit distance is the way to go. Typical algorithms are quadratic in the length of the input strings, and you might have to run for each pair of words in the corpus. That's a long time!
I would again suggest that you give random sampling a shot. You say you are concerned about accuracy, but are using Naive Bayes. If you wanted the best model money can buy, you would go for a non-linear SVM, and you probably wouldn't live to see it finish training. People resort to classifiers with known issues (there's a reason Naive Bayes is called Naive) because they are much faster than the alternative but performance will often be just a tiny bit worse. Let me give you an example from my experience:
RBF SVM- 85% F1 score - training time ~ month
Linear SVM- 83% F1 score - training time ~ day
Naive Bayes- 82% F1 score - training time ~ day
You find the same thing in the literature: paper . Out of curiosity, what kind of accuracy are you getting?
I have around 10 K points in 5 dimensional space. We can assume that the points are randomly distributed in space (0,0,0,0,0) and (100,100,100,100,100). Clearly, the whole data set can easily reside in memory.
I would like to know which algorithm for k nearest neighbour would run faster, kd-tree or RTree.
Although I have some very high level idea of these two algorithms, I am not sure which will run faster, and why. I am open to exploring other algorithms if any, which could run fast. Please, if possible, specify why an algorithm may run faster.
This depends on various parameters. Most importantly on your capability to implement these algorithms.
I've personally found bulk-loaded R*-trees to be faster for large data, probably because they have a better fan-out. Bulk-loaded R-trees is a more fair comparison, as kd-trees are commonly bulk-loaded (in fact, they don't support incremental operation very well at all).
For tiny data, kd-trees will likely be faster, plus they are much simpler to implement.
For other things, please refer to this earlier question / answer:
https://stackoverflow.com/a/11109467/1060350
I wrote a very simple distributed computing platform (based on the Map/Reduce paradigm), and I'm in the process of writing some demos and showcases. I have a very small team and have to prioritize which demos I'll write first.
To prioritize I need to sort the demos accordingly to about 70% being a relevant, common, significant use case of distributed computing, 30% being easy to write.
So far I have it ordered like this:
Discovering pi digits with Monte Carlo
Numerical integration with Monte Carlo
Large matrix multiplication (dense matrices)
Linear regressions
Large matrix inversion
Multiple regressions
Sorting
Clustering (K-Means)
Clustering (Hierarchical)
Number 1 is on the list because it took 10 minutes to write, although it's completely useless (I'm not sure but I figure there's not a lot of people trying to find more digits to pi).
Due to the nature of my platform, it will shine more in things that are of course embarrassingly parallel, and not I/O-bounded or reduce-dominated.
How would you change my list? What would you add to it? Is sorting useful at all in the enterprise world or is it only for benchmarking distributed computing platforms?
Your list suggests that you are not distinguishing between parallel computing and distributed computing. This is not necessarily wrong but someone looking for a demonstration of the excellence of a distributed computing platform might be left tepidly enthused upon seeing parallel computations, such as your items 2 - 5, being performed.
Sorting is certainly useful everywhere there is data: large enterprises, small enterprises, in your desk drawers, across the Googlesphere. So too is searching, which is a surprising omission from your list. The other omission which strikes me immediately is any sort of data fusion, merging large datasets to get information from their intersections beyond what can be extracted from the datasets individually.
I second Mark in that you are mixing distributed computing and HPC. Here are some comments on each of your topics:
(1) There are people trying to compute as many digits of Pi as they can but the Monte Carlo algorithm is completely useless there as its precision scales with the inverse square root of the number of trials, so in order to get one more decimal digit of precision you would roughly need 100 times more trials. There are other algorithms - see if you can implement some of them using Map/Reduce.
(2) This one is fine, although seldom used - same problem with precision as (1).
(5) Pure matrix inversions are seldom performed, mainly because of numerical instabilities. How about solving a dense system of linear equations instead?
I would say that you are missing one of the main usages of M/R processing nowadays, namely graph processing (read: social and other networks/flows analysis). Also some more general optimisation problem might be nice, e.g. genetic algorithms.