Linear regression with High Dimensional dataset is slow - performance

I am looking for a regression model that's very efficient with a large number of features.
Well, basically I am using OLS model with a dataset that has around 600 features. And I noticed that it's very slow.
Could you kindly suggest other models that can be effective and efficiently more than OLS with high dimensional datasets?

Related

Spatial partition data structure that is better suited for a placement system than a quadtree

I want to know if there is spatial partition data structure that is better suited for a placement system than a quadtree. By better suited I mean for the data structure to have a O(logn) time complexity or less when search querying it and using less memory. I want to know what data structure can organize my data in such a way that querying it is faster than a quadtree. Its all 2D and its all rectangles which should never overlap. I currently have a quadtree done and it works great and its fast, I am just curious to know if there is a data structure that uses less resources and its faster than a quadtree for this case.
The fastest is probably brute forcing it on a GPU.
Also, it is really worth trying out different implementations, I found performance differences between implementations to be absolutely wild.
Another tip: measure performance with realistic data (potentially multiple scenarios), data and usage characteristics can have enormous influence on index performance.
Some of these characteristics are (you already mentioned "rectangle data" and "2D"):
How big is your dataset
How much overlap do you have between rectangles?
Do you need to update data often?
Do you have a large variance between small and large rectangles?
Do you have dense cluster of rectangles?
How large is the are you cover?
Are your coordinates integers or floats?
Is it okay if the execution time of operations varies or should it be consistent?
Can you pre-load data? Do you need to update the index?
Quadtrees can be a good initial choice. However they have some problems, e.g.:
They can get very deep (and inefficient) with dense clusters
They don't work very well when there is a lot of overlap between rectangles
Update operations may take longer if nodes are merged or split.
Another popular choice are R-Trees (I found R-star-Trees to be the best). Some properties:
Balanced (good for predictable search time but bad because update times can be very unpredictable due to rebalancing)
Quite complex to implement.
R-Trees can also be preloaded (takes longer but allows queries to be faster), this is called STR-Tree (Sort-tile-recurse-R-Tree)
It may be worth looking at the PH-Tree (disclaimer: self advertisement):
Similar to a quadtree but depth is limited to the bit-width of the data (usually 32 or 64 (bits)).
No rebalancing. Merging or splitting is guaranteed to move only one entry (=cheap)
Prefers integer coordinates but works reasonably well with floating point data as well.
Implementations can be quite space efficient (they don't need to store all bit of coordinates). However, not all implementations support that. Also, the effect varies and is strongest with integer coordinates.
I made some measurements here. The measurements include a 2D dataset where I store line segments from OpenStreetMap as boxes, the relevant diagrams are labeled with "OSM-R" (R for rectangles).
Fig. 3a shows timings for inserting a given amount of data into a tree
Fig. 9a shows memory usage
Fig. 15a shows query times for queries that return on average 1000 entries
Fig. 17a shows how query performance changes when varying the query window size (on an index with 1M entries)
Fig. 41a shows average times for updating an index with 1M entries
PH/PHM is the PH-Tree, PHM has coordinates converted to integer before storing them
RSZ/RSS are two different R-Tree implementations
STR is an STR-Tree
Q(T)Z is a quadtree
In case you are using Java, have a look at my spatial index collection.
Similar collections exist for other programming languages.

Optimal perplexity for t-SNE with using larger datasets (>300k data points)

I am using t-SNE to make a 2D projection for visualization from a higher dimensional dataset (in this case 30-dims) and I have a question about the perplexity hyperparameter.
It's been a while since I used t-SNE and had previously only used it on smaller datasets <1000 data points, where the advised perplexity of 5-50 (van der Maaten and Hinton) was sufficient to display the underlying data structure.
Currently, I am working with a dataset with 340,000 data points and feel that as the perplexity influences the local vs non-local representation of the data, more data points would require a perplexity much higher than 50 (especially if the data is not highly segregated in the higher dimensional space).
Does anyone have any experience with setting the optimal perplexity on datasets with a larger number of data points (>100k)?
I would be really interested to hear your experiences and which methods you go about using to determine the optimal perplexity (or optimal perplexity range).
An interesting article suggests that the optimal perplexity follows a simple power law (~N^0.5), would be interested to know what others think about that?
Thanks for your help
Largely this is empirical, and so I recommend just playing around with values. But I can share my experience...
I had a dataset of about 400k records, each of ~70 dimensions. I reran scikit learn's implementation of tsne with perplexity values 5, 15, 50, 100 and I noticed that the clusters looked the same after 50. I gathered that 5-15 was too small, 50 was enough, and increased perplexity didn't make much difference. That run time was a nightmare though.
The openTSNE implementation is much faster, and offers an interesting guide on how to use smaller and larger perplexity values at different stages of the same run of the algorithm in order to to get the advantages of both. Loosely speaking what it does is initiate the algorithm with high perplexity to find global structure for a small number of steps, then repeats the algorithm with the lower perplexity.
I used this implementation on a dataset of 1.5million records with dimension ~ 200. The data came from the same source system as the first dataset I mentioned. I didn't play with the perplexity values here because total runtime on a 32 cpu vm was several hours, but the clusters looked remarkably similar to the ones on the smaller dataset (recreated binary classification-esque distinct clusters), so I was happy.

what is the theoretical benchmark upon which to make select the artifical network to predicting data?

I have trained neural networks for 50 times(training) now I am not sure which network (from 50nets) I should choose to predict my data?Among MSE(mean square error) of network or R-squred or validation performance or test performance?....Thanks for any suggestion
If you have a training data, which you actually use to build a predictive model (so you need network for some actual application, and not for research purposes) then you should just have split with two subsets: training and validation, nothing else. Thus when you run your 50 trainings (over k-fold CV or random splits) you are supposed to use it to find out what set of hyperparameters is the best. And you select hyperparameters which lead to the best mean score over these 50 splits. Then you retrain your model on the whole dataset with these hyperparameters. Similarly - if you want to select between K algorithms, you use each of them to approximate their generalization capabilities, select the one with biggest mean score, and retrain this particular model on the whole dataset.

KNN Algorithm in Weka Never Completing On Large Dataset

back with a question on datamining and working with Weka and WekaSharp on datamining. Through WekaSharp I have been doing some analysis on a fairly large dataset which is the KDD Cup 1999 10% database ( ~70 mb). I have had good results with the decision tree J48 algorithm and the Naive Bayes algorithm each taking between 10 and 30 min to complete. When I run this same data through the KNN algorithm and it never finishes the analysis, it does not error out it simply runs forever. I have tried all different parameters with no effect. When I run the same KNN algorithm on a smaller sample dataset such as the iris.arff it finishes with no difficulty. Here is the setup I have for the KNN parameters:
"-K 1 -W 0 -A \"weka.core.neighboursearch.KDTree -A \\"weka.core.EuclideanDistance -R first-last\\"\""
Is there an inherent issue with KNN and large datasets or is there a setup issue? Thank you very much.
kNN is subject to the "curse of dimensionality": spatial queries of high-dimensional datasets cannot be effectively optimized in the same way lower-dimensional datasets can, turning them effectively into brute-force searches.
NB laughs at dimensionality because it basically ignores dimensions. Many decision tree variants are also fairly good at dealing with high-dimensional data. kNN does not like high-dimensional data. Expect to wait for a long time.

Neural Network with correlated features

Is there a Neural Network algorithm that supports adding features on the fly (non-fixed feature set) and where it does not assume features isn't correlated with each other?
I don't think you can add features on fly, becouse NN as many other algorithm work with vector of input vector with same size, although it is sparse vectors. You can train with one feature set, then store weights add new features and start new training I think it will coverege much faster than first one.
NN(of first order) is work like Logistic regression and solve problem for global maximum, there are no assumption about features at all, just finding function which is related to probabilistic distribution which maximize likehood of training data, unlike Naive Bayes where each propability is calulcated separetly and then they combined with independence assumption.

Resources