What is the sampling method of h2o in "sample_rate"? - h2o

I would like to know what is the algorithm that h2o uses in sampling. For example, does it use Latin Hyper-cube Sampling?

Related

Machine learning algorithm for correlation between indicators

I have a dataset with several indicators related to some geographical entities ,I want to study factors that influence an indicator A (among the other indicator) .I need to determine which indicators affect it the most (correlation)
which ML algo should I use
I want to have a kind of scoring function for my indicator A to allow its prediction
enter image description here
What you are looking for are correlation coefficients, you have multiple choices for that, the most commons are:
Pearson's coefficient which only measure linear relationship between two variables, see [Scipy's implementation]
Spearman's coefficient which can show non-linear relationship , see Scipy's implementation
You can also normalize your data using z-normalization and then do a simple Linear regression. The regression coefficient can give you an idea of the influence of each variable on the outcome. However this method is highly sensible to multi-collinearity which might be present, especially if your variables are geographical.
Could you provide an example of the dataset? Discrete or continuous variables? Which software are you using?
Anyway an easy way to test correlation (without going into ML algorithms in the very sense) is to simply perform Pearson's or Spearman's correlation coefficient on selected features or on the whole dataset by creating a matrix of the data. You can do that in Python with NumPy (see this) or in R (see this).
You can also use simple linear regression or logistic/multinomial logistic regression (depending on the nature of your data) to quantify the influence of the other features on your target variables. Just keep in mind that "correlation is not causation. Look here to see some models.
Then it depends on the object of your analysis whether to aggregate all the features of all the geographical points or create covariance matrices for each "subset" of observation related to the geographical points.

LightGBM: Intent of lightgbm.dataset()

What is the purpose of lightgbm.Dataset() as per the docs when I can use the sklearn API to feed the data and train a model?
Any real world examples explaining the usage of lightgbm.dataset() would be interesting to learn?
LightGBM uses a few techniques to speed up training which require preprocessing one time before training starts.
The most important of these is bucketing continuous features into histograms. When LightGBM searches splits to possibly add to a tree, it only searches the boundaries of these histogram bins. This greatly reduces the number of splits to evaluate.
I think this picture from "What Makes LightGBM Fast?" describes it well:
The Dataset object in the library is where this preprocessing happens. Histograms are created one time, and then don't need to be calculated again for the rest of training.
You can get some more information about what happens in the Dataset object by looking at the parameters that control that Dataset, available at https://lightgbm.readthedocs.io/en/latest/Parameters.html#dataset-parameters. Some examples of other tasks:
optimization for sparse features
filtering out features that are not splittable
when I can use the sklearn API to feed the data and train a model
The lightgbm.sklearn interface is intended to make it easy to use LightGBM alongside other libraries like xgboost and scikit-learn. It takes in data in formats like scipy sparse matrices, pandas data frames, and numpy arrays to be compatible with those other libraries. Internally, LightGBM constructs a Dataset from those inputs.

Will all algorithms in h2o wil support in h2o automl like naive bayes, time series

The all algorithms that are available in h2o will applicable in Automl. For example, will H2O automl run on these algorithms like time series, Cox Proportional Hazards (CoxPH), naive bayes.
As mentioned in the docs, during H2O's AutoML
all appropriate H2O algorithms will be used if the search stopping criteria allows and if the include_algos option is not specified
If you would like to specify certain algos, you can specify a list or vector in the include_algos argument (see here).

What is the algorithm used by the "Universal Recommender" on Prediction.IO?

good Afternoon
What is the name algorithm used by the "Universal Recommender (UR)" on Prediction.IO?
during which i know Algorithm for
system recommendation are "collaborative filtering" and "content based filtering".
thanks!
It uses Correlated Cross-Occurrence(CCO) algorithm from Apache-mahout.
check out these
https://actionml.com/blog/cco
https://mahout.apache.org/users/algorithms/recommender-overview.html
Prediction.io uses Apache Spark MLLib's Alternating Least Squares matrix factorization method (ALS). It is one of basic methods of Collaborative Filtering, which are User-based, Item-based and Matrix factorization. Documentation can be found at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
Universal Recommender Template use this algorithm for computing "events" that are appearing "often" with "buying" some "item". Use of factorization is not what authors of Universal Recommender principle describe in their original idea, instead they used LLR similarity to find statistically significant "events". I personally doubt about suitability of use of matrix factorization and use of HBase (use Redis cluster instead). You can read about Universal Recommender general idea at https://www.mapr.com/practical-machine-learning and http://mahout.apache.org/users/algorithms/recommender-overview.html

VowpalWabbit: Differences and scalability

I am trying to ascertain how VowpalWabbit's "state" is maintained as the size of our input set grows. In a typical machine learning environment, if I have 1000 input vectors, I would expect to send all of those at once, wait for a model building phase to complete, and then use the model to create new predictions.
In VW, it appears that the "online" nature of the algorithm shifts this paradigm to be more performant and capable of adjusting in real-time.
How is this real-time model modification implemented ?
Does VW take increasing resources with respect to total input data size over time ? That is, as i add more data to my VW model (when it is small), do the real-time adjustment calculations begin to take longer once the cumulative # of feature vector inputs increases to 1000s, 10000s, or millions?
Just to add to carlosdc's good answer.
Some of the features that set vowpal wabbit apart, and allow it to scale to tera-feature (1012) data-sizes are:
The online weight vector:
vowpal wabbit maintains an in memory weight-vector which is essentially the vector of weights for the model that it is building. This is what you call "the state" in your question.
Unbounded data size:
The size of the weight-vector is proportional to the number of features (independent input variables), not the number of examples (instances). This is what makes vowpal wabbit, unlike many other (non online) learners, scale in space. Since it doesn't need to load all the data into memory like a typical batch-learner does, it can still learn from data-sets that are too big to fit in memory.
Cluster mode:
vowpal wabbit supports running on multiple hosts in a cluster, imposing a binary tree graph structure on the nodes and using the all-reduce reduction from leaves to root.
Hash trick:
vowpal wabbit employs what's called the hashing trick. All feature names get hashed into an integer using murmurhash-32. This has several advantages: it is very simple and time-efficient not having to deal with hash-table management and collisions, while allowing features to occasionally collide. It turns out (in practice) that a small number of feature collisions in a training set with thousands of distinct features is similar to adding an implicit regularization term. This counter-intuitively, often improves model accuracy rather than decrease it. It is also agnostic to sparseness (or density) of the feature space. Finally, it allows the input feature names to be arbitrary strings unlike most conventional learners which require the feature names/IDs to be both a) numeric and b) unique.
Parallelism:
vowpal wabbit exploits multi-core CPUs by running the parsing and learning in two separate threads, adding further to its speed. This is what makes vw be able to learn as fast as it reads data. It turns out that most supported algorithms in vw, counter-intuitively, are bottlenecked by IO speed, rather than by learning speed.
Checkpointing and incremental learning:
vowpal wabbit allows you to save your model to disk while you learn, and then to load the model and continue learning where you left off with the --save_resume option.
Test-like error estimate:
The average loss calculated by vowpal wabbit "as it goes" is always on unseen (out of sample) data (*). This eliminates the need to bother with pre-planned hold-outs or do cross validation. The error rate you see during training is 'test-like'.
Beyond linear models:
vowpal wabbit supports several algorithms, including matrix factorization (roughly sparse matrix SVD), Latent Dirichlet Allocation (LDA), and more. It also supports on-the-fly generation of term interactions (bi-linear, quadratic, cubic, and feed-forward sigmoid neural-net with user-specified number of units), multi-class classification (in addition to basic regression and binary classification), and more.
There are tutorials and many examples in the official vw wiki on github.
(*) One exception is if you use multiple passes with --passes N option.
VW is a (very) sophisticated implementation of stochastic gradient descent. You can read more about stochastic gradient descent here
It turns out that a good implementation of stochastic gradient descent is basically I/O bound, it goes as fast as you can get it the data, so VW has some sophisticated data structures to "compile" the data.
Therefore the answer the answer to question (1) is by doing stochastic gradient descent and the answer to question (2) is definitely not.

Resources