Attribute selection in h2o - h2o

I am very beginner in h2o and I want to know if there is any attribute selection capabilities in h2o framework so to be applied in h2oframes?

No there are not currently feature selection functions in H2O -- my advice would be to use Lasso regression (in H2O this means use GLM with alpha = 1.0) to do the feature selection, or simply allow whatever machine learning algorithm (e.g. GBM) you are planning to use to use all the features (they'll tend to ignore the bad ones, but it could still degrade performance of the algorithm to have bad features in the training data).
If you'd like, you can make a feature request by filling out a ticket on the H2O-3 JIRA. This seems like a nice feature to have.

In my opinion, Yes
My way is use automl to train your data.
after training, you can get a lot of model.
use h2o.get_model method or H2O server page to watch some model you like.
you can get VARIABLE IMPORTANCES frame.
then pick your features.

Related

How can we improve the accuracy of form-recognizer model?

I am using Microsoft form-recognizer service. My forms are bit complex and I tried training a model for them. The performance I achieved is not really good. Is there anyway I can improve this accuracy? Is there anyway to tune this model? I have trained the model using 5 different populated forms of the same type.
Not sure if you're still interested in it. I've been using the labeling tool in the v2 You can manually label area to create tags (features). This can be done without too much effort. I've been using it and get a very accurate results.
IC

Is H2O DAI's MLI display menu dependent on the algorithms used in its experiments?

I see H2O DAI picks up the optimized algorithm for the dataset automatically. I hear that the contents of MLI (machine learning interpretation) from other platforms (like SAS Viya) is dependent on the algorithm it uses. For example, LOCO is not available for GBM, etc. (Of course, this is a purely hypothetical example.)
Is it the same with H2O DriverlessAI ? Or does it always show the same MLI menus regardless of the algorithms it used?
Currently, MLI will always display the same dashboard for any DAI algorithm, with the following exceptions: Shapley plots are not currently supported for RuleFit and TensorFlow models, Multinomial currently only show Shapley and Feature Importance (global and local), and Time Series experiments are not yet supported for MLI. What this means is you can expect to always see: K-LIME/LIME-SUP, Surrogate Decision Tree, Partial Dependence Plot and Individual Conditional Expectation. Note this may change in the future, for the most up-to-date details please see the documentation.

Time series database - interpolation and filtering

Does any existing time series database contain functionality for interpolation and digital filtering for outlier detection (Hampel, Savitzky-Golay)? Or at least provide interfaces to enable custom query result post-processing?
As far as I know InfluxDB does not offer you anything more than a simple linear interpolation and the basic set of InfluxQL functions.
Looks like, everything more complex than that has to be done by hand with the programming language of your choice. Influx has a number of language clients.
There is an article on anomaly detection with Prometheus, but that looks like an attempt rather then a capability.
However, there is a thing called Kapacitor for InfluxDB. It a quite powerful data processing tools which allows define User Defined Functions (UDF). Here is an article of how to implement custom anomaly detection with Kapacitor.
Akumuli time-series database provides some basic outlier detection, for instance the EWMA and SMA. You can issue a query that will return the difference between the predicted value (given by EWMA or SMA) and the actual value.

Training Model for Sentiment Analysis with Google Prdection API

I am planning to use Google Prediction API for Sentiment Analysis. How can I generate the Traning model for this? Or where can I have any standard training model available for commercial use? I have already tried with the Sentiment Predictor provided in Prediction Gallery of Google Prediction API, but does not seem to work properly.
From my understanding, the "model" for the Google Prediction API is actually not a model, but a suite of models for regression as well as classification. That being said, it's not clear how the Prediction API decides what kind of regression or classification model is used when you present it with training data. You may want to look at how to train a model on the Google Prediction API if you haven't already done so.
If you're not happy with the results of the Prediction API, it might be an issue with your training data. You may want to think about adding more examples to the training file to see if the model comes up with better results. I don't know how many examples you used, but generally, the more you can add, the better.
However, if you want to look at creating one yourself, NLTK is a Python library that you can use to train your own model. Another Python library you can use is scikit-learn.
Hope this helps.
google prediction API is great BUT to train a model you will need...LOT OF DATA.
you can use the sentiment model that is alrady trained..

Use clustering for prediction in Weka

Can I use clustering (e.g. using k-means) to make predictions in Weka?
I have some data based on a research for president elections. I have answers from questionnaires (numeric attributes), and I have one attribute that is the answer for the question Who are you going to vote? (1, 2 or 3)
I make predictions using some classifiers (e.g. Bayes) in Weka. My results are based on that answer(vote intention) and I have about 60% recall(rate of correct predictions).
I understand that clustering is a different thing, but can I use clustering to make predictions? I've already tried so, but I've realized clustering always selects its own centroids, and it does not use my vote intention question.
Explain results of K-means
must be a colleague of yours. He seems to use the same data set, and it would be helpful if we could all have a look at the data.
In general, clustering is not classification or prediction.
However, you can try to improve your classification by using the information gained from clustering. Two such techniques:
substitute your data set with the cluster centers, and use this for classification (at least if your clusters are reasonably pure wrt. to the class label!)
train a separate classifier on each cluster, and build an ensemble out of them (in particular, if your clusters are inhomogenous)
But I belive your understanding of classification or clustering is not yet far enough to try out these. You need to handle them carefully, and know your data very well.
Yes. You can use the Weka interface to do prediction via clustering. First, upload your training data using the Preprocess tab. Then, go to classify tab, under classifier, click choose and under meta, choose ClassificationViaClustering. The default clustering algorithm used by weka is SimpleKMean but you can change that by clicking on the options string (i.e. the text next to the choose button) and weka will display a message box, click choose and a set of clustering algorithms will be listed to choose from (e.g. EM). After that, you can do Cross-Validation or upload a test data by clicking on set as you normally do when you use weka for classification.
Hope this will help anyone having the same question!

Resources