H2O AutoML building a large number of GBM models - h2o

I tried to use AutoML for a binary classification task with 100 hours. It appears that it is just building a large number of GBM models and not getting to other types. (So far built 40)
Is there a way to set the maximum number of GBM models?

There is an order in which AutoML builds the models (the GBMs are first in line). The length of the GBM model building process will depend on how much time you set for max_runtime_secs. If you plan to run it for 100 hours, then a good portion of that will be spend in the GBM hyperparamter space, so I am not surprised that your first 40 models are GBMs. In other words, this is expected behavior.
If you want variety in your models as they are training, then you can run a single AutoML job for a smaller max_runtime_secs (say 2 hours), and then run the AutoML process again on that same project (49 more times at 2 hours each -- or some combination that adds up to 100 hours). If you use the same project_name when you start an AutoML job, a full new set of models (GBMs, RFs, DNNs, GLMs) should be added to the existing AutoML leaderboard.

As Erin said, if you run AutoML multiple times with the same project_name the results will accumulate into a single leaderboard and the hyperparameter searches will accumulate into the same grid objects. However, AutoML will still run through the same sequence of model builds, so it will do a GBM hyperparameter search again before it gets to the DL model builds.
It feels like your GBM hyperparameter search isn't converging because the stopping_tolerance is too small for your dataset. There was a bug in pre-release versions of the bindings which forced the stopping_tolerance to 0.001 instead of letting AutoML set it higher, if it calculated that that tolerance was too tight for a small dataset. Which version of H2O-3 are you using?
A bit about stopping criteria:
The stopping_criteria such as max_models, stopping_rounds, and stopping_tolerance apply to the overall AutoML process as well as to the hyperparameter searches and the individual model builds. At the beginning of the run max_runtime_secs is used to compute the end time for the entire process, and then at each stage the remaining overall time is computed and is passed down to the model build or hyperparameter search subtask.
The Run Time 558:10:56.131 that you posted is really weird. I don't see that sort of output in the AutoML.java code nor in the Python or R bindings. It looks at first glance like this is coming from outside of H2O. . . Do you have any sense of what the real time was for this run?
We should be able to figure out what's going on if you do the following:
If you're not on the recent release version 3.14.x, please upgrade.
While we're debugging please set the seed parameter for your AutoML run so that we get repeatable results.
Please post your stopping criteria, your leaderboard output, your User Feedback output, and send your H2O logs to rpeck (at) h2o.ai and support (at) h2o.ai in case we need to delve further. You can grab the H2O logs from the server or download them using Flow.

Related

Model Monitoring for Image Data not working in Vertex AI

My use case is related to multiclass image classification. Deployed CNN Model in production and enabled Model Monitoring for prediction drift detection only which does not require training data. It automatically gets created two buckets- analysis and predict in storage bucket. Then I created and run 1000 instances for model testing purpose(Same request 1000 times through Apache Bench) as it was prerequisite. I kept monitoring job to run for every hour and 100% sampling rate. I am not getting any output or logs in newly created buckets?
What's the error here?
Is Model Monitoring(Prediction Drift Detection) not enabled for Image Data by Vertex AI?
What steps do I need to take in order to check the Model Monitoring is working fine for Image Classification Model. We need evidence in the form of logs generated in two buckets.
Model monitoring is only supported for tabular AutoML and tabular custom-trained models at the moment. It is not support for custom-trained image classification models.
For a more proactive approach that should minimize prediction drift in image classification models, Vertex AI Team would recommend the following:
• Augmenting your data such that you have a more diverse set of samples. This set should match your business needs, and has meaningful transformations given your context. Please refer to [2] for more information about data augmentation.
• Utilizing Vertex Explainable AI to identify the features which are contributing the most to your model's classification decisions. This would help you to augment your data in a more educated manner. Please refer to [3] for more information about Vertex Explainable AI.
[1] https://cloud.google.com/vertex-ai/docs/model-monitoring/overview
[2] https://www.tensorflow.org/tutorials/images/data_augmentation
[3] https://cloud.google.com/vertex-ai/docs/explainable-ai/overview

Designing an algorithm for detecting anamoly and statistical significance for ordinal data using python

Firstly, I would like to apologise for the detailed problem statement. Being a novice, I couldn't express it in any lesser words.
Environment Setup Details:
To give some background, I work in a cloud company where we have multiple servers geographically located in all continents. So, we have hierarchy like this:
Several partitions
Each partition has 7 pop's
Each pop has multiple nodes all set up with redundancy.
Turn servers connecting traffic to each node depending on the client location
Actual clients-ios, android, mac, windows,etc.
Now, every time the user uses our product/service, he leaves a rating out of 5, 5 being outstanding. This data is stored in our databases and we mine it and analyse it to pin-point the exact issue on any particular day.
For example, if the users from Asia are giving more bad ratings on Tuesday this week than a usual Tuesday, what factors can cause this - is it something to do with clients app version, or server release , physical factors, loss, increased round trip delay etc.
What we have done:
Till now we have been using visualization tools to track each of these metrics separately per day to see the trends and detect the issues manually.
But, due to growing micr-services, it is becoming difficult day by day. Now, we want to automate it using python/pandas.
What I want to do:
If the ratings drop on a particular day/hour, I run the script and it should do all the manual work by taking all the permutations and combinations of all factors and list out the exact combinations which could have lead to the drop.
The second step would be to check whether the drop was significant due to varying number of ratings.
What I know:
I understand that I can do this using pandas by creating a dataframe for each predictor variable and trying to do it per variable.
And then I can apply tests like whitney test etc for ordinal data.
What I need help with:
But I just wanted to know if there is a better way to do it? It is perfectly fine if there is a learning curve involved. I can learn and do it. I just wanted some help in choosing the right approach for this.

What estimator to use in scikit-learn?

This is my first brush with machine learning, so I'm trying to figure out how this all works. I have a dataset where I've compiled all the statistics of each player to play with my high school baseball team. I also have a list of all the players that have ever made it to the MLB from my high school. What I'd like to do is split the data into a training set and a test set, and then feed it to some algorithm in the scikit-learn package and predict the probability of making the MLB.
So I looked through a number of sources and found this cheat sheet that suggests I start with linear SVC.
So, then as I understand it I need to break my data into training samples where each row is a player and each column is a piece of data about the player (batting average, on base percentage, yada, yada), X_train; and a corresponding truth matrix of a single row per player that is simply 1 (played in MLB) or 0 (did not play in MLB), Y_train. From there, I just do Fit(X,Y) and then I can use predict(X_test) to see if it gets the right values for Y_test.
Does this seem a logical choice of algorithm, method, and application?
EDIT to provide more information:
The data is made of 20 features such as number of games played, number of hits, number of Home Runs, number of Strike Outs, etc. Most are basic counting statistics about the players career; a few are rates such as batting average.
I have about 10k total rows to work with, so I can split the data based on that; but I have no idea how to optimally split the data, given that <1% have made the MLB.
Alright, here are a few steps that might want to make:
Prepare your data set. In practice, you might want to scale the features, but we'll leave it out to make the first working model as simple as possible. So will just need to split the dataset into test/train set. You could shuffle the records manually and take the first X% of the examples as the train set, but there's already a function for it in scikit-learn library: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. You might want to make sure that both: positive and negative examples are present in the train and test set. To do so, you can separate them before the test/train split to make sure that, say 70% of negative examples and 70% of positive examples go the training set.
Let's pick a simple classifier. I'll use logistic regression here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html, but other classifiers have a similar API.
Creating the classifier and training it is easy:
clf = LogisticRegression()
clf.fit(X_train, y_train)
Now it's time to make our first predictions:
y_pred = clf.predict(X_test)
A very important part of the model is its evaluation. Using accuracy is not a good idea here: the number of positive examples is very small, so the model that unconditionally returns 0 can get a very high score. We can use the f1 score instead: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html.
If you want to predict probabilities instead of labels, you can just use the predict_proba method of the classifier.
That's it. We have a working model! Of course, there are a lot thing you may try to improve, such as scaling the features, trying different classifiers, tuning their hyperparameters, but this should be enough to get started.
If you don't have a lot of experience in ML, in scikit learn you have classification algorithms (if the target of your dataset is a boolean or a categorical variable) or regression algorithms (if the target is a continuous variable).
If you have a classification problem, and your variables are in a very different scale a good starting point is a decision tree:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
The classifier is a Tree and you can see the decisions that are taking in the nodes.
After that you can use random forest, that is a group of decision trees that average results:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
After that you can put the same scale in every feature:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
And you can use other algorithms like SVMs.
For every algorithm you need a technique to select its parameters, for example cross validation:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)
But a good course is the best option to learn. In coursera you can find several good courses like this:
https://www.coursera.org/learn/machine-learning

K-Means on time series data with Apache Spark

I have a data pipeline system where all events are stored in Apache Kafka. There is an event processing layer, which consumes and transforms that data (time series) and then stores the resulting data set into Apache Cassandra.
Now I want to use Apache Spark in order train some machine learning models for anomaly detection. The idea is to run the k-means algorithm on the past data for example for every single hour in a day.
For example, I can select all events from 4pm-5pm and build a model for that interval. If I apply this approach, I will get exactly 24 models (centroids for every single hour).
If the algorithm performs well, I can reduce the size of my interval to be for example 5 minutes.
Is it a good approach to do anomaly detection on time series data?
I have to say that strategy is good to find the Outliers but you need to take care of few steps. First, using all events of every 5 minutes to create a new Centroid for event. I think tahat could be not a good idea.
Because using too many centroids you can make really hard to find the Outliers, and that is what you don't want.
So let's see a good strategy:
Find a good number of K for your K-means.
That is reall important for that, if you have too many or too few you can take a bad representation of the reality. So select a good K
Take a good Training set
So, you don't need to use all the data to create a model every time and every day. You should take a example of what is your normal. You don't need to take what is not your normal because this is what you want to find. So use this to create your model and then find the Clusters.
Test it!
You need to test if it is working fine or not. Do you have any example of what you see that is strange? And you have a set that you now that is not strange. Take this an check if it is working or not. To help with it you can use Cross Validation
So, your Idea is good? Yes! It works, but make sure to not do over working in the cluster. And of course you can take your data sets of every day to train even more your model. But make this process of find the centroids once a day. And let the Euclidian distance method find what is or not in your groups.
I hope that I helped you!

How to improve the accuracy of a Naive Bayes Classifier?

I am using Naive Bayes Classifier. Following this tutorial.
For the the trained data, i am using 308 questions and categorizing them into 26 categories which are manually tagged.
Before sending the data i am performing NLP. In NLP i am performing(punctuation removal, tokenization, stopword removal and stemming)
This filtered data, am using as input for mahout.
Using mahout NBC's i train this data and get the model file. Now when i run
mahout testnb
command i get Correctly Classified Instances as 96%.
Now for my test data i am using 100 questions which i have manually tagged. And when i use the trained model with the test data, i get Correctly Classified Instances as 1%.
This is pissing me off.
Can anyone suggest me what i doing wrong or suggest me some ways to increase the performance of NBC.?
Also, ideally how much of questions data should i use to train and test?
This appears to be the classic problem of "overfitting"... where you get a very high % accuracy on the training set, but a low % in real situations.
You probably need more training instances. Also, there is the possibility that the 26 categories don't correlate to the features you have. Machine Learning isn't magical and needs some sort of statistical relationship between the variables and the outcomes. Effectively, what NBC might be doing here is effectively "memorizing" the training set, which is completely useless for questions outside of memory.

Resources