should I train the autoML model for a day? - google-cloud-automl

I have 12,000 images spread acorss 12 categories. I uploaded and trained for 1 hour (free plan). I am not happy with the average percision of 20%
If I train the data for 6 or 12 hours, will I get better precision? If yes, will it be around 70 to 80%?
I am asking this because the training cost is very high and I am not sure if I will get good returns on the investment :)

It's mention in the Image prediction pricing that the Free plan runs for one hour and works only with 1,000 images. Meaning that your model didn't trained with your 12,000 images dataset. This might explain your low accuracy prediction.
Yes, the price per training hour is high, but you should consider paying so you will let your model training with your whole dataset.
I don't know if you'll get 70% or 80% accuracy, because the accuracy of your model generally depends on how long you allow it to train and the quality of your training dataset.
Hope this is helpful :D

Related

Can we measure bulk density from wetland soil samples (of unknown volume)?

I am part of an undergrad pilot study - sampling wetlands for carbon stock. Field samples have been collected. Now it is winter, the ground is frozen, and we realize we should have collected bulk density samples. Dried, ground, + sieved samples have been sent to our lab partners for carbon analysis - per the lab's instructions.
Can we still measure bulk density, from our samples?
We still have composite cores with marble-golf ball sized hardened chunks of clay-heavy soil. Also, we have we have remaining ground samples.
Could water displacement, with samples in a plastic bag, removing as much air as possible, work - to determine this soil's volume?
One prof suggested that measuring the chunks may work this way.
Or, chunks could be tied to a 'sinker', sans plastic bag - assuming such short-term water immersion would have minimal effect on clay-heavy soil volume.
??
Is there any way to determine bulk density from our samples?
please and thank you.

How effective is early stopping when the validation accuracy varies

I am building a time series model for future forecasting which consists of 2 BILSTM layers followed by a dense layer. I have a total of 120 products to forecast their values. And I have a relatively small dataset (Monthly data for a period of 2 years => maximum 24-time steps). Consequently, if I look into the overall validation accuracy, I got this:
At every epoch, I saved the model weights into memory so that I load any model anytime in the future.
When I look into the validation accuracy of different products, I got the following (This is roughly for a few products):
For this product, can I use the model saved at epoch ~90 to forecast for this model?
And the following product, can I use the saved model at epoch ~40 for forecasting?
Am I cheating? Please note that products are quite diverse and their purchasing behavior differs from one to another. To me, following this strategy, is equivalent to training 120 models (given 120 products), while at the same time, feeding more data per model as a bonus, hoping to achieve better per product. Am I making a fair assumption?
Any help is much appreciated!

h2o autoML track convergence

The autoML stops on a clock. I compared two auto-ML's where one used a subset of what the other had to make the same predictions, and at 3600 seconds runtime the fuller model looked better. I repeated this with a 5000 second re-run, and the subset model looked better. They traded places, and that isn't supposed to happen.
I think it is convergence. Is there any way to track convergence-history of stacked ensemble learners to determine either if they are relatively stable? We have that for parallel and series CART ensembles. I don't see why a heterogeneous ensemble wouldn't do the same.
I have plenty of data, and especially with cross-validation, I would like to not think that the difference was because of the training vs. validation set random draws.
I'm running on relatively high performance hardware so I don't think it is a "too short runtime". My "all" model count is between hundreds and a thousand, for what it's worth.

Clustering+Regression-the right approach or not?

I have a task of prognosing the quickness of selling goods (for example, in one category). E.g, the client inputs the price that he wants his item to be sold and the algorithm should displays that it will be sold with the inputed price for n days. And it should have 3 intervals of quick, medium and long sell. Like in the picture:
The question: how exactly should I prepare the algorithm?
My suggestion: use clustering technics for understanding this three price ranges and then solving regression task for each cluster for predicting the number of days. Is it a right concept to do?
There are two questions here, and I think the answer to each lies in a different domain:
Given an input price, predict how long will it take to sell the item. This is a well defined prediction problem, and can be tackled using ML algorithms. e.g. use your entire dataset to train and test a regression model for prediction.
Translate the prediction into a class: quick-, medium- or slow-sell. This problem is product oriented - there doesn't seem to be any concrete data allowing you to train a classifier on this translation; and I agree with #anony-mousse that using unsupervised learning might not yield easy-to-use results.
You can either consult your users or a product manager on reasonable thresholds to use (there might be considerations here like the type of item, season etc.), or try getting some additional data in order to train a supervised classifier.
E.g. you could ask your users, post-sell, if they think the sell was quick, medium or slow. Then you'll have some data to use for thresholding or for classification.
I suggest you simply define thesholds of 10 days and 31 days. Keep it simple.
Because these are the values the users will want to understand. If you use clustering, you may end up with 0.31415 days or similar nonintuitive values that you cannot explain to the user anyway.

Mahout - Naive Bayes Model Very Slow

I have about 44 Million training examples across about 6200 categories.
After training, the model comes out to be ~ 450MB
And while testing, with 5 parallel mappers (each given enough RAM), the classification proceeds at a rate of ~ 4 items a second which is WAY too slow.
How can speed things up?
One way i can think of is to reduce the word corpus, but i fear losing accuracy. I had maxDFPercent set to 80.
Another way i thought of was to run the items through a clustering algorithm and empirically maximize the number of clusters while keeping the items within each category restricted to a single cluster. This would allow me to build separate models for each cluster and thereby (possibly) decrease training and testing time.
Any other thoughts?
Edit :
After some of the answers given below, i started contemplating doing some form of down-sampling by running a clustering algorithm, identifying groups of items that are "highly" close to one another and then taking a union of a few samples from those "highly" close groups and other samples that are not that tightly close to one another.
I also started thinking about using some form of data normalization techniques that involve incorporating edit distances while using n-grams (http://lucene.apache.org/core/4_1_0/suggest/org/apache/lucene/search/spell/NGramDistance.html)
I'm also considering using the hadoop streaming api to leverage some of the ML libraries available in Python from listed here http://pydata.org/downloads/ , and here http://scikit-learn.org/stable/modules/svm.html#svm (These I think use liblinear mentioned in one of the answers below)
Prune stopwords and otherwise useless words (too low support etc.) as early as possible.
Depending on how you use clustering, it may actually make in particular the test phase even more expensive.
Try other tools than Mahout. I found Mahout to be really slow in comparison. It seems that it somewhere comes at a really high overhead.
Using less training exampes would be an option. You will see that after a specific amount of training examples you classification accuracy on unseen examples won't increase. I would recommend to try to train with 100, 500, 1000, 5000, ... examples per category and using 20% for cross validating the accuracy. When it doesn't increase anymore, you have found the amount of data you need which may be a lot less then you use now.
Another approach would be to use another library. For document-classification i find liblinear very very very fast. It's may be more low-level then mahout.
"but i fear losing accuracy" Have you actually tried using less features or less documents? You may not lose as much accuracy as you fear. There may be a few things at play here:
Such a high number of documents are not likely to be from the same time period. Over time, the content of a stream will inevitably drift and words indicative of one class may become indicative of another. In a way, adding data from this year to a classifier trained on last year's data is just confusing it. You may get much better performance if you train on less data.
The majority of features are not helpful, as #Anony-Mousse said already. You might want to perform some form of feature selection before you train your classifier. This will also speed up training. I've had good results in the past with mutual information.
I've previously trained classifiers for a data set of similar scale and found the system worked best with only 200k features, and using any more than 10% of the data for training did not improve accuracy at all.
PS Could you tell us a bit more about your problem and data set?
Edit after question was updated:
Clustering is a good way of selecting representative documents, but it will take a long time. You will also have to re-run it periodically as new data come in.
I don't think edit distance is the way to go. Typical algorithms are quadratic in the length of the input strings, and you might have to run for each pair of words in the corpus. That's a long time!
I would again suggest that you give random sampling a shot. You say you are concerned about accuracy, but are using Naive Bayes. If you wanted the best model money can buy, you would go for a non-linear SVM, and you probably wouldn't live to see it finish training. People resort to classifiers with known issues (there's a reason Naive Bayes is called Naive) because they are much faster than the alternative but performance will often be just a tiny bit worse. Let me give you an example from my experience:
RBF SVM- 85% F1 score - training time ~ month
Linear SVM- 83% F1 score - training time ~ day
Naive Bayes- 82% F1 score - training time ~ day
You find the same thing in the literature: paper . Out of curiosity, what kind of accuracy are you getting?

Resources