Hierarchical Time series reconciliation using my own Forecast Python - time

I am trying to find a way to Reconcile my hierarchical time series buy I have only fount the library
scikit-hts
It forces you to use one of the models they offer.
Is there any other library where I can use my own predictions as the one in R buy in python ?

Related

Arima and Arima-Egarch create simulations

I have a projection of returns on an asset for 20yrs (annual).
I want to simulate 10,000 stochastic scenarios using some Arima and Arima Egarch models.
Is there any code readily available for such a thing? Or where is a good place to look for examples of this? What sort of packages / functions would work here?
Very basic R skills so far.
Thanks

Foreground extraction using opencv

I am working on a dataset (training + testing) which contains a different shopping cart items (eg: biscuits, soaps etc..) with different backgrounds and I need to predict the product id for all testing images (product ids are unique for each product, Let's say Good-day 10 rs is having product id 1 and so on... for different products )
My approach was to :
1) extract the foreground from the image.
2) Apply sift/surf algorithm for finding matching keypoints (or) train a faster RCNN...
I was thinking to build a Haar Cascade classifier, can anyone suggest an easy foreground extraction algorithm possible for this scenario in python ?
For real-time purposes I don't recommend the RCNN models since they are not built for realtime but for precision. Sift or surf can recognise scenes but if the object is deformed in some way they would fail easily. Haar cascade seems to be a good solution. I also recommend checking out Yolo or SSD models since they can be easily trained with transfer learning and they are very successfull at realtime object classification. Opencv also has a DNN module for running these kinds of neural networks.

H2O AutoML building a large number of GBM models

I tried to use AutoML for a binary classification task with 100 hours. It appears that it is just building a large number of GBM models and not getting to other types. (So far built 40)
Is there a way to set the maximum number of GBM models?
There is an order in which AutoML builds the models (the GBMs are first in line). The length of the GBM model building process will depend on how much time you set for max_runtime_secs. If you plan to run it for 100 hours, then a good portion of that will be spend in the GBM hyperparamter space, so I am not surprised that your first 40 models are GBMs. In other words, this is expected behavior.
If you want variety in your models as they are training, then you can run a single AutoML job for a smaller max_runtime_secs (say 2 hours), and then run the AutoML process again on that same project (49 more times at 2 hours each -- or some combination that adds up to 100 hours). If you use the same project_name when you start an AutoML job, a full new set of models (GBMs, RFs, DNNs, GLMs) should be added to the existing AutoML leaderboard.
As Erin said, if you run AutoML multiple times with the same project_name the results will accumulate into a single leaderboard and the hyperparameter searches will accumulate into the same grid objects. However, AutoML will still run through the same sequence of model builds, so it will do a GBM hyperparameter search again before it gets to the DL model builds.
It feels like your GBM hyperparameter search isn't converging because the stopping_tolerance is too small for your dataset. There was a bug in pre-release versions of the bindings which forced the stopping_tolerance to 0.001 instead of letting AutoML set it higher, if it calculated that that tolerance was too tight for a small dataset. Which version of H2O-3 are you using?
A bit about stopping criteria:
The stopping_criteria such as max_models, stopping_rounds, and stopping_tolerance apply to the overall AutoML process as well as to the hyperparameter searches and the individual model builds. At the beginning of the run max_runtime_secs is used to compute the end time for the entire process, and then at each stage the remaining overall time is computed and is passed down to the model build or hyperparameter search subtask.
The Run Time 558:10:56.131 that you posted is really weird. I don't see that sort of output in the AutoML.java code nor in the Python or R bindings. It looks at first glance like this is coming from outside of H2O. . . Do you have any sense of what the real time was for this run?
We should be able to figure out what's going on if you do the following:
If you're not on the recent release version 3.14.x, please upgrade.
While we're debugging please set the seed parameter for your AutoML run so that we get repeatable results.
Please post your stopping criteria, your leaderboard output, your User Feedback output, and send your H2O logs to rpeck (at) h2o.ai and support (at) h2o.ai in case we need to delve further. You can grab the H2O logs from the server or download them using Flow.

Attribute selection in h2o

I am very beginner in h2o and I want to know if there is any attribute selection capabilities in h2o framework so to be applied in h2oframes?
No there are not currently feature selection functions in H2O -- my advice would be to use Lasso regression (in H2O this means use GLM with alpha = 1.0) to do the feature selection, or simply allow whatever machine learning algorithm (e.g. GBM) you are planning to use to use all the features (they'll tend to ignore the bad ones, but it could still degrade performance of the algorithm to have bad features in the training data).
If you'd like, you can make a feature request by filling out a ticket on the H2O-3 JIRA. This seems like a nice feature to have.
In my opinion, Yes
My way is use automl to train your data.
after training, you can get a lot of model.
use h2o.get_model method or H2O server page to watch some model you like.
you can get VARIABLE IMPORTANCES frame.
then pick your features.

Contextual Search: Classifying shopping products

I have got a new task(not traditional) from my client, It is something about machine learning.
As I have never been to "machine learning" except some little Data Mining stuff so I need your help.
My task is to Classify a product present on any Shopping Site, on the basis of gender(whom the product belongs to),agegroup etc, the training data we can have is the product's Title, Keywords(available in the html of the product page), and product description.
I did a lot of R&D , I found Image Recog APIs(cloudsight,vufind) that returned the details of the product image but that did not full fill the need, used google suggestqueries, searched out many machine learning algorithms and finally...
I came to know about the "Decision Tree Learning Algorithm" but cannot figure out, how it is applicable to my problem.
I tried out the "PlayingTennis" dataset but couldn't make the sense what to do.
Can you give me some direction that from where to start this journey? Should I focus on The Decision Tree Learning algorithm or Is there any other algorithm you would suggest I should focus on to categorize the products on the basis of context?
If you say , I would share in detail about what things I searched about to solve my problem.
I would suggest to do the following:
Go through items in your dataset and classify them manually (decide for which gender each item is). Store each decision so that you would be able to somehow link each item in an original dataset with a target class.
Develop an algorithm for converting each item from your dataset into a feature vector. This algorithm should be able to convert each item in your original dataset in a vector of numbers (more about how to do it later).
Convert all your dataset with appropriate classes into a dataset that would look like this:
Feature_1, Feature_2, Feature_3, ..., Gender
value_1, value_2, value_3, ... male
It would be a good decision to store it in CSV file since you would be able to load it and process in different machine learning tools (More about those later).
Load dataset you've created at step 3 in machine learning tool of your choice and try to come up with the best model that can classify items in your dataset by gender.
Store model created at step 4. It will be part of your production system.
Develop a production code that can convert an unclassified product, create feature vector out of it and pass this feature vector to the model you've saved at step 5. The result of this operation should be a predicted gender.
Details
If there too many items (say tens of thousands) in your original dataset it may be impractical to classify them yourself. What you can do is to use Amazon Mechanical Turk to simplify your task. If you are unable to use it (the last time I've checked you had to have a USA address to use it) you can just classify few hundreds of items to start working on your model and classify the rest to improve accuracy of your classification (the more training data you use the better the accuracy, but up to a certain point)
How to extract features from a dataset
If keyword has form like tag=true/false, it's a boolean feature.
If keyword has form like tag=42, it's a numerical one or ordinal. For example it can be price value or price range (0-10, 10-50, 50-100, etc.)
If keyword has form like tag=string_value you can convert it into a categorical value
A class (gender) is simply boolean value 0/1
You can experiment a bit with how you extract your features, since it may influence the result accuracy.
How to extract features from product description
There are different ways to convert a text into a feature vector. Look for TF-IDF algorithms or something similar.
Machine learning tools
You can use one of existing machine learning libraries and hack some code that loads your CSV dataset, trains a model and checks the accuracy, but at first I would suggest to use something like Weka. It has more or less intuitive UI and you can quickly start to experiment with different machine learning algorithms, convert different features in your dataset from string to categories, or from real values to ordinal values, etc. Good thing about Weka is that it has Java API, so you can automate all the process of data conversion, train models programmatically, etc.
What algorithms to choose
I would suggest to use decision tree algorithms like C4.5. It's fast and show good results on wide range of machine learning tasks. Additionally you can use ensemble of classifiers. There are various algorithms that can combine several algorithms like (google for boosting or random forest to find out more) usually they give better results, but work more slowly (since you need to run a single feature vector through several algorithms.
One another trick that you can use to make your algorithm more accurate is to use models that work on different sets of features (say one algorithm uses features extracted from tags and another algorithm uses data extracted from product description). You can then combine them using algorithms like stacking to come up with a final result.
For classification on the basis of features extracted from text, you can try to use Naive Bayes algorithm or SVM. They both show good results in text classification.
Do consider Support Vector Classifier (SVC), or for Google's sake the Support Vector Machine (SVM). If You have a large training set (which I suspect) search for implementations that are "fast" or "scalable".

Resources