how should I use H2O DAI on 2 target columns? Current version (AMI ID: h2oai-driverless-ai-1.0.19 (ami-46e5dd3c)) only allows 1 target column. The 2 target columns of interest are both float64 type. Thanks.
Currently H2O DAI only supports one target. (As noted in the comments you can run two separate experiments, for ideas on how to handle predicting lat and long the following kaggle competition may be of interesthttps://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i/).
Related
Is it possible (and how?) to provide time series for binary classification in H2O.ai's Driverless AI? I have dataframe that looks like this:
ID
Status/Target [0/1]
TimeStamp for events that happened on given ID, in last 90 days
Details of those events (category, description, values, etc...)
Ideally what i want is to build a model that predict status for given ID, based on provided history of events.
For H2O's Driverless AI, you can use it out of the box for time-series modeling. See this section. You need to provide the "Time Column" as your TimeStamp and add ID to your "Time Groups Column".
If your target column is 0s or 1s, then it should automatically identify it as binary. If you not, you can toggle it from regression to binary classification.
Just started with H2O AutoML so apologies in advance if I have missed something basic.
I have a binary classification problem where data are observations from K years. I want to train on the K-1 years and tune the models and select the best one explicitly based on the remaining K year.
If I switch off cross-validation (with nfolds=0) to avoid randomly blending of years into the N folds and define data of year K as the validation_frame then I don't have the ensemble created (as expected according to the documentation) which in fact I need.
If I train with cross-validation (default nfolds) and defining a validation frame to be the K-year data
aml = H2OAutoML(max_runtime_secs=3600, seed=1)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
then according to
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
the validation_frame is ignored
"...By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored."
Is there a way to get the tuning of the models and the selection of the best one(ensemble or not) based on the K-year data only, and while the ensemble of models is also available in the output?
Thanks a lot!
You don't want to have cross-validation (CV) if you are dealing with times-series (non-IID) data, since you won't want folds from the future to the predict the past.
I would explicitly add nfolds=0 so that CV is disabled in AutoML:
aml = H2OAutoML(max_runtime_secs=3600, seed=1, nfolds=0)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
To have an ensemble, add a blending_frame which also applies to time-series. See more info here.
Additionally, since you are dealing with time-series data. I would recommend adding time-series transformations (e.g. lags), so that your model gets info from previous years and their aggregates (e.g. weighted moving average).
We are using Google AutoML with Tables using input as CSV files. We have imported data , linked all schema with nullable columns and train model and then deployed and used the online prediction to predict value of one column.
Column we targeted has values min-max ( 44 - 263).
When we deployed and ran online-prediction it return values like this
Prediction result
0.49457597732543945
95% prediction interval
[-8.209495544433594, 0.9892584085464478]
Most of the resultset is in above format. How can we convert it to values in range of (44-263). Didn't find much documentation online on the same.
Looking for documentation reference and interpretation along with interpretation of 95% prediction.
Actually to clarify (I'm the PM of AutoML Tables)--
AutoML Tables does not do any normalization of the predicted values for your label data, so if you expect your label data to have a distribution of min/max 44-263, then the output predictions should also be in that range. Two possibilities would make it significantly different:
1) You selected the wrong label column
2) Your input features for this prediction are dramatically different than what was seen in the training data used.
Please feel free to reach out to cloud-automl-tables-discuss#googlegroups.com if you'd like us to help debug further
I've been working with the h2o.ai automl function on a few problems with quite a bit of success, but have come across a bit of a roadblock.
I've got a problem that uses 500-odd predictors (all float) to map onto 6 responses (again all float.)
Required Data Parameters
y: This argument is the name (or index) of the response column.
3.16 docs
It seems that the automl library only handles a single response. Am I missing something? Perhaps in the terminology even?
In the case that I'm not, my plan is to build 6 separate leaderboards, one for each response, and use the results to kick-start a manual network search.
In theory I guess I could actually run the 6 automl models individually to get the vector response, but that feels like an odd approach.
Any insight would be appreciated,
Cheers.
Not just AutoML, but H2O generally, will only let you predict a single thing.
Without more information about what those 6 outputs represent, and their relationship to each other, I can think of 3 approaches.
Approach 1: 6 different models, as you suggest.
Approach 2: Train an auto-encoder to compress 6 dimensions to 1 dimension. Then train your model to predict that single value. Then expand it back out. (E.g. by a lookup table on the training data, e.g. if your model predicts 1.123, and you have [1,2,3,4,5,6] was represented by 1.122, and [3.14,0,0,3.14,0,0] was represented by 1.125, you could choose [1,2,3,4,5,6], or a weighted average of those 2 closest matches.) (Other dimension-reduction approaches, such as PCA, are the same idea.)
Approach 3: If the possible combinations of your 6 floats is a (relatively small) finite set, you could have an explicit lookup table, to N categories.
I assume each are continuous variables, which is why they are float, so I expect approach 3 will be inferior to approach 2. If there is very little correlation/relationship between the 6 outputs, approach 1 is going to be best.
I've been asked to reduce an existing data model using Data Stage ETL.
It's more of an exercice and a way to get to know this program which I'm very new to.
Of course, the data shall be reduced following some functionnal rules.
Table : MEMBERSHIP (..,A,B,C) # where A,B,C are different attributes (our filters)
Reducing data from ~700k rows to 7k rows or so.
I was thinking about keeping the same percentage as in the data source.
Therefore if we have the 70% of A, 20% of B and 10% of C, we would pretty much have the same percentage on the reduced version.
I'm looking for the best way to do so and the inner tools to use(maybe with the aggregator stage?).
Is there any way to do some scripting similar to PL with DataStage ?
I hope I've been clear enough. If you have any advice I'd be very grateful.
Thanks to all of you.
~Whitoo
Datastage does not do percentage wise reductions
What you can do is to use a tranformer stage or a filter stage to filter out the data from the source based on certain conditions. But like I said conditions have to be very specific. (for example - select only those records which have A = [somevalue] or A not= [somevalue])
DataStage PX has the sample stage that allows you to specify what percent of data you want it to sample: http://datastage4you.blogspot.com/2014/01/sample-stage-in-datastage.html.