I am using H2O AutoML for modelling in R. I found that AutoML supports keep_cross_validation_predictions option on h2o web interface page (i.e. Flow) and it doesn't support it when we use R interface to run. Please help me to know why such thing is happening.
Neither the Flow web interface nor R/Python expose the keep_cross_validation_predictions option for AutoML. EDIT: This parameter is now exposed as of H2O 3.20.0.1.
However, under the hood, all the models will have this set to TRUE by default because this is required in order to build the Stacked Ensembles at the end of the AutoML run.
If you wanted to prevent cross validation from occurring you can set nfolds=0 for AutoML, in which case you will not get any Stacked Ensembles built (though I think the CV predictions will still be saved).
Please see the screen shot below that indicates there is no exposed parameter for keep_cross_validation_predictions. Please note, however, that if you are building a regular model in H2O Flow or R or Python you will see the parameter keep_cross_validation_predictions.
Related
I am training a model in GCP's AutoML Natural Language Entity extraction.
I have 50+ annotations for each label but still cant start training a model.
Take a look at a screenshot of the
train section. The Start training button remains grey and cannot be selected.
Looking at the screenshot it seems as if you may be talking about training an AutoML Entity Extraction model. Then, this issue seems the same as in Unable to start training my GCP AutoML Entity Extraction model on Web UI
There are thus a couple of reasons that may result in this behavior:
Your dataset are located in a specific region (e.g. "EU") and you need to specify the proper endpoint, as shown in the official documentation.
You might need to increase the number of "Training items per label" to 100 at minimum (see Natural Language limits).
From the aforementioned post, the solution seems to be the first one.
I have a processor that generates time series data in JSON format. Based on the received data I need to make a forecast using machine learning algorithms on python. Then write the new forecast values to another flow file.
The problem is: when you run such a python script, it must perform many massive preprocessing operations: queries to a database, creating a complex data structure, initializing forecasting models, etc.
If you use ExecuteStreamCommand, then for each flow file the script will be run every time. Is this true?
Can I make in NIFI a python script that starts once and receives the flow files many times, storing the history of previously received data. Or do I need to make an HTTP service that will receive data from NIFI?
You have a few options:
Build a custom processor. This is my suggested approach. The code would need to be in Java (or Groovy, which provides a more Python-like experience) but would not have Python dependencies, etc. However, I have seen examples of this approach for ML model application (see Tim Spann's examples) and this is generally very effective. The initialization and individual flowfile trigger logic is cleanly separated, and performance is good.
Use InvokeScriptedProcessor. This will allow you to write the code in Python and separate the initialization (pre-processing, DB connections, etc., onScheduled in NiFi processor parlance) with the execution phase (onTrigger). Some examples exist but I have not personally pursued this with Python specifically. You can use Python dependencies but not "native modules" (i.e. compiled C code), as the execution engine is still Jython.
Use ExecuteStreamCommand. Not strongly recommended. As you mention, every invocation would require the preprocessing steps to occur, unless you designed your external application in such a way that it ran a long-lived "server" component and each ESC command sent data to it and returned an individual response. I don't know what your existing Python application looks like, but this would likely involve complicated changes. Tim has another example using CDSW to host and deploy the model and NiFi to send it data via HTTP to evaluate.
Make a Custom Processor that can do that. Java is more appropriate. I believe you can do pretty much every with Java you just need to find libraries. Yes, there might be some issues with some initialization and preprocessing that can be handled by all that in the init function of nifi that will allow you preserve the state of certain components.
Link in my use case I had to build a custom processor that could take in images and apply count the number of people in that image. For that, I had to load a deep learning model once in the init method and after through on trigger method, it could be taking the reference of that model every time it processes an image.
Relatively new to ML and h2o. Is there a way to do collaborative learning/training with h2o? Would prefer a way that uses the flow UI, else woud be using python.
My use case is that there would be new feature samples x=[a, b, c, d] periodically coming into a system where an h2o algorithm (say, running from a java program using a MOJO) assigns a binary class that users should be able to manually reclassify as either good(0) or bad(1), at which point these samples (with their newly assigned responses) get sent back to theh h2o algorithm to be used to further train it.
Thanks
FLOW UI is great for prototyping something very quick with H2O without writing a single like of code. You can ingest the data, build desired model and the evaluate the results. Unfortunately FLOW UI is can not be extended for the reason you asked, and FLOW is limited for that reason.
For collaborative learning you can write your whole application directly in python or R and it will work as expected.
Using h2o steam's prediction service for a deployed model, the default threshold that seems to be used by the prediction service is the max f1 threshold. However, in my case I would like the be able to use other thresholds (as displayed by the model when built in h2o flow) (eg. max f2 or max accuracy thresholds) like these.
Is there a way to set these thresholds in steam?
Looking at the inspector on the prediction service page, seems to shows that the logic for the predictor is from a script called "predict.js" (see below):
But I can't find where in the steam launch directory (running from local host based on these instructions) these files are (doing a file search in this directory for anything named "predict.js" returns nothing).
I believe for POJOs there's no way to change it as it's hardcoded. Don't know the answer for MOJOs.
Is there a way to automate hyper-parameter optimization for models in H2O's Flow UI, such as how python's scikit-learn package includes GridGridSearchCV and RandomizedSearchCV? Thanks
You can find out how to use Grid Search in Flow here. (Use CV in your grid search by setting nfolds > 1.)
H2O also supports Random (Grid) Search through the programmatic APIs, but it's not currently supported via Flow, so I created a JIRA for that. More info on that at the Grid Search section of the H2O User Guide.