Automated hyperparmeter optimization in h2o flow ui - h2o

Is there a way to automate hyper-parameter optimization for models in H2O's Flow UI, such as how python's scikit-learn package includes GridGridSearchCV and RandomizedSearchCV? Thanks

You can find out how to use Grid Search in Flow here. (Use CV in your grid search by setting nfolds > 1.)
H2O also supports Random (Grid) Search through the programmatic APIs, but it's not currently supported via Flow, so I created a JIRA for that. More info on that at the Grid Search section of the H2O User Guide.

Related

In Google Cloud Platform «Start training» is disable

I am training a model in GCP's AutoML Natural Language Entity extraction.
I have 50+ annotations for each label but still cant start training a model.
Take a look at a screenshot of the
train section. The Start training button remains grey and cannot be selected.
Looking at the screenshot it seems as if you may be talking about training an AutoML Entity Extraction model. Then, this issue seems the same as in Unable to start training my GCP AutoML Entity Extraction model on Web UI
There are thus a couple of reasons that may result in this behavior:
Your dataset are located in a specific region (e.g. "EU") and you need to specify the proper endpoint, as shown in the official documentation.
You might need to increase the number of "Training items per label" to 100 at minimum (see Natural Language limits).
From the aforementioned post, the solution seems to be the first one.

keep_cross_validation_predictions parameter in H2O AutoML

I am using H2O AutoML for modelling in R. I found that AutoML supports keep_cross_validation_predictions option on h2o web interface page (i.e. Flow) and it doesn't support it when we use R interface to run. Please help me to know why such thing is happening.
Neither the Flow web interface nor R/Python expose the keep_cross_validation_predictions option for AutoML. EDIT: This parameter is now exposed as of H2O 3.20.0.1.
However, under the hood, all the models will have this set to TRUE by default because this is required in order to build the Stacked Ensembles at the end of the AutoML run.
If you wanted to prevent cross validation from occurring you can set nfolds=0 for AutoML, in which case you will not get any Stacked Ensembles built (though I think the CV predictions will still be saved).
Please see the screen shot below that indicates there is no exposed parameter for keep_cross_validation_predictions. Please note, however, that if you are building a regular model in H2O Flow or R or Python you will see the parameter keep_cross_validation_predictions.

Is there a way to do collaborative learning with h2o.ai (flow)?

Relatively new to ML and h2o. Is there a way to do collaborative learning/training with h2o? Would prefer a way that uses the flow UI, else woud be using python.
My use case is that there would be new feature samples x=[a, b, c, d] periodically coming into a system where an h2o algorithm (say, running from a java program using a MOJO) assigns a binary class that users should be able to manually reclassify as either good(0) or bad(1), at which point these samples (with their newly assigned responses) get sent back to theh h2o algorithm to be used to further train it.
Thanks
FLOW UI is great for prototyping something very quick with H2O without writing a single like of code. You can ingest the data, build desired model and the evaluate the results. Unfortunately FLOW UI is can not be extended for the reason you asked, and FLOW is limited for that reason.
For collaborative learning you can write your whole application directly in python or R and it will work as expected.

How to do text analysis in Spark

I'm quite familiar with Hadoop but totally new to Apache Spark. Currently I'm using LDA (Latent Dirichlet Allocation) algorithm implemented in Mahout to do topic discovery. However as I need to make the process faster I'd like to use spark, however the LDA (or CVB) algorithm is not implemented in Spark MLib. Does this mean that I have to implement it from scratch by myself? If so, does Spark provide some tools that make it easier?
LDA has been added to Spark very recently. It is not part of the current 1.2.1 release.
Yet, you can find an example on the current SNAPSHOT version: LDAExample.scala
You can also read interesting information about the SPARK-1405 issue.
So how can I use it?
The simplest way while it is not released is probably to copy the following classes in your project, as if you coded them yourself:
LDA.scala
LDAModel.scala
Actually Spark 1.3.0 is out now so LDA is available !!
c.f. https://issues.apache.org/jira/browse/SPARK-1405
Regards,
Regarding how to use the new Spark LDA API in 1.3:
Here is an article describing the new API:Topic modeling with LDA: MLlib meets GraphX
And, it links to example code showing how to vectorize text input: Github LDA Example

Charting Software for Performance Stats

I'm looking for a charting library similar to amCharts that allows you to create graphs with a timeline, e.g:
http://www.amcharts.com/stock/
It should also allow to you to select a range within the chart and zoom in to see further specifics. The purpose of this is for visualizing performance related information such as i/o stats etc... Does anybody know of an open source library that will allow this? A Ruby (most preferable) or Python library would be ideal.
I don't know if that's going to be at all suitable for your needs, but recent "Communications of ACM" had an article about Protovis. The graphics is very impressive. Still on my todo list though.

Resources