XGBoost with AutoML in Flow - h2o

I have fitted various H2O models, including XGBoost, in R and also within Flow, predicting count data (non-negative integers).
I can fit XGBoost models in Flow from the "Model" menu. However, I would like to include XGBoost when using AutoML - however XGBoost is not listed. The available algorithms are:
GLM
DRF
GBM
DeepLearning
StackedEnsemble
The response column is coded as INT, and the version details are:
H2O Build git branch rel-wright
H2O Build git hash 0457fda98594a72aca24d06e8c3622d45bd545d2
H2O Build git describe jenkins-rel-latest-stable-1-g0457fda
H2O Build project version 3.20.0.8
H2O Build age 1 month and 15 days
H2O Built by jenkins
H2O Built on 2018-09-21 16:54:12
H2O Internal Security Disabled
Flow version 0.7.36
How can I include XGBoost when running AutoML in Flow ?

XGBoost has only been recently added to AutoML (you can see the changes for each version here: https://github.com/h2oai/h2o-3/blob/master/Changes.md).
If you would like to have access to XGBoost within H2OAutoML please upgrade to the latest version, which is currently 3.22.0.1: http://h2o-release.s3.amazonaws.com/h2o/rel-xia/1/index.html

Related

What is Apache Maven and how do i install Geomesa-FS in my Ubuntu 20.04 through Apache maven?

I am completely new to spatiotemporal data analysis and I saw geomesa providing all the functionality that I need in my project.
Lets say i have a pd dataframe or an SQL server with all the location data like
latitude
longitude
shopid
and
latitude
longitude
customerid
timestamp
Now Geomesa will help me analysis all nearest shops to a customer on their route and weather to show an ad of that shop to the customer. (To my knowledge)(Assuming other data required)
Finding Popular shops and etc.
In installation documentation of geomesa it requires to install Apache Maven which i did by
sudo apt install maven
Image of maven version
now there are a lot of of options for running geomesa.
Is geomesa only for distributed systems?
Is it even possible for using geomesa in my problem?
Is it a dependency?
Can i use it through python?
Also can you suggest me best choice of database for spatiotemporal data.
I downloaded geomesa-fs since i don't have any distributed property to my data.
But don't know how to use it.
GeoMesa is mainly used with distributed systems, but not always. Take a look at the introduction in the documentation for more details. For choosing a database, take a look at the Getting Started page. Python is mainly supported through PySpark. Maven is only required for building from the source code, which you generally would not need to do.
If you already have your data in MySQL, you may just want to use GeoTools and GeoServer, which support MySQL.

How to specify the H2O version in Sparkling Water?

In a Databricks notebook, I am trying to load an H2O model that was trained for H2O version 3.30.1.3.
I have installed the version of Sparkling Water which corresponds to the Spark version used for the model training (3.0), h2o-pysparkling-3.0, which I pulled from PyPI.
The Sparkling Water server is using the latest version of H2O rather than the version I need. Maybe there is a way to specify the H2O version when I initiate the Sparkling Water context? Something like this:
import h2o
from pysparkling import H2OContext
from pysparkling.ml import H2OBinaryModel
hc = H2OContext.getOrCreate(h2o_version='3.30.1.3')
model = H2OBinaryModel.read('s3://bucket/model_file')
I run the above code without an argument to H2OContext.getOrCreate() and I get this error:
IllegalArgumentException:
The binary model has been trained in H2O of version
3.30.1.3 but you are currently running H2O version of 3.34.0.6.
Please make sure that running Sparkling Water/H2O-3 cluster and the loaded binary
model correspond to the same H2O-3 version.
Where is the Python API for Sparkling Water? If I could find that I might be able to determine if there's an H2O version argument for the context initializer but surprisingly it's been impossible for me to find so far with Google and poking around in the docs.
Or is this something that's instead handled by installing an H2O version-specific build of Sparkling Water? Or perhaps there's another relevant configuration setting someplace?
Did you try Notebook-scoped library concepts ? Notebook-scoped libraries let you create, modify, save, reuse, and share custom Python environments that are specific to a notebook. When you install a notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected. You can ref : link
Limitations : Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster.

What's the difference between h2o on multi-nodes and h2o on hadoop?

In H2O site, it says
H2O’s core code is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines. The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading.
Does this mean H2O will not work better than other libraries if it runs on single node cluster? But will work well on multiple nodes cluster. Is that right?
Also what's the difference between h2o on multi-nodes and h2o on hadoop?
please see the documentation on how to run H2O on Hadoop:http://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html#hadoop-users
as well as this presentation
you can think of "H2O on Hadoop" as H2O's certified integration for Hadoop. However, you don't need Hadoop to run H2O in a multi-node environment, you could always do this manually if you wanted to.

Automating H2O Flow: run flow from CLI

I’ve been an h2o user for a little over a year and a half now, but my work has been limited to the R api; h2o flow is relatively new to me. If it's new to you as well, it's basically 0xdata's version of iPython, however iPython let's you export your notebook to a script. I can't find a similar option in flow...
I’m at the point of moving a model (built in flow) to production, and I'm wondering how to automate it. With the R api, after the model was built and saved, I could easily load it in R and make predictions on the new data simply by running a nohup Rscript <the_file> & from CLI, but I’m not sure how I can do something similar with flow, especially since it’s running on Hadoop.
As it currently stands, every run is broken into three pieces with the flow creating a relatively clunky process in the middle:
preprocess data, move it to hdfs
start h2o on hadoop, nslookup the IP address h2o is running on, manually run the flow cell-by-cell
run the post-prediction clean-up and final steps
This is a terribly intrusive production process, and I want to tie all the ends up, however flow is making it rather difficult. To distill the question: is there a way to compress the flow into a hadoop jar and then later just run the jar like hadoop jar <my_flow_jar.jar> ...?
Edit:
Here's the h2o R package documentation. The R API allows you to load an H2O model, so I tried loading the flow (as if it were an H2O model), and unsurprisingly it did not work (failed with a water.api.FSIOException) as it's not technically an h2o model.
This is really late, but (now) h2o flow models have auto-generated java code that represents the trained model (called a POJO) that can be cut and pasted (say from your remote hadoop session to a local java file). See here for a quickstart tutorial on how to use the java object (https://h2o-release.s3.amazonaws.com/h2o/rel-turing/1/docs-website/h2o-docs/pojo-quick-start.html). You'll have to refer to the h2o java api (https://h2o-release.s3.amazonaws.com/h2o/rel-turing/8/docs-website/h2o-genmodel/javadoc/hex/genmodel/easy/EasyPredictModelWrapper.html) to start customizing how you want to use the POJO, but you essentially use it as a black box that makes predictions on properly formated inputs.
Assuming you hadoop session is remote, replace "localhost" in the example with the IP address of your (remote) flow session.

Hadoop cluster set up with 0.23 release (MRv2 or NextGen MR)

AS i see the latest stable release of hadoop is 0.20.x. And latest release is 0.23.. Seems there are lot of chanages from .20. to 0.23.x.
We are able to set up small cluster with stable relase(0.20.2) and practicising mapreduce programming.
We have seen lot of new api's added in 0.23.x. In order to explore 0.23.x, we need to setup cluster also with 0.23.x release.
Could you guys point us a documentation, where we can set up cluster with 0.23.x release.
seems 0.23.x is completely different its not like 0.20.x when i untar the tar file. Please give us some book reference/doc where cluster set up is mentioned from begining.
Thanks
MRK
The major difference between 0.23 and pre-0.23 release is that in 0.23 the resource management and the application life cycle management have been separated. Pre-0.23 allowed only MapReduce applications to run, but 0.23 allows other applications besides MapReduce. Already Hama, Giraph and some other applications have been ported and porting of MPI is in progress.
We have seen lot of new api's added in 0.23.x. In order to explore 0.23.x, we need to setup cluster also with 0.23.x release.
There hasn't been any differences in the user API, so the existing applications should run without any code changes, but configuration file changes are required. 0.23 release is backward compatible from an API perspective.
Here is the consolidated list of MRv2 architecture, videos, articles etc. I will try to keep them updated as I come across new information.
http://www.thecloudavenue.com/p/mrv2resources.html
This is the official documentation for cluster setup in r0.23.0:
http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html

Resources