How to specify the H2O version in Sparkling Water? - h2o

In a Databricks notebook, I am trying to load an H2O model that was trained for H2O version 3.30.1.3.
I have installed the version of Sparkling Water which corresponds to the Spark version used for the model training (3.0), h2o-pysparkling-3.0, which I pulled from PyPI.
The Sparkling Water server is using the latest version of H2O rather than the version I need. Maybe there is a way to specify the H2O version when I initiate the Sparkling Water context? Something like this:
import h2o
from pysparkling import H2OContext
from pysparkling.ml import H2OBinaryModel
hc = H2OContext.getOrCreate(h2o_version='3.30.1.3')
model = H2OBinaryModel.read('s3://bucket/model_file')
I run the above code without an argument to H2OContext.getOrCreate() and I get this error:
IllegalArgumentException:
The binary model has been trained in H2O of version
3.30.1.3 but you are currently running H2O version of 3.34.0.6.
Please make sure that running Sparkling Water/H2O-3 cluster and the loaded binary
model correspond to the same H2O-3 version.
Where is the Python API for Sparkling Water? If I could find that I might be able to determine if there's an H2O version argument for the context initializer but surprisingly it's been impossible for me to find so far with Google and poking around in the docs.
Or is this something that's instead handled by installing an H2O version-specific build of Sparkling Water? Or perhaps there's another relevant configuration setting someplace?

Did you try Notebook-scoped library concepts ? Notebook-scoped libraries let you create, modify, save, reuse, and share custom Python environments that are specific to a notebook. When you install a notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected. You can ref : link
Limitations : Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster.

Related

What is Apache Maven and how do i install Geomesa-FS in my Ubuntu 20.04 through Apache maven?

I am completely new to spatiotemporal data analysis and I saw geomesa providing all the functionality that I need in my project.
Lets say i have a pd dataframe or an SQL server with all the location data like
latitude
longitude
shopid
and
latitude
longitude
customerid
timestamp
Now Geomesa will help me analysis all nearest shops to a customer on their route and weather to show an ad of that shop to the customer. (To my knowledge)(Assuming other data required)
Finding Popular shops and etc.
In installation documentation of geomesa it requires to install Apache Maven which i did by
sudo apt install maven
Image of maven version
now there are a lot of of options for running geomesa.
Is geomesa only for distributed systems?
Is it even possible for using geomesa in my problem?
Is it a dependency?
Can i use it through python?
Also can you suggest me best choice of database for spatiotemporal data.
I downloaded geomesa-fs since i don't have any distributed property to my data.
But don't know how to use it.
GeoMesa is mainly used with distributed systems, but not always. Take a look at the introduction in the documentation for more details. For choosing a database, take a look at the Getting Started page. Python is mainly supported through PySpark. Maven is only required for building from the source code, which you generally would not need to do.
If you already have your data in MySQL, you may just want to use GeoTools and GeoServer, which support MySQL.

XGBoost with AutoML in Flow

I have fitted various H2O models, including XGBoost, in R and also within Flow, predicting count data (non-negative integers).
I can fit XGBoost models in Flow from the "Model" menu. However, I would like to include XGBoost when using AutoML - however XGBoost is not listed. The available algorithms are:
GLM
DRF
GBM
DeepLearning
StackedEnsemble
The response column is coded as INT, and the version details are:
H2O Build git branch rel-wright
H2O Build git hash 0457fda98594a72aca24d06e8c3622d45bd545d2
H2O Build git describe jenkins-rel-latest-stable-1-g0457fda
H2O Build project version 3.20.0.8
H2O Build age 1 month and 15 days
H2O Built by jenkins
H2O Built on 2018-09-21 16:54:12
H2O Internal Security Disabled
Flow version 0.7.36
How can I include XGBoost when running AutoML in Flow ?
XGBoost has only been recently added to AutoML (you can see the changes for each version here: https://github.com/h2oai/h2o-3/blob/master/Changes.md).
If you would like to have access to XGBoost within H2OAutoML please upgrade to the latest version, which is currently 3.22.0.1: http://h2o-release.s3.amazonaws.com/h2o/rel-xia/1/index.html

What's the difference between h2o on multi-nodes and h2o on hadoop?

In H2O site, it says
H2O’s core code is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines. The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading.
Does this mean H2O will not work better than other libraries if it runs on single node cluster? But will work well on multiple nodes cluster. Is that right?
Also what's the difference between h2o on multi-nodes and h2o on hadoop?
please see the documentation on how to run H2O on Hadoop:http://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html#hadoop-users
as well as this presentation
you can think of "H2O on Hadoop" as H2O's certified integration for Hadoop. However, you don't need Hadoop to run H2O in a multi-node environment, you could always do this manually if you wanted to.

Automating H2O Flow: run flow from CLI

I’ve been an h2o user for a little over a year and a half now, but my work has been limited to the R api; h2o flow is relatively new to me. If it's new to you as well, it's basically 0xdata's version of iPython, however iPython let's you export your notebook to a script. I can't find a similar option in flow...
I’m at the point of moving a model (built in flow) to production, and I'm wondering how to automate it. With the R api, after the model was built and saved, I could easily load it in R and make predictions on the new data simply by running a nohup Rscript <the_file> & from CLI, but I’m not sure how I can do something similar with flow, especially since it’s running on Hadoop.
As it currently stands, every run is broken into three pieces with the flow creating a relatively clunky process in the middle:
preprocess data, move it to hdfs
start h2o on hadoop, nslookup the IP address h2o is running on, manually run the flow cell-by-cell
run the post-prediction clean-up and final steps
This is a terribly intrusive production process, and I want to tie all the ends up, however flow is making it rather difficult. To distill the question: is there a way to compress the flow into a hadoop jar and then later just run the jar like hadoop jar <my_flow_jar.jar> ...?
Edit:
Here's the h2o R package documentation. The R API allows you to load an H2O model, so I tried loading the flow (as if it were an H2O model), and unsurprisingly it did not work (failed with a water.api.FSIOException) as it's not technically an h2o model.
This is really late, but (now) h2o flow models have auto-generated java code that represents the trained model (called a POJO) that can be cut and pasted (say from your remote hadoop session to a local java file). See here for a quickstart tutorial on how to use the java object (https://h2o-release.s3.amazonaws.com/h2o/rel-turing/1/docs-website/h2o-docs/pojo-quick-start.html). You'll have to refer to the h2o java api (https://h2o-release.s3.amazonaws.com/h2o/rel-turing/8/docs-website/h2o-genmodel/javadoc/hex/genmodel/easy/EasyPredictModelWrapper.html) to start customizing how you want to use the POJO, but you essentially use it as a black box that makes predictions on properly formated inputs.
Assuming you hadoop session is remote, replace "localhost" in the example with the IP address of your (remote) flow session.

Setting up Hadoop in a public cloud

As a part of my college project, I would like to modify Hadoop's source code. However, the problem is that I would need atleast 20 systems to test it. Is it possible to setup this modified version of Hadoop in public clouds such as Google Cloud platform or Amazon Services?Can you give me an idea on the procedure to follow?I could only find information about setting up the original Hadoop versions in the public cloud set up. I couldn't find any information that is relevant to my case.Please do help me out.
Amazon offers elastic mapreduce. But as you correctly pointed out you will not be able to deploy your version of hadoop there.
But you still can use Amazon or Google cloud to just get the base linux servers and install your hadoop on it. It is just a longer process but not different from any other hadoop installation if you have done it before.

Resources