is multi-cpu supported by h2o-xgboost? - h2o

Is there a configuration which allows to run H2OXGBoostEstimator in multithreading and not in the minimal config with one CPU, with h2o version 3.15.0.4035?

xgboost implementation on H2O is multithreaded and like all other algorithms supported into H2O however it is platform dependent which is described into H2O documentation properly.
So if you try it on Linux, and have all supported libraries available then you will take advantage of distributed xgboost otherwise like OSX, you might get a single CPU fall back runtime. So it's all depend on which lib is loaded from your OS.
When H2O starts in the log you will see the following:
10-02 09:25:34.579 10.0.0.46:54321 54229 main INFO: Registered 3 core extensions in: 57ms
10-02 09:25:34.580 10.0.0.46:54321 54229 main INFO: Registered H2O core extensions: [Watchdog, XGBoost, KrbStandalone]
10-02 09:25:34.791 10.0.0.46:54321 54229 main INFO: Registered: 161 REST APIs in: 211ms
10-02 09:25:34.791 10.0.0.46:54321 54229 main INFO: Registered REST API extensions: [XGBoost, Algos, AutoML, Core V3, Core V4]
Then you will see if CPU/GPU is included as below:
10-02 09:23:49.952 10.0.0.46:54321 54143 FJ-1-5 INFO: No GPU (gpu_id: 0) found. Using CPU backend.
If you could run objdump or ldd command to see the libs loaded with H2O, you will have better idea what is missed which cause your xgboost runtime to be single CPU.

Related

How to specify the H2O version in Sparkling Water?

In a Databricks notebook, I am trying to load an H2O model that was trained for H2O version 3.30.1.3.
I have installed the version of Sparkling Water which corresponds to the Spark version used for the model training (3.0), h2o-pysparkling-3.0, which I pulled from PyPI.
The Sparkling Water server is using the latest version of H2O rather than the version I need. Maybe there is a way to specify the H2O version when I initiate the Sparkling Water context? Something like this:
import h2o
from pysparkling import H2OContext
from pysparkling.ml import H2OBinaryModel
hc = H2OContext.getOrCreate(h2o_version='3.30.1.3')
model = H2OBinaryModel.read('s3://bucket/model_file')
I run the above code without an argument to H2OContext.getOrCreate() and I get this error:
IllegalArgumentException:
The binary model has been trained in H2O of version
3.30.1.3 but you are currently running H2O version of 3.34.0.6.
Please make sure that running Sparkling Water/H2O-3 cluster and the loaded binary
model correspond to the same H2O-3 version.
Where is the Python API for Sparkling Water? If I could find that I might be able to determine if there's an H2O version argument for the context initializer but surprisingly it's been impossible for me to find so far with Google and poking around in the docs.
Or is this something that's instead handled by installing an H2O version-specific build of Sparkling Water? Or perhaps there's another relevant configuration setting someplace?
Did you try Notebook-scoped library concepts ? Notebook-scoped libraries let you create, modify, save, reuse, and share custom Python environments that are specific to a notebook. When you install a notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected. You can ref : link
Limitations : Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster.

Cannot find nifi processor called PutHive3QL

Using Nifi 1.11.3, I am not able to find the processor called PutHive3QL.
This processor does not show up in the "Add processor" panel. I only have PutHiveQL.
How can I add or where can I find this processor ?
The Hive 3 components are not included with the NiFi distribution due to size limitations, but they are built and published as part of the release process. For version 1.11.3, you can find the NAR here.

XGBoost with AutoML in Flow

I have fitted various H2O models, including XGBoost, in R and also within Flow, predicting count data (non-negative integers).
I can fit XGBoost models in Flow from the "Model" menu. However, I would like to include XGBoost when using AutoML - however XGBoost is not listed. The available algorithms are:
GLM
DRF
GBM
DeepLearning
StackedEnsemble
The response column is coded as INT, and the version details are:
H2O Build git branch rel-wright
H2O Build git hash 0457fda98594a72aca24d06e8c3622d45bd545d2
H2O Build git describe jenkins-rel-latest-stable-1-g0457fda
H2O Build project version 3.20.0.8
H2O Build age 1 month and 15 days
H2O Built by jenkins
H2O Built on 2018-09-21 16:54:12
H2O Internal Security Disabled
Flow version 0.7.36
How can I include XGBoost when running AutoML in Flow ?
XGBoost has only been recently added to AutoML (you can see the changes for each version here: https://github.com/h2oai/h2o-3/blob/master/Changes.md).
If you would like to have access to XGBoost within H2OAutoML please upgrade to the latest version, which is currently 3.22.0.1: http://h2o-release.s3.amazonaws.com/h2o/rel-xia/1/index.html

h2o is not using all processors

I have a server with 48 processors.
The server is not virtualized and the h2o sees 48 processors, but 16 of them for some reason are not being used.
Any advice?
enter image description here
It looks like somehow your H2O cluster was launched with 32 cores instead of the full 48. That's what "H2O cluster allowed cores: 32" indicates is happening. To use all the cores, do the following:
Shut down your existing H2O cluster using h2o.shutdown()
Start a new H2O cluster from R using h2o.init(nthreads = -1), which means that it will use all available cores. If for some reason that does not work, try h2o.init(nthreads = 48).
You can also start the H2O cluster from the command line by typing the following: java -Xmx30g -jar h2o.jar -nthreads 48 and then use h2o.init() to connect inside R.
Feel free to also upgrade to the latest stable version of H2O (3.8.0.2 is slightly outdated, now we are at 3.8.1.1).
It looks like this was a limitation in the old version. Using 3.10 and testing 3.12 now issue was fixed.

Hadoop services are not getting stated if i use particular zlib library

I'm trying to use different zlib library with hadoop. When i'm using a particular library, hadoop services are not getting started. Services are starting properly if i use different library. The only difference between the libraries are, not working one using a device file to send data to kernel driver for compression. Where could be the issue ? and also where to look for the error log in hadoop.

Resources