Flaky Assert "java.lang.AssertionError: Can't unlock: Not locked!" - h2o

While trying to build a H2O Random Forest via the Python API, I got a flaky error. (I've saved the .err file which is empty and the .out file in case somebody wants to look at it.)
"java.lang.AssertionError: Can't unlock: Not locked!"
in two out of the three times I tried. One failure showed progress up to 86% and the other 90%. On the third try, I got all the way through. I restarted the H2O server after the first failure.
Running on x86_64 x86_64 x86_64 GNU/Linux, Linux 4.4.0-101-generic (Ubuntu), H2O 3.18.0.4, Python 2.7.12
There are about 33K training set examples for a multi-nominal classifier of about 86 classes and 137 numeric input features. Previously, we had no problem (other than some time out issues) with the system using similar data using much older versions of H2O.
Here's the output to stdout from running my program.
Attempting to start a local H2O server...
Java Version: java version "1.8.0_144"; Java(TM) SE Runtime Environment (build 1.8.0_144-b01); Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
Starting server from /home/ubuntu/django-env/lib/python2.7/site-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmpViqbs4
JVM stdout: /tmp/tmpViqbs4/h2o_ubuntu_started_from_python.out
JVM stderr: /tmp/tmpViqbs4/h2o_ubuntu_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
-------------------------- ----------------------------------------
H2O cluster uptime: 08 secs
H2O cluster timezone: Etc/UTC
H2O data parsing timezone: UTC
H2O cluster version: 3.18.0.4
H2O cluster version age: 24 days
H2O cluster name: H2O_from_python_ubuntu_x4p9wv
H2O cluster total nodes: 1
H2O cluster free memory: 6.545 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy:
H2O internal security: False
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
Python version: 2.7.12 final
-------------------------- ----------------------------------------
[stuff omitted]
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
Running trainh2o()..
drf Model Build progress: |█████████████████████████████████████████████████████████▉ (failed)| 86%
Traceback (most recent call last):
File "/home/ubuntu/XXX/webapp/XXX/classify_unified/buildpc.py", line 130, in <module>
print("\nTrain Error: {}".format(pc.train()))
File "/home/ubuntu/XXX/webapp/XXX/classify_unified/ProvisionClassifier.py", line 2130, in train
self._trainh2o()
File "/home/ubuntu/XXX/webapp/XXX/classify_unified/ProvisionClassifier.py", line 2090, in _trainh2o
training_frame=self._combo_h2odf)
File "/home/ubuntu/django-env/local/lib/python2.7/site-packages/h2o/estimators/estimator_base.py", line 232, in train
model.poll(verbose_model_scoring_history=verbose)
File "/home/ubuntu/django-env/local/lib/python2.7/site-packages/h2o/job.py", line 77, in poll
"\n{}".format(self.job_key, self.exception, self.job["stacktrace"]))
EnvironmentError: Job with key $03017f00000132d4ffffffff$_8f9d9edcff82420eefea1d6cff0f4396 failed with an exception: java.lang.AssertionError: Can't unlock: Not locked!
stacktrace:
java.lang.AssertionError: Can't unlock: Not locked!
at water.Lockable$Unlock.atomic(Lockable.java:197)
at water.Lockable$Unlock.atomic(Lockable.java:187)
at water.TAtomic.atomic(TAtomic.java:17)
at water.Atomic.compute2(Atomic.java:56)
at water.Atomic.fork(Atomic.java:39)
at water.Atomic.invoke(Atomic.java:31)
at water.Lockable.unlock(Lockable.java:181)
at water.Lockable.unlock(Lockable.java:176)
at hex.tree.SharedTree$Driver.computeImpl(SharedTree.java:358)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
H2O session _sid_9f4c closed.

Related

A fatal error has been detected by the Java Runtime Environment in JMETER

I'm trying to run a test from jmeter using "webdriver sampler" with 1500 users with a ramp-up 60 sec in one hour...
everything is going well, but at one point, for example, 15 minutes later... I get this error
ChromeDriver was started successfully.
Java HotSpot(TM) 64-Bit Server VM warning: Attempt to allocate stack guard pages failed.
Java HotSpot(TM) 64-Bit Server VM warning: Attempt to unguard stack red zone failed.
An unrecoverable stack overflow has occurred.
#
# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_STACK_OVERFLOW (0xc00000fd) at pc=0x000000006671bbfb, pid=12248, tid=0x0000000000000358
#
# JRE version: Java(TM) SE Runtime Environment (8.0_341-b10) (build 1.8.0_341-b10)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.341-b10 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# V [jvm.dll+0x20bbfb]
#
# Failed to write core dump. Call to MiniDumpWriteDump() failed (Error 0x800705af: The paging file is too small for this operation to complete.
)
#
# An error report file with more information is saved as:
# D:\workspace\test\hs_err_pid12248.log
errorlevel=-1073741819
Press any key to continue . . .
Build step 'Execute Windows batch command' marked build as failure
Finished: FAILURE
I use chromedriver headless
This is the command line that I use in jenkins
apache-jmeter-5.5/bin/jmeter.bat -n -t "test.jmx"
jmeter version 5.5
what is the problem and possible cause
also i get this message sometimes in the output
WARNING: Unable to find version of CDP to use for . You may need to include a dependency on a specific version of the CDP using something similar to `org.seleniumhq.selenium:selenium-devtools-v86:4.5.0` where the version ("v86") matches the version of the chromium-based browser you're using and the version number of the artifact is the same as Selenium's.
An unrecoverable stack overflow has occurred
it means that either you have an endless loop somewhere or create a large object which exceeds the thread stack size
The solutions are in:
inspect your code for any loop instances (for, foreach, while) which may fail to exit
increase stack size by passing the relevant -Xss argument
allocate another machine and switch to distributed testing mode with 750 users per machine
In general using Selenium for performance testing is not recommended, it might be a better idea to conduct the main load using JMeter's HTTP Request samplers and use 1-2 threads in another Thread Group running WebDriver Samplers to measure frontend performance, rendering speed, scripts execution time, collecting web vitals metrics and so on.

SQL Error when querying any tables/views on a Databricks cluster via Dbeaver

I am able to connect to the cluster, browse its hive catalog, see tables/views and columns/datatypes
Running a simple select statement from a view on a parquet file produces this error and no other results:
SQL Error [500540] [HY000]: [Databricks][DatabricksJDBCDriver](500540) Error caught in BackgroundFetcher. Foreground thread ID: 180. Background thread ID: 223. Error caught: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available.
Standard Databricks cluster:
Standard_DS3_v2
JDBC URL:
jdbc:databricks://<reducted>.1.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<reducted>/<reducted>;AuthMech=3;UID=token;PWD=<reducted>
Advanced Options Spark Config:
spark.databricks.cluster.profile singleNode
spark.databricks.io.directoryCommit.createSuccessFile false
spark.master local[*, 4]
spark.driver.extraJavaOptions -Dio.netty.tryReflectionSetAccessible=true
spark.hadoop.fs.azure.account.key.<reducted>.blob.core.windows.net <reducted>
spark.executor.extraJavaOptions -Dio.netty.tryReflectionSetAccessible=true
parquet.enable.summary-metadata false
My local machine:
Dbeaver Version 22.1.2.202207091909
MacOS version (M1 chip): Monterey 12.4
Java version:
java --version
openjdk 18.0.1 2022-04-19
OpenJDK Runtime Environment Homebrew (build 18.0.1+0)
OpenJDK 64-Bit Server VM Homebrew (build 18.0.1+0, mixed mode, sharing)
I am able to do the following with no errors (Databricks default test dataset):
CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true");
When I run this select color from diamonds; or this select * from diamonds;
I get this:
SQL Error [500618] [HY000]: [Databricks][DatabricksJDBCDriver](500618) Error occured while deserializing arrow data: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available
Hence, any select query on any object (parquet file or anything else) causes the error described above.
What could be the problem? Any recommendations how to resolve this error? Why am I able to connect and see the metadata of the schemas/tables/views/columns, but not query or view the data?
P.S. I followed this guide exactly: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/dbeaver#step-3-connect-dbeaver-to-your-azure-databricks-databases

Error starting exported model from Automl Vision model

I trained an Auto ml Vision Edge model and exported it as TensorFlow Package model. I then tried to run it using 'gcr.io/automl-vision-ondevice/gcloud-container-1.12.0' image:
docker run --rm --name ${CONTAINER_NAME} -p ${PORT}:8501 -v ${MODEL_PATH}:/tmp/mounted_model/0001 -t ${CPU_DOCKER_GCR_PATH}
This is the output:
2020-03-24 18:49:11.574773: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config: model_name: default model_base_path: /tmp/mounted_model/
2020-03-24 18:49:11.576100: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2020-03-24 18:49:11.576174: I tensorflow_serving/model_servers/server_core.cc:559] (Re-)adding model: default
2020-03-24 18:49:11.676338: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: default version: 1}
2020-03-24 18:49:11.676387: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: default version: 1}
2020-03-24 18:49:11.676457: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: default version: 1}
2020-03-24 18:49:11.676491: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /tmp/mounted_model/0001
2020-03-24 18:49:11.676551: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /tmp/mounted_model/0001
2020-03-24 18:49:11.713626: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-03-24 18:49:11.748933: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-24 18:49:11.821336: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:310] SavedModel load for tags { serve }; Status: fail. Took 144731 microseconds.
2020-03-24 18:49:11.821400: E tensorflow_serving/util/retrier.cc:37] Loading servable: {name: default version: 1} failed: Not found: Op type not registered 'FusedBatchNormV3' in binary running on 2f729ee881b6. Make sure the Op and Kernelare registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazilyregistered when the module is first accessed.
It seems that the error is "failed: Not found: Op type not registered 'FusedBatchNormV3'"
The model is a standard exported auto ml vision model that I never touched. The model is working fibe when served by Google auto m vision deployment but I want to run it myself. Any help?
Best
André
The error message "failed: Not found: Op type not registered 'FusedBatchNormV3'" is indeed symptomatic of a conflict in runtime versions used for model training and deployment.
The issue lies with the (not configurable) runtime version used by the Console when creating a model version.
The workaround is to train and deploy your model exclusively through the cli.

Updated H2O in R, Flow won´t start

I am using H2O from via an Amazon Ubuntu EC2 AMI I created half a year ago. It works well: When needed I fire up an instance, start H2O in rstudio, go to the flow interface, do my thing and close t down again
But when I try to update H2O to the latest build I cannot access flow. Everything apparently works in rstudio but not flow. I suspect Java, a restart of rstudio and/or the H2O build that is the bleeding edge build number even if I request the latest stable version. t could have
I follow the instructions here:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html#install-in-r
and this is the rstudio console
h2o.init()
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/RtmpKNp0jt/h2o_rstudio_started_from_r.out
/tmp/RtmpKNp0jt/h2o_rstudio_started_from_r.err
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
Starting H2O JVM and connecting: .......... Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 22 seconds 380 milliseconds
H2O cluster timezone: Etc/UTC
H2O data parsing timezone: UTC
H2O cluster version: 3.21.0.4364
H2O cluster version age: 3 months and 13 days !!!
H2O cluster name: H2O_started_from_R_rstudio_urm169
H2O cluster total nodes: 1
H2O cluster total memory: 0.86 GB
H2O cluster total cores: 2
H2O cluster allowed cores: 2
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.2 (2017-09-28)
Warning message:
In h2o.clusterInfo() :
Your H2O cluster version is too old (3 months and 13 days)!
Please download and install the latest version from http://h2o.ai/download/
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
[1] "A shutdown has been triggered. "
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
Removing package from ‘/home/rstudio/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
+ if (! (pkg %in% rownames(installed.packages()))) {
install.packages(pkg) }
+ }
install.packages("h2o", type="source", repos=(c("http://h2o-
release.s3.amazonaws.com/h2o/latest_stable_R")))
Installing package into ‘/home/rstudio/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R
/src/contrib/h2o_3.23.0.4471.tar.gz'
Content type 'application/x-tar' length 120706169 bytes (115.1 MB)
==================================================
downloaded 115.1 MB
* installing *source* package ‘h2o’ ...
** R
** demo
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (h2o)
The downloaded source packages are in
‘/tmp/RtmpKNp0jt/downloaded_packages’
library(h2o)
Error: package or namespace load failed for ‘h2o’ in get(method, envir =
home):
lazy-load database '/home/rstudio/R/x86_64-pc-linux-gnu-
library/3.4/h2o/R/h2o.rdb' is corrupt
In addition: Warning message:
In get(method, envir = home) : internal error -3 in R_decompress1
Because of the error message I restart R via the menu in rstudio
Restarting R session...
library(h2o)
----------------------------------------------------------------------
Your next step is to start H2O:
> h2o.init()
For H2O package documentation, ask for help:
> ??h2o
After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai
----------------------------------------------------------------------
Attaching package: ‘h2o’
The following objects are masked from ‘package:stats’:
cor, sd, var
The following objects are masked from ‘package:base’:
||, &&, %*%, apply, as.factor, as.numeric, colnames, colnames<-, ifelse,
%in%,
is.character, is.factor, is.numeric, log, log10, log1p, log2, round, signif,
trunc
h2o.init()
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/RtmpMdVz9z/h2o_rstudio_started_from_r.out
/tmp/RtmpMdVz9z/h2o_rstudio_started_from_r.err
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1ubuntu0.16.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
Starting H2O JVM and connecting: . Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 seconds 744 milliseconds
H2O cluster timezone: Etc/UTC
H2O data parsing timezone: UTC
H2O cluster version: 3.23.0.4471
H2O cluster version age: 9 hours and 21 minutes
H2O cluster name: H2O_started_from_R_rstudio_rrc849
H2O cluster total nodes: 1
H2O cluster total memory: 0.86 GB
H2O cluster total cores: 2
H2O cluster allowed cores: 2
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.2 (2017-09-28)
From here on H2O works in rstudio but flow won´t start.
Any suggestions?
I suggest to update to the newest version 3.22.0.1. Then initialise the cluster so that it does not bind only to localhost: init(bind_to_localhost=False). When you initialise H2O from R or Python, the instance binds by default to localhost only which means that you can access it from RStudio because it is running on the server, but not via Flow because then you access it from your distant browser.
Another option is to start H2O independently from command line.
Beware that if you do not bind H2O to localhost only, it is then accessible to anybody who can access the port and the network interface, which can pose a significant security hole (exposing your data, models, etc.).

H2O cluster startup frequently timing out

Trying to start an h2o cluster on (MapR) hadoop via python
# startup hadoop h2o cluster
import os
import subprocess
import h2o
import shlex
import re
from Queue import Queue, Empty
from threading import Thread
def enqueue_output(out, queue):
"""
Function for communicating streaming text lines from seperate thread.
see https://stackoverflow.com/questions/375427/non-blocking-read-on-a-subprocess-pipe-in-python
"""
for line in iter(out.readline, b''):
queue.put(line)
out.close()
# clear legacy temp. dir.
hdfs_legacy_dir = '/mapr/clustername/user/mapr/hdfsOutputDir'
if os.path.isdir(hdfs_legacy_dir ):
print subprocess.check_output(shlex.split('rm -r %s'%hdfs_legacy_dir ))
# start h2o service in background thread
local_h2o_start_path = '/home/mapr/h2o-3.18.0.2-mapr5.2/'
startup_p = subprocess.Popen(shlex.split('/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir'.format(local_h2o_start_path)),
shell=False,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
t.start()
# read line without blocking
h2o_url_out = ''
while True:
try: line = q.get_nowait() # or q.get(timeout=.1)
except Empty:
continue
else: # got line
print line
# check for first instance connection url output
if re.search('Open H2O Flow in your web browser', line) is not None:
h2o_url_out = line
break
if re.search('Error', line) is not None:
print 'Error generated: %s' % line
sys.exit()
print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip = re.search('(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)', h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip
frequently throws a timeout error
Waiting for H2O cluster to come up...
H2O node 172.18.4.66:54321 requested flatfile
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Error generated: ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Shutting down h2o cluster
Looking at the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html) (and just doing a wordfind for the word "timeout"), was unable to find anything that helped the problem (eg. extending the timeout time via hadoop jar h2odriver.jar -timeout <some time> did nothing but extend the time until the timeout error popped up).
Have noticed that this happens often when there is another instance of an h2o cluster already up and running (which I don't understand since I would think that YARN could support multiple instances), yet also sometimes when there is no other cluster initialized.
Anyone know anything else that can be tried to solve this problem or get more debugging info beyond the error message being thrown by h2o?
UPDATE:
Trying to recreate the problem from the commandline, getting
[me#mnode01 project]$ /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 172.18.4.62]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.62:29388
(You can override these with -driverif and -driverport/-driverportrange.)
Memory Settings:
mapreduce.map.java.opts: -Xms6g -Xmx6g -XX:PermSize=256m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 6758
18/08/15 09:18:46 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: number of splits:4
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523404089784_7404
18/08/15 09:18:48 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
18/08/15 09:18:48 INFO impl.YarnClientImpl: Submitted application application_1523404089784_7404
18/08/15 09:18:48 INFO mapreduce.Job: The url to track the job: https://mnode03.cluster.local:8090/proxy/application_1523404089784_7404/
Job name 'H2O_66888' submitted
JobTracker job ID is 'job_1523404089784_7404'
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
Waiting for H2O cluster to come up...
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
H2O node 172.18.4.66:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
Killed.
18/08/15 09:23:54 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
----- YARN cluster metrics -----
Number of YARN worker nodes: 6
----- Nodes -----
Node: http://mnode03.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 7.0 GB used, 0 / 2 vcores used
Node: http://mnode05.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode06.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode01.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 5.0 GB used, 0 / 2 vcores used
Node: http://mnode04.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 7.0 / 10.4 GB used, 1 / 2 vcores used
Node: http://mnode02.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 2.0 / 8.7 GB used, 1 / 2 vcores used
----- Queues -----
Queue name: root.default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 0.00
Maximum capacity: -1.00
Application count: 0
Queue 'root.default' approximate utilization: 0.0 / 0.0 GB used, 0 / 0 vcores used
----------------------------------------------------------------------
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster resource limitations
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
and noticing the later outputs
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster
I am confused by the reported 0GB mem. and 0 vcores becuase there are no other applications running on the cluster and looking at the cluster details in the YARN RM web UI shows
(using image, since could not find unified place in log files for this info and why the mem. availability is so uneven despite having no other running applications, I do not know). At this point, should mention that don't have much experience tinkering with / examining YARN configs, so it's difficult for me to find relevant information at this point.
Could it be that I am starting h2o cluster with -mapperXmx=6g, but (as shown in the image) one of the nodes only has 5g mem. available, so if this node is randomly selected to contribute to the initialized h2o application, it does not have enough memory to support the requested mapper mem.? Changing the startup command to /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 5g -timeout 300 -output hdfsOutputDir and start/stopping multiple times without error seems to support this theory (though need to check further to determine if I'm interpreting things correctly).
This is most likely because your Hadoop cluster is busy, and there just isn't space to start new yarn containers.
If you ask for N nodes, then you either get all N nodes, or the launch process times out like you are seeing. You can optionally use the -timeout command line flag to increase the timeout.

Resources