H2O H2OServerError: HTTP 500 Server Error when training model - h2o

Trying to train a DRF classifier in h2o (version 3.20.0.5), the error "H2OServerError: HTTP 500 Server Error" with no further explanation.
---------------------------------------------------------------------------
H2OServerError Traceback (most recent call last)
<ipython-input-44-f52d1cb4b77a> in <module>()
4 training_frame=train_u, validation_frame=val_u,
5 weights_column='weight',
----> 6 max_runtime_secs=max_train_time_hrs*60*60)
7
8
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/estimators/estimator_base.pyc in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose)
224 rest_ver = parms.pop("_rest_version") if "_rest_version" in parms else 3
225
--> 226 model_builder_json = h2o.api("POST /%d/ModelBuilders/%s" % (rest_ver, self.algo), data=parms)
227 model = H2OJob(model_builder_json, job_type=(self.algo + " Model Build"))
228
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/h2o.pyc in api(endpoint, data, json, filename, save_to)
101 # type checks are performed in H2OConnection class
102 _check_connection()
--> 103 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
104
105
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/backend/connection.pyc in request(self, endpoint, data, json, filename, save_to)
400 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
401 self._log_end_transaction(start_time, resp)
--> 402 return self._process_response(resp, save_to)
403
404 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/backend/connection.pyc in _process_response(response, save_to)
728 # Note that it is possible to receive valid H2OErrorV3 object in this case, however it merely means the server
729 # did not provide the correct status code.
--> 730 raise H2OServerError("HTTP %d %s:\n%r" % (status_code, response.reason, data))
731
732
H2OServerError: HTTP 500 Server Error:
Server error java.lang.NullPointerException:
Error: Caught exception: java.lang.NullPointerException
Request: None
The code snippet in question is shown below:
max_train_time_hrs = 8
drf_proc.train(
x=train_features, y=train_response,
training_frame=train_u, validation_frame=val_u,
weights_column='weight',
max_runtime_secs=max_train_time_hrs*60*60)
The output from running the h2o.init() command looks like
Checking whether there is an H2O instance running at http://172.18.4.62:54321. connected.
Warning: Your H2O cluster version is too old (7 months and 24 days)! Please download and install the latest version from http://h2o.ai/download/
H2O cluster uptime: 06 secs
H2O cluster timezone: Pacific/Honolulu
H2O data parsing timezone: UTC
H2O cluster version: 3.20.0.5
H2O cluster version age: 7 months and 24 days !!!
H2O cluster name: H2O_88021
H2O cluster total nodes: 4
H2O cluster free memory: 15.34 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://172.18.4.62:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: AutoML, XGBoost, Algos, Core V3, Core V4
Python version: 2.7.12 fin
While I realize that there is a warning that the version of h2o I am using is "too old", the version of the h2o python package I am using and the cluster I am connecting to still match and this cannot be upgraded due to other h2o applications that access this cluster and expect a certain version (all of these applications appear to have no problem running on the cluster). Meanwhile, any web browser is unable to connect to the H2O connection url.
Any ideas about what could be going on here or debugging steps that could be looked into?

15GB of memory might not be enough for a training process you expect to last 8hrs. (Aside: I'd recommend using early stopping, rather than, or as well as, max_runtime_secs.)
As a debugging step, I would recommend watching in the Flow interface (point your browser to port 54321 - see the connection URL in your h2o.init() output). Especially watch how memory usage is rising over time.
(Sometimes a "500" error just means it has gone unstable, and lack of memory is a common trigger.)
If you are getting the error immediately, that is less likely to be the problem (unless you have a huge dataset).
In that case I'd try to narrow down if a particular column or data row could be causing the problem. E.g.
Experiment 1: first half of columns in train_features
Experiment 2: second half of columns in train_features
Experiment 3: first half of rows in train_u
Experiment 4: second half of rows in train_u
Experiment 5/6 (if still no luck): the same for valid_u
If one of the experiment pair crashes but the other doesn't, then repeat the experiment on the crashing half.

Related

Please suggest hardware configuration for network-intensive Flink job (Async I/O)

TLDR; I am running Flink Streaming job in mode=Batch on EMR. I have tried several EMR cluster configurations but neither of them works as required. Some do not work at all. Workflow is very network-intensive that cases main problems.
Question: What EMR cluster configuration (ec2 instance types) would you recommend for this use-case?
--
The job has following stages:
Read from MySQL
KeyBy user_id
Reduce by user_id
Async I/O enriching from Redis
Async I/O enriching from other Redis
Async I/O enriching from REST #1
Async I/O enriching from REST #2
Async I/O enriching from REST #2
Write to Elasticsearch
Other info:
Flink version: 1.13.1
EMR version: 6.4.0
Java version: JDK version Corretto-8.302.08.1 (provided by EMR)
Input data size: ~800 GB
Output data size: ~300 GB
"taskmanager.network.sort-shuffle.min-parallelism": 1
"taskmanager.memory.framework.off-heap.batch-shuffle.size": 256m
"taskmanager.network.sort-shuffle.min-buffers": 2048
"taskmanager.network.blocking-shuffle.compression.enabled": true
"taskmanager.memory.framework.off-heap.size": 512m
"taskmanager.memory.network.max": 2g
Configurations we tried:
#1
master: r6g.xlarge
core: r6g.xlarge (per/hour: $0.2; CPU: 4; RAM: 32 GiB; Disk: EBS 128 GB, network: 1.25 Gigabit baseline with burst up to 10 Gigabit)
min_scale: 2
max_scale: 25
expected: finishes within 24 hours
actual: works with sort-based shuffling enabled but very slowly (~36h), as this type of instance has a baseline & burst performance, when burst credits are exhausted degrades to the baseline of 1GBps, that slows down I/O. With hash-based shuffling fails on KeyBy -> Reduce with "Connection reset by peer", Task Manager fails -> Job fails -> Job manager is not able to restart.
#2
master: m5.xlarge
core: r6g.12xlarge (per/hour: $2.4; CPU: 48; RAM: 384 GiB; Disk: EBS 1.5 TB, network: 20 Gigabit)
min_scale: 1
max_scale: 4
expected: finishes within 24 hours, as there is much higher network badwith
actual: does not work. With sort-based shuffling fails on the writing phase with exception "Failed to transfer file from TaskExecutor". With hash-based shuffling fails on the same stage with "Connection reset by peer".

AutoML: out of memory on small training file

I am attempting to run H2OAutoML on a 2.7MB training CSV on a system with 4GB RAM using the python API and it is running out of memory.
The error messages I am encountering are either:
h2o_ubuntu_started_from_python.out:
02-17 17:57:25.063 127.0.0.1:54321 27097 FJ-3-15 INFO: Stopping XGBoost training because of timeout
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 247463936 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/ubuntu/h20.ai/h2o-3.28.0.2/hs_err_pid27097.log
or
03:37:07.509: XRT_1_AutoML_20200217_030816 [DRF XRT (Extremely Randomized Trees)] failed: java.lang.OutOfMemoryError: Java heap space
in the output of the python depending on the exact crash instance I look at.
My init is:
h2o.init(max_mem_size='3G',min_mem_size='2G',jvm_custom_args=["-Xmx3g"])
Though I have tried with:
h2o.init()
My H2OAutoML call is:
H2OAutoML(nfolds=5,max_models=20, max_runtime_secs_per_model=600, seed=1,project_name =project_name)
aml.train(x=x, y=y, training_frame=train,validation_frame=test)
These are the server stats:
H2O cluster uptime: 02 secs
H2O cluster timezone: Etc/UTC
H2O data parsing timezone: UTC
H2O cluster version: 3.28.0.2
H2O cluster version age: 27 days
H2O cluster name: H2O_from_python_ubuntu_htq5aj
H2O cluster total nodes: 1
H2O cluster free memory: 3 Gb
H2O cluster total cores: 2
H2O cluster allowed cores: 2
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: {'http': None, 'https': None}
H2O internal security: False
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version: 3.6.9 final
Does this sound right? Am I not able to run 20 models?
I can run this just fine setting the max_models=10. This takes about 60 min.
Are there guidelines for the amount of RAM needed for a given max_models and filesize?
Connect to the Flow interface, running at 127.0.0.1:54321.
There is a section there where you can view the remaining memory. You can also see what models and data frames are being created. You have max_runtime_secs_per_model set to 600, and say 10 models takes an hour, so if you check in every 5-10 minutes, you can get an idea of how much memory each model is taking up.
Your h2o.init() response looks fine. The guideline was to have 3-4 times the dataset size free. If your data is only 2.7MB, then this should not be a concern. Though if you have a lot of categorical columns, especially with a lot of choices, then they can take up more memory than you expect.
The memory used by a model can vary quite a lot, depending on the parameters chosen. Again, it is best to look on Flow, to see what parameters AutoML is choosing for you.
If it is simply the case that 10 models will fit in memory, and 20 models won't, and you don't want to take manual control of the parameters, then you could do batches of 10 models, and save after each hour. (Choose a different seed for each run.)

Which are minimum resources to deploy a Elasticsearch cluster? 1 master - 2 nodes

I have deployed a cluster of ElasticSearch in 4 nodes of Centos 7, but the sataus os cluster is red, I attached the ERROR bellow:
Error: [search_phase_execution_exception] all shards failed
at respond (http://172.18.13.66:5601/bundles/vendors.bundle.js?v=16602:111:161556)
at checkRespForFailure (http://172.18.13.66:5601/bundles/vendors.bundle.js?v=16602:111:160796)
at http://172.18.13.66:5601/bundles/vendors.bundle.js?v=16602:105:285566
at processQueue (http://172.18.13.66:5601/bundles/vendors.bundle.js?v=16602:58:132456)
at http://172.18.13.66:5601/bundles/vendors.bundle.js?v=16602:58:133349
at Scope.$digest (http://172.18.13.66:5601/bundles/vendors.bundle.js?v=16602:58:144239)
at Scope.$apply (http://172.18.13.66:5601/bundles/vendors.bundle.js?v=16602:58:147018)
at done (http://172.18.13.66:5601/bundles/vendors.bundle.js?v=16602:58:100026)
at completeRequest (http://172.18.13.66:5601/bundles/vendors.bundle.js?v=16602:58:104697)
at XMLHttpRequest.xhr.onload (http://172.18.13.66:5601/bundles/vendors.bundle.js?v=16602:58:105435
I have been read in other post that this error means I do not have sufficient disk available or suficient RAM available. Which are the minimum resources to solve this?

Elasticsearch DB giving Client request timeout [Status Code 504, error : Gateway Time-Out]

I am trying to send request on Elasticsearch Db. Most of the time I get this error:
{
"statusCode": 504,
"error": "Gateway Time-out",
"message": "Client request timeout"
}
Currently i have 24 GB of data in Elasticsearch DB
System Configuration:
8 Core, 4 GB RAM, OS Ubuntu.
I have only one node in the cluster.
I am unable to find why am i getting timeout issue frequently.
Is it because the size of the data I have?
Is it because the size of the data I have?
No, I would not say so. We have about 100 GB in our one-node-cluster and it works fine.
As for troubleshooting your problem, it is really hard to say anything as you have not given too much info.
The 24 GB is on one indice or more?
What kind of queries are you using?
What is your heapsize?

Redis sync fails. Redis copy keys and values works

I have two redis instances both running on the same machine on win64. The version is the one from https://github.com/MSOpenTech/redis with no amendments and the binaries are running as per download from github (ie version 2.6.12).
I would like to create a slave and sync it to the master. I am doing this on the same machine to ensure it works before creating a slave on a WAN located machine which will take around an hour to transfer the data that exists in the primary.
However, I get the following error:
[4100] 15 May 18:54:04.620 * Connecting to MASTER...
[4100] 15 May 18:54:04.620 * MASTER <-> SLAVE sync started
[4100] 15 May 18:54:04.620 * Non blocking connect for SYNC fired the event.
[4100] 15 May 18:54:04.620 * Master replied to PING, replication can continue...
[4100] 15 May 18:54:28.364 * MASTER <-> SLAVE sync: receiving 2147483647 bytes from master
[4100] 15 May 18:55:05.772 * MASTER <-> SLAVE sync: Loading DB in memory
[4100] 15 May 18:55:14.508 # Short read or OOM loading DB. Unrecoverable error, aborting now.
The only way I can sync up is via a mini script something along the lines of :
import orm.model
if __name__ == "__main__":
src = orm.model.caching.Redis(**{"host":"source_host","port":6379})
dest = orm.model.caching.Redis(**{"host":"source_host","port":7777})
ks = src.handle.keys()
for i,k in enumerate(ks):
if i % 1000 == 0:
print i, "%2.1f %%" % ( (i * 100.0) / len(ks))
dest.handle.set(k,src.handle.get(k))
where orm.model.caching.* are my middleware cache implementation bits (which for redis is just creating a self.handle instance variable).
Firstly, I am very suspicious of the number in the receiving bytes as that is 2^32-1 .. a very strange coincidence. Secondly, OOM can mean out of memory, yet I can fire up a 2nd process and sync that via the script but doing this via redis --slaveof fails with what appears to be out of memory. Surely this can't be right?
redis-check-dump does not run as this is the windows implementation.
Unfortunately there is sensitive data in the keys I am syncing so I can't offer it to anybody to investigate. Sorry about that.
I am definitely running the 64 bit version as it states this upon startup in the header.
I don't mind syncing via my mini script and then just enabling slave mode, but I don't think that is possible as the moment slaveof is executed, it drops all known data and resyncs from scratch (and then fails).
Any ideas ??
I have also seen this error earlier, but the latest bits from 2.8.4 seem to have resolved it https://github.com/MSOpenTech/redis/tree/2.8.4_msopen

Resources