AutoML: out of memory on small training file - h2o

I am attempting to run H2OAutoML on a 2.7MB training CSV on a system with 4GB RAM using the python API and it is running out of memory.
The error messages I am encountering are either:
h2o_ubuntu_started_from_python.out:
02-17 17:57:25.063 127.0.0.1:54321 27097 FJ-3-15 INFO: Stopping XGBoost training because of timeout
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 247463936 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/ubuntu/h20.ai/h2o-3.28.0.2/hs_err_pid27097.log
or
03:37:07.509: XRT_1_AutoML_20200217_030816 [DRF XRT (Extremely Randomized Trees)] failed: java.lang.OutOfMemoryError: Java heap space
in the output of the python depending on the exact crash instance I look at.
My init is:
h2o.init(max_mem_size='3G',min_mem_size='2G',jvm_custom_args=["-Xmx3g"])
Though I have tried with:
h2o.init()
My H2OAutoML call is:
H2OAutoML(nfolds=5,max_models=20, max_runtime_secs_per_model=600, seed=1,project_name =project_name)
aml.train(x=x, y=y, training_frame=train,validation_frame=test)
These are the server stats:
H2O cluster uptime: 02 secs
H2O cluster timezone: Etc/UTC
H2O data parsing timezone: UTC
H2O cluster version: 3.28.0.2
H2O cluster version age: 27 days
H2O cluster name: H2O_from_python_ubuntu_htq5aj
H2O cluster total nodes: 1
H2O cluster free memory: 3 Gb
H2O cluster total cores: 2
H2O cluster allowed cores: 2
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: {'http': None, 'https': None}
H2O internal security: False
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version: 3.6.9 final
Does this sound right? Am I not able to run 20 models?
I can run this just fine setting the max_models=10. This takes about 60 min.
Are there guidelines for the amount of RAM needed for a given max_models and filesize?

Connect to the Flow interface, running at 127.0.0.1:54321.
There is a section there where you can view the remaining memory. You can also see what models and data frames are being created. You have max_runtime_secs_per_model set to 600, and say 10 models takes an hour, so if you check in every 5-10 minutes, you can get an idea of how much memory each model is taking up.
Your h2o.init() response looks fine. The guideline was to have 3-4 times the dataset size free. If your data is only 2.7MB, then this should not be a concern. Though if you have a lot of categorical columns, especially with a lot of choices, then they can take up more memory than you expect.
The memory used by a model can vary quite a lot, depending on the parameters chosen. Again, it is best to look on Flow, to see what parameters AutoML is choosing for you.
If it is simply the case that 10 models will fit in memory, and 20 models won't, and you don't want to take manual control of the parameters, then you could do batches of 10 models, and save after each hour. (Choose a different seed for each run.)

Related

Please suggest hardware configuration for network-intensive Flink job (Async I/O)

TLDR; I am running Flink Streaming job in mode=Batch on EMR. I have tried several EMR cluster configurations but neither of them works as required. Some do not work at all. Workflow is very network-intensive that cases main problems.
Question: What EMR cluster configuration (ec2 instance types) would you recommend for this use-case?
--
The job has following stages:
Read from MySQL
KeyBy user_id
Reduce by user_id
Async I/O enriching from Redis
Async I/O enriching from other Redis
Async I/O enriching from REST #1
Async I/O enriching from REST #2
Async I/O enriching from REST #2
Write to Elasticsearch
Other info:
Flink version: 1.13.1
EMR version: 6.4.0
Java version: JDK version Corretto-8.302.08.1 (provided by EMR)
Input data size: ~800 GB
Output data size: ~300 GB
"taskmanager.network.sort-shuffle.min-parallelism": 1
"taskmanager.memory.framework.off-heap.batch-shuffle.size": 256m
"taskmanager.network.sort-shuffle.min-buffers": 2048
"taskmanager.network.blocking-shuffle.compression.enabled": true
"taskmanager.memory.framework.off-heap.size": 512m
"taskmanager.memory.network.max": 2g
Configurations we tried:
#1
master: r6g.xlarge
core: r6g.xlarge (per/hour: $0.2; CPU: 4; RAM: 32 GiB; Disk: EBS 128 GB, network: 1.25 Gigabit baseline with burst up to 10 Gigabit)
min_scale: 2
max_scale: 25
expected: finishes within 24 hours
actual: works with sort-based shuffling enabled but very slowly (~36h), as this type of instance has a baseline & burst performance, when burst credits are exhausted degrades to the baseline of 1GBps, that slows down I/O. With hash-based shuffling fails on KeyBy -> Reduce with "Connection reset by peer", Task Manager fails -> Job fails -> Job manager is not able to restart.
#2
master: m5.xlarge
core: r6g.12xlarge (per/hour: $2.4; CPU: 48; RAM: 384 GiB; Disk: EBS 1.5 TB, network: 20 Gigabit)
min_scale: 1
max_scale: 4
expected: finishes within 24 hours, as there is much higher network badwith
actual: does not work. With sort-based shuffling fails on the writing phase with exception "Failed to transfer file from TaskExecutor". With hash-based shuffling fails on the same stage with "Connection reset by peer".

H2O H2OServerError: HTTP 500 Server Error when training model

Trying to train a DRF classifier in h2o (version 3.20.0.5), the error "H2OServerError: HTTP 500 Server Error" with no further explanation.
---------------------------------------------------------------------------
H2OServerError Traceback (most recent call last)
<ipython-input-44-f52d1cb4b77a> in <module>()
4 training_frame=train_u, validation_frame=val_u,
5 weights_column='weight',
----> 6 max_runtime_secs=max_train_time_hrs*60*60)
7
8
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/estimators/estimator_base.pyc in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose)
224 rest_ver = parms.pop("_rest_version") if "_rest_version" in parms else 3
225
--> 226 model_builder_json = h2o.api("POST /%d/ModelBuilders/%s" % (rest_ver, self.algo), data=parms)
227 model = H2OJob(model_builder_json, job_type=(self.algo + " Model Build"))
228
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/h2o.pyc in api(endpoint, data, json, filename, save_to)
101 # type checks are performed in H2OConnection class
102 _check_connection()
--> 103 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
104
105
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/backend/connection.pyc in request(self, endpoint, data, json, filename, save_to)
400 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
401 self._log_end_transaction(start_time, resp)
--> 402 return self._process_response(resp, save_to)
403
404 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/backend/connection.pyc in _process_response(response, save_to)
728 # Note that it is possible to receive valid H2OErrorV3 object in this case, however it merely means the server
729 # did not provide the correct status code.
--> 730 raise H2OServerError("HTTP %d %s:\n%r" % (status_code, response.reason, data))
731
732
H2OServerError: HTTP 500 Server Error:
Server error java.lang.NullPointerException:
Error: Caught exception: java.lang.NullPointerException
Request: None
The code snippet in question is shown below:
max_train_time_hrs = 8
drf_proc.train(
x=train_features, y=train_response,
training_frame=train_u, validation_frame=val_u,
weights_column='weight',
max_runtime_secs=max_train_time_hrs*60*60)
The output from running the h2o.init() command looks like
Checking whether there is an H2O instance running at http://172.18.4.62:54321. connected.
Warning: Your H2O cluster version is too old (7 months and 24 days)! Please download and install the latest version from http://h2o.ai/download/
H2O cluster uptime: 06 secs
H2O cluster timezone: Pacific/Honolulu
H2O data parsing timezone: UTC
H2O cluster version: 3.20.0.5
H2O cluster version age: 7 months and 24 days !!!
H2O cluster name: H2O_88021
H2O cluster total nodes: 4
H2O cluster free memory: 15.34 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://172.18.4.62:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: AutoML, XGBoost, Algos, Core V3, Core V4
Python version: 2.7.12 fin
While I realize that there is a warning that the version of h2o I am using is "too old", the version of the h2o python package I am using and the cluster I am connecting to still match and this cannot be upgraded due to other h2o applications that access this cluster and expect a certain version (all of these applications appear to have no problem running on the cluster). Meanwhile, any web browser is unable to connect to the H2O connection url.
Any ideas about what could be going on here or debugging steps that could be looked into?
15GB of memory might not be enough for a training process you expect to last 8hrs. (Aside: I'd recommend using early stopping, rather than, or as well as, max_runtime_secs.)
As a debugging step, I would recommend watching in the Flow interface (point your browser to port 54321 - see the connection URL in your h2o.init() output). Especially watch how memory usage is rising over time.
(Sometimes a "500" error just means it has gone unstable, and lack of memory is a common trigger.)
If you are getting the error immediately, that is less likely to be the problem (unless you have a huge dataset).
In that case I'd try to narrow down if a particular column or data row could be causing the problem. E.g.
Experiment 1: first half of columns in train_features
Experiment 2: second half of columns in train_features
Experiment 3: first half of rows in train_u
Experiment 4: second half of rows in train_u
Experiment 5/6 (if still no luck): the same for valid_u
If one of the experiment pair crashes but the other doesn't, then repeat the experiment on the crashing half.

Spark job just hangs with large data

I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating.
My settings :
I am using 51 node cluster r3.4x large with dynamic allocation and maximum resource turned on.
All I am doing is =
val startTime="2017-11-21T08:00:00Z"
val endTime="2017-12-05T08:00:00Z"
val start = DateUtils.getLocalTimeStamp( startTime )
val end = DateUtils.getLocalTimeStamp( endTime )
val days: Int = Days.daysBetween( start, end ).getDays
val files: Seq[String] = (0 to days)
.map( start.plusDays )
.map( d => s"$input_path${DateTimeFormat.forPattern( "yyyy/MM/dd" ).print( d )}/*/*" )
sqlSession.sparkContext.textFile( files.mkString( "," ) ).count
When I run the same with 14 days, I got 197337380 (count) and I ran the 15th day separately and got 27676788. But when I query 15 days total the job hangs
Update :
The job works fine with :
var df = sqlSession.createDataFrame(sc.emptyRDD[Row], schema)
for(n <- files ){
val tempDF = sqlSession.read.schema( schema ).json(n)
df = df(tempDF)
}
df.count
But can some one explain why it works now but not before ?
UPDATE : After setting mapreduce.input.fileinputformat.split.minsize to 256 GB it works fine now.
Dynamic allocation and maximize resource allocation are both different settings, one would be disabled when other is active. With Maximize resource allocation in EMR, 1 executor per node is launched, and it allocates all the cores and memory to that executor.
I would recommend taking a different route. You seem to have a pretty big cluster with 51 nodes, not sure if it is even required. However, follow this rule of thumb to begin with, and you will get a hang of how to tune these configurations.
Cluster memory - minimum of 2X the data you are dealing with.
Now assuming 51 nodes is what you require, try below:
r3.4x has 16 CPUs - so you can put all of them to use by leaving one for the OS and other processes.
Set your number of executors to 150 - this will allocate 3 executors per node.
Set number of cores per executor to 5 (3 executors per node)
Set your executor memory to roughly total host memory/3 = 35G
You got to control the parallelism (default partitions), set this to number of total cores you have ~ 800
Adjust shuffle partitions - make this twice of number of cores - 1600
Above configurations have been working like a charm for me. You can monitor the resource utilization on Spark UI.
Also, in your yarn config /etc/hadoop/conf/capacity-scheduler.xml file, set yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - which will allow Spark to really go full throttle with those CPUs. Restart yarn service after change.
You should be increasing the executor memory and # executors, If the data is huge try increasing the Driver memory.
My suggestion is to not use the dynamic resource allocation and let it run and see if it still hangs or not (Please note that spark job can consume entire cluster resources and make other applications starve for resources try this approach when no jobs are running). if it doesn't hang that means you should play with the resource allocation, then start hardcoding the resources and keep increasing resources so that you can find the best resource allocation you can possibly use.
Below links can help you understand the resource allocation and optimization of resources.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html

es_rejected_execution_exception rejected execution

I'm getting the following error when doing indexing.
es_rejected_execution_exception rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#16248886
on EsThreadPoolExecutor[bulk, queue capacity = 50,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#739e3764[Running,
pool size = 16, active threads = 16, queued tasks = 51, completed
tasks = 407667]
My current setup:
Two nodes. One is the master (data: true, master: true) while the other one is data only (data: true, master: false). They are both EC2 I2.4XL (16 Cores, 122GB RAM, 320GB instance storage). 2 shards, 1 replication.
Those two nodes are being fed by our aggregation server which has 20 separate workers. Each worker makes bulk indexing request to our ES cluster with 50 items to index. Each item is between 1000-4000 characters.
Current server setup: 4x client facing servers -> aggregation server -> ElasticSearch.
Now the issue is this error only started occurring when we introduced the second node. Before when we had one machine, we got consistent indexing throughput of 20k request per second. Now with two machine, once it hits the 10k mark (~20% CPU usage)
we start getting some of the errors outlined above.
But here is the interesting thing which I have noticed. We have a mock item generator which generates a random document to be indexed. Generally these documents are of the same size, but have random parameters. We use this to do the stress test and check the stability. This mock item generator sends requests to aggregation server which in turn passes them to Elasticsearch. The interesting thing is, we are able to index around 40-45k (# ~80% CPU usage) items per second without getting this error. So it seems really interesting as to why we get this error. Has anyone seen this error or know what could be causing it?

Hector is unable to read Cassandra data when nodes reboot or terminate

We are trying to run a cassandra cluster on AWS/EC2 within a standard VPC footprint (cassandra nodes on private subnets). Because this is AWS there is always a chance that an EC2 instance will terminate or reboot with no warning. I have been simulating this case on a test cluster and I am seeing things with the cluster that I thought a cluster was suppose to prevent. Specifically if a node reboots some data will go temporarily missing until the node completes its reboot. If a node terminates it appears that some data is lost forever.
For my test I just did a bunch of writes (using QUORUM consistency) to some keyspaces then interrogate the contents of those keyspaces as I bring down nodes (either through reboot or terminate). I'm just using cqlsh SELECT to do the keyspace/column family interrogation of the cluster using ONE consistency level.
Note, even though I am performing no writes to the cluster while I am doing the SELECTs rows temporarily disappear when rebooting and can permanently go missing during termination.
I thought Netflix Priam might be able to help, but sadly it doesn't work in a VPC the last time I checked.
Also, because we are using ephemeral storage instances there is no equivalent of 'shutdown' so I cannot run any scripts during reboot/terminate of an instance to perform a nodetool decommission or nodetool removenode before an instance goes away. Terminate is the equivalent of kicking the plug out of the wall.
Since I am using a replication factor of 3 and quorum/write that should mean that all data is written to at least 2 nodes. So, unless I am totally misunderstanding things (which is possible), losing one node should not mean that I lose any data for any period of time when I am using consistency level ONE for the read.
Questions
Why wouldn't a 6 node cluster with a replication factor of 3 work?
Do I need to run something like a 12 node cluster with a replication factor of 7? Don't bother telling me that will fix the problem, because it doesn't.
Do I need to use consistency level of ALL on the writes then use ONE or QUORUM on the reads?
Is there something not quite right with virtual nodes? unlikely
Are there nodetool commands besides removenode that I need to run when a node terminates to recover missing data? As mentioned earlier, when a reboot occurs, eventually the missing data reappears.
Is there some cassandra savant who can look at my cassandra.yaml file below and send me on the path to salvation?
More Info added 7/19
I don't think this is a QUORUM vs ONE vs ALL is the issue. The test I set up performs no writes to the keyspaces after the initial population of the column families. So the data has had plenty of time (hours) to make it to all the nodes as required by the replication factor. Plus the test dataset is REALLY small (2 column families with about 300-1000 values each). So in other words, the data is completely static.
The behavior I am seeing seems to be tied to the fact that the ec2 instance is no longer on the network. The reason I say this is because if I log on to a node and just do a cassandra stop I see no loss of data. But if I do the reboot or terminate I start getting the following in a stack trace.
CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
CassandraHostRetryService - Downed Host retry shutdown complete
CassandraHostRetryService - Downed Host retry shutdown hook called
Caused by: TimedOutException()
Caused by: TimedOutException()
So it seems to be more of a networking communication issue in that the cluster is expecting, for example 10.0.12.74, to be on the network after it has joined the cluster. If that ip is suddenly unreachable either due to reboot or termination the timeouts start happening.
When I do a nodetool status under all three scenarios (cassandra stop, reboot or terminate) the status of the node shows up as DN. Which is what you would expect. Eventually nodetool status will return to UN with cassandra start or reboot, but obviously termination always stays DN.
Details of my Configuration
Here are some details of my configuration (cassandra.yaml is at the bottom of this posting):
Nodes are running in private subnets of a VPC.
Cassandra 1.2.5 with num_tokens: 256 (virtual nodes). initial_token: (blank). I am really hoping this works because all of our nodes run in autoscaling groups so the thought that redistribution could be handle dynamically is appealing.
EC2 m1.large one seed and one non-seed node in each availability zone. (so 6 total nodes in the cluster).
Ephemeral storage, not EBS.
Ec2Snitch with NetworkTopologyStrategy and all keyspaces have replication factor of 3.
Non-seed nodes are auto_bootstraped, seed nodes are not.
sample cassandra.yaml file
cluster_name: 'TestCluster'
num_tokens: 256
initial_token:
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
disk_failure_policy: stop
key_cache_size_in_mb:
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
row_cache_provider: SerializingCacheProvider
saved_caches_directory: /opt/company/dbserver/caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "SEED_IP_LIST"
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 8
memtable_flush_queue_size: 4
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7000
ssl_storage_port: 7001
listen_address: LISTEN_ADDRESS
start_native_transport: false
native_transport_port: 9042
start_rpc: true
rpc_address: 0.0.0.0
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
auto_bootstrap: AUTO_BOOTSTRAP
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
read_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
cross_node_timeout: false
endpoint_snitch: Ec2Snitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
server_encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra
client_encryption_options:
enabled: false
keystore: conf/.keystore
keystore_password: cassandra
internode_compression: all
I think http://www.datastax.com/documentation/cassandra/1.2/cassandra/dml/dml_config_consistency_c.html will clear up a lot of this. In particular, QUORUM/ONE is not guaranteed to return the most recent data. QUORUM/QUORUM is. So is ALL/ONE, but that will be intolerant to failure on write.
Edit to go with the new information:
CassandraHostRetryService is part of Hector. I assumed you were testing with cqlsh like a sane person would. Lessons:
Use cqlsh for testing
Use the DataStax Java Driver for building your application, which is faster, easier to use, and has more insight into the cluster state than Hector thanks to the native protocol it's built on.

Resources