H2O Spark streaming 2.1 distribution - spark-streaming

I have been intermittently getting distribution error when running a sample IRIS model in sparkling water.
Sparkling water: 2.1
Spark streaming kafka - 0.10.0.0
Running locally using spark submit - Only master
DistributedException from xxx:54321, caused by java.lang.NullPointerException
at water.MRTask.getResult(MRTask.java:478)
at water.MRTask.getResult(MRTask.java:486)
at water.MRTask.doAll(MRTask.java:390)
at water.MRTask.doAll(MRTask.java:396)
at hex.Model.predictScoreImpl(Model.java:1103)
at hex.Model.score(Model.java:964)
at hex.Model.score(Model.java:932)
....
Caused by: java.lang.NullPointerException
at water.fvec.Vec.chunkForChunkIdx(Vec.java:1014)
at water.fvec.CategoricalWrappedVec.chunkForChunkIdx(CategoricalWrappedVec.java:49)
at water.MRTask.compute2(MRTask.java:618)
at water.MRTask.compute2(MRTask.java:591)
at water.MRTask.compute2(MRTask.java:591)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1223)
at hex.Model$BigScore$Icer.compute1(Model$BigScore$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1219)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

So the problem is that H2O model is not seeing the data and causing NPE. The main reasons could be that h2o dataframe is deleted either at the time of prediction or just before prediction call.
We are interested to know how you do process mini batch data i.e. how mini batch is transformed into h2o data frame.
It will also help if you explain "how h2o model is being called to make prediction".

Related

AutoML: out of memory on small training file

I am attempting to run H2OAutoML on a 2.7MB training CSV on a system with 4GB RAM using the python API and it is running out of memory.
The error messages I am encountering are either:
h2o_ubuntu_started_from_python.out:
02-17 17:57:25.063 127.0.0.1:54321 27097 FJ-3-15 INFO: Stopping XGBoost training because of timeout
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 247463936 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/ubuntu/h20.ai/h2o-3.28.0.2/hs_err_pid27097.log
or
03:37:07.509: XRT_1_AutoML_20200217_030816 [DRF XRT (Extremely Randomized Trees)] failed: java.lang.OutOfMemoryError: Java heap space
in the output of the python depending on the exact crash instance I look at.
My init is:
h2o.init(max_mem_size='3G',min_mem_size='2G',jvm_custom_args=["-Xmx3g"])
Though I have tried with:
h2o.init()
My H2OAutoML call is:
H2OAutoML(nfolds=5,max_models=20, max_runtime_secs_per_model=600, seed=1,project_name =project_name)
aml.train(x=x, y=y, training_frame=train,validation_frame=test)
These are the server stats:
H2O cluster uptime: 02 secs
H2O cluster timezone: Etc/UTC
H2O data parsing timezone: UTC
H2O cluster version: 3.28.0.2
H2O cluster version age: 27 days
H2O cluster name: H2O_from_python_ubuntu_htq5aj
H2O cluster total nodes: 1
H2O cluster free memory: 3 Gb
H2O cluster total cores: 2
H2O cluster allowed cores: 2
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: {'http': None, 'https': None}
H2O internal security: False
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version: 3.6.9 final
Does this sound right? Am I not able to run 20 models?
I can run this just fine setting the max_models=10. This takes about 60 min.
Are there guidelines for the amount of RAM needed for a given max_models and filesize?
Connect to the Flow interface, running at 127.0.0.1:54321.
There is a section there where you can view the remaining memory. You can also see what models and data frames are being created. You have max_runtime_secs_per_model set to 600, and say 10 models takes an hour, so if you check in every 5-10 minutes, you can get an idea of how much memory each model is taking up.
Your h2o.init() response looks fine. The guideline was to have 3-4 times the dataset size free. If your data is only 2.7MB, then this should not be a concern. Though if you have a lot of categorical columns, especially with a lot of choices, then they can take up more memory than you expect.
The memory used by a model can vary quite a lot, depending on the parameters chosen. Again, it is best to look on Flow, to see what parameters AutoML is choosing for you.
If it is simply the case that 10 models will fit in memory, and 20 models won't, and you don't want to take manual control of the parameters, then you could do batches of 10 models, and save after each hour. (Choose a different seed for each run.)

java.lang.StackoverflowError when writing dataframe into Postgresql using JDBC

I'm trying to write the result of multiple operations into an AWS Aurora PostgreSQL cluster. All the calculations performs right but, when I try to write the result into the database I get the next error:
py4j.protocol.Py4JJavaError: An error occurred while calling o12179.jdbc.
: java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
I already tried to increase cluster size (15 r4.2xlarge machines), change number of partitions for the data to 120 partitions, change executor and driver memory to 4Gb each and I'm facing the same results.
The current SparkSession configuration is the next:
spark = pyspark.sql.SparkSession\
.builder\
.appName("profile")\
.config("spark.sql.shuffle.partitions", 120)\
.config("spark.executor.memory", "4g").config("spark.driver.memory", "4g")\
.getOrCreate()
I don't know if is a Spark configuration problem or if it's a programming problem.
Finally I found the problem.
The problem was an iterative read from S3 creating a really big DAG. I changed the way I read CSV files from S3 with the following instruction.
df = spark.read\
.format('csv')\
.option('header', 'true')\
.option('delimiter', ';')\
.option('mode', 'DROPMALFORMED')\
.option('inferSchema', 'true')\
.load(list_paths)
Where list_paths is a precalculated list of paths to S3 objects.

writing rdd from spark to Elastic Search fails

I am trying to write a pair rdd to Elastic Search on Elastic Cloud on version 2.4.0.
I am using elasticsearch-spark_2.10-2.4.0 plugin to write to ES.
Here is the code I am using to write to ES:
def predict_imgs(r):
import json
out_d = {}
out_d["pid"] = r["pid"]
out_d["other_stuff"] = r["other_stuff"]
return (r["pid"], json.dumps(out_d))
res2 = res1.map(predict_imgs)
es_write_conf = {
"es.nodes" : image_es,
#"es.port" : "9243",
"es.resource" : "index/type",
"es.nodes.wan.only":"True",
"es.write.operation":"upsert",
"es.mapping.id":"product_id",
"es.nodes.discovery" : "false",
"es.net.http.auth.user": "username",
"es.net.http.auth.pass": "pass",
"es.input.json": "true",
"es.http.timeout":"1m",
"es.scroll.size":"10",
"es.batch.size.bytes":"1mb",
"es.http.retries":"1",
"es.batch.size.entries":"5",
"es.batch.write.refresh":"False",
"es.batch.write.retry.count":"1",
"es.batch.write.retry.wait":"10s"}
res2.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_write_conf)
The Error I get is as follows:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 744 in stage 26.0 failed 4 times, most recent failure: Lost task 744.3 in stage 26.0 (TID 2841, 10.181.252.29): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
The interesting part is this works when I do a take on the first few elements on rdd2 and then make a new rdd out of it and write it to ES, it works flawlessly:
x = sc.parallelize([res2.take(1)])
x.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_write_conf)
I am using Elastic Cloud (cloud offering of Elastic Search) and Databricks (cloud offering of Apache Spark)
Could it be that ES is not able to keep up with the through put of Spark writing to ES ?
I increased our Elastic Cloud size from 2GB RAM to 8GB RAM.
Are there any recommended configs for the es_write_conf I used above? Any other confs that you can think of?
Does updating to ES 5.0 help?
Any help is appreciated. Have been struggling with this for a few days now. Thank you.
It looks like problem with pyspark calculations, not necessarly elasticsearch saving process. Ensure your RDDs are OK by:
Performing count() on rdd1 (to "materialize" results)
Performing count() on rdd2
If counts are OK, try with caching results before saving into ES:
res2.cache()
res2.count() # to fill the cache
res2.saveAsNewAPIHadoopFile(...
It the problem still appears, try to look at dead executors stderr and stdout (you can find them on Executors tab in SparkUI).
I also noticed the very small batch size in es_write_conf, try increasing it to 500 or 1000 to get better performance.

Connection Refused Exception while using Teradata Hadoop Connector

I am currently using the free version of Teradata Hadoop connector teradata-connector 1.3.4 to load data to Teradata. I am using internal.fastload to load the data.
Database version is 14.10
jdbc driver version is 15.0
Sometimes I face Connection refused exception while running the job but this issue goes off while reseting the load job 2-3 times. Also this has nothing to do with the load on teradata database as the load is pretty normal. The exception which is thrown is below:
15/10/29 22:52:54 INFO mapreduce.Job: Running job: job_1445506804193_290389
com.teradata.connector.common.exception.ConnectorException: Internal fast load socket server time out
at com.teradata.connector.teradata.TeradataInternalFastloadOutputFormat$InternalFastloadCoordinator.beginLoading(TeradataInternalFastloadOutputFormat.java:642)
at com.teradata.connector.teradata.TeradataInternalFastloadOutputFormat$InternalFastloadCoordinator.run(TeradataInternalFastloadOutputFormat.java:503)
at java.lang.Thread.run(Thread.java:745)
15/10/29 23:39:29 INFO mapreduce.Job: Job job_1445506804193_290389 running in uber mode : false
15/10/29 23:39:29 INFO mapreduce.Job: map 0% reduce 0%
15/10/29 23:40:08 INFO mapreduce.Job: Task Id : attempt_1445506804193_290389_m_000001_0, Status : FAILED
Error: com.teradata.connector.common.exception.ConnectorException: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.<init>(Socket.java:434)
at java.net.Socket.<init>(Socket.java:211)
at com.teradata.connector.teradata.TeradataInternalFastloadOutputFormat.getRecordWriter(TeradataInternalFastloadOutputFormat.java:301)
at com.teradata.connector.common.ConnectorOutputFormat$ConnectorFileRecordWriter.<init>(ConnectorOutputFormat.java:84)
at com.teradata.connector.common.ConnectorOutputFormat.getRecordWriter(ConnectorOutputFormat.java:33)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:624)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:744)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1591)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Any pointers in this regard will definitely help.
Thanks in advance.
Root cause: com.teradata.connector.common.exception.ConnectorException: Internal fast load socket server time out
Internal fast load server socket time out
When running export job using the "internal.fastload" method, the following error may occur: Internal fast load socket server time out
This error occurs because the number of available map tasks currently is less than the number of map tasks specified in the command line by parameter of "-nummappers".
This error can occur in the following conditions:
(1) There are some other map/reduce jobs running concurrently in the Hadoop cluster, so there are not enough resources to allocate specified map tasks for the export job.
(2) The maximum number of map tasks is smaller than existing map tasks added expected map tasks of the export jobs in the Hadoop cluster.
When the above error occurs, please try to increase the maximum number of map tasks of the Hadoop cluster, or decrease the number of map tasks for the export job
There is good trouble shooter PDF available #teradata
If you get any type of errors, have a look at above PDF and get it fixed.
Have a look at other map reduce properties if you have to fine tune them.
ravindra-babu's answer is correct since the answer is buried in the pdf docs. The support.teradata.com kb article KB0023556 also offers more details on the why.
Resolution
All mappers should run simultaneously. If they are not
running simultaneously, try to reduce the number of mappers in TDCH
job through -nummappers argument.
Resubmit the TDCH job after
changing the -nummappers
Honestly it's a very confusing error and could be displayed better.

Hbase Bulk load - Map Reduce job failing

I have map reduce job for hbase bulk load. Job is converting data into Hfiles and loading into hbase but after certain map % job is failing. Below is the exception that I am getting.
Error: java.io.FileNotFoundException: /var/mapr/local/tm4/mapred/nodeManager/spill/job_1433110149357_0005/attempt_1433110149357_0005_m_000000_0/spill83.out.index
at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:198)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:800)
at org.apache.hadoop.io.SecureIOUtils.openFSDataInputStream(SecureIOUtils.java:156)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:74)
at org.apache.hadoop.mapred.MapRFsOutputBuffer.mergeParts(MapRFsOutputBuffer.java:1382)
at org.apache.hadoop.mapred.MapRFsOutputBuffer.flush(MapRFsOutputBuffer.java:1627)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:709)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Only thing that i noticed in job that for small set of data it is working fine but as data grows job starts failing.
Let me know if anyone has faced this issue.
Thanks
This was a bug in MapR. I got reply on MapR forum. If someone is facing similar issue then refere below link.
http://answers.mapr.com/questions/163440/hbase-bulk-load-map-reduce-job-failing-on-mapr.html

Resources