Spark - Container is running beyond physical memory limits - hadoop

I have a cluster of two worker nodes.
Worker_Node_1 - 64GB RAM
Worker_Node_2 - 32GB RAM
Background Summery :
I am trying to execute spark-submit on yarn-cluster to run Pregel on a Graph to calculate the shortest path distances from one source vertex to all other vertices and print the values on console.
Experment :
For Small graph with 15 vertices execution completes application final status : SUCCEEDED
My code works perfectly and prints shortest distance for 241 vertices graph for single vertex as source vertex but there is a problem.
Problem :
When I dig into the Log file the task gets complete successfully in 4 mins and 26 Secs but still on the terminal it keeps on showing application status as Running and after approx 12 more minutes task execution terminates saying -
Application application_1447669815913_0002 failed 2 times due to AM Container for appattempt_1447669815913_0002_000002 exited with exitCode: -104 For more detailed output, check application tracking page:http://myserver.com:8088/proxy/application_1447669815913_0002/
Then, click on links to logs of each attempt.
Diagnostics: Container [pid=47384,containerID=container_1447669815913_0002_02_000001] is running beyond physical memory limits. Current usage: 17.9 GB of 17.5 GB physical memory used; 18.7 GB of 36.8 GB virtual memory used. Killing container.
Dump of the process-tree for container_1447669815913_0002_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 47387 47384 47384 47384 (java) 100525 13746 20105633792 4682973 /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -Xmx16384m -Djava.io.tmpdir=/yarn/nm/usercache/cloudera/appcache/application_1447669815913_0002/container_1447669815913_0002_02_000001/tmp -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://myserver.com:8020/user/spark/applicationHistory -Dspark.executor.memory=14g -Dspark.shuffle.service.enabled=false -Dspark.yarn.executor.memoryOverhead=2048 -Dspark.yarn.historyServer.address=http://myserver.com:18088 -Dspark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native -Dspark.shuffle.service.port=7337 -Dspark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.authenticate=false -Dspark.app.name=com.path.PathFinder -Dspark.master=yarn-cluster -Dspark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native -Dspark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class com.path.PathFinder --jar file:/home/cloudera/Documents/Longest_Path_Data_1/Jars/ShortestPath_Loop-1.0.jar --arg /home/cloudera/workspace/Spark-Integration/LongestWorstPath/configFile --executor-memory 14336m --executor-cores 32 --num-executors 2
|- 47384 47382 47384 47384 (bash) 2 0 17379328 853 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native::/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -Xmx16384m -Djava.io.tmpdir=/yarn/nm/usercache/cloudera/appcache/application_1447669815913_0002/container_1447669815913_0002_02_000001/tmp '-Dspark.eventLog.enabled=true' '-Dspark.eventLog.dir=hdfs://myserver.com:8020/user/spark/applicationHistory' '-Dspark.executor.memory=14g' '-Dspark.shuffle.service.enabled=false' '-Dspark.yarn.executor.memoryOverhead=2048' '-Dspark.yarn.historyServer.address=http://myserver.com:18088' '-Dspark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native' '-Dspark.shuffle.service.port=7337' '-Dspark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar' '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' '-Dspark.authenticate=false' '-Dspark.app.name=com.path.PathFinder' '-Dspark.master=yarn-cluster' '-Dspark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native' '-Dspark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.path.PathFinder' --jar file:/home/cloudera/Documents/Longest_Path_Data_1/Jars/ShortestPath_Loop-1.0.jar --arg '/home/cloudera/workspace/Spark-Integration/LongestWorstPath/configFile' --executor-memory 14336m --executor-cores 32 --num-executors 2 1> /var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001/stdout 2> /var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
Things I tried :
yarn.schedular.maximum-allocation-mb – 32GB
mapreduce.map.memory.mb = 2048 (Previously it was 1024)
Tried varying --driver-memory upto 24g
Could you please put more color on to how I can configure the Resource Manager so that Large Size Graphs ( > 300K vertices) can also be processed? Thanks.

Just increase default conf of spark.driver.memory from 512m to 2g solve this error in my case.
You may set the memory to higher if it keeps hitting the same error. Then, you can keep reducing it until it hits the same error so that you know the optimum driver memory to use for your job.

The more data you are processing, the more memory is needed by each Spark task. And if your executor is running too many tasks then it can run out of memory. When I had problems processing large amounts of data, it usually was a result of not properly balancing the number of cores per executor. Try to either reduce the number of cores or increase the executor memory.
One easy way to tell that you are having memory issues is to check the Executor tab on the Spark UI. If you see a lot of red bars indicating high garbage collection time, you are probably running out of memory in your executors.

I slove the error in my case to increase conf of spark.yarn.executor.memoryOverhead Which stand for off-heap memory
When you increase the amount of driver-memory and executor-memory, do not forget this config item

I have similar problem :
Key error info:
exitCode: -104
'PHYSICAL' memory limit
Application application_1577148289818_10686 failed 2 times due to AM Container for appattempt_1577148289818_10686_000002 exited with **exitCode: -104**
Failing this attempt.Diagnostics: [2019-12-26 09:13:54.392]Container [pid=18968,containerID=container_e96_1577148289818_10686_02_000001] is running 132722688B beyond the **'PHYSICAL' memory limit**. Current usage: 1.6 GB of 1.5 GB physical memory used; 4.6 GB of 3.1 GB virtual memory used. Killing container.
Increase both spark.executor.memory and spark.executor.memoryOverhead didn't take effect .
Then I increase spark.driver.memory solved it.

Spark jobs ask for resources from resource manager in a different way from MapReduce jobs. Try to tune the number of executors and mem/vcore allocated to each executor. Follow http://spark.apache.org/docs/latest/submitting-applications.html

Related

How to avoid OutOfMemoryErrors when processing big analysis reports?

Running SonarQube 5.6.6 from Jenkins on CentOS 7.3, I got the following error:
2017.09.01 19:05:16 ERROR [o.s.s.c.t.CeWorkerCallableImpl] Failed to execute task AV485bp0qXlQ-QPWWE9A
java.lang.OutOfMemoryError: Java heap space
2017.09.01 19:05:17 ERROR [o.s.s.c.t.CeWorkerCallableImpl] Executed task | project=PP::Symphony3M | type=REPORT | id=AV485bp0qXlQ-QPWWE9A | time=74089ms
sonar.ce.javaOpts is set like below:
sonar.ce.javaOpts=-Xmx60g -Xms1g -XX:+HeapDumpOnOutOfMemoryError -Djava.net.preferIPv4Stack=true
How much heap space should I give to SonarQube, when analyzing a one million LOC project? Or is there another way of avoiding Java heap space issues?
The max heap you can allocate depends on the free Ram on your server. free command can help identify the stats. Based on free Ram you can set your Xmx values.
BTW, make sure the code compiles on a server. If you are able to compile and not scan then only the increasing the heap will help.

cluster-mode SPARK refuses to run more than two jobs concurrently

My Spark cluster refuses to run more than two jobs simultaneously. One of the three will invariable stay stuck in 'ACCEPTED' state.
Hardware
4 Data Node with spark clients, 24gb ram, 4processors
Cluster Metrics show there should be enough cores
Apps Submitted 3
Apps Pending 1
Apps Running 2
Apps Completed 0
Containers Running 4
Memory Used 8GB
Memory Total 32GB
Memory Reserved 0B
VCores Used 4
VCores Total 8
VCores Reserved 0
Active Nodes 2
Decommissioned Nodes 0
Lost Nodes 0
Unhealthy Nodes 0
Rebooted Nodes 0
On Application Manager you can see the final the only way to run the third app is to kill a running one
application_1504018580976_0002 adm com.x.app1 SPARK default 0 [date] N/A RUNNING UNDEFINED 2 2 5120 25.0 25.0
application_1500031233020_0090 adm com.x.app2 SPARK default 0 [date] N/A RUNNING UNDEFINED 2 2 3072 25.0 25.0
application_1504024737012_0001 adm com.x.app3 SPARK default 0 [date] N/A ACCEPTED UNDEFINED 0 0 0 0.0 0.0
The running apps have 2x containers and 2x allocated vcores, 25% of the queue and 25% of the cluster.
Deployment command for all 3 apps.
/usr/hdp/current/spark2-client/bin/spark-submit
--master yarn
--deploy-mode cluster
--driver-cores 1
--driver-memory 512m
--num-executors 1
--executor-cores 1
--executor-memory 1G
--class com..x.appx ../lib/foo.jar
Capacity Scheduler
yarn.scheduler.capacity.default.minimum-user-limit-percent = 100
yarn.scheduler.capacity.maximum-am-resource-percent = 0.2
yarn.scheduler.capacity.maximum-applications = 10000
yarn.scheduler.capacity.node-locality-delay = 40
yarn.scheduler.capacity.root.accessible-node-labels = *
yarn.scheduler.capacity.root.acl_administer_queue = *
yarn.scheduler.capacity.root.capacity = 100
yarn.scheduler.capacity.root.default.acl_administer_jobs = *
yarn.scheduler.capacity.root.default.acl_submit_applications = *
yarn.scheduler.capacity.root.default.capacity = 100
yarn.scheduler.capacity.root.default.maximum-capacity = 100
yarn.scheduler.capacity.root.default.state = RUNNING
yarn.scheduler.capacity.root.default.user-limit-factor = 1
yarn.scheduler.capacity.root.queues = default
Your setting:
yarn.scheduler.capacity.maximum-am-resource-percent = 0.2
Implies:
total vcores(8) x maximum-am-resource-percent(0.2) = 1.6
1.6 gets rounded up to 2 since partial vcores makes no sense. This means you can only have 2 application masters at a time which is why you can only run 2 jobs at a time.
Solution, bump up yarn.scheduler.capacity.maximum-am-resource-percent to a higher value like 0.5.
followings are parameters to control parallel execution are:
spark.executor.instances -> number of executors
spark.executor.cores -> number of cores per executors
spark.task.cpus -> number of tasks per cpu
https://spark.apache.org/docs/latest/submitting-applications.html

Spring DataFlow Yarn - Container is running beyond physical memory

I'm running Spring Cloud Tasks on Yarn simple tasks work fine but running bigger tasks which require more resources I got "Container is running beyond physical memory" error:
onContainerCompleted:ContainerStatus: [ContainerId:
container_1485796744143_0030_01_000002, State: COMPLETE, Diagnostics: Container [pid=27456,containerID=container_1485796744143_0030_01_000002] is running beyond physical memory limits. Current usage: 652.5 MB of 256 MB physical memory used; 5.6 GB of 1.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1485796744143_0030_01_000002 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 27461 27456 27456 27456 (java) 1215 126 5858455552 166335 /usr/lib/jvm/java-1.8.0/bin/java -Dserver.port=0 -Dspring.jmx.enabled=false -Dspring.config.location=servers.yml -jar cities-job-0.0.1.jar --spring.datasource.driverClassName=org.h2.Driver --spring.datasource.username=sa --spring.cloud.task.name=city2 --spring.datasource.url=jdbc:h2:tcp://localhost:19092/mem:dataflow
|- 27456 27454 27456 27456 (bash) 0 0 115806208 705 /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Dserver.port=0 -Dspring.jmx.enabled=false -Dspring.config.location=servers.yml -jar cities-job-0.0.1.jar --spring.datasource.driverClassName='org.h2.Driver' --spring.datasource.username='sa' --spring.cloud.task.name='city2' --spring.datasource.url='jdbc:h2:tcp://localhost:19092/mem:dataflow' 1>/var/log/hadoop-yarn/containers/application_1485796744143_0030/container_1485796744143_0030_01_000002/Container.stdout 2>/var/log/hadoop-yarn/containers/application_1485796744143_0030/container_1485796744143_0030_01_000002/Container.stderr
I tried tuning options in DataFlow's server.yml settings:
spring:
deployer:
yarn:
app:
baseDir: /dataflow
taskappmaster:
memory: 512m
virtualCores: 1
javaOpts: "-Xms512m -Xmx512m"
taskcontainer:
priority: 1
memory: 512m
virtualCores: 1
javaOpts: "-Xms256m -Xmx512m"
I found out that taskappmaster memory changes are visible (AM container in YARN is set to this value), but taskcontainer memory options isnt changing - every container for Cloud Task which is created has only 256 mb which is default option for YarnDeployer.
For this server.yml expected result is allocation of 2 containers with 512 both for Application Master and Application Container. But YARN allocates 2 containers 512 for application master and 256 mb for application.
I dont think this problem is connected with YARN wrong options because Spark Applications work correctly seizing GBs of memory.
Some of my YARN settings:
mapreduce.reduce.java.opts -Xmx2304m
mapreduce.reduce.memory.mb 2880
mapreduce.map.java.opts -Xmx3277m
mapreduce.map.memory.mb 4096
yarn.nodemanager.vmem-pmem-ratio 5
yarn.nodemanager.vmem-check-enabled false
yarn.scheduler.minimum-allocation-mb 32
yarn.nodemanager.resource.memory-mb 11520
My Hadoop runtime is EMR 4.4.0 also I had to change default java to 1.8.
Cleaning up /dataflow directory in HDFS resolves problem, after deleting this directory Spring DataFlow upload all needed files. The other way is to remove file by yourself and upload new one.

elasticsearch JDBC -RIVER java.lang.OutOfMemoryError: unable to create new native thread

I am using elasticsearch "1.4.2" with river plugin on an aws instance with 8GB ram.Everything was working fine for a week but after a week the river plugin[plugin=org.xbib.elasticsearch.plugin.jdbc.river.JDBCRiverPlugin
version=1.4.0.4] stopped working also I was not able to do a ssh login to the server.After server restart ssh login worked fine ,when I checked the logs of elastic search I could find this error.
[2015-01-29 09:00:59,001][WARN ][river.jdbc.SimpleRiverFlow] no river mouth
[2015-01-29 09:00:59,001][ERROR][river.jdbc.RiverThread ] java.lang.OutOfMemoryError: unable to create new native thread
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: unable to create new native thread
After restarting the service everything works normal .But after certain interval the same thing happen.Can anyone tell what could be the reason and solution .If any other details are required please let me know.
When I checked the number of file descriptor using
sudo ls /proc/1503/fd/ | wc -l
I could see it is increasing after every time . It was 320 and it now reached 360 (keeps increasing) . and
sudo grep -E "^Max open files" /proc/1503/limits
this shows 65535
processor info
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2670 v2 # 2.50GHz
stepping : 4
microcode : 0x415
cpu MHz : 2500.096
cache size : 25600 KB
siblings : 8
cpu cores : 4
memory
MemTotal: 62916320 kB
MemFree: 57404812 kB
Buffers: 102952 kB
Cached: 3067564 kB
SwapCached: 0 kB
Active: 2472032 kB
Inactive: 2479576 kB
Active(anon): 1781216 kB
Inactive(anon): 528 kB
Active(file): 690816 kB
Inactive(file): 2479048 kB
Do the following
Run the following two commands as root:
ulimit -l unlimited
ulimit -n 64000
In /etc/elasticsearch/elasticsearch.yml make sure you uncomment or add a line that says:
bootstrap.mlockall: true
In /etc/default/elasticsearch uncomment the line (or add a line) that says MAX_LOCKED_MEMORY=unlimited and also set the ES_HEAP_SIZE line to a reasonable number. Make sure it's a high enough amount of memory that you don't starve elasticsearch, but it should not be higher than half the memory on your system generally and definitely not higher than ~30GB. I have it set to 8g on my data nodes.
In one way or another the process is obviously being starved of resources. Give your system plenty of memory and give elasticsearch a good part of that.
I think you need to analysis your server log. Maybe In: /var/log/message

Container is running beyond physical memory. Hadoop Streaming python MR

I am running a Python Script which needs a file (genome.fa) as a dependency(reference) to execute. When I run this command :
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/had oop-streaming-2.5.1.jar -file ./methratio.py -file '../Test_BSMAP/genome.fa' - mapper './methratio.py -r -g ' -input /TextLab/sravisha_test/SamFiles/test_sam -output ./outfile
I am getting this Error:
15/01/30 10:48:38 INFO mapreduce.Job: map 0% reduce 0%
15/01/30 10:52:01 INFO mapreduce.Job: Task Idattempt_1422600586708_0001_m_000 009_0, Status : FAILED
Container [pid=22533,containerID=container_1422600586708_0001_01_000017] is running beyond physical memory limits. Current usage: 1.1 GB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.
I am using Cloudera Manager (Free Edition) .These are my config :
yarn.app.mapreduce.am.resource.cpu-vcores = 1
ApplicationMaster Java Maximum Heap Size = 825955249 B
mapreduce.map.memory.mb = 1GB
mapreduce.reduce.memory.mb = 1 GB
mapreduce.map.java.opts = -Djava.net.preferIPv4Stack=true
mapreduce.map.java.opts.max.heap = 825955249 B
yarn.app.mapreduce.am.resource.mb = 1GB
Java Heap Size of JobHistory Server in Bytes = 397 MB
Can Someone tell me why I am getting this error ??
I think your python script is consuming a lot of memory during the reading of your large input file (clue: genome.fa).
Here is my reason (Ref: http://courses.coreservlets.com/Course-Materials/pdf/hadoop/04-MapRed-6-JobExecutionOnYarn.pdf, Container is running beyond memory limits, http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/)
Container’s Memory Usage = JVM Heap Size + JVM Perm Gen + Native Libraries + Memory used by spawned processes
The last variable 'Memory used by spawned processes' (the Python code) might be the culprit.
Try increasing the mem size of these 2 parameters: mapreduce.map.java.opts
and mapreduce.reduce.java.opts.
Try increasing the maps spawning at the time of execution ... you can increase no. of mappers by decreasing the split size... mapred.max.split.size ...
It will have overheads but will mitigate the problem ....

Resources