Spring DataFlow Yarn - Container is running beyond physical memory - spring

I'm running Spring Cloud Tasks on Yarn simple tasks work fine but running bigger tasks which require more resources I got "Container is running beyond physical memory" error:
onContainerCompleted:ContainerStatus: [ContainerId:
container_1485796744143_0030_01_000002, State: COMPLETE, Diagnostics: Container [pid=27456,containerID=container_1485796744143_0030_01_000002] is running beyond physical memory limits. Current usage: 652.5 MB of 256 MB physical memory used; 5.6 GB of 1.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1485796744143_0030_01_000002 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 27461 27456 27456 27456 (java) 1215 126 5858455552 166335 /usr/lib/jvm/java-1.8.0/bin/java -Dserver.port=0 -Dspring.jmx.enabled=false -Dspring.config.location=servers.yml -jar cities-job-0.0.1.jar --spring.datasource.driverClassName=org.h2.Driver --spring.datasource.username=sa --spring.cloud.task.name=city2 --spring.datasource.url=jdbc:h2:tcp://localhost:19092/mem:dataflow
|- 27456 27454 27456 27456 (bash) 0 0 115806208 705 /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Dserver.port=0 -Dspring.jmx.enabled=false -Dspring.config.location=servers.yml -jar cities-job-0.0.1.jar --spring.datasource.driverClassName='org.h2.Driver' --spring.datasource.username='sa' --spring.cloud.task.name='city2' --spring.datasource.url='jdbc:h2:tcp://localhost:19092/mem:dataflow' 1>/var/log/hadoop-yarn/containers/application_1485796744143_0030/container_1485796744143_0030_01_000002/Container.stdout 2>/var/log/hadoop-yarn/containers/application_1485796744143_0030/container_1485796744143_0030_01_000002/Container.stderr
I tried tuning options in DataFlow's server.yml settings:
spring:
deployer:
yarn:
app:
baseDir: /dataflow
taskappmaster:
memory: 512m
virtualCores: 1
javaOpts: "-Xms512m -Xmx512m"
taskcontainer:
priority: 1
memory: 512m
virtualCores: 1
javaOpts: "-Xms256m -Xmx512m"
I found out that taskappmaster memory changes are visible (AM container in YARN is set to this value), but taskcontainer memory options isnt changing - every container for Cloud Task which is created has only 256 mb which is default option for YarnDeployer.
For this server.yml expected result is allocation of 2 containers with 512 both for Application Master and Application Container. But YARN allocates 2 containers 512 for application master and 256 mb for application.
I dont think this problem is connected with YARN wrong options because Spark Applications work correctly seizing GBs of memory.
Some of my YARN settings:
mapreduce.reduce.java.opts -Xmx2304m
mapreduce.reduce.memory.mb 2880
mapreduce.map.java.opts -Xmx3277m
mapreduce.map.memory.mb 4096
yarn.nodemanager.vmem-pmem-ratio 5
yarn.nodemanager.vmem-check-enabled false
yarn.scheduler.minimum-allocation-mb 32
yarn.nodemanager.resource.memory-mb 11520
My Hadoop runtime is EMR 4.4.0 also I had to change default java to 1.8.

Cleaning up /dataflow directory in HDFS resolves problem, after deleting this directory Spring DataFlow upload all needed files. The other way is to remove file by yourself and upload new one.

Related

Sonar-scanner hangs after 'Load active rules (done)' is shown in the logs

The tail of logging shows the following:
22:09:11.016 DEBUG: GET 200 http://someserversomewhere:9000/api/rules/search.protobuf?f=repo,name,severity,lang,internalKey,templateKey,params,actives,createdAt,updatedAt&activation=true&qprofile=AXaXXXXXXXXXXXXXXXw0&ps=500&p=1 | time=427ms
22:09:11.038 INFO: Load active rules (done) | time=12755ms
I have mounted the running container to see if the scanner process is pegged/running/etc and it shows the following:
Mem: 2960944K used, 106248K free, 67380K shrd, 5032K buff, 209352K cached
CPU: 0% usr 0% sys 0% nic 99% idle 0% io 0% irq 0% sirq
Load average: 5.01 5.03 4.83 1/752 46
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
1 0 root S 3811m 127% 1 0% /opt/java/openjdk/bin/java -Djava.awt.headless=true -classpath /opt/sonar-scanner/lib/sonar-scann
40 0 root S 2424 0% 0 0% bash
46 40 root R 1584 0% 0 0% top
I was unable to find any logging in the sonar-scanner-cli container to help indicate the state. It appears to just be hung and waiting for something to happen.
I am running Sonarqube locally from docker at the lts version 7.9.5
I am also running the docker container sonarsource:sonar-scanner-cli which is currently using the following version in the Dockerfile.
SONAR_SCANNER_VERSION=4.5.0.2216
I am triggering the scan via the following command:
docker run --rm \
-e SONAR_HOST_URL="http://someserversomewhere:9000" \
-e SONAR_LOGIN="nottherealusername" \
-e SONAR_PASSWORD="not12345likeinspaceballs" \
-v "$DOCKER_TEST_DIRECTORY:/usr/src" \
--link "myDockerContainerNameForSonarQube" \
sonarsource/sonar-scanner-cli -X -Dsonar.password=not12345likeinspaceballs -Dsonar.verbose=true \
-Dsonar.sources=app -Dsonar.tests=test -Dsonar.branch=master \
-Dsonar.projectKey="${PROJECT_KEY}" -Dsonar.log.level=TRACE \
-Dsonar.projectBaseDir=/usr/src/$PROJECT_NAME -Dsonar.working.directory=/usr/src/$PROJECT_NAME/$SCANNER_WORK_DIR
I have done a lot of digging to try to find anyone with similar issues and found the following older issue which seems to be similar but it is unclear how to determine if I am experiencing something related. Why does sonar-maven-plugin hang at loading global settings or active rules?
I am stuck and not sure what to do next any help or hints would be appreciated.
Additional note is that this process does work for the 8.4.2-developer version of Sonarqube that I am planning migrate to. The purpose of verifying 7.9.5 is to follow the recommended upgrade path from Sonarqube that recommends the interim step of first bringing your current version to the latest LTS then running the data migration before jumping to the next major version.

Spark - Container is running beyond physical memory limits

I have a cluster of two worker nodes.
Worker_Node_1 - 64GB RAM
Worker_Node_2 - 32GB RAM
Background Summery :
I am trying to execute spark-submit on yarn-cluster to run Pregel on a Graph to calculate the shortest path distances from one source vertex to all other vertices and print the values on console.
Experment :
For Small graph with 15 vertices execution completes application final status : SUCCEEDED
My code works perfectly and prints shortest distance for 241 vertices graph for single vertex as source vertex but there is a problem.
Problem :
When I dig into the Log file the task gets complete successfully in 4 mins and 26 Secs but still on the terminal it keeps on showing application status as Running and after approx 12 more minutes task execution terminates saying -
Application application_1447669815913_0002 failed 2 times due to AM Container for appattempt_1447669815913_0002_000002 exited with exitCode: -104 For more detailed output, check application tracking page:http://myserver.com:8088/proxy/application_1447669815913_0002/
Then, click on links to logs of each attempt.
Diagnostics: Container [pid=47384,containerID=container_1447669815913_0002_02_000001] is running beyond physical memory limits. Current usage: 17.9 GB of 17.5 GB physical memory used; 18.7 GB of 36.8 GB virtual memory used. Killing container.
Dump of the process-tree for container_1447669815913_0002_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 47387 47384 47384 47384 (java) 100525 13746 20105633792 4682973 /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -Xmx16384m -Djava.io.tmpdir=/yarn/nm/usercache/cloudera/appcache/application_1447669815913_0002/container_1447669815913_0002_02_000001/tmp -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://myserver.com:8020/user/spark/applicationHistory -Dspark.executor.memory=14g -Dspark.shuffle.service.enabled=false -Dspark.yarn.executor.memoryOverhead=2048 -Dspark.yarn.historyServer.address=http://myserver.com:18088 -Dspark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native -Dspark.shuffle.service.port=7337 -Dspark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.authenticate=false -Dspark.app.name=com.path.PathFinder -Dspark.master=yarn-cluster -Dspark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native -Dspark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class com.path.PathFinder --jar file:/home/cloudera/Documents/Longest_Path_Data_1/Jars/ShortestPath_Loop-1.0.jar --arg /home/cloudera/workspace/Spark-Integration/LongestWorstPath/configFile --executor-memory 14336m --executor-cores 32 --num-executors 2
|- 47384 47382 47384 47384 (bash) 2 0 17379328 853 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native::/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -Xmx16384m -Djava.io.tmpdir=/yarn/nm/usercache/cloudera/appcache/application_1447669815913_0002/container_1447669815913_0002_02_000001/tmp '-Dspark.eventLog.enabled=true' '-Dspark.eventLog.dir=hdfs://myserver.com:8020/user/spark/applicationHistory' '-Dspark.executor.memory=14g' '-Dspark.shuffle.service.enabled=false' '-Dspark.yarn.executor.memoryOverhead=2048' '-Dspark.yarn.historyServer.address=http://myserver.com:18088' '-Dspark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native' '-Dspark.shuffle.service.port=7337' '-Dspark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar' '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' '-Dspark.authenticate=false' '-Dspark.app.name=com.path.PathFinder' '-Dspark.master=yarn-cluster' '-Dspark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native' '-Dspark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.path.PathFinder' --jar file:/home/cloudera/Documents/Longest_Path_Data_1/Jars/ShortestPath_Loop-1.0.jar --arg '/home/cloudera/workspace/Spark-Integration/LongestWorstPath/configFile' --executor-memory 14336m --executor-cores 32 --num-executors 2 1> /var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001/stdout 2> /var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
Things I tried :
yarn.schedular.maximum-allocation-mb – 32GB
mapreduce.map.memory.mb = 2048 (Previously it was 1024)
Tried varying --driver-memory upto 24g
Could you please put more color on to how I can configure the Resource Manager so that Large Size Graphs ( > 300K vertices) can also be processed? Thanks.
Just increase default conf of spark.driver.memory from 512m to 2g solve this error in my case.
You may set the memory to higher if it keeps hitting the same error. Then, you can keep reducing it until it hits the same error so that you know the optimum driver memory to use for your job.
The more data you are processing, the more memory is needed by each Spark task. And if your executor is running too many tasks then it can run out of memory. When I had problems processing large amounts of data, it usually was a result of not properly balancing the number of cores per executor. Try to either reduce the number of cores or increase the executor memory.
One easy way to tell that you are having memory issues is to check the Executor tab on the Spark UI. If you see a lot of red bars indicating high garbage collection time, you are probably running out of memory in your executors.
I slove the error in my case to increase conf of spark.yarn.executor.memoryOverhead Which stand for off-heap memory
When you increase the amount of driver-memory and executor-memory, do not forget this config item
I have similar problem :
Key error info:
exitCode: -104
'PHYSICAL' memory limit
Application application_1577148289818_10686 failed 2 times due to AM Container for appattempt_1577148289818_10686_000002 exited with **exitCode: -104**
Failing this attempt.Diagnostics: [2019-12-26 09:13:54.392]Container [pid=18968,containerID=container_e96_1577148289818_10686_02_000001] is running 132722688B beyond the **'PHYSICAL' memory limit**. Current usage: 1.6 GB of 1.5 GB physical memory used; 4.6 GB of 3.1 GB virtual memory used. Killing container.
Increase both spark.executor.memory and spark.executor.memoryOverhead didn't take effect .
Then I increase spark.driver.memory solved it.
Spark jobs ask for resources from resource manager in a different way from MapReduce jobs. Try to tune the number of executors and mem/vcore allocated to each executor. Follow http://spark.apache.org/docs/latest/submitting-applications.html

elasticsearch JDBC -RIVER java.lang.OutOfMemoryError: unable to create new native thread

I am using elasticsearch "1.4.2" with river plugin on an aws instance with 8GB ram.Everything was working fine for a week but after a week the river plugin[plugin=org.xbib.elasticsearch.plugin.jdbc.river.JDBCRiverPlugin
version=1.4.0.4] stopped working also I was not able to do a ssh login to the server.After server restart ssh login worked fine ,when I checked the logs of elastic search I could find this error.
[2015-01-29 09:00:59,001][WARN ][river.jdbc.SimpleRiverFlow] no river mouth
[2015-01-29 09:00:59,001][ERROR][river.jdbc.RiverThread ] java.lang.OutOfMemoryError: unable to create new native thread
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: unable to create new native thread
After restarting the service everything works normal .But after certain interval the same thing happen.Can anyone tell what could be the reason and solution .If any other details are required please let me know.
When I checked the number of file descriptor using
sudo ls /proc/1503/fd/ | wc -l
I could see it is increasing after every time . It was 320 and it now reached 360 (keeps increasing) . and
sudo grep -E "^Max open files" /proc/1503/limits
this shows 65535
processor info
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2670 v2 # 2.50GHz
stepping : 4
microcode : 0x415
cpu MHz : 2500.096
cache size : 25600 KB
siblings : 8
cpu cores : 4
memory
MemTotal: 62916320 kB
MemFree: 57404812 kB
Buffers: 102952 kB
Cached: 3067564 kB
SwapCached: 0 kB
Active: 2472032 kB
Inactive: 2479576 kB
Active(anon): 1781216 kB
Inactive(anon): 528 kB
Active(file): 690816 kB
Inactive(file): 2479048 kB
Do the following
Run the following two commands as root:
ulimit -l unlimited
ulimit -n 64000
In /etc/elasticsearch/elasticsearch.yml make sure you uncomment or add a line that says:
bootstrap.mlockall: true
In /etc/default/elasticsearch uncomment the line (or add a line) that says MAX_LOCKED_MEMORY=unlimited and also set the ES_HEAP_SIZE line to a reasonable number. Make sure it's a high enough amount of memory that you don't starve elasticsearch, but it should not be higher than half the memory on your system generally and definitely not higher than ~30GB. I have it set to 8g on my data nodes.
In one way or another the process is obviously being starved of resources. Give your system plenty of memory and give elasticsearch a good part of that.
I think you need to analysis your server log. Maybe In: /var/log/message

Container is running beyond physical memory. Hadoop Streaming python MR

I am running a Python Script which needs a file (genome.fa) as a dependency(reference) to execute. When I run this command :
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/had oop-streaming-2.5.1.jar -file ./methratio.py -file '../Test_BSMAP/genome.fa' - mapper './methratio.py -r -g ' -input /TextLab/sravisha_test/SamFiles/test_sam -output ./outfile
I am getting this Error:
15/01/30 10:48:38 INFO mapreduce.Job: map 0% reduce 0%
15/01/30 10:52:01 INFO mapreduce.Job: Task Idattempt_1422600586708_0001_m_000 009_0, Status : FAILED
Container [pid=22533,containerID=container_1422600586708_0001_01_000017] is running beyond physical memory limits. Current usage: 1.1 GB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.
I am using Cloudera Manager (Free Edition) .These are my config :
yarn.app.mapreduce.am.resource.cpu-vcores = 1
ApplicationMaster Java Maximum Heap Size = 825955249 B
mapreduce.map.memory.mb = 1GB
mapreduce.reduce.memory.mb = 1 GB
mapreduce.map.java.opts = -Djava.net.preferIPv4Stack=true
mapreduce.map.java.opts.max.heap = 825955249 B
yarn.app.mapreduce.am.resource.mb = 1GB
Java Heap Size of JobHistory Server in Bytes = 397 MB
Can Someone tell me why I am getting this error ??
I think your python script is consuming a lot of memory during the reading of your large input file (clue: genome.fa).
Here is my reason (Ref: http://courses.coreservlets.com/Course-Materials/pdf/hadoop/04-MapRed-6-JobExecutionOnYarn.pdf, Container is running beyond memory limits, http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/)
Container’s Memory Usage = JVM Heap Size + JVM Perm Gen + Native Libraries + Memory used by spawned processes
The last variable 'Memory used by spawned processes' (the Python code) might be the culprit.
Try increasing the mem size of these 2 parameters: mapreduce.map.java.opts
and mapreduce.reduce.java.opts.
Try increasing the maps spawning at the time of execution ... you can increase no. of mappers by decreasing the split size... mapred.max.split.size ...
It will have overheads but will mitigate the problem ....

RethinkDB: why does rethinkdb service use so much memory?

After encountering situations where I found that rethinkdb service is down for unknown reason, I noticed it uses a lot of memory:
# free -m
total used free shared buffers cached
Mem: 7872 7744 128 0 30 68
-/+ buffers/cache: 7645 226
Swap: 4031 287 3744
# top
top - 23:12:51 up 7 days, 1:16, 3 users, load average: 0.00, 0.00, 0.00
Tasks: 133 total, 1 running, 132 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8061372k total, 7931724k used, 129648k free, 32752k buffers
Swap: 4128760k total, 294732k used, 3834028k free, 71260k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1835 root 20 0 7830m 7.2g 5480 S 1.0 94.1 292:43.38 rethinkdb
29417 root 20 0 15036 1256 944 R 0.3 0.0 0:00.05 top
1 root 20 0 19364 1016 872 S 0.0 0.0 0:00.87 init
# cat log_file | tail -9
2014-09-22T21:56:47.448701122 0.052935s info: Running rethinkdb 1.12.5 (GCC 4.4.7)...
2014-09-22T21:56:47.452809839 0.057044s info: Running on Linux 2.6.32-431.17.1.el6.x86_64 x86_64
2014-09-22T21:56:47.452969820 0.057204s info: Using cache size of 3327 MB
2014-09-22T21:56:47.453169285 0.057404s info: Loading data from directory /rethinkdb_data
2014-09-22T21:56:47.571843375 0.176078s info: Listening for intracluster connections on port 29015
2014-09-22T21:56:47.587691636 0.191926s info: Listening for client driver connections on port 28015
2014-09-22T21:56:47.587912507 0.192147s info: Listening for administrative HTTP connections on port 8080
2014-09-22T21:56:47.595163724 0.199398s info: Listening on addresses
2014-09-22T21:56:47.595167377 0.199401s info: Server ready
It seems a lot considering the size of the files:
# du -h
4.0K ./tmp
156M .
Do I need to configure a different cache size? Do you think it has something to do with finding the service surprisingly gone? I'm using v1.12.5
There were a few leak in the previous version, the main one being https://github.com/rethinkdb/rethinkdb/issues/2840
You should probably update RethinkDB -- the current version being 1.15.
If you run 1.12, you need to export your data, but that should be the last time you need it since 1.14 introduced seamless migrations.
From Understanding RethinkDB memory requirements - RethinkDB
By default, RethinkDB automatically configures the cache size limit according to the formula (available_mem - 1024 MB) / 2. available_mem
You can change this via a config file as they document, or change it with a size (in MB) from the command line:
rethinkdb --cache-size 2048

Resources