Jmeter slave (distributed testing): Getting GC overhead limit exceeds in slave machines - jmeter

Jmeter slave (distributed testing): Getting GC overhead limit exceeds in slave machines when users reaches 300+. I have already made changes to jmeter.sh file in 3 machines (1 master and 2 slaves) with heap size to 3GB but for some reason this values are not considered. Please guide how and where to set heap size in slave machines.
Running in non gui mode without adding any listeners or graphs.
Running command:
sudo docker exec -i master /bin/bash -c "/jmeter/apache-jmeter-3.1/bin/jmeter -n -t /home/xx_journey_new.jmx -Djava.rmi.server.hostname=zz.zz.zz.zz -Dclient.rmi.localport=60000 -Rxx.xx.xx.xx,yy.yy.yy.yy -j jmeter.log -l result.csv"
jmeter.sh file in slave machines:
HEAP="-Xms1024m -Xmx3072m"
Tried with below also:
set HEAP=-Xms4g -Xmx4g
Please guide. Attched file has full details about errors.

Looking into GC overhead limit exceeded article:
The java.lang.OutOfMemoryError: GC overhead limit exceeded error is displayed when your application has exhausted pretty much all the available memory and GC has repeatedly failed to clean it.
So it indicates the problem with your test as it uses all the heap space allocated to JMeter and Java Garbage Collector cannot efficiently free up some memory to continue.
So make sure you're:
Following JMeter Best Practices
Sticking to recommendations from the 9 Easy Solutions for a JMeter Load Test “Out of Memory” Failure
Optimize your test and leave only those test elements which are absolutely required
Assessing other Garbage Collector implementations, i.e. CMS or G1

You need to monitor your slave's memory and CPU utilization as well along with application server. Check here for more details - http://www.testautomationguru.com/jmeter-server-performance-monitoring-with-collectd-influxdb-grafana/
Regarding the heap allocation, you can set 80% of RAM to JMeter. If it still occurs, you need to reduce the load / slave. So you need more slaves to run your test.
You could revisit your test plan. Remove any unnecessary test elements. Remove any listeners from the test plan. http://www.testautomationguru.com/jmeter-tips-tricks-for-beginners/

Related

Nifi memory continues to expand

I used a three-node nifi cluster, the nifi version is 1.16.3, the hardware is 8core 32G memory, and the solid-state high-speed hard disk is 2T. OS is CentOS7.9, ARM64 hardware architecture.
The initial configuration of nifi is xms12g and xmx12G(bootstrip.conf).
Native installation, docker is not used, and only nifi installed on all thoese machines, using integrated zookeeper.
Run 20 workflow everyday from 00:00 to 03:00, and the total data size is 1.2G. Collect csv documents to the greenplum database.
My problem now is that the memory usage of nifi is increasing every day, 0.2G per day, and all three nodes are like this. Then the memory is slowly full and then the machine is dead. This procedure is about a month(when the memory is set to 12G.).
That is to say, I need to restart the cluster every month. I use a native processor and workflow.
I can't locate the problem. Who can help me?
I may have any descriptions. Please feel to let me know,thanks.
I have made the following attempts:
I set the initial memory to 18G or 6G, and the speed of workflow processing has not changed. The difference is that, after setting it to 18G, it will freeze for a shorter time.
I used openjre1.8, and I tried to upgrade it to 11, but it was useless.
i add the following configuration, and is also useless:
java.arg.7=-XX:ReservedCodeCacheSize=256m
java.arg.8=-XX:CodeCacheMinimumFreeSpace=10m
java.arg.9=-XX:+UseCodeCacheFlushing
Every day's timing tasks consume little resources. Even if the memory is adjusted to 6G, 20 tasks run at the same time, the memory consumption is about 30%, and it will run out in half an hour.

Why is hadoop slow for a simple hello world job

I am following the tutorial on the hadoop website: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html.
I run the following example in Pseudo-Distributed Mode.
time hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+'
It takes 1:47min to complete. When I turn off the network (wifi), it finishes in approx 50 seconds.
When I run the same command using the Local (Standalone) Mode, it finishes in approx 5 seconds (on a mac).
I understand that in Pseudo-Distributed Mode there is more overhead involved and hence it will take more time, but in this case it takes way more time. The CPU is completely idle during the run.
Do you have any idea what can cause this issue?
First, I don't have an explanation for why turning off your network would result in faster times. You'd have to dig through the Hadoop logs to figure out that problem.
This is typical behavior most people encounter running Hadoop on a single node. Effectively, you are trying to use Fedex to deliver something to your next door neighbor. It will always be faster to walk it over because the inherent overhead of operating a distributed system. When you run local mode, you are only performing the Map-Reduce function. When you run pseudo-distributed, it will use all the Hadoop servers (NameNode, DataNodes for data; Resource Manager, NodeManagers for compute) and what you are seeing is the latencies involved in that.
When you submit your job, the Resource Manager has to schedule it. As your cluster is not busy, it will ask for resources from the Node Manager. The Node Manager will give it a container which will run your Application Master. Typically, this loop takes about 10 seconds. Once your AM is running it will ask for resources from the Resource Manager for it's Map and Reduce tasks. This takes another 10 seconds. Also when you submit your job there is around a 3 second wait before this job is actually submitted to the Resource Manager. So far that's 23 seconds and you haven't done any computation yet.
Once the job is running, the most likely cause of waiting is allocating memory. On smaller systems (> 32GB of memory) the OS might take a while to allocate space. If you were to run the same thing on what is considered commodity hardware for Hadoop (16+ core, 64+ GB) you would probably see run time closer to 25-30 seconds.

How to get memory usage information from Yarn to measure performance?

How to know the allocated memory, utilized memory for the Yarn container from command line? Basically I want to measure the performance of the Yarn container to set optimal parameters.

Distcp - Container is running beyond physical memory limits

I've been strugling with distcp for several days and I swear I have googled enough. Here is my use-case:
USE CASE
I have a main folder in a certain location say /hdfs/root, with a lot of subdirs (deepness is not fixed) and files.
Volume: 200,000 files ~= 30 GO
I need to copy only a subset for a client, /hdfs/root in another location, say /hdfs/dest
This subset is defined by a list of absolute path that can be updated over time.
Volume: 50,000 files ~= 5 GO
You understand that I can't use a simple hdfs dfs -cp /hdfs/root /hdfs dest because it is not optimized, it will take every files, and it hasn't an -update mode.
SOLUTION POC
I ended up using hadoop distcp in two ways:
Algo 1 (simplified):
# I start up to N distcp jobs in parallel for each subdir, with N=MAX_PROC (~30)
foreach subdir in mylist:
# mylist = /hdfs/root/dirX/file1 /hdfs/root/dirX/file2 ...
mylist = buildList(subdirs)
hadoop distcp -i -pct -update mylist /hdfs/dest/subdir &
and
Algo 2
# I start one distcp that has a blacklist
blacklist = buildBlackList()
hadoop distcp -numListstatusThread 10 -filters blacklist -pct -update /hdfs/root /hdfs/dest
Algo 2 does not even start, it seems that building a diff between source and blacklist is too hard for him, so I use Algo 1, and it works.
OOZIE WORKFLOW
Know I need to schedule all the workflow in a Oozie workflow.
I have put the algo 2 in a shell action, since I have a lot of distcp command and I don't master recursion or loop in oozie.
Once started, after a while, I get the following error:
Container runs beyond physical memory limits. Current usage: 17.2 GB of 16 GB physical memory used
Alright then, i'm gonna add more memory :
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>32768</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx512m</value>
</property>
</configuration>
And still I get: Container runs beyond physical memory limits. Current usage: 32.8 GB of 32 GB physical memory used But the job lived twice as long as the previous one.
The RAM on my cluster is not infinite, so I can't go further. Here are my hypothesis:
A distcp job does not release memory (JVM garbage collector ?)
Oozie sees the addition of all distcp jobs as the current memory usage, which is stupid
This is not the right way to do this (well I know, but still)
Also, there are a lot of things I did not understand about memory management, it's pretty foggy (yarn, oozie, jvm, mapreduce).
While googling, I noticed few people are talking about real distcp use case, this post is 4 days old: https://community.hortonworks.com/articles/71775/managing-hadoop-dr-with-distcp-and-snapshots.html and explains the snapshot usage, that I can't use in my case.
I've also heard about http://atlas.incubator.apache.org that would eventually solve my problem by "tagging" files and grant access to specific users, so we can avoid copying to a certain location. My admin team is working on it, but we won't get it to production know.
I'm quite desperate. Help me.
YARN containers are built on top of Linux "cgroups". These "cgroups" are used to put soft limits on CPU, but not on RAM...
Therefore YARN uses a clumsy workaround: it periodically checks how much RAM each container uses, and kills brutally anything that got over quota. So you lose the execution logs, and only get that dreadful message you have seen.
In most cases, you are running some kind of JVM binary (i.e. a Java/Scala utility or custom program) so you can get away by setting your own JVM quotas (especially -Xmx) so that you always stay under the YARN limit. Which means some wasted RAM because of the safety margin. But then the worse case is an clean failure of the JVM when it's out of memory, you get the execution logs in extenso and can start adjusting the quotas -- or fixing your memory leaks :-/
So what happens in your specific case? You are using Oozie to start a shell -- then the shell starts a hadoop command, which runs in a JVM. It is on that embedded JVM that you must set the Max Heap Size.
Long story short: if you allocate 32GB to the YARN container that runs your shell (via oozie.launcher.mapreduce.map.memory.mb) then you must ensure that the Java commands inside the shell do not consume more than, say, 28GB of Heap (to stay on the safe side).
If you are lucky, setting a single env variable will do the trick:
export HADOOP_OPTS=-Xmx28G
hadoop distcp ...........
If you are not lucky, you will have to unwrap the whole mess of hadoop-env.sh mixing different env variables with different settings (set by people that visibly hate you, in init scripts that you cannot even know about) to be interpreted by the JVM using complex precedence rules. Have fun. You may peek at that very old post for hints about where to dig.

What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job always got stuck in repartitioning. Namely, the job will constantly get following errors and retries:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer
org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /tmp/spark-...
I've tried to identify the problem but it seems like both memory and disk consumption of the machine throwing these errors are below 50%. I've also tried different configurations, including:
let driver/executor memory use 60% of total memory.
let netty to priortize JVM shuffling buffer.
increase shuffling streaming buffer to 128m.
use KryoSerializer and max out all buffers
increase shuffling memoryFraction to 0.4
But none of them works. The small job always trigger the same series of errors and max out retries (upt to 1000 times). How to troubleshoot this thing in such situation?
Thanks a lot if you have any clue.
Check your log if you get an error similar to this.
ERROR 2015-05-12 17:29:16,984 Logging.scala:75 - Lost executor 13 on node-xzy: remote Akka client disassociated
Every time you get this error is because you lose an executor. As why you lost an executor, that is another story, again check your log for clues.
One thing Yarn can kill your job, if it thinks that see you are using "too much memory"
Check for something like this:
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl - Container [<edited>] is running beyond physical memory limits. Current usage: 18.0 GB of 18 GB physical memory used; 19.4 GB of 37.8 GB virtual memory used. Killing container.
Also see: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
The current state of the art is to increase
spark.yarn.executor.memoryOverhead until the job stops failing. We do have
plans to try to automatically scale this based on the amount of memory
requested, but it will still just be a heuristic.
I was also getting error
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
and looking further in log I found
Container killed on request. Exit code is 143
After searching for the exit code, I realized that's its mainly related to memory allocation. So I checked the amount of memory I have configured for executors. I found that by mistake I had configured 7g to driver and only 1g for executor. After increasing the memory of executor my spark job ran successfully.
Seems like after I do the changeQueue operation used may cause this problem, the server has been changed after I changed the queue.

Resources