Why Spark application runs much slower with lower MaxGCPauseMillis? - performance

I am testing Spark-1.5.1 with different G1 configurations and observe that my application takes 2 min to complete with MaxGCPauseMillis = 200 (default) and 4 min with MaxGCPauseMillis = 1. The heap usage depicted below. We can see from the statistics below that the GC time of both configs is different by only 5 sec.
I am wondering why execution time increases this much?
Some statistics:
MaxGCPauseMillis = 200 - No. young GCs: 67; GC time of an executor: 9.8 sec
MaxGCPauseMillis = 1 - No. young GCs: 224; GC time of an executor: 14.7 sec
Red area is area is young generation, black is old generation. The application runs on 10 nodes with 1 executor and 6 GB heap each.
The application is a Word Count example:
val lines = sc.textFile(args(0), 1)
val words = lines.flatMap(l => SPACE.split(l))
val ones = words.map(w => (w,1))
val counts = ones.reduceByKey(_ + _)
//val output = counts.collect()
//output.foreach(t => println(t._1 + ": " + t._2))
counts.saveAsTextFile(args(1))

MaxGCPauseMillis is an hint to the JVM that the overall pause times caused by GC should not be more than specified value (in milliseconds). Recommended value is 200 milliseconds for most of the production grade system.
Anything lower may force GC to run more number of times than it is required and would impact the overall throughput of the application, which is exactly happening in your case.
The number of young GCs is 67 while we configure MaxGCPauseMillis=200 and number of Young GC's is almost 4 times (224) when we configure MaxGCPauseMillis=1.
Refer here for more detailed explanations.

Your intuition is wrong. Rather, theoretically, with a heap size chosen, throughput and latency (hinted by MaxGCPauseMillis in this case) have a counter effect. So when you lower MaxGCPauseMillis and hence latency, your throughput goes down too.

Related

Unable to analyse Throughput Performance of One API which has been migrated to new platform

I have checking the performance one APIs which is performing in two systems therefore as the api has been migrated to new system i am doing the performance comparison from old system
Statistics as shown below:
 
New System:
 
Thread -25
Ramp-up ~25
Avg -8sec
Median - 7.8
95th percentile  -  8.8 sec
Throughput  - 0.39                  
Old System:
 
Thread -25
Ramp-up ~25
Avg -10 sec
Median - 10
95th percentile - 10
Throughput  - 0.74 
Here we can observe that the New System has taken less time for 25 Threads than old system but throughput is more Old System.
But Old System has Taken more time
I am confused about the throughput which system is more efficient ?
One which has taken less time should have more throughput but here the lesser time taken has less throughput which makes me confused to understand the performance??
can anyone help me here???
As per JMeter Glossary:
Throughput is calculated as requests/unit of time. The time is calculated from the start of the first sample to the end of the last sample. This includes any intervals between samples, as it is supposed to represent the load on the server.
The formula is: Throughput = (number of requests) / (total time)
So double check total test duration for both test executions, my expectation is that the "new system test" took longer.
With regards to the reason I cannot state anything meaningful without seeing the full .jtl results files for both executions, I can only assume that it could be one very long request in the "new system" test or you're having a Timer with random think time somewhere in your test.

How is total throughput value calculated in Aggregate Report?

I discovered, that in Aggregate Report TOTAL THROUGHPUT value depends on thread count. And if we run tests with only one thread, total throughput is calculated as 1 / Total Average (and multiplied by 1000 to convert milliseconds to seconds, see the screenshot below).
But when we set thread count to 2 or more, total throughput is calculated the unknown way, so what I want to know is which formula is used when calculating total throughput in this case (thread count > 1), because it does not seem to be an average of all requests throughput, it's also not calculated as 1 / Total Average like described in the first case. So how exactly does this work? (Screenshot for 2 threads attached below)
Thanks.
Screenshot for 1 thread used:
aggregate_1_thread.png
Screenshot for 2 threads used:
aggregate_2_threads.png
As per doc:
http://jmeter.apache.org/usermanual/component_reference.html#Aggregate_Report
Throughput - the Throughput is measured in requests per second/minute/hour. The time unit is chosen so that the displayed rate is at least 1.0. When the throughput is saved to a CSV file, it is expressed in requests/second, i.e. 30.0 requests/minute is saved as 0.5.
So result depends both on Response time and Number of threads which influences those response times.
The total number of requests is divided by the time taken to run them, see:
https://github.com/apache/jmeter/blob/trunk/src/core/org/apache/jmeter/visualizers/SamplingStatCalculator.java#L198

Prometheus - Convert cpu_user_seconds to CPU Usage %?

I'm monitoring docker containers via Prometheus.io. My problem is that I'm just getting cpu_user_seconds_total or cpu_system_seconds_total.
How to convert this ever-increasing value to a CPU percentage?
Currently I'm querying:
rate(container_cpu_user_seconds_total[30s])
But I don't think that it is quite correct (comparing to top).
How to convert cpu_user_seconds_total to CPU percentage? (Like in top)
Rate returns a per second value, so multiplying by 100 will give a percentage:
rate(container_cpu_user_seconds_total[30s]) * 100
I also found this way to get CPU Usage to be accurate:
100 - (avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) * 100)
From: http://www.robustperception.io/understanding-machine-cpu-usage/
Note that container_cpu_user_seconds_total and container_cpu_system_seconds_total are per-container counters, which show CPU time used by a particular container in user space and in kernel space accordingly (see these docs for more details). Cadvisor exposes additional metric - container_cpu_usage_seconds_total. This metric equals to the sum of the container_cpu_user_seconds_total and container_cpu_system_seconds_total, e.g. it shows overall CPU time used by each container. See these docs for more details.
The container_cpu_usage_seconds_total is a counter, e.g. it increases over time. This isn't very informative for determining CPU usage at a particular time. Prometheus provides rate() function, which returns the average per-second increase rate over counters. For example, the followign query returns the average per-second increase of per-container container_cpu_usage_seconds_total metrics over the last 5 minutes (see 5m lookbehind window in square brackets):
rate(container_cpu_usage_seconds_total[5m])
This is basically the average number of CPU cores used during the last 5 minutes. Just multiply it by 100 in order to get CPU usage in %. Note that the resulting value may exceed 100% if the container uses more than a single CPU core during the last 5 minutes.
The rate(container_cpu_usage_seconds_total[5m]) usually returns a TON of time series with many long labels in production Kubernetes, so it is better to use the following queries:
The average number of CPU cores used during the last 5 minutes per each pod:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
The average number of CPU cores used during the last 5 minutes per each node:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)
The average number of CPU cores used during the last 5 minutes per each namespace:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
The container!="" filter removes superfluous metrics related to cgroups hierarchy - see this answer for more details.
For Windows Users - wmi_exporter
100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)

UDF optimization in Hadoop

I testing my UDF on Windows virtual machine with 8 cores and 8 GB RAM. I have created 5 files of 2 GB about and run the pig script after modifying "mapred.tasktracker.map.tasks.maximum".
The following runtime and statistics:
mapred.tasktracker.map.tasks.maximum = 2
duration = 20 min 54 sec
mapred.tasktracker.map.tasks.maximum = 4
duration = 13 min 38 sec and about 30 sec for task
35% better
mapred.tasktracker.map.tasks.maximum = 8
duration = 12 min 44 sec and about 1 min for task
only 7% better
Why such a small improvement when changing settings? any ideas? Job was divided into 145 tasks.
![4 slots][1]
![8 slots][2]
Couple of observations:
I imagine your windows machine only has a single disk backing this VM - so there is a limit to how much data you can read off disk at any one time (and write back for the spills). By increasing the task slots, your effectively driving up the read / write demands on your disk (and a more disk thrashing too potentially). If you have multiple disks backing your VM (and not virtual disks all on the same physical disk, i mean virtual disks backed by different physical disks), you would probably see a performance increase over what you've already seen.
By adding more map slots, you've reduced the amount of assignment waves that the Job Tracker needs to do - and each wave has a polling overhead (TT polling the jobs, JT polling the TTs and assigning new tasks to free slots). A 2 slot TT vs 8 slot TT will mean that you have 145/2=~73 assignment waves (if all tasks ran in equal time - obviously not realistic) vs 145/8=~19 waves - thats a ~3x increase in the amount of polling needed to be done (and it all adds up).
mapred.tasktracker.map.tasks.maximum configures the maximum number of map tasks that will be run simultaneously by a task tracker. There is a practical hardware limit to how many tasks a single node can run at a time. So there will be diminishing returns when you keep increasing this number.
For example, say the tasktracker node has 8 cores. Say 4 cores are being used by processes other than the tasktracker. That leaves 4 cores for the mapred tasks. So your task time will improve from mapred.tasktracker.map.tasks.maximum = 1 to 4, but after that, it would just remain static because the other tasks will just be waiting. In fact, if you increase it too much, the contention and context switching might make it slower. The recommended value for this parameter is the No. of CPU cores - 1

question about Littles Law

I know that Little's Law states (paraphrased):
the average number of things in a system is the product of the average rate at which things leave the system and the average time each one spends in the system,
or:
n=x*(r+z);
x-throughput
r-response time
z-think time
r+z - average response time
now i have question about a problem from programming pearls:
Suppose that system makes 100 disk accesses to process a transaction (although some systems require fewer, some systems will require several hundred disk access per transaction). How many transactions per hour per disk can the system handle?
Assumption: disk access takes 20 milliseconds.
Here is solution on this problem
Ignoring slowdown due to queuing, 20 milliseconds (of the seek time) per disk operation gives 2 seconds per transaction or 1800 transactions per hour
i am confused because i did not understand solution of this problem
please help
It will be more intuitive if you forget about that formula and think that the rate at which you can do something is inversely proportional to the time that it takes you to do it. For example, if it takes you 0.5 hour to eat a pizza, you eat pizzas at a rate of 2 pizzas per hour because 1/0.5 = 2.
In this case the rate is the number of transactions per time and the time is how long a transaction takes. According to the problem, a transaction takes 100 disk accesses, and each disk access takes 20 ms. Therefore each transaction takes 2 seconds total. The rate is then 1/2 = 0.5 transactions per second.
Now, more formally:
Rate of transactions per seconds R is inversely proportional to the transaction time in seconds TT.
R = 1/TT
The transaction time TT in this case is:
TT = disk access time * number of disk accesses per transaction =
20 milliseconds * 100 = 2000 milliseconds = 2 seconds
R = 1/2 transactions per second
= 3600/2 transactions per hour
= 1800 transactions per hour

Resources