Aggregate Resource Allocation for a job in YARN - hadoop

I am new to Hadoop. When i run a job, i see the aggregate resource allocation for that job as 251248654 MB-seconds, 24462 vcore-seconds. However, when i find the details about the cluster, it shows there are 888 Vcores-total and 15.90 TB Memory-total. Can anyone tell me how this is related? what does MB-second and Vcore-seconds refer to for the job.
Is there any material online to know these? I tried surfing, dint get a proper answer

VCores-Total: Indicates the total number of VCores available in the cluster
Memory-Total: Indicates the total memory available in the cluster.
For e.g. I have a single node cluster, where, I have set memory requirements per container to be: 1228 MB (determined by config: yarn.scheduler.minimum-allocation-mb) and vCores per container to 1 vCore (determined by config: yarn.scheduler.minimum-allocation-vcores).
I have set: yarn.nodemanager.resource.memory-mb to 9830 MB. So, there can be totally 8 containers per node (9830 / 1228 = 8).
So, for my cluster:
VCores-Total = 1 (node) * 8 (containers) * 1 (vCore per container) = 8
Memory-Total = 1 (node) * 8 (containers) * 1228 MB (memory per container) = 9824 MB = 9.59375 GB = 9.6 GB
The figure below, shows my cluster metrics:
Now let's see "MB-seconds" and "vcore-seconds".
As per the description in the code (ApplicationResourceUsageReport.java):
MB-seconds: The aggregated amount of memory (in megabytes) the application has allocated times the number of seconds the application has been running.
vcore-seconds: The aggregated number of vcores that the application has allocated times the number of seconds the application has been running.
The description is self-explanatory (remember the keyword: Aggregated).
Let me explain this with an example.
I ran a DistCp job (which spawned 25 containers), for which I got the following:
Aggregate Resource Allocation: 10361661 MB-seconds, 8424 vcore-seconds
Now, let's do some rough calculation on how much time each container took:
For memory:
10361661 MB-seconds = 10361661 / 25 (containers) / 1228 MB (memory per container) = 337.51 seconds = 5.62 minutes
For CPU
8424 vcore-seconds = 8424 / 25 (containers) / 1 (vCore per container) = 336.96 seconds = 5.616 minutes
This indicates on an average, each container took 5.62 minutes to execute.
I hope this makes it clear. You can execute a job and confirm it yourself.

Related

Hadoop YARN - Job consuming more resources than user limit factor

I've got a default queue in YARN with the following configuration:
Capacity: 70%
Max Capacity: 90%
User Limit Factor 0.08
Minimum User Limit 8%
Maximum Applications Inherited
Maximum AM Resource Inherited
Priority 0
Ordering Policy Fair
Maximum Application Lifetime -1
Default Application Lifetime -1
Enable Size Based Weight Ordering Disabled
Maximum Allocation Vcores Inherited
Maximum Allocation Mb Inherited
But even with the User Limit Factor of 0.08 and Minimum User Limit of 8%, there are jobs running with more than 8% of the resources of the queue, as you can see below:
How is this even possible? Is the User Limit Factor/Minimum User Limit not working? Is there any other configuration I should be aware of?
Minimum user limit of 8% means if 12 users arrive at the same time - they will get 8 % of resources each so as to give them equal share.
However in your case the timing of the jobs start suggest that the tasks submitted first will have more cluster resources available since they are not coming
in at the same time. A good understanding on these parameters can be had from here and here

Why Spark application runs much slower with lower MaxGCPauseMillis?

I am testing Spark-1.5.1 with different G1 configurations and observe that my application takes 2 min to complete with MaxGCPauseMillis = 200 (default) and 4 min with MaxGCPauseMillis = 1. The heap usage depicted below. We can see from the statistics below that the GC time of both configs is different by only 5 sec.
I am wondering why execution time increases this much?
Some statistics:
MaxGCPauseMillis = 200 - No. young GCs: 67; GC time of an executor: 9.8 sec
MaxGCPauseMillis = 1 - No. young GCs: 224; GC time of an executor: 14.7 sec
Red area is area is young generation, black is old generation. The application runs on 10 nodes with 1 executor and 6 GB heap each.
The application is a Word Count example:
val lines = sc.textFile(args(0), 1)
val words = lines.flatMap(l => SPACE.split(l))
val ones = words.map(w => (w,1))
val counts = ones.reduceByKey(_ + _)
//val output = counts.collect()
//output.foreach(t => println(t._1 + ": " + t._2))
counts.saveAsTextFile(args(1))
MaxGCPauseMillis is an hint to the JVM that the overall pause times caused by GC should not be more than specified value (in milliseconds). Recommended value is 200 milliseconds for most of the production grade system.
Anything lower may force GC to run more number of times than it is required and would impact the overall throughput of the application, which is exactly happening in your case.
The number of young GCs is 67 while we configure MaxGCPauseMillis=200 and number of Young GC's is almost 4 times (224) when we configure MaxGCPauseMillis=1.
Refer here for more detailed explanations.
Your intuition is wrong. Rather, theoretically, with a heap size chosen, throughput and latency (hinted by MaxGCPauseMillis in this case) have a counter effect. So when you lower MaxGCPauseMillis and hence latency, your throughput goes down too.

Cassandra Reading Benchmark with Spark

I'm doing a benchmark on Cassandra's Reading performance. In the test-setup step I created a cluster with 1 / 2 / 4 ec2-instances and data nodes. I wrote 1 table with 100 million of entries (~3 GB csv-file). Then I launch a Spark application which reads the data into a RDD using the spark-cassandra-connector.
However, I thought the behavior should be the following: The more instances Cassandra (same instance amount on Spark) uses, the faster the reads! With the writes everything seems to be correct (~2-times faster if cluster 2-times larger).
But: In my benchmark the read is always faster with a 1-instance-cluster then with a 2- or 4-instance-cluster!!!
My Benchmark Results:
Cluster-size 4: Write: 1750 seconds / Read: 360 seconds
Cluster-size 2: Write: 3446 seconds / Read: 420 seconds
Cluster-size 1: Write: 7595 seconds / Read: 284 seconds
ADDITIONAL TRY - WITH THE CASSANDRA-STRESS TOOL
I launched the "cassandra-stress" tool on the Cassandra cluster (size 1 / 2 / 3 / 4 nodes), with following results:
Clustersize Threads Ops/sek Time
1 4 10146 30,1
8 15612 30,1
16 20037 30,2
24 24483 30,2
121 43403 30,5
913 50933 31,7
2 4 8588 30,1
8 15849 30,1
16 24221 30,2
24 29031 30,2
121 59151 30,5
913 73342 31,8
3 4 7984 30,1
8 15263 30,1
16 25649 30,2
24 31110 30,2
121 58739 30,6
913 75867 31,8
4 4 7463 30,1
8 14515 30,1
16 25783 30,3
24 31128 31,1
121 62663 30,9
913 80656 32,4
Results: With 4 or 8 threads the single-node cluster is as fast or faster then the larger clusters!!!
Results as diagram:
The data-sets are the cluster sizes (1/2/3/4), x-axis the threads, and y-axis the ops/sec.
--> Question here: Are these results the cluster-wide results or is this a test for a local node (and so the result of only one instance of the ring)???
Can someone give an explanation? Thank you!
I ran a similar test with a spark worker running on each Cassandra node.
Using a Cassandra table with 15 million rows (about 1.75 GB of data), I ran a spark job to create an RDD from the table with each row as a string, and then printed a count of the number of rows.
Here are the times I got:
1 C* node, 1 spark worker - 1 min. 42 seconds
2 C* nodes, 2 spark workers - 55 seconds
4 C* nodes, 4 spark workers - 35 seconds
So it seems to scale pretty well with the number of nodes when the spark workers are co-located with the C* nodes.
By not co-locating your workers with Cassandra, you are forcing all the table data to go across the network. That will be slow and perhaps in your environment is a bottleneck. If you co-locate them, then you benefit from data locality since spark will create the RDD partitions from the tokens that are local to each machine.
You may also have some other bottleneck. I'm not familiar with EC2 and what it offers. Hopefully it has local disk storage rather than network storage since C* doesn't like network storage.

Elastic Search and Logstash Performance Tuning

In a Single Node Elastic Search along with logstash, We tested with 20mb and 200mb file parsing to Elastic Search on Different types of the AWS instance i.e Medium, Large and Xlarge.
Environment Details : Medium instance 3.75 RAM 1 cores Storage :4 GB SSD 64-bit Network Performance: Moderate
Instance running with : Logstash, Elastic search
Scenario: 1
**With default settings**
Result :
20mb logfile 23 mins Events Per/second 175
200mb logfile 3 hrs 3 mins Events Per/second 175
Added the following to settings:
Java heap size : 2GB
bootstrap.mlockall: true
indices.fielddata.cache.size: "30%"
indices.cache.filter.size: "30%"
index.translog.flush_threshold_ops: 50000
indices.memory.index_buffer_size: 50%
# Search thread pool
threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100
**With added settings**
Result:
20mb logfile 22 mins Events Per/second 180
200mb logfile 3 hrs 07 mins Events Per/second 180
Scenario 2
Environment Details : R3 Large 15.25 RAM 2 cores Storage :32 GB SSD 64-bit Network Performance: Moderate
Instance running with : Logstash, Elastic search
**With default settings**
Result :
20mb logfile 7 mins Events Per/second 750
200mb logfile 65 mins Events Per/second 800
Added the following to settings:
Java heap size: 7gb
other parameters same as above
**With added settings**
Result:
20mb logfile 7 mins Events Per/second 800
200mb logfile 55 mins Events Per/second 800
Scenario 3
Environment Details :
R3 High-Memory Extra Large r3.xlarge 30.5 RAM 4 cores Storage :32 GB SSD 64-bit Network Performance: Moderate
Instance running with : Logstash, Elastic search
**With default settings**
Result:
20mb logfile 7 mins Events Per/second 1200
200mb logfile 34 mins Events Per/second 1200
Added the following to settings:
Java heap size: 15gb
other parameters same as above
**With added settings**
Result:
20mb logfile 7 mins Events Per/second 1200
200mb logfile 34 mins Events Per/second 1200
I wanted to know
What is the benchmark for the performance?
Is the performance meets the benchmark or is it below the benchmark
Why even after i increased the elasticsearch JVM iam not able to find the difference?
how do i monitor Logstash and improve its performance?
appreciate any help on this as iam new to logstash and elastic search.
I think this situation is related to the fact that Logstash uses fixed size queues (The Logstash event processing pipeline)
Logstash sets the size of each queue to 20. This means a maximum of 20 events can be pending for the next stage. The small queue sizes mean that Logstash simply blocks and stalls safely when there’s a heavy load or temporary pipeline problems. The alternatives would be to either have an unlimited queue or drop messages when there’s a problem. An unlimited queue can grow unbounded and eventually exceed memory, causing a crash that loses all of the queued messages.
I think what you should try is to increase the worker count with the '-w' flag.
On the other hand many people say that Logstash should be scaled horizontally, rather that adding more cores and GB of ram (How to improve Logstash performance)
You have given Java Heap size correctly with respect to your total memory, but I think you are not utilizing it properly. I hope you have idea about what is fielddata size, the default is 60% of Heap size and you are reducing it to 30%.
I don't know why you are doing this, my perception might be wrong for your use-case but its good habit to allocate indices.fielddata.cache.size: "70%" or even 75%, but with this setting you must have to set something like indices.breaker.total.limit: "80%" to avoid Out Of Memory(OOM) exception. You can check this for further details on Limiting Memory Usage.

UDF optimization in Hadoop

I testing my UDF on Windows virtual machine with 8 cores and 8 GB RAM. I have created 5 files of 2 GB about and run the pig script after modifying "mapred.tasktracker.map.tasks.maximum".
The following runtime and statistics:
mapred.tasktracker.map.tasks.maximum = 2
duration = 20 min 54 sec
mapred.tasktracker.map.tasks.maximum = 4
duration = 13 min 38 sec and about 30 sec for task
35% better
mapred.tasktracker.map.tasks.maximum = 8
duration = 12 min 44 sec and about 1 min for task
only 7% better
Why such a small improvement when changing settings? any ideas? Job was divided into 145 tasks.
![4 slots][1]
![8 slots][2]
Couple of observations:
I imagine your windows machine only has a single disk backing this VM - so there is a limit to how much data you can read off disk at any one time (and write back for the spills). By increasing the task slots, your effectively driving up the read / write demands on your disk (and a more disk thrashing too potentially). If you have multiple disks backing your VM (and not virtual disks all on the same physical disk, i mean virtual disks backed by different physical disks), you would probably see a performance increase over what you've already seen.
By adding more map slots, you've reduced the amount of assignment waves that the Job Tracker needs to do - and each wave has a polling overhead (TT polling the jobs, JT polling the TTs and assigning new tasks to free slots). A 2 slot TT vs 8 slot TT will mean that you have 145/2=~73 assignment waves (if all tasks ran in equal time - obviously not realistic) vs 145/8=~19 waves - thats a ~3x increase in the amount of polling needed to be done (and it all adds up).
mapred.tasktracker.map.tasks.maximum configures the maximum number of map tasks that will be run simultaneously by a task tracker. There is a practical hardware limit to how many tasks a single node can run at a time. So there will be diminishing returns when you keep increasing this number.
For example, say the tasktracker node has 8 cores. Say 4 cores are being used by processes other than the tasktracker. That leaves 4 cores for the mapred tasks. So your task time will improve from mapred.tasktracker.map.tasks.maximum = 1 to 4, but after that, it would just remain static because the other tasks will just be waiting. In fact, if you increase it too much, the contention and context switching might make it slower. The recommended value for this parameter is the No. of CPU cores - 1

Resources