Hadoop Terasort unstable benchmark results - hadoop

I have a Cloudera Hadoop cluster and I'm doing some benchmarks running Terasort but I'm getting very unstable results from 105 - 150 minutes. Some times I've seen it was replicating more than usual or doing a lot of garbage collections but some other times they were pretty much the same.
I don't know the reason of the unstable results, any hint or recommendation will be very welcome :)
I run the benchmarks as follows:
I've chosen the number of maps and reduces tasks following this guide http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Speculative maps and reduce execution is off.
Generating dataset:
10,000,000,000 rows of 100 bytes ~= 953674 M
Block size = 128 MB
Number of maps tasks = 3725 (number-of-rows * row-size) / (block-size*2) I do times 2 because the maps tasks time was too low, like 7 seconds.
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teragen -Ddfs.replication=3 -Dmapred.map.tasks=3725 10000000000 /terasort-in
Running terasort:
num-of-worker-nodes = 4
num-of-cores-per-node = 8
Reduce tasks = 56 ( 1.75 * num-of-worker-nodes * num-of-cores-per-node )
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar terasort -Ddfs.replication=1 -Dmapred.reduce.tasks=56 /terasort-in /terasort-out
The service and role distribution among nodes is as follows:
6 Nodes - 8 cores, 16 GB RAM and 2 HD each - running just HDFS and MapReduce:
1st node, just master roles:
Namenode.
Cloudera management services.
2nd node, just master roles:
JobTracker.
SecondaryNamenode.
3rd to 6th nodes, just worker roles:
TaskTracker.
Datanode.
I use the 2nd node as client because is the one with the lowest load.
Please tell me if you need any configuration property value or detail.
Update: After Chris White's answer I've tried to reduce the number of pollings between the jobtracker and tasktrackers by having just 1 worker and very few maps and reduces, now the benchmarks are pretty stable :)

There are many factors that you need to take into consideration when looking at performance:
This could be a polling problem combined with the small number of processing slots you have available.
The Task Trackers poll the running tasks periodically to determine if they have finished, and the Job Tracker also polls the Task Trackers. With your ~3700 map tasks (if i've read your question correctly), if there was say a ~1 second difference in polling times, then this could account for the ~hour you are seeing in timing differences.
If you have a larger cluster with more processing slots, i imagine this number would become more stable, but no MR job will every have a constant running time, there are too many polling and other external timings (JVM start up time for example) that can adjust the overall runtime.
What was the data locality counters say for both jobs? If one job had considerably more data lock tasks than another then i would expect it to run fast too.

Related

Spark partition on nodes foreachpartition

I have a spark cluster (DataProc) with a master and 4 workers (2 preemtible), in my code I have some thing like this:
JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(myArray);
rdd_data.foreachPartition(partitionOfRecords -> {
while (partitionOfRecords.hasNext()) {
MyData d = partitionOfRecords.next();
LOG.info("my data: " + d.getId().toString());
}
})
myArray is composed by 1200 MyData objects.
I don't understand why spark uses only 2 cores, divide my array into 2 partitions, and doesn't use 16 cores.
I need to set the number of partition?
Thanks in advance for any help.
Generally it's always a good idea to specific the number of partitions as the second argument to parallelize since the optimal slicing of your dataset should really be independent from the particular shape of the cluster you're using, and Spark can at best use current sizes of executors as a "hint".
What you're seeing here is that Spark will default to asking taskScheduler for current number of executor cores to use as the defaultParallelism, combined with the fact that in Dataproc Spark dynamic allocation is enabled. Dynamic allocation is important because otherwise a single job submitted to a cluster might just specify max executors even if it sits idle and then it will prevent other jobs from being able to use those idle resources.
So on Dataproc, if you're using default n1-standard-4, Dataproc configures 2 executors per machine and gives each executor 2 cores. The value of spark.dynamicAllocation.minExecutors should be 1, so your default job, upon startup without doing any work, would sit on 1 executor with 2 cores. Then taskScheduler will report that 2 cores are currently reserved in total, and therefore defaultParallelism will be 2.
If you had a large cluster and you were already running a job for awhile (say, you have a map phase that runs for longer than 60 seconds) you'd expect dynamic allocation to have taken all available resources, so the next step of the job that uses defaultParallelism would then presumably be 16, which is the total cores on your cluster (or possibly 14, if 2 are consumed by an appmaster).
In practice, you probably want to parallelize into a larger number of partitions than total cores available anyways. Then if there's any skew in how long each element takes to process, you can have nice balancing where fast tasks finish and then those executors can start taking on new partitions while the slow ones are still running, instead of always having to wait for a single slowest partition to finish. It's common to choose a number of partitions anywhere from 2x the number of available cores to something 100x or more.
Here's another related StackOverflow question: spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Why some worker nodes cost more CPU for system during running Spark application?

I have 1 master node and 4 worker nodes. I set up the cluster using Ambari and all monitoring metrics are collected from its dashboard. Spark on the top of Hadoop, so I have got YARN and HDFS. I run a very simple word count script and found that one of the worker nodes did the most job. The word count job is divided into 149 tasks. 98 tasks are done by one node.
Here is my code for counting words
val file = sc.textFile("/data/2gdata.txt") //read file from HDFS
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.collect*
This picture illustrates the events timeline and CPU usage for each worder nodes
Aggregated Metrics by Executor are shown here
Each task has same size of input file. I assume they would spend similar time such as around 30 seconds to count word in the piece of input file. Some tasks spent more than 10 minutes.
I realized those nodes doing less job cost more CPU for system operation as shown in blue area in the first graph. The worker did more tasks and cost more CPU for user (application).
I am wondering what kinds of system operations required for a Spark application. Why three of worker nodes cost more CPU for system? I also enabled spark.speculation, but those stragglers will be killed after 10 minutes and performance didn't get better. Moreover, those stragglers are node_local, so I assume this issue is not related to HDFS replication. (There are 3 replications under the rack.)
Thank you very much.
Even the input file size is same for each task, during the shuffle and reduce phase, some task might process more data than other tasks, data skewing may cause more CPU costs.
You can repartitioning the data in between may improve the performance.

Hadoop Running reducers in parallel

I have a 4G file with ~ 16 mill lines, maps are running distributed with 6 maps in parallel out of 15 maps. Generates 35000 keys. I am using MultipleTextoutput so each reducer generates a output independent of other reducer.
I have configured the conf with 25-50 reducers, but it always runs 1 reducer at a time.
Machine - 4 core 32 G ram single machine running hortonworks stack
How do I get more than 1 reduce task to run in parallel ?
Have a look hadoop MapReduce Tutorial
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * ).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Have a look at related SE questions:
How hadoop decides how many nodes will do map and reduce tasks
What is Ideal number of reducers on Hadoop?
With specifying a lower reducer memory of 2 GB, the default in the mapred-site xml was 6GB, the framework brings up 3 reducers in parallel rather than 1.

Apache Spark: The number of cores vs. the number of executors

I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN.
The test environment is as follows:
Number of data nodes: 3
Data node machine spec:
CPU: Core i7-4790 (# of cores: 4, # of threads: 8)
RAM: 32GB (8GB x 4)
HDD: 8TB (2TB x 4)
Network: 1Gb
Spark version: 1.0.0
Hadoop version: 2.4.0 (Hortonworks HDP 2.1)
Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair -> reduceByKey -> map -> saveAsTextFile
Input data
Type: single text file
Size: 165GB
Number of lines: 454,568,833
Output
Number of lines after second filter: 310,640,717
Number of lines of the result file: 99,848,268
Size of the result file: 41GB
The job was run with following configurations:
--master yarn-client --executor-memory 19G --executor-cores 7 --num-executors 3 (executors per data node, use as much as cores)
--master yarn-client --executor-memory 19G --executor-cores 4 --num-executors 3 (# of cores reduced)
--master yarn-client --executor-memory 4G --executor-cores 2 --num-executors 12 (less core, more executor)
Elapsed times:
50 min 15 sec
55 min 48 sec
31 min 23 sec
To my surprise, (3) was much faster.
I thought that (1) would be faster, since there would be less inter-executor communication when shuffling.
Although # of cores of (1) is fewer than (3), #of cores is not the key factor since 2) did perform well.
(Followings were added after pwilmot's answer.)
For the information, the performance monitor screen capture is as follows:
Ganglia data node summary for (1) - job started at 04:37.
Ganglia data node summary for (3) - job started at 19:47. Please ignore the graph before that time.
The graph roughly divides into 2 sections:
First: from start to reduceByKey: CPU intensive, no network activity
Second: after reduceByKey: CPU lowers, network I/O is done.
As the graph shows, (1) can use as much CPU power as it was given. So, it might not be the problem of the number of the threads.
How to explain this result?
To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as
possible: Imagine a cluster with six nodes running NodeManagers, each
equipped with 16 cores and 64GB of memory. The NodeManager capacities,
yarn.nodemanager.resource.memory-mb and
yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 *
1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100%
of the resources to YARN containers because the node needs some
resources to run the OS and Hadoop daemons. In this case, we leave a
gigabyte and a core for these system processes. Cloudera Manager helps
by accounting for these and configuring these YARN properties
automatically.
The likely first impulse would be to use --num-executors 6
--executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity
of the NodeManagers. The application master will take up a core on one
of the nodes, meaning that there won’t be room for a 15-core executor
on that node. 15 cores per executor can lead to bad HDFS I/O
throughput.
A better option would be to use --num-executors 17
--executor-cores 5 --executor-memory 19G. Why?
This config results in three executors on all nodes except for the one
with the AM, which will have two executors.
--executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.
The explanation was given in an article in Cloudera's blog, How-to: Tune Your Apache Spark Jobs (Part 2).
Short answer: I think tgbaggio is right. You hit HDFS throughput limits on your executors.
I think the answer here may be a little simpler than some of the recommendations here.
The clue for me is in the cluster network graph. For run 1 the utilization is steady at ~50 M bytes/s. For run 3 the steady utilization is doubled, around 100 M bytes/s.
From the cloudera blog post shared by DzOrd, you can see this important quote:
I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.
So, let's do a few calculations see what performance we expect if that is true.
Run 1: 19 GB, 7 cores, 3 executors
3 executors x 7 threads = 21 threads
with 7 cores per executor, we expect limited IO to HDFS (maxes out at ~5 cores)
effective throughput ~= 3 executors x 5 threads = 15 threads
Run 3: 4 GB, 2 cores, 12 executors
2 executors x 12 threads = 24 threads
2 cores per executor, so hdfs throughput is ok
effective throughput ~= 12 executors x 2 threads = 24 threads
If the job is 100% limited by concurrency (the number of threads). We would expect runtime to be perfectly inversely correlated with the number of threads.
ratio_num_threads = nthread_job1 / nthread_job3 = 15/24 = 0.625
inv_ratio_runtime = 1/(duration_job1 / duration_job3) = 1/(50/31) = 31/50 = 0.62
So ratio_num_threads ~= inv_ratio_runtime, and it looks like we are network limited.
This same effect explains the difference between Run 1 and Run 2.
Run 2: 19 GB, 4 cores, 3 executors
3 executors x 4 threads = 12 threads
with 4 cores per executor, ok IO to HDFS
effective throughput ~= 3 executors x 4 threads = 12 threads
Comparing the number of effective threads and the runtime:
ratio_num_threads = nthread_job2 / nthread_job1 = 12/15 = 0.8
inv_ratio_runtime = 1/(duration_job2 / duration_job1) = 1/(55/50) = 50/55 = 0.91
It's not as perfect as the last comparison, but we still see a similar drop in performance when we lose threads.
Now for the last bit: why is it the case that we get better performance with more threads, esp. more threads than the number of CPUs?
A good explanation of the difference between parallelism (what we get by dividing up data onto multiple CPUs) and concurrency (what we get when we use multiple threads to do work on a single CPU) is provided in this great post by Rob Pike: Concurrency is not parallelism.
The short explanation is that if a Spark job is interacting with a file system or network the CPU spends a lot of time waiting on communication with those interfaces and not spending a lot of time actually "doing work". By giving those CPUs more than 1 task to work on at a time, they are spending less time waiting and more time working, and you see better performance.
As you run your spark app on top of HDFS, according to Sandy Ryza
I’ve noticed that the HDFS client has trouble with tons of concurrent
threads. A rough guess is that at most five tasks per executor can
achieve full write throughput, so it’s good to keep the number of
cores per executor below that number.
So I believe that your first configuration is slower than third one is because of bad HDFS I/O throughput
From the excellent resources available at RStudio's Sparklyr package page:
SPARK DEFINITIONS:
It may be useful to provide some simple definitions
for the Spark nomenclature:
Node: A server
Worker Node: A server that is part of the cluster and are available to
run Spark jobs
Master Node: The server that coordinates the Worker nodes.
Executor: A sort of virtual machine inside a node. One Node can have
multiple Executors.
Driver Node: The Node that initiates the Spark session. Typically,
this will be the server where sparklyr is located.
Driver (Executor): The Driver Node will also show up in the Executor
list.
I haven't played with these settings myself so this is just speculation but if we think about this issue as normal cores and threads in a distributed system then in your cluster you can use up to 12 cores (4 * 3 machines) and 24 threads (8 * 3 machines). In your first two examples you are giving your job a fair number of cores (potential computation space) but the number of threads (jobs) to run on those cores is so limited that you aren't able to use much of the processing power allocated and thus the job is slower even though there is more computation resources allocated.
you mention that your concern was in the shuffle step - while it is nice to limit the overhead in the shuffle step it is generally much more important to utilize the parallelization of the cluster. Think about the extreme case - a single threaded program with zero shuffle.
I think one of the major reasons is locality. Your input file size is 165G, the file's related blocks certainly distributed over multiple DataNodes, more executors can avoid network copy.
Try to set executor num equal blocks count, i think can be faster.
Spark Dynamic allocation gives flexibility and allocates resources dynamically. In this number of min and max executors can be given. Also the number of executors that has to be launched at the starting of the application can also be given.
Read below on the same:
http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
There is a small issue in the First two configurations i think. The concepts of threads and cores like follows. The concept of threading is if the cores are ideal then use that core to process the data. So the memory is not fully utilized in first two cases. If you want to bench mark this example choose the machines which has more than 10 cores on each machine. Then do the bench mark.
But dont give more than 5 cores per executor there will be bottle neck on i/o performance.
So the best machines to do this bench marking might be data nodes which have 10 cores.
Data node machine spec:
CPU: Core i7-4790 (# of cores: 10, # of threads: 20)
RAM: 32GB (8GB x 4)
HDD: 8TB (2TB x 4)
In the 2.) configuration you're reducing the parallel tasks and thus I believe your comparison isn't fair.
Make the --num-executors to atleast 5.
Thus, you will have 20 tasks running in comparison to your 21 tasks in 1.) configuration.
Then, the comparison will be fair as per me.
Also, please calculate the executor memory accordingly.

hadoop cassandra cpu utilization

Summary: How can I get Hadoop to use more CPUs concurrently on my server?
I'm running Cassandra and Hadoop on a single high-end server with 64GB RAM, SSDs, and 16 CPU cores. The input to my mapreduce job has 50M rows. During the map phase, Hadoop creates seven mappers. Six of those complete very quickly, and the seventh runs for two hours to complete the map phase. I've suggested more mappers like this ...
job.getConfiguration().set("mapred.map.tasks", "12");
but Hadoop continues to create only seven. I'd like to get more mappers running in parallel to take better advantage of the 16 cores in the server. Can someone explain how Hadoop decides how many mappers to create?
I have a similar concern during the reduce phase. I tell Hadoop to create 12 reducers like this ...
job.setNumReduceTasks(12);
Hadoop does create 12 reducers, but 11 complete quickly and the last one runs for hours. My job has 300K keys, so I don't imagine they're all being routed to the same reducer.
Thanks.
The map task number is depend on your input data.
For example:
if your data source is HBase the number is the region number of you data
if your data source is the file the map number is your file size/the block size(64mb or 128mb).
you cannot specify the map number in code
The problem of 6 fast and 1 slow is because the data unbalanced. I did not use Cassandra before, so I cannot tell you how to fix it.

Resources