Hadoop Running reducers in parallel - hadoop

I have a 4G file with ~ 16 mill lines, maps are running distributed with 6 maps in parallel out of 15 maps. Generates 35000 keys. I am using MultipleTextoutput so each reducer generates a output independent of other reducer.
I have configured the conf with 25-50 reducers, but it always runs 1 reducer at a time.
Machine - 4 core 32 G ram single machine running hortonworks stack
How do I get more than 1 reduce task to run in parallel ?

Have a look hadoop MapReduce Tutorial
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * ).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Have a look at related SE questions:
How hadoop decides how many nodes will do map and reduce tasks
What is Ideal number of reducers on Hadoop?

With specifying a lower reducer memory of 2 GB, the default in the mapred-site xml was 6GB, the framework brings up 3 reducers in parallel rather than 1.

Related

Spark partition on nodes foreachpartition

I have a spark cluster (DataProc) with a master and 4 workers (2 preemtible), in my code I have some thing like this:
JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(myArray);
rdd_data.foreachPartition(partitionOfRecords -> {
while (partitionOfRecords.hasNext()) {
MyData d = partitionOfRecords.next();
LOG.info("my data: " + d.getId().toString());
}
})
myArray is composed by 1200 MyData objects.
I don't understand why spark uses only 2 cores, divide my array into 2 partitions, and doesn't use 16 cores.
I need to set the number of partition?
Thanks in advance for any help.
Generally it's always a good idea to specific the number of partitions as the second argument to parallelize since the optimal slicing of your dataset should really be independent from the particular shape of the cluster you're using, and Spark can at best use current sizes of executors as a "hint".
What you're seeing here is that Spark will default to asking taskScheduler for current number of executor cores to use as the defaultParallelism, combined with the fact that in Dataproc Spark dynamic allocation is enabled. Dynamic allocation is important because otherwise a single job submitted to a cluster might just specify max executors even if it sits idle and then it will prevent other jobs from being able to use those idle resources.
So on Dataproc, if you're using default n1-standard-4, Dataproc configures 2 executors per machine and gives each executor 2 cores. The value of spark.dynamicAllocation.minExecutors should be 1, so your default job, upon startup without doing any work, would sit on 1 executor with 2 cores. Then taskScheduler will report that 2 cores are currently reserved in total, and therefore defaultParallelism will be 2.
If you had a large cluster and you were already running a job for awhile (say, you have a map phase that runs for longer than 60 seconds) you'd expect dynamic allocation to have taken all available resources, so the next step of the job that uses defaultParallelism would then presumably be 16, which is the total cores on your cluster (or possibly 14, if 2 are consumed by an appmaster).
In practice, you probably want to parallelize into a larger number of partitions than total cores available anyways. Then if there's any skew in how long each element takes to process, you can have nice balancing where fast tasks finish and then those executors can start taking on new partitions while the slow ones are still running, instead of always having to wait for a single slowest partition to finish. It's common to choose a number of partitions anywhere from 2x the number of available cores to something 100x or more.
Here's another related StackOverflow question: spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Shuffle phase lasts too long Hadoop

I'm having a MR job in which shuffle phase lasts too long.
At first I thought that it is because I'm emitting a lot of data from Mapper (around 5GB). Then I fixed that problem by adding a Combiner, thus emitting less data to Reducer. After that shuffle period did not shorten, as I thought it would.
My next idea was to eliminate Combiner, by combining in Mapper itself. That idea I got from here, where it says that data needs to be serialized/deserialized to use Combiner. Unfortunately shuffle phase is still the same.
My only thought is that it can be because I'm using a single Reducer. But this shouldn't be a case since I'm not emitting a lot of data when using Combiner or combining in Mapper.
Here are my stats:
Here are all the counters for my Hadoop (YARN) job:
I should also add that this is run on a small cluster of 4 machines. Each has 8GB of RAM (2GB reserved) and number of virtual cores is 12 (2 reserved).
These are virtual machines. At first they were all on a single unit, but then I separated them 2-2 on two units. So they were sharing HDD at first, now there are two machines per disk. Between them is a gigabit network.
And here are more stats:
Whole memory is occupied
CPU is constantly under pressure while the job is run (the picture shows CPU for two consecutive runs of same job)
My question is - why is shuffle time so big and how to fix it? I also don't understand how there was no speedup even though I have dramatically reduced the amount of data emitted from Mapper?
Few observations :
For a job of 30 mins, the GC time is too high (Try reusing objects rather creating a new one for each call in map()/Reduce() method)
Average map time is TOOOOO hight , 16 mins what are you doing in ur map ?
YARN memory is 99% , this signifies you are running too many services on your HDP cluster and RAM is not sufficient to support those many services.
Please increse YAN container memory, please give at least 1 GB.
This looks like a GC + overscheduled cluster problem

Default number of reducers

In Hadoop, if we have not set number of reducers, then how many number of reducers will be created?
Like number of mappers is dependent on (total data size)/(input split size),
E.g. if data size is 1 TB and input split size is 100 MB. Then number of mappers will be (1000*1000)/100 = 10000(Ten thousand).
The number of reducer is dependent on which factors ? How many reducers are created for a job?
How Many Reduces? ( From official documentation)
The right number of reduces seems to be 0.95 or 1.75 multiplied by
(no. of nodes) * (no. of maximum containers per node).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.
This article covers about Mapper count too.
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.
Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
If you want to change the default value of 1 for number of reducers, you can set below property (From hadoop 2.x version) as a command line parameter
mapreduce.job.reduces
OR
you can set programmatically with
job.setNumReduceTasks(integer_numer);
Have a look at one more related SE question: What is Ideal number of reducers on Hadoop?
By default the no of reducers is set to 1.
You can change it by adding a parameter
mapred.reduce.tasks in the command line or in the Driver code or in the conf file that you pass.
e.g: Command Line Argument: bin/hadoop jar ... -Dmapred.reduce.tasks=<num reduce tasks>
or, in Driver code as: conf.setNumReduceTasks(int num);
Recommended read:
https://wiki.apache.org/hadoop/HowManyMapsAndReduces

Hadoop Terasort unstable benchmark results

I have a Cloudera Hadoop cluster and I'm doing some benchmarks running Terasort but I'm getting very unstable results from 105 - 150 minutes. Some times I've seen it was replicating more than usual or doing a lot of garbage collections but some other times they were pretty much the same.
I don't know the reason of the unstable results, any hint or recommendation will be very welcome :)
I run the benchmarks as follows:
I've chosen the number of maps and reduces tasks following this guide http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Speculative maps and reduce execution is off.
Generating dataset:
10,000,000,000 rows of 100 bytes ~= 953674 M
Block size = 128 MB
Number of maps tasks = 3725 (number-of-rows * row-size) / (block-size*2) I do times 2 because the maps tasks time was too low, like 7 seconds.
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teragen -Ddfs.replication=3 -Dmapred.map.tasks=3725 10000000000 /terasort-in
Running terasort:
num-of-worker-nodes = 4
num-of-cores-per-node = 8
Reduce tasks = 56 ( 1.75 * num-of-worker-nodes * num-of-cores-per-node )
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar terasort -Ddfs.replication=1 -Dmapred.reduce.tasks=56 /terasort-in /terasort-out
The service and role distribution among nodes is as follows:
6 Nodes - 8 cores, 16 GB RAM and 2 HD each - running just HDFS and MapReduce:
1st node, just master roles:
Namenode.
Cloudera management services.
2nd node, just master roles:
JobTracker.
SecondaryNamenode.
3rd to 6th nodes, just worker roles:
TaskTracker.
Datanode.
I use the 2nd node as client because is the one with the lowest load.
Please tell me if you need any configuration property value or detail.
Update: After Chris White's answer I've tried to reduce the number of pollings between the jobtracker and tasktrackers by having just 1 worker and very few maps and reduces, now the benchmarks are pretty stable :)
There are many factors that you need to take into consideration when looking at performance:
This could be a polling problem combined with the small number of processing slots you have available.
The Task Trackers poll the running tasks periodically to determine if they have finished, and the Job Tracker also polls the Task Trackers. With your ~3700 map tasks (if i've read your question correctly), if there was say a ~1 second difference in polling times, then this could account for the ~hour you are seeing in timing differences.
If you have a larger cluster with more processing slots, i imagine this number would become more stable, but no MR job will every have a constant running time, there are too many polling and other external timings (JVM start up time for example) that can adjust the overall runtime.
What was the data locality counters say for both jobs? If one job had considerably more data lock tasks than another then i would expect it to run fast too.

hadoop cassandra cpu utilization

Summary: How can I get Hadoop to use more CPUs concurrently on my server?
I'm running Cassandra and Hadoop on a single high-end server with 64GB RAM, SSDs, and 16 CPU cores. The input to my mapreduce job has 50M rows. During the map phase, Hadoop creates seven mappers. Six of those complete very quickly, and the seventh runs for two hours to complete the map phase. I've suggested more mappers like this ...
job.getConfiguration().set("mapred.map.tasks", "12");
but Hadoop continues to create only seven. I'd like to get more mappers running in parallel to take better advantage of the 16 cores in the server. Can someone explain how Hadoop decides how many mappers to create?
I have a similar concern during the reduce phase. I tell Hadoop to create 12 reducers like this ...
job.setNumReduceTasks(12);
Hadoop does create 12 reducers, but 11 complete quickly and the last one runs for hours. My job has 300K keys, so I don't imagine they're all being routed to the same reducer.
Thanks.
The map task number is depend on your input data.
For example:
if your data source is HBase the number is the region number of you data
if your data source is the file the map number is your file size/the block size(64mb or 128mb).
you cannot specify the map number in code
The problem of 6 fast and 1 slow is because the data unbalanced. I did not use Cassandra before, so I cannot tell you how to fix it.

Resources