Estimation Of Mappers for a cluster - hadoop

Need some clarifications regarding the estimation of mappers for a particular job in Hadoop cluster. As per my understanding the no of mappers depends on the input splits taken for processing . But this is in the case if we are gonna to do processing for input data residing already in HDFS . Here I need clarification regarding the mappers and reducers triggered by a SQOOP job . PFB..
How mappers count are estimated for a dedicated cluster,based on RAM or based on the input splits/blocks?(In General)
How mappers count are estimated for a sqoop job for retrieving data from an RDBMS to HDFS based on input size?(Sqoop based)
what is meant by core CPU’s and how it affects the count of mappers that can be run parallel?(Genaral)
Thanks.

How mappers count are estimated for a dedicated cluster,based on RAM or based on the input splits/blocks?(In General)
You are right. Number of mappers are usually based on number of DFS blocks in the input.
How mappers count are estimated for a sqoop job for retrieving data from an RDBMS to HDFS based on input size?(Sqoop based)
By default, Sqoop will use four tasks in parallel to import/export data.
You may change this by using -m <number of mappers> option.
Refer: Sqoop parallelism
what is meant by core CPU’s and how it affects the count of mappers that can be run parallel?(General)
CPU cores are processing units. In simple words "More the Cores the better.", that is if we have more cores it can process more parallely.
Example: if you have 4 cores, 4 mappers can run parallely.(theoretically!)

Need some clarifications regarding the estimation of mappers for a particular job in Hadoop cluster. As per my understanding the no of mappers depends on the input splits taken for processing . But this is in the case if we are gonna to do processing for input data residing already in HDFS . Here I need clarification regarding the mappers and reducers triggered by a SQOOP job . PFB..
How mappers count are estimated for a dedicated cluster,based on RAM or based on the input splits/blocks?(In General)
answer : No it has nothing to do with the RAM size. it all depends on the number of input split.
How mappers count are estimated for a sqoop job for retrieving data from an RDBMS to HDFS based on input size?(Sqoop based)
answer : by default the number of mappers for a Sqoop job are 4. you can change the default by using -m (1,2,3,4,5...) or --num-mappers parameter , but you have to make sure that either you have primary key in you db or you are using -split-by parameter otherwise there will be only one mapper running and you have to explicitly say -m 1.
what is meant by core CPU’s and how it affects the count of mappers that can be run parallel?(General)
answer : core in CPU is the processing unit which can run a task. and when you say 4 core processor that means it can run 4 task at a time. the number of cores does not participate in calculating the number of mappers by mapreduce framework. but yes if there are 4 cores and mapreduce calculates the number of mappers are 12 then at a time 4 mappers will be running in parallel and after that rest will be running serially.

Related

Number of reducers in hadoop

I was learning hadoop,
I found number of reducers very confusing :
1) Number of reducers is same as number of partitions.
2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).
3) Number of reducers is set by mapred.reduce.tasks.
4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.
I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?
How is number of reducers is calculated? Please tell me how to calculate number of reducers.
1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.
2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).
3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.
4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.
There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.
Number of reducer is internally calculated from size of the data we are processing if you don't explicitly specify using below API in driver program
job.setNumReduceTasks(x)
By default on 1 GB of data one reducer would be used.
so if you are playing with less than 1 GB of data and you are not specifically setting the number of reducer so 1 reducer would be used .
Similarly if your data is 10 Gb so 10 reducer would be used .
You can change the configuration as well that instead of 1 GB you can specify the bigger size or smaller size.
property in hive for setting size of reducer is :
hive.exec.reducers.bytes.per.reducer
you can view this property by firing set command in hive cli.
Partitioner only decides which data would go to which reducer.
Your job may or may not need reducers, it depends on what are you trying to do. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output. Too many reducers and you end up with lots of small files.
Partitioner makes sure that same keys from multiple mappers goes to the same reducer. This doesn't mean that number of partitions is equal to number of reducers. However, you can specify number of reduce tasks in the driver program using job instance like job.setNumReduceTasks(2). If you don't specify the number of reduce tasks in the driver program then it picks from the mapred.reduce.tasks which has the default value of 1 (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) i.e. all mappers output will go to the same reducer.
Also, note that programmer will not have control over number of mappers as it depends on the input split where as programmer can control the number of reducers for any job.

Hadoop Map/Reduce Job distribution

I have 4 nodes and I am running a mapreduce sample project to see if job is being distrubuted between all 4 nodes. I ran the project mulitple times and have noticed that, the mapper task is being splitted among all 4 nodes but the reducer task is only being done by one node. Is this how it is suppose to be or is reducer task suppose to be split among all 4 nodes as well.
Thank you
Distribution of Mappers depends on which block of data the mapper will operate on. Framework by default tries to assign the task to a node which has the block of data stored. This will prevent network transfer of data.
For reducers again it depends on no. of reducers which your job requires. If your job uses only one reducer it may be assigned to any pf the nodes.
Also impacting this is speculative execution. If on then this results in multiple instances of map task/ reduce task to start on different nodes and the job tracker based on % completion decides which one will go through and other instances will be killed.
Let us say you 224 MB file. When you add that file into HDFS based on the default block size of 64 MB, the files are split into 4 blocks [blk1=64M,blk2=64M,blk3=64M,blk4=32M]. Let us assume blk1 in on node1 represented as blk1::node1, blk2::node2, blk3:node3, blk4:node4. Now when you run the MR, the Map needs to access the input file. So MR FWK creates 4 mappers and are executed on each node. Now comes the Reducer, as Venkat said it depends on no.of reducers configured for your job. The reducers can be configured using the Hadoop org.apache.hadoop.mapreduce.Job setNumReduceTasks(int tasks) API.

hadoop cassandra cpu utilization

Summary: How can I get Hadoop to use more CPUs concurrently on my server?
I'm running Cassandra and Hadoop on a single high-end server with 64GB RAM, SSDs, and 16 CPU cores. The input to my mapreduce job has 50M rows. During the map phase, Hadoop creates seven mappers. Six of those complete very quickly, and the seventh runs for two hours to complete the map phase. I've suggested more mappers like this ...
job.getConfiguration().set("mapred.map.tasks", "12");
but Hadoop continues to create only seven. I'd like to get more mappers running in parallel to take better advantage of the 16 cores in the server. Can someone explain how Hadoop decides how many mappers to create?
I have a similar concern during the reduce phase. I tell Hadoop to create 12 reducers like this ...
job.setNumReduceTasks(12);
Hadoop does create 12 reducers, but 11 complete quickly and the last one runs for hours. My job has 300K keys, so I don't imagine they're all being routed to the same reducer.
Thanks.
The map task number is depend on your input data.
For example:
if your data source is HBase the number is the region number of you data
if your data source is the file the map number is your file size/the block size(64mb or 128mb).
you cannot specify the map number in code
The problem of 6 fast and 1 slow is because the data unbalanced. I did not use Cassandra before, so I cannot tell you how to fix it.

Hadoop cluster for non-MapReduce algorithms in parallel

The Apache Hadoop is inspired by the Google MapReduce paper. The flow of MapReduce can be considered as two set of SIMDs (single instruction multiple data), one for Mappers, another for Reducers. Reducers, through predefined "key", consume the output of Mappers. The essence of MapReduce framework (and Hadoop) is to automatically partition the data, determine the number of partitions and parallel jobs, and manage distributed resources.
I have a general algorithm (not necessarily MapReducable) to be run in parallel. I am not implementing the algorithm itself the MapReduce-way. Instead, the algorithm is just a single-machine python/java program. I want to run 64 copies of this program in parallel (assuming there is no concurrency issue in the program). i.e. I am more interested in the computing resources in the Hadoop cluster than the MapReduce frameworks. Is there anyway I can use the Hadoop cluster in this old fashion?
Other way of thinking about MapReduce, is MR does the transformation and Reduce does some sort of aggregations.
Hadoop also allows for a Map only job. This way it should be possible to run 64 copies of the Map program run in parallel.
Hadoop has the concept of slots. By default there will be 2 map and 2 reduce slots per node/machine. So, for 64 processes in parallel, 32 nodes are required. If the nodes are of higher end configuration, then the number of M/R slots per node can also be bumped up.

Hadoop streaming api - limit number of mappers on a per job basis

I have a job running on a small hadoop cluster that I want to limit the number of mappers it spawns per datanode. When I use the -Dmapred.map.tasks=12, it still spawns 17 mappers for some reason. I've figured out a way to limit it globally, but I want to do it on a per job basis.
In Map Reduce , the total number of mappers will be spawned depends upon the input splits that are being created from your data .
There will be one mapper task spawned per input split. SO , you cannot decrease the count of mapper in Map Reduce.

Resources