Splitsize for the Hive table in TDCH - hadoop

I'm using TDCH to export a hive data into teradata table. For this I need to specify the number of mappers to my TDCH job. So, my question is "is this number of mappers option we give to TDCH job is just a hint to TDCH? or are these total number of mappers created by TDCH will always be equal to number of mappers given in the option (of the TDCH job)"?
My assumption is that the number of mappers mainly depends on the split-size than on the given number of mappers (in the options of TDCH job). Is my assumption correct for the TDCH jobs?.
Also, for the Hive table how is the split size defined? is that defined based on the number of rows? or it's just defined based on the size of the data (like 60MB or 120MB etc) similar to the cases like the "textfiles"?

"is this number of mappers option we give to TDCH job is just a hint to TDCH? or are these total number of mappers created by TDCH will always be equal to number of mappers given in the option (of the TDCH job)"?
Splitsize in TDCH is always equals to "number of mappers" specified (I read this in one of the TDCH tutorials). So, number of mappers is not just a hint (unlike in conventional mapreduce programming) it's nothing but the number of splits.
As it's equals to number of splits total number of mappers generated for a TDCH job is always equals to "number of mappers" (option) specified while running the job.

Related

how many mappers and reduces will get created for a partitoned table in hive

I am always confused on how many mappers and reduces will get created for a particular task in hive.
e.g If block size = 128mb and there are 365 files each maps to a date in a year(file size=1 mb each). There is partition based on date column. In this case how many mappers and reducers will be run during loading the data?
Mappers:
Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
MR uses CombineInputFormat, while Tez uses grouped splits.
Tez:
set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split
MapReduce:
set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.maxsize=1073741824; -- 1 GB
Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.
Reducers:
Controlling the number of reducers is much easier.
The number of reducers determined according to
mapreduce.job.reduces - The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.
Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapreduce.job.reduces is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.
So, if you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer

Number of reducers in hadoop

I was learning hadoop,
I found number of reducers very confusing :
1) Number of reducers is same as number of partitions.
2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).
3) Number of reducers is set by mapred.reduce.tasks.
4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.
I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?
How is number of reducers is calculated? Please tell me how to calculate number of reducers.
1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.
2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).
3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.
4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.
There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.
Number of reducer is internally calculated from size of the data we are processing if you don't explicitly specify using below API in driver program
job.setNumReduceTasks(x)
By default on 1 GB of data one reducer would be used.
so if you are playing with less than 1 GB of data and you are not specifically setting the number of reducer so 1 reducer would be used .
Similarly if your data is 10 Gb so 10 reducer would be used .
You can change the configuration as well that instead of 1 GB you can specify the bigger size or smaller size.
property in hive for setting size of reducer is :
hive.exec.reducers.bytes.per.reducer
you can view this property by firing set command in hive cli.
Partitioner only decides which data would go to which reducer.
Your job may or may not need reducers, it depends on what are you trying to do. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output. Too many reducers and you end up with lots of small files.
Partitioner makes sure that same keys from multiple mappers goes to the same reducer. This doesn't mean that number of partitions is equal to number of reducers. However, you can specify number of reduce tasks in the driver program using job instance like job.setNumReduceTasks(2). If you don't specify the number of reduce tasks in the driver program then it picks from the mapred.reduce.tasks which has the default value of 1 (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) i.e. all mappers output will go to the same reducer.
Also, note that programmer will not have control over number of mappers as it depends on the input split where as programmer can control the number of reducers for any job.

Distributed by Clause in HIVE

I have table of with huge data like 100TB.
When I am querying the table I used distributed by clause on particular column (say x).
The table contains 200 distinct or unique values of X.
So When I queried the table with distributed by clause on X the maximum reducers should be 200. But I am seeing it is utilizing MAX reducers i.e. 999
Let me explain with example
Suppose the description of the emp_table is as fallows with 3 columns.
1.emp_name
2. emp_ID
3.Group_ID
and Group_ID has **200 distinct** values
Now I want to query the table
select * from emp_table distributed by Group_ID;
This Query should use 200 Reducers as per distributed clause. But I am seeing 999 reducers getting utilized.
I am doing it as part optimization. So how can I make sure it should be utilize 200 reducers?
The number of reducers in Hive is decided by either two properties.
hive.exec.reducers.bytes.per.reducer - The default value is 1GB, This makes hive to create one reducer for each 1GB of input table's size.
mapred.reduce.tasks - takes an intger value, and those many reducers will be prepared for the job.
The distribute by clause doesn't have any role in deciding the number of reducers, all its work is to distribute/partition the key value from mappers to prepared reducers based on the column given in the clause.
Consider setting the mapred.reduce.tasks as 200, and the distribute by will take care of partitioning the key values to the 200 reducers in even manner.
The reduce number of hive depends on the size of your input file.But if the output of the mapper contains only 200 groups.Then I guess most of the reduce job will recieve nothing.
If you really want to control the reduce number.set mapred.reduce.tasks will help.

Creating more partitions than reducers

When developing locally on my single machine, I believe the default number of reducers is 6. In a particular MR step, I actually divide up the data into n partitions where n can be greater than 6. From what I have observed, it looks like only 6 of those partitions actually get processed because I only see output from 6 specific partitions only. A few questions:
(a) Do I need to set the number of reducers to be greater than the number of partitions? If so, can I do this before/during/after running the Mapper?
(b) Why is it that the other partitions are not queued up? Is there a way to wait for a reducer to finish processing one partition before working on another partition such that all partitions can be processed regardless of whether the actual number of reducers is less than the number of partitions?
(a) No. You can have any number of reducers based on your needs. Partitioning just decides which set of key/value pairs will go to which reducer. It doesn't decide how many reducers will be generated. But, if there is a situation wherein you want to set the number of reducers as per your requirement, you can do that through Job :
job.setNumReduceTasks(2);
(b) This is actually what happens. Based on the availability of slots a set reducers is initiated which process all the input fed to them. If all the reducers have finished and some data is still left unprocessed a second batch of reducers will start and finish rest of the data. All of your data will eventually get processed irrespective of the number of partitions and reducers.
Please make sure your partition logic is correct.
P.S. : Why do you believe the default number of reducers is 6?
You can also ask for a number of reducers when you submit the job to hadoop.
$hadoop jar myjarfile mymainclass -Dmapreduce.job.reduces=n myinput myoutputdir
For more options and some details see:
Hadoop Number of Reducers Configuration Options Priority

How does Hive choose the number of reducers for a job?

Several places say the default # of reducers in a Hadoop job is 1. You can use the mapred.reduce.tasks symbol to manually set the number of reducers.
When I run a Hive job (on Amazon EMR, AMI 2.3.3), it has some number of reducers greater than one. Looking at job settings, something has set mapred.reduce.tasks, I presume Hive. How does it choose that number?
Note: here are some messages while running a Hive job that should be a clue:
...
Number of reduce tasks not specified. Estimated from input data size: 500
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
...
The default of 1 maybe for a vanilla Hadoop install. Hive overrides it.
In open source hive (and EMR likely)
# reducers = (# bytes of input to mappers)
/ (hive.exec.reducers.bytes.per.reducer)
This post says default hive.exec.reducers.bytes.per.reducer is 1G.
You can limit the number of reducers produced by this heuristic using hive.exec.reducers.max.
If you know exactly the number of reducers you want, you can set mapred.reduce.tasks, and this will override all heuristics. (By default this is set to -1, indicating Hive should use its heuristics.)
In some cases - say 'select count(1) from T' - Hive will set the number of reducers to 1 , irrespective of the size of input data. These are called 'full aggregates' - and if the only thing that the query does is full aggregates - then the compiler knows that the data from the mappers is going to be reduced to trivial amount and there's no point running multiple reducers.

Resources