To determine the number of reducers in Hive "order by" clause - hadoop

I have a 2.6 MB sized CSV file. I created a hive table and loaded the csv file in it.
Now, if I write a query as "select * from abc order by a;" , mapreduce used 1 reducer. How did it identify the number of reducer as 1? Did it use the default value "1" or something else?
In general, how does hive decide how many reducers to use in an "order by", "sort by" or "group by" clause?

It goes with data size, default is 1 per 1GB, its regulated by this property:
hive.exec.reducers.bytes.per.reducer
If you want to have more reducers set it with this:
mapred.reduce.tasks
Full list of settings with explanations can be find here.

Number of reducers in Hive is calculated using hive.exec.reducers.bytes.per.reducer property where 1GB (1000000000 bytes) is it's default value.
You can configure number of reducers by changing the above mentioned property. Also you need to set the constant number of reducers for a job by the property mapred.reduce.tasks
// hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>xxxxxxx</value>
</property>
// console
$ hive -e "set hive.exec.reducers.bytes.per.reducer=xxxxxxx"

Related

Why is hive writing 2 part files to hdfs even though number of mappers and reducers is set to 1

I have a hive insert overwrite query - set mapred.map.tasks=1; set mapred.reduce.tasks=1; insert overwrite table staging.table1 partition(dt) select * from testing.table1;
When I inspect the HDFS directory for staging.table1, I see that there are 2 part files created.
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000000_0
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000001_0
Why is it that 2 files are created?
I am using beeline client and hive 2.1.1-cdh6.3.1
The insert query you executed is map-only, which means there is no reduce task. So there's no point of setting mapred.reduce.tasks.
Also, the number of mapper is determined by the num of splits, so setting mapred.map.tasks won't change the parallelism of mappers.
There are at least two feasible ways to enforce the resultant num of files to be 1:
Enforcing a post job for file merging.
Set hive.merge.mapfiles to be true. Well, the default value is already true.
Decrease hive.merge.smallfiles.avgsize to actually trigger the merging.
Increase hive.merge.size.per.task to be big enough as the target size after merging.
Configuring the file merging behavior of mappers to cut down num of mappers.
Make sure that hive.input.format is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat, which is also the default.
Then increase mapreduce.input.fileinputformat.split.maxsize to allow larger split size.

hive compaction using insert overwrite partition

Trying to address the small files problem by compacting the files under hive partitions by Insert overwrite partition command in hadoop.
Query :
SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
INSERT OVERWRITE TABLE tbl1 PARTITION (year=2016, month=03, day=11)
SELECT col1,col2,col3 from tbl1
WHERE year=2016 and month=03 and day=11;
Input Files:
For testing purpose I have three files under the hive partition (2016/03/11) in HDFS with the size of 40 MB each.
2016/03/11/file1.csv
2016/03/11/file2.csv
2016/03/11/file3.csv
Example my block size is 128 , So I would like to create only one output files. But I am getting 3 different compressed files.
Please help me to get the hive configuration to restrict the output file size. If I am not using the compression I am getting the single file.
Hive Version : 1.1
It's interesting that you are still getting 3 files when specifying the partition when using compression so you may want to look into dynamic partitioning or ditch the partitioning and focus on the number of mappers and reducers being created by your job. If your files are small I could see how you would want them all in one file on your target, but then I would also question the need for compression on them.
The number of files created in your target is directly tied to the number of reducers or mappers. If the SQL you write needs to reduce then the number of files created will be the same as the number of reducers used in the job. This can be controlled by setting the number of reducers used in the job.
set mapred.reduce.tasks = 1;
In your example SQL there most likely wouldn't be any reducers used, so the number of files in the target is equal to the number of mappers used which is equal to the number of files in the source. It isn't as easy to control the number of output files on a map only job but there are a number of configuration settings that can be tried.
Setting to combine small input files so fewer mappers are spawned, the default is false.
set hive.hadoop.supports.splittable.combineinputformat = true;
Try setting a threshold in bytes for the input files, anything under this threshold would try to be converted to a map join which can affect the number of output files.
set hive.mapjoin.smalltable.filesize = 25000000;
As for the compression I would play with changing the type of compression being used just to see if that makes any difference in your output.
set hive.exec.orc.default.compress = gzip, snappy, etc...

how to set Hive reduce operator since reduce operator is always is 0

I am trying to upload data to hive rc and orc file but number of reducer is always 0. I try to to set the reducer in hive with set mapred.reducer.tasks=1 but it does not work. I found internet that default size per reducer is 1G so i try to upload 3G data so reducer would be at least 2. what i have to work reduce operator?
I would need more information about the query to know for sure but my guess is that the query you are running is a map only job, thus not requiring any reducers. You can add a DISTRIBUTE BY statement to force Hadoop to use reducers. For example,
SELECT txn_id FROM table;
will be a map only job. You can force Hive to add a reduce step by adding this clause.
SELECT txn_id FROM table
DISTRIBUTE BY txn_id;
Try
set mapred.reduce.tasks=99;
set hive.exec.reducers.max=99;
However it is likely that your tasks do not require a reducer.

which determines the number of map tasks and reduce tasks in hive?

I use hive to run a query "select * from T1,T2 where T1.a=T2.b", and the schema is T1(a int, b int),T2(a int,b int), when it runs, 6 map tasks and one reduce task generated, and I want to ask that, which determined the number of map tasks and reduce tasks? is the data volume?
The number of map tasks is dependent on the data volume, block size and split size.
For example: If you have block size 128 MB and your file size is 1 GB then there will be 8 number of map tasks. You can control it by using split size.
And number of reducers in a Hive job is 1 by default. You have to update it via configuration
<property>
<name>mapred.reduce.tasks</name>
<value>-1</value>
<description>The default number of reduce tasks per job. Typically set
to a prime close to the number of available hosts. Ignored when
mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses
-1 as its default value.
By setting this property to -1, Hive will automatically figure out what should be
the number of reducers.
</description>
</property>
Parameters which decides your split Size, in-turn you no of Map Tasks are.
> mapred.max.split.size
> mapred.min.split.size
"mapred.max.split.size" which can be set per job individually through
your conf Object. Don't change "dfs.block.size" which affects your
HDFS too.
if mapred.min.split.size is less than block size and
mapred.max.split.size is greater than block size then 1 block is sent
to each map task. The block data is split into key value pairs based
on the Input Format you use.
hive> select * from emp;
Then there will be no map and reduce will start. Means we are only dumping the data.
If I want so how many map and reduce start when I am hitting query.
hive> select count(*) from emp group by name;
If we added explain keyword before the query it will going show how many map and reduce will get start.
hive> explain select count(*) from emp group by name;

Hive unable to manually set number of reducers

I have the following hive query:
select count(distinct id) as total from mytable;
which automatically spawns:
1408 Mappers
1 Reducer
I need to manually set the number of reducers and I have tried the following:
set mapred.reduce.tasks=50
set hive.exec.reducers.max=50
but none of these settings seem to be honored. The query takes forever to run. Is there a way to manually set the reducers or maybe rewrite the query so it can result in more reducers? Thanks!
writing query in hive like this:
SELECT COUNT(DISTINCT id) ....
will always result in using only one reducer.
You should:
use this command to set desired number of reducers:
set mapred.reduce.tasks=50
rewrite query as following:
SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;
This will result in 2 map+reduce jobs instead of one, but performance gain will be substantial.
Number of reducers depends also on size of the input file
By default it is 1GB (1000000000 bytes). You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>1000000</value>
</property>
or using set
$ hive -e "set hive.exec.reducers.bytes.per.reducer=1000000"
You could set the number of reducers spawned per node in the conf/mapred-site.xml config file. See here: http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html.
In particular, you need to set this property:
mapred.tasktracker.reduce.tasks.maximum
Mapper is totaly depend on number of file i.e size of file we can call it as input splits. Split is noting but the logical split of data.
Ex: my file size is 150MB and my HDFS default block is 128MB. It will create two split means two blocks. Two Mapper will get assigned for this job.
Imp Note: Suppose I have specified the split size is 50MB then It will start 3 Mapper because of it totally depend on number of split.
Imp Note: if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
Note: If we haven't specifyed the split size it will take default hdfs block size as split size.
Reducer has 3 primary phases: shuffle, sort and reduce.
Command :
1] Set Map Task : -D mapred.map.tasks=4
2] Set Reduce Task : -D mapred.reduce.tasks=2

Resources