I use hive to run a query "select * from T1,T2 where T1.a=T2.b", and the schema is T1(a int, b int),T2(a int,b int), when it runs, 6 map tasks and one reduce task generated, and I want to ask that, which determined the number of map tasks and reduce tasks? is the data volume?
The number of map tasks is dependent on the data volume, block size and split size.
For example: If you have block size 128 MB and your file size is 1 GB then there will be 8 number of map tasks. You can control it by using split size.
And number of reducers in a Hive job is 1 by default. You have to update it via configuration
<property>
<name>mapred.reduce.tasks</name>
<value>-1</value>
<description>The default number of reduce tasks per job. Typically set
to a prime close to the number of available hosts. Ignored when
mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses
-1 as its default value.
By setting this property to -1, Hive will automatically figure out what should be
the number of reducers.
</description>
</property>
Parameters which decides your split Size, in-turn you no of Map Tasks are.
> mapred.max.split.size
> mapred.min.split.size
"mapred.max.split.size" which can be set per job individually through
your conf Object. Don't change "dfs.block.size" which affects your
HDFS too.
if mapred.min.split.size is less than block size and
mapred.max.split.size is greater than block size then 1 block is sent
to each map task. The block data is split into key value pairs based
on the Input Format you use.
hive> select * from emp;
Then there will be no map and reduce will start. Means we are only dumping the data.
If I want so how many map and reduce start when I am hitting query.
hive> select count(*) from emp group by name;
If we added explain keyword before the query it will going show how many map and reduce will get start.
hive> explain select count(*) from emp group by name;
Related
I have a hive insert overwrite query - set mapred.map.tasks=1; set mapred.reduce.tasks=1; insert overwrite table staging.table1 partition(dt) select * from testing.table1;
When I inspect the HDFS directory for staging.table1, I see that there are 2 part files created.
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000000_0
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000001_0
Why is it that 2 files are created?
I am using beeline client and hive 2.1.1-cdh6.3.1
The insert query you executed is map-only, which means there is no reduce task. So there's no point of setting mapred.reduce.tasks.
Also, the number of mapper is determined by the num of splits, so setting mapred.map.tasks won't change the parallelism of mappers.
There are at least two feasible ways to enforce the resultant num of files to be 1:
Enforcing a post job for file merging.
Set hive.merge.mapfiles to be true. Well, the default value is already true.
Decrease hive.merge.smallfiles.avgsize to actually trigger the merging.
Increase hive.merge.size.per.task to be big enough as the target size after merging.
Configuring the file merging behavior of mappers to cut down num of mappers.
Make sure that hive.input.format is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat, which is also the default.
Then increase mapreduce.input.fileinputformat.split.maxsize to allow larger split size.
It is possible to enable Fetch task in Hive for simple query instead of Map or MapReduce using hive hive.fetch.task.conversion parameter.
Please explain why Fetch task is running much faster than Map especially when doing some simple work (for example select * from table limit 10;)? What map-only task is doing additionally in this case? The performance difference is more than 20 times faster in my case. Both tasks should read the table data, isn't it?
FetchTask directly fetches data, whereas Mapreduce will invoke a map reduce job
<property>
<name>hive.fetch.task.conversion</name>
<value>minimal</value>
<description>
Some select queries can be converted to single FETCH task
minimizing latency.Currently the query should be single
sourced not having any subquery and should not have
any aggregations or distincts (which incurrs RS),
lateral views and joins.
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)
</description>
</property>
Also there is another parameter hive.fetch.task.conversion.threshold which by default in 0.10-0.13 is -1 and >0.14 is 1G(1073741824)
This indicates that, If table size is greater than 1G use Mapreduce instead of Fetch task
more detail
I have a 2.6 MB sized CSV file. I created a hive table and loaded the csv file in it.
Now, if I write a query as "select * from abc order by a;" , mapreduce used 1 reducer. How did it identify the number of reducer as 1? Did it use the default value "1" or something else?
In general, how does hive decide how many reducers to use in an "order by", "sort by" or "group by" clause?
It goes with data size, default is 1 per 1GB, its regulated by this property:
hive.exec.reducers.bytes.per.reducer
If you want to have more reducers set it with this:
mapred.reduce.tasks
Full list of settings with explanations can be find here.
Number of reducers in Hive is calculated using hive.exec.reducers.bytes.per.reducer property where 1GB (1000000000 bytes) is it's default value.
You can configure number of reducers by changing the above mentioned property. Also you need to set the constant number of reducers for a job by the property mapred.reduce.tasks
// hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>xxxxxxx</value>
</property>
// console
$ hive -e "set hive.exec.reducers.bytes.per.reducer=xxxxxxx"
I am trying to upload data to hive rc and orc file but number of reducer is always 0. I try to to set the reducer in hive with set mapred.reducer.tasks=1 but it does not work. I found internet that default size per reducer is 1G so i try to upload 3G data so reducer would be at least 2. what i have to work reduce operator?
I would need more information about the query to know for sure but my guess is that the query you are running is a map only job, thus not requiring any reducers. You can add a DISTRIBUTE BY statement to force Hadoop to use reducers. For example,
SELECT txn_id FROM table;
will be a map only job. You can force Hive to add a reduce step by adding this clause.
SELECT txn_id FROM table
DISTRIBUTE BY txn_id;
Try
set mapred.reduce.tasks=99;
set hive.exec.reducers.max=99;
However it is likely that your tasks do not require a reducer.
I have the following hive query:
select count(distinct id) as total from mytable;
which automatically spawns:
1408 Mappers
1 Reducer
I need to manually set the number of reducers and I have tried the following:
set mapred.reduce.tasks=50
set hive.exec.reducers.max=50
but none of these settings seem to be honored. The query takes forever to run. Is there a way to manually set the reducers or maybe rewrite the query so it can result in more reducers? Thanks!
writing query in hive like this:
SELECT COUNT(DISTINCT id) ....
will always result in using only one reducer.
You should:
use this command to set desired number of reducers:
set mapred.reduce.tasks=50
rewrite query as following:
SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;
This will result in 2 map+reduce jobs instead of one, but performance gain will be substantial.
Number of reducers depends also on size of the input file
By default it is 1GB (1000000000 bytes). You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>1000000</value>
</property>
or using set
$ hive -e "set hive.exec.reducers.bytes.per.reducer=1000000"
You could set the number of reducers spawned per node in the conf/mapred-site.xml config file. See here: http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html.
In particular, you need to set this property:
mapred.tasktracker.reduce.tasks.maximum
Mapper is totaly depend on number of file i.e size of file we can call it as input splits. Split is noting but the logical split of data.
Ex: my file size is 150MB and my HDFS default block is 128MB. It will create two split means two blocks. Two Mapper will get assigned for this job.
Imp Note: Suppose I have specified the split size is 50MB then It will start 3 Mapper because of it totally depend on number of split.
Imp Note: if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
Note: If we haven't specifyed the split size it will take default hdfs block size as split size.
Reducer has 3 primary phases: shuffle, sort and reduce.
Command :
1] Set Map Task : -D mapred.map.tasks=4
2] Set Reduce Task : -D mapred.reduce.tasks=2