How to determine file size in HDFS using Hive - hadoop

The workspace i am using is set with Hive 1.1.0 and CDH 5.5.4. I make a query which brings a 22 partitions result. The files saved in this partitions directories are always unique, and can variate from 20MB to 700MB.
From what i understood, this is related to the number of reducers used in the process of the query. Let´s assume i want to have 5 files for each partition instead of 1, i use this command:
set mapreduce.job.reduces=5;
This will make the system use 5 reduce tasks in stage 1, but will automatically switch to 1 reducer at stage 2 (determined automatically at compile time). From what i read, this is due to compiler having more importance than configuration at the time of choosing the number of reducers. It seems that some tasks can not be 'paralelized' and can only be done by one process or reducer task, so system will automatically determine it.
Code :
insert into table core.pae_ind1 partition (project,ut,year,month)
select ts,date_time, if(
-- m1
code_ac_dcu_m1_d1=0
and (min(case when code_ac_dcu_m1_d1=1 then ts end ) over (partition by ut
order by ts rows between 1 following and 1000 following)-ts) <= 15,
min(case when code_ac_dcu_m1_d1=1 then ts end ) over (partition by ut order
by ts rows between 1 following and 1000 following)-ts,NULL) as
t_open_dcu_m1_d1,
if( code_ac_dcu_m1_d1=2
and (min(case when code_ac_dcu_m1_d1=3 then ts end ) over (partition by ut
order by ts rows between 1 following and 1000 following)-ts) <= 15,
min(case when code_ac_dcu_m1_d1=3 then ts end ) over (partition by ut order
by ts rows between 1 following and 1000 following)-ts, NULL) as
t_close_dcu_m1_d1,
project,ut,year,month
from core.pae_open_close
where ut='902'
order by ut,ts
This leads to having huge files at the end. I would like to know if there is a way of splitting this result files into smaller ones (preferably limiting them by size).

As #DuduMarkovitz pointed, your code contains instruction to order globally the dataset. This will run on single reducer. You better order during select from your table. Even if your files are in order after such insert and they are splittable - they will be read on many mappers then the result will be not in order due to parallelism and you will need to order. Just get rid of this order by ut,ts in the insert and use these configuration settings for controlling the number of reducers:
set hive.exec.reducers.bytes.per.reducer=67108864;
set hive.exec.reducers.max = 2000; --default 1009
The number of reducers determined according to
mapred.reduce.tasks - The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.
Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapred.reduce.tasks is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.
So, if you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer
Each reducer will create one file for each partition (not bigger than hive.exec.reducers.bytes.per.reducer ). It's possible that one reducer will receive many partitions data and as a result will create many small files in each partition. It's because on shuffle phase partitions data will be distributed between many reducers.
If you do not want each reducer to create every (or too many) partitions then distribute by partition key (instead of order). In this case the number of files in the partition will be more like partition_size/hive.exec.reducers.bytes.per.reducer

Related

What will happen if Hive number of reducers is different to number of keys?

In Hive I ofter do queries like:
select columnA, sum(columnB) from ... group by ...
I read some mapreduce example and one reducer can only produce one key. It seems the number of reducers completely depends on number of keys in columnA.
Therefore, why could hive set number of reducers manully?
If there are 10 different values in columnA and I set number of reducers to 2, what will happen? Each reducers will be reused 5 times?
If there are 10 different values in columnA and I set number of reducers to 20, what will happen? hive will only generate 10 reducers?
Normally you should not set the exact number of reducers manually. Use bytes.per.reducer instead:
--The number of reduce tasks determined at compile time
--Default size is 1G, so if the input size estimated is 10G then 10 reducers will be used
set hive.exec.reducers.bytes.per.reducer=67108864;
If you want to limit cluster usage by job reducers, you can set this property: hive.exec.reducers.max
If you are running on Tez, at execution time Hive can dynamically set the number of reducers if this property is set:
set hive.tez.auto.reducer.parallelism = true;
In this case the number of reducers initially started may be bigger because it was estimated based on size, at runtime extra reducers can be removed.
One reducer can process many keys, it depends on data size and bytes.per.reducer and reducer limit configuration settings. The same keys will pass to the same reducer in case of query like in your example because each reducer container is running isolated and all rows having particular key need to be passed to single reducer to be able calculate count for this key.
Extra reducers can be forced (mapreduce.job.reducers=N) or started automatically based on wrong estimation(because of stale stats) and if not removed at run-time, they will do nothing and finish quickly because there is nothing to process. But such reducers anyway will be scheduled and containers allocated, so better do not force extra reducers and keep stats fresh for better estimation.

how many mappers and reduces will get created for a partitoned table in hive

I am always confused on how many mappers and reduces will get created for a particular task in hive.
e.g If block size = 128mb and there are 365 files each maps to a date in a year(file size=1 mb each). There is partition based on date column. In this case how many mappers and reducers will be run during loading the data?
Mappers:
Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
MR uses CombineInputFormat, while Tez uses grouped splits.
Tez:
set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split
MapReduce:
set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.maxsize=1073741824; -- 1 GB
Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.
Reducers:
Controlling the number of reducers is much easier.
The number of reducers determined according to
mapreduce.job.reduces - The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.
Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapreduce.job.reduces is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.
So, if you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer

Distributed by Clause in HIVE

I have table of with huge data like 100TB.
When I am querying the table I used distributed by clause on particular column (say x).
The table contains 200 distinct or unique values of X.
So When I queried the table with distributed by clause on X the maximum reducers should be 200. But I am seeing it is utilizing MAX reducers i.e. 999
Let me explain with example
Suppose the description of the emp_table is as fallows with 3 columns.
1.emp_name
2. emp_ID
3.Group_ID
and Group_ID has **200 distinct** values
Now I want to query the table
select * from emp_table distributed by Group_ID;
This Query should use 200 Reducers as per distributed clause. But I am seeing 999 reducers getting utilized.
I am doing it as part optimization. So how can I make sure it should be utilize 200 reducers?
The number of reducers in Hive is decided by either two properties.
hive.exec.reducers.bytes.per.reducer - The default value is 1GB, This makes hive to create one reducer for each 1GB of input table's size.
mapred.reduce.tasks - takes an intger value, and those many reducers will be prepared for the job.
The distribute by clause doesn't have any role in deciding the number of reducers, all its work is to distribute/partition the key value from mappers to prepared reducers based on the column given in the clause.
Consider setting the mapred.reduce.tasks as 200, and the distribute by will take care of partitioning the key values to the 200 reducers in even manner.
The reduce number of hive depends on the size of your input file.But if the output of the mapper contains only 200 groups.Then I guess most of the reduce job will recieve nothing.
If you really want to control the reduce number.set mapred.reduce.tasks will help.

Relationship between a HIVE query and the number of mapreducers provided by Hadoop?

I am executing a query in HIVE shell as
SELECT tradeId, bookid, foid from trades where bookid='"ABCDEFG"'
The table "trades" has index on bookid. When the query runs, it shows the details of Mappers and Reducers as follows :-
Number of reduce tasks is set to 0 since there's no reduce operator
Hadoop job information for Stage-1: number of mappers: 48; number of reducers: 0
Time taken: **606.183 seconds**, Fetched: **18 row(s)**
If you see it took enormous amount of time to fetch just 18 rows. My question is what I am doing wrong here ? Should the recuder be non-zero ? Will it help if I set it using
set mapred.reduce.tasks = some_number
Shouldn't the indexes help retrieve the data faster ?
When you are doing simple select, all the filtering thing and column selection are done by the mappers itself. There is no purpose for reducer task here, hence number of reducer is zero - which is fine. You probably have around 48*block size amount of data in your table so it spawned 48 mappers. How many map slot per DN do you have and how many of them were free when you fired your query? Chances are all 48 of them are not running in parallel. Though it returned only 18 rows, it read the full table. Is your table bucketed and clustered on the bookid column - in that case you may use TABLESAMPLE clause to make it read only the buckets that contain your ABCDEFG value.

How does Hive choose the number of reducers for a job?

Several places say the default # of reducers in a Hadoop job is 1. You can use the mapred.reduce.tasks symbol to manually set the number of reducers.
When I run a Hive job (on Amazon EMR, AMI 2.3.3), it has some number of reducers greater than one. Looking at job settings, something has set mapred.reduce.tasks, I presume Hive. How does it choose that number?
Note: here are some messages while running a Hive job that should be a clue:
...
Number of reduce tasks not specified. Estimated from input data size: 500
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
...
The default of 1 maybe for a vanilla Hadoop install. Hive overrides it.
In open source hive (and EMR likely)
# reducers = (# bytes of input to mappers)
/ (hive.exec.reducers.bytes.per.reducer)
This post says default hive.exec.reducers.bytes.per.reducer is 1G.
You can limit the number of reducers produced by this heuristic using hive.exec.reducers.max.
If you know exactly the number of reducers you want, you can set mapred.reduce.tasks, and this will override all heuristics. (By default this is set to -1, indicating Hive should use its heuristics.)
In some cases - say 'select count(1) from T' - Hive will set the number of reducers to 1 , irrespective of the size of input data. These are called 'full aggregates' - and if the only thing that the query does is full aggregates - then the compiler knows that the data from the mappers is going to be reduced to trivial amount and there's no point running multiple reducers.

Resources