Hive query generating multiple small file - hadoop

I have 2 hive table as source. Say
DEV.INPUT_01
DEV.INPUT_02
I have 1 more table as DEV.TARGET. I want to load data into this table for above 2 input tables. The HQL which I have is:
insert overwrite table DEV.TARGET partition(c30)
select
c1
,c2
,c3
,c4
,c5
,c6
,c7
,c8
,c9
,c10
,c11
,c12
,c13
,c14
,c15
,c16
,c17
,c18
,c19
,c20
,c21
,c22
,c23
,c24
,c25
,c26
,c27
,c28
,c29
,c30
from
DEV.SOURCE_01 t1 left join
DEV.SOURCE_02 t2 on
t1.tab_id = t2.tab_id;
The query is working fine. Number of mapper are 700 and reducers are 400.
The problem is above query is generating 400 files per partition and size of every file is around 200K.
I have tried multiple parameter combinations:
Setting 1:
set hive.exec.reducers.bytes.per.reducer=256000000;
Result 1 Number of reducers decreased to 100 and hence 100 files generated per partition.
Setting 2
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
Result 2 Above setting launched 2 MR steps and result is same.
Setting 3
set mapred.reduce.tasks=40;
Result 3
Number of files are reduced to 40 (which is expected)
Query performance degraded by 3 folds (original query to 20 mins and with this setting it took 55 mins).
The another problem is data size with this setting. As data grows this setting start degrading more and hence will be tough to manage.
Question How can I generate files of size 128M?

I don't think you can generate files of a specific size as Hive output.
However you can achieve some part of it with partitioning
This SO question has the answer explaining how to split data across files
Hive -- split data across files

Please set the following properties
set hive.optimize.index.filter=true;
set hive.exec.orc.skip.corrupt.data=true;
set hive.vectorized.execution.enabled=true;
set hive.compute.query.using.stats=true;
set stats.reliable=true;
set hive.optimize.sort.dynamic.partition=true;
set hive.optimize.ppd=true;
set hive.optimize.ppd.storage=true;
set hive.merge.mapredfiles=true;
set hive.merge.mapfile=true ;
set hive.hadoop.supports.splittable.combineinputformat=true;
set hive.exec.compress.output=true;
I have tried to find exactly which combination of setting worked for me. But all of them together only worked for me

If you want to reduce the number of partitioning files in HDFS you need to limit the block size with the Hive parameters. For instance in the block size in the cluster is configured to 128M:
SET dfs.blocksize=134217728;
(Number above in binary)
With that you will sort out the small partitioning file issue

Related

Why is hive writing 2 part files to hdfs even though number of mappers and reducers is set to 1

I have a hive insert overwrite query - set mapred.map.tasks=1; set mapred.reduce.tasks=1; insert overwrite table staging.table1 partition(dt) select * from testing.table1;
When I inspect the HDFS directory for staging.table1, I see that there are 2 part files created.
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000000_0
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000001_0
Why is it that 2 files are created?
I am using beeline client and hive 2.1.1-cdh6.3.1
The insert query you executed is map-only, which means there is no reduce task. So there's no point of setting mapred.reduce.tasks.
Also, the number of mapper is determined by the num of splits, so setting mapred.map.tasks won't change the parallelism of mappers.
There are at least two feasible ways to enforce the resultant num of files to be 1:
Enforcing a post job for file merging.
Set hive.merge.mapfiles to be true. Well, the default value is already true.
Decrease hive.merge.smallfiles.avgsize to actually trigger the merging.
Increase hive.merge.size.per.task to be big enough as the target size after merging.
Configuring the file merging behavior of mappers to cut down num of mappers.
Make sure that hive.input.format is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat, which is also the default.
Then increase mapreduce.input.fileinputformat.split.maxsize to allow larger split size.

Hive map only job/stages creating multiple zero byte file

I have a hive query which has multiple joins , so as number of stages. While executing the query in some scenario, there won’t be any output. In these scenarios the job completes at intermediate stage where the no of mappers is N and no of reducers is 0 ( No reducer) which creates N no of zero byte files
Tried providing the following settings
set hive.merge.mapfiles=true
set hive.merge.mapredfiles=true
set hive.merge.smallfiles.avgfilesize=128000000
set hive.merge.size.per.task=256000000
If there is some records in the output we are getting the expected output as per the settings
Basically it happens as a result when no records outputs for a map only job/stage.
I am getting 0byte output as single file if config is set for no of reducers as 1 (all the stage of query uses single reducer )or the compress output as true. Still a 0byte file will be there
A solution would be appreciated. Thanks in advance

hive compaction using insert overwrite partition

Trying to address the small files problem by compacting the files under hive partitions by Insert overwrite partition command in hadoop.
Query :
SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
INSERT OVERWRITE TABLE tbl1 PARTITION (year=2016, month=03, day=11)
SELECT col1,col2,col3 from tbl1
WHERE year=2016 and month=03 and day=11;
Input Files:
For testing purpose I have three files under the hive partition (2016/03/11) in HDFS with the size of 40 MB each.
2016/03/11/file1.csv
2016/03/11/file2.csv
2016/03/11/file3.csv
Example my block size is 128 , So I would like to create only one output files. But I am getting 3 different compressed files.
Please help me to get the hive configuration to restrict the output file size. If I am not using the compression I am getting the single file.
Hive Version : 1.1
It's interesting that you are still getting 3 files when specifying the partition when using compression so you may want to look into dynamic partitioning or ditch the partitioning and focus on the number of mappers and reducers being created by your job. If your files are small I could see how you would want them all in one file on your target, but then I would also question the need for compression on them.
The number of files created in your target is directly tied to the number of reducers or mappers. If the SQL you write needs to reduce then the number of files created will be the same as the number of reducers used in the job. This can be controlled by setting the number of reducers used in the job.
set mapred.reduce.tasks = 1;
In your example SQL there most likely wouldn't be any reducers used, so the number of files in the target is equal to the number of mappers used which is equal to the number of files in the source. It isn't as easy to control the number of output files on a map only job but there are a number of configuration settings that can be tried.
Setting to combine small input files so fewer mappers are spawned, the default is false.
set hive.hadoop.supports.splittable.combineinputformat = true;
Try setting a threshold in bytes for the input files, anything under this threshold would try to be converted to a map join which can affect the number of output files.
set hive.mapjoin.smalltable.filesize = 25000000;
As for the compression I would play with changing the type of compression being used just to see if that makes any difference in your output.
set hive.exec.orc.default.compress = gzip, snappy, etc...

Google cloud storage - Tez output files

When I run a query using tez , the number of output files are very huge. I have some 4-5 GB of data each having 46 MB or 16 MB. I want to have only 2-3 files as output files.
My output files location will be google cloud storage. How do I merge the files?
set mapred.reduce.tasks = 1;
set hive.merge.mapfiles = true;
set hive.mergejob.maponly = true;
set hive.merge.mapredfiles=true;
I did set these parameters. And I did write insert overwrite query to overwrite the data in same location. No use. Please help.
I was able to get this done. Earlier, when I was doing this, it was map only job. Now, I have changed the query a bit to use reducer also(Added distribute by). Then if I say "number of reducer = 1" it works. But it is not working for other parameters which should work for map only job

Hive unable to manually set number of reducers

I have the following hive query:
select count(distinct id) as total from mytable;
which automatically spawns:
1408 Mappers
1 Reducer
I need to manually set the number of reducers and I have tried the following:
set mapred.reduce.tasks=50
set hive.exec.reducers.max=50
but none of these settings seem to be honored. The query takes forever to run. Is there a way to manually set the reducers or maybe rewrite the query so it can result in more reducers? Thanks!
writing query in hive like this:
SELECT COUNT(DISTINCT id) ....
will always result in using only one reducer.
You should:
use this command to set desired number of reducers:
set mapred.reduce.tasks=50
rewrite query as following:
SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;
This will result in 2 map+reduce jobs instead of one, but performance gain will be substantial.
Number of reducers depends also on size of the input file
By default it is 1GB (1000000000 bytes). You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>1000000</value>
</property>
or using set
$ hive -e "set hive.exec.reducers.bytes.per.reducer=1000000"
You could set the number of reducers spawned per node in the conf/mapred-site.xml config file. See here: http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html.
In particular, you need to set this property:
mapred.tasktracker.reduce.tasks.maximum
Mapper is totaly depend on number of file i.e size of file we can call it as input splits. Split is noting but the logical split of data.
Ex: my file size is 150MB and my HDFS default block is 128MB. It will create two split means two blocks. Two Mapper will get assigned for this job.
Imp Note: Suppose I have specified the split size is 50MB then It will start 3 Mapper because of it totally depend on number of split.
Imp Note: if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
Note: If we haven't specifyed the split size it will take default hdfs block size as split size.
Reducer has 3 primary phases: shuffle, sort and reduce.
Command :
1] Set Map Task : -D mapred.map.tasks=4
2] Set Reduce Task : -D mapred.reduce.tasks=2

Resources