Trying to address the small files problem by compacting the files under hive partitions by Insert overwrite partition command in hadoop.
Query :
SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
INSERT OVERWRITE TABLE tbl1 PARTITION (year=2016, month=03, day=11)
SELECT col1,col2,col3 from tbl1
WHERE year=2016 and month=03 and day=11;
Input Files:
For testing purpose I have three files under the hive partition (2016/03/11) in HDFS with the size of 40 MB each.
2016/03/11/file1.csv
2016/03/11/file2.csv
2016/03/11/file3.csv
Example my block size is 128 , So I would like to create only one output files. But I am getting 3 different compressed files.
Please help me to get the hive configuration to restrict the output file size. If I am not using the compression I am getting the single file.
Hive Version : 1.1
It's interesting that you are still getting 3 files when specifying the partition when using compression so you may want to look into dynamic partitioning or ditch the partitioning and focus on the number of mappers and reducers being created by your job. If your files are small I could see how you would want them all in one file on your target, but then I would also question the need for compression on them.
The number of files created in your target is directly tied to the number of reducers or mappers. If the SQL you write needs to reduce then the number of files created will be the same as the number of reducers used in the job. This can be controlled by setting the number of reducers used in the job.
set mapred.reduce.tasks = 1;
In your example SQL there most likely wouldn't be any reducers used, so the number of files in the target is equal to the number of mappers used which is equal to the number of files in the source. It isn't as easy to control the number of output files on a map only job but there are a number of configuration settings that can be tried.
Setting to combine small input files so fewer mappers are spawned, the default is false.
set hive.hadoop.supports.splittable.combineinputformat = true;
Try setting a threshold in bytes for the input files, anything under this threshold would try to be converted to a map join which can affect the number of output files.
set hive.mapjoin.smalltable.filesize = 25000000;
As for the compression I would play with changing the type of compression being used just to see if that makes any difference in your output.
set hive.exec.orc.default.compress = gzip, snappy, etc...
Related
I have a hive insert overwrite query - set mapred.map.tasks=1; set mapred.reduce.tasks=1; insert overwrite table staging.table1 partition(dt) select * from testing.table1;
When I inspect the HDFS directory for staging.table1, I see that there are 2 part files created.
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000000_0
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000001_0
Why is it that 2 files are created?
I am using beeline client and hive 2.1.1-cdh6.3.1
The insert query you executed is map-only, which means there is no reduce task. So there's no point of setting mapred.reduce.tasks.
Also, the number of mapper is determined by the num of splits, so setting mapred.map.tasks won't change the parallelism of mappers.
There are at least two feasible ways to enforce the resultant num of files to be 1:
Enforcing a post job for file merging.
Set hive.merge.mapfiles to be true. Well, the default value is already true.
Decrease hive.merge.smallfiles.avgsize to actually trigger the merging.
Increase hive.merge.size.per.task to be big enough as the target size after merging.
Configuring the file merging behavior of mappers to cut down num of mappers.
Make sure that hive.input.format is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat, which is also the default.
Then increase mapreduce.input.fileinputformat.split.maxsize to allow larger split size.
I have a hive query which has multiple joins , so as number of stages. While executing the query in some scenario, there won’t be any output. In these scenarios the job completes at intermediate stage where the no of mappers is N and no of reducers is 0 ( No reducer) which creates N no of zero byte files
Tried providing the following settings
set hive.merge.mapfiles=true
set hive.merge.mapredfiles=true
set hive.merge.smallfiles.avgfilesize=128000000
set hive.merge.size.per.task=256000000
If there is some records in the output we are getting the expected output as per the settings
Basically it happens as a result when no records outputs for a map only job/stage.
I am getting 0byte output as single file if config is set for no of reducers as 1 (all the stage of query uses single reducer )or the compress output as true. Still a 0byte file will be there
A solution would be appreciated. Thanks in advance
I have 2 hive table as source. Say
DEV.INPUT_01
DEV.INPUT_02
I have 1 more table as DEV.TARGET. I want to load data into this table for above 2 input tables. The HQL which I have is:
insert overwrite table DEV.TARGET partition(c30)
select
c1
,c2
,c3
,c4
,c5
,c6
,c7
,c8
,c9
,c10
,c11
,c12
,c13
,c14
,c15
,c16
,c17
,c18
,c19
,c20
,c21
,c22
,c23
,c24
,c25
,c26
,c27
,c28
,c29
,c30
from
DEV.SOURCE_01 t1 left join
DEV.SOURCE_02 t2 on
t1.tab_id = t2.tab_id;
The query is working fine. Number of mapper are 700 and reducers are 400.
The problem is above query is generating 400 files per partition and size of every file is around 200K.
I have tried multiple parameter combinations:
Setting 1:
set hive.exec.reducers.bytes.per.reducer=256000000;
Result 1 Number of reducers decreased to 100 and hence 100 files generated per partition.
Setting 2
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
Result 2 Above setting launched 2 MR steps and result is same.
Setting 3
set mapred.reduce.tasks=40;
Result 3
Number of files are reduced to 40 (which is expected)
Query performance degraded by 3 folds (original query to 20 mins and with this setting it took 55 mins).
The another problem is data size with this setting. As data grows this setting start degrading more and hence will be tough to manage.
Question How can I generate files of size 128M?
I don't think you can generate files of a specific size as Hive output.
However you can achieve some part of it with partitioning
This SO question has the answer explaining how to split data across files
Hive -- split data across files
Please set the following properties
set hive.optimize.index.filter=true;
set hive.exec.orc.skip.corrupt.data=true;
set hive.vectorized.execution.enabled=true;
set hive.compute.query.using.stats=true;
set stats.reliable=true;
set hive.optimize.sort.dynamic.partition=true;
set hive.optimize.ppd=true;
set hive.optimize.ppd.storage=true;
set hive.merge.mapredfiles=true;
set hive.merge.mapfile=true ;
set hive.hadoop.supports.splittable.combineinputformat=true;
set hive.exec.compress.output=true;
I have tried to find exactly which combination of setting worked for me. But all of them together only worked for me
If you want to reduce the number of partitioning files in HDFS you need to limit the block size with the Hive parameters. For instance in the block size in the cluster is configured to 128M:
SET dfs.blocksize=134217728;
(Number above in binary)
With that you will sort out the small partitioning file issue
I want each hadoop mapper to process a separate portion of data at a M/R job and I would like to test on a pseudo-distributed (single-node) setup the case where many mappers would be necessary to exist as a result of a bigger input-data size. Given the size of my current input and the standalone mode I am experimenting on, I can only see 1 map task.
My input comes from an hbase table and I thought that the number of regions per hbase table is equal to the number of mappers used to process the table's data.
So, as to reproduce a case where many mappers would process the input data, I predefined regions of table through shell like this :
create 't1', 'f1', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
or setting 'UniformSplit' as SPLITALGO, but even if mappers indeed increase to the specified number of regions (after importing data to the respective table), all the input data (at a subsequent test job where I try to read from this table) pass through only one mapper - with the others processing none of the input rows.
I work on a pseudo-distributed (single-node) setup and I really don't know how to solve this. Does anyone have any ideas? Thanks!
Are you scanning the entire table or just a section of it? If you are scanning a section of the table, then that might be the cause of your problem as your data source isn't big enough to trigger multiple mappers.
You can try to decrease the region size in your hbase-size.xml configuration and restart hbase to achieve the desired effect.
Lastly, in your mapred-site.xml configuration, how many mapper slots do you have? If it is just 1, this will not limit the number of map jobs, but it will limit the number of map jobs that can be run at a time on that server.
Other than that, I don't think you have much control over specifying the number of mappers per job- not like you do with the number of reducers.
I have the following hive query:
select count(distinct id) as total from mytable;
which automatically spawns:
1408 Mappers
1 Reducer
I need to manually set the number of reducers and I have tried the following:
set mapred.reduce.tasks=50
set hive.exec.reducers.max=50
but none of these settings seem to be honored. The query takes forever to run. Is there a way to manually set the reducers or maybe rewrite the query so it can result in more reducers? Thanks!
writing query in hive like this:
SELECT COUNT(DISTINCT id) ....
will always result in using only one reducer.
You should:
use this command to set desired number of reducers:
set mapred.reduce.tasks=50
rewrite query as following:
SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;
This will result in 2 map+reduce jobs instead of one, but performance gain will be substantial.
Number of reducers depends also on size of the input file
By default it is 1GB (1000000000 bytes). You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>1000000</value>
</property>
or using set
$ hive -e "set hive.exec.reducers.bytes.per.reducer=1000000"
You could set the number of reducers spawned per node in the conf/mapred-site.xml config file. See here: http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html.
In particular, you need to set this property:
mapred.tasktracker.reduce.tasks.maximum
Mapper is totaly depend on number of file i.e size of file we can call it as input splits. Split is noting but the logical split of data.
Ex: my file size is 150MB and my HDFS default block is 128MB. It will create two split means two blocks. Two Mapper will get assigned for this job.
Imp Note: Suppose I have specified the split size is 50MB then It will start 3 Mapper because of it totally depend on number of split.
Imp Note: if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
Note: If we haven't specifyed the split size it will take default hdfs block size as split size.
Reducer has 3 primary phases: shuffle, sort and reduce.
Command :
1] Set Map Task : -D mapred.map.tasks=4
2] Set Reduce Task : -D mapred.reduce.tasks=2