Google cloud storage - Tez output files - hadoop

When I run a query using tez , the number of output files are very huge. I have some 4-5 GB of data each having 46 MB or 16 MB. I want to have only 2-3 files as output files.
My output files location will be google cloud storage. How do I merge the files?
set mapred.reduce.tasks = 1;
set hive.merge.mapfiles = true;
set hive.mergejob.maponly = true;
set hive.merge.mapredfiles=true;
I did set these parameters. And I did write insert overwrite query to overwrite the data in same location. No use. Please help.

I was able to get this done. Earlier, when I was doing this, it was map only job. Now, I have changed the query a bit to use reducer also(Added distribute by). Then if I say "number of reducer = 1" it works. But it is not working for other parameters which should work for map only job

Related

hive compaction using insert overwrite partition

Trying to address the small files problem by compacting the files under hive partitions by Insert overwrite partition command in hadoop.
Query :
SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
INSERT OVERWRITE TABLE tbl1 PARTITION (year=2016, month=03, day=11)
SELECT col1,col2,col3 from tbl1
WHERE year=2016 and month=03 and day=11;
Input Files:
For testing purpose I have three files under the hive partition (2016/03/11) in HDFS with the size of 40 MB each.
2016/03/11/file1.csv
2016/03/11/file2.csv
2016/03/11/file3.csv
Example my block size is 128 , So I would like to create only one output files. But I am getting 3 different compressed files.
Please help me to get the hive configuration to restrict the output file size. If I am not using the compression I am getting the single file.
Hive Version : 1.1
It's interesting that you are still getting 3 files when specifying the partition when using compression so you may want to look into dynamic partitioning or ditch the partitioning and focus on the number of mappers and reducers being created by your job. If your files are small I could see how you would want them all in one file on your target, but then I would also question the need for compression on them.
The number of files created in your target is directly tied to the number of reducers or mappers. If the SQL you write needs to reduce then the number of files created will be the same as the number of reducers used in the job. This can be controlled by setting the number of reducers used in the job.
set mapred.reduce.tasks = 1;
In your example SQL there most likely wouldn't be any reducers used, so the number of files in the target is equal to the number of mappers used which is equal to the number of files in the source. It isn't as easy to control the number of output files on a map only job but there are a number of configuration settings that can be tried.
Setting to combine small input files so fewer mappers are spawned, the default is false.
set hive.hadoop.supports.splittable.combineinputformat = true;
Try setting a threshold in bytes for the input files, anything under this threshold would try to be converted to a map join which can affect the number of output files.
set hive.mapjoin.smalltable.filesize = 25000000;
As for the compression I would play with changing the type of compression being used just to see if that makes any difference in your output.
set hive.exec.orc.default.compress = gzip, snappy, etc...

Hive query generating multiple small file

I have 2 hive table as source. Say
DEV.INPUT_01
DEV.INPUT_02
I have 1 more table as DEV.TARGET. I want to load data into this table for above 2 input tables. The HQL which I have is:
insert overwrite table DEV.TARGET partition(c30)
select
c1
,c2
,c3
,c4
,c5
,c6
,c7
,c8
,c9
,c10
,c11
,c12
,c13
,c14
,c15
,c16
,c17
,c18
,c19
,c20
,c21
,c22
,c23
,c24
,c25
,c26
,c27
,c28
,c29
,c30
from
DEV.SOURCE_01 t1 left join
DEV.SOURCE_02 t2 on
t1.tab_id = t2.tab_id;
The query is working fine. Number of mapper are 700 and reducers are 400.
The problem is above query is generating 400 files per partition and size of every file is around 200K.
I have tried multiple parameter combinations:
Setting 1:
set hive.exec.reducers.bytes.per.reducer=256000000;
Result 1 Number of reducers decreased to 100 and hence 100 files generated per partition.
Setting 2
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
Result 2 Above setting launched 2 MR steps and result is same.
Setting 3
set mapred.reduce.tasks=40;
Result 3
Number of files are reduced to 40 (which is expected)
Query performance degraded by 3 folds (original query to 20 mins and with this setting it took 55 mins).
The another problem is data size with this setting. As data grows this setting start degrading more and hence will be tough to manage.
Question How can I generate files of size 128M?
I don't think you can generate files of a specific size as Hive output.
However you can achieve some part of it with partitioning
This SO question has the answer explaining how to split data across files
Hive -- split data across files
Please set the following properties
set hive.optimize.index.filter=true;
set hive.exec.orc.skip.corrupt.data=true;
set hive.vectorized.execution.enabled=true;
set hive.compute.query.using.stats=true;
set stats.reliable=true;
set hive.optimize.sort.dynamic.partition=true;
set hive.optimize.ppd=true;
set hive.optimize.ppd.storage=true;
set hive.merge.mapredfiles=true;
set hive.merge.mapfile=true ;
set hive.hadoop.supports.splittable.combineinputformat=true;
set hive.exec.compress.output=true;
I have tried to find exactly which combination of setting worked for me. But all of them together only worked for me
If you want to reduce the number of partitioning files in HDFS you need to limit the block size with the Hive parameters. For instance in the block size in the cluster is configured to 128M:
SET dfs.blocksize=134217728;
(Number above in binary)
With that you will sort out the small partitioning file issue

Fail to Increase Hive Mapper Tasks?

I have a managed Hive table, which contains only one 150MB file. I then do "select count(*) from tbl" to it, and it uses 2 mappers. I want to set it to a bigger number.
First I tried 'set mapred.max.split.size=8388608;', so hopefully it will use 19 mappers. But it only uses 3. Somehow it still split the input by 64MB. I also used 'set dfs.block.size=8388608;', not working either.
Then I tried a vanilla map-reduce job to do the same thing. It initially uses 3 mappers, and when I set mapred.max.split.size, it uses 19. So the problem lies in Hive, I suppose.
I read some of the Hive source code, like CombineHiveInputFormat, ExecDriver, etc. can't find a clue.
What else settings can I use?
I combined #javadba 's answer with that I received from Hive mailing list, here's the solution:
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set mapred.map.tasks = 20;
select count(*) from dw_stage.st_dw_marketing_touch_pi_metrics_basic;
From the mailing list:
It seems that HIVE is using the old Hadoop MapReduce API and so mapred.max.split.size won't work.
I would dig into source code later.
Try adding the following:
set hive.merge.mapfiles=false;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

Increasing size no of mapper

I am new to PIG and HDFS. Here is what I am trying to do.
I have a lot of flat text LZO compressed ill formatted server logs files - about 2 GB each getting generated from around 400 servers daily.
I am trying to take advantage of map reduce to format and clean up the data in HDFS using my java formatter and then load the output in Hive.
My problem is that my PIG scripts spawns only one mapper which takes around 15 mins. to read the file sequentially. This is not practical for amount of data I have to load daily in hive.
Here is my pig script.
SET default_parallel 100;
SET output.compression.enabled true;
SET output.compression.codec com.hadoop.compression.lzo.LzopCodec
SET mapred.min.split.size 256000;
SET mapred.max.split.size 256000;
SET pig.noSplitCombination true;
SET mapred.max.jobs.per.node 1;
register file:/apps/pig/pacudf.jar
raw1 = LOAD '/data/serverx/20120710/serverx_20120710.lzo' USING PigStorage() as (field1);
pac = foreach raw1 generate pacudf.filegenerator(field1);
store pac into '/data/bazooka/';
Looks like mapred.min.split.size setting isn't working. I can see only 1 mapper being initiated which works on the whole 2 GB file on a single server of the cluster. As we have a 100 node cluster I was wondering if I can make use for more servers in parallel if I can spawn more mappers.
Thanks in advance
Compression support in PigStorage does not provide splitting ability. For splittable lzo compression support with pig, you would need the elephant-bird library from twitter. Also to get splitting work (properly ?) with existing regular-lzo files, you would need to index them prior to loading in your pig script.

What's wrong with my Hive-UDF?How to set the map number of hive?

I use Hadoop-Hive to analyse apache log to statis access features. I write a UDF named GetCity to convert the remote_ip to city name, but when I run "select GetCity(remote_ip) from log_pre;", it's very slow, and even failed when the data is too large as more than 1000 items.
I tried to set mapred.reduce.tasks=10, but the jobtracker shown the map total num is 1 all the same. How can I set more maps when select?
When performing a query like this the "GetCity(remote_ip)" call always happens on the mapper. In fact, I am doubtful there is anything going on in the reducer here except for maybe file concatenation. You can control the number of tasks that get used in the mapper from hive by calling:
SET mapred.map.tasks=10;
Hope this helps,
synctree

Resources