I am new to PIG and HDFS. Here is what I am trying to do.
I have a lot of flat text LZO compressed ill formatted server logs files - about 2 GB each getting generated from around 400 servers daily.
I am trying to take advantage of map reduce to format and clean up the data in HDFS using my java formatter and then load the output in Hive.
My problem is that my PIG scripts spawns only one mapper which takes around 15 mins. to read the file sequentially. This is not practical for amount of data I have to load daily in hive.
Here is my pig script.
SET default_parallel 100;
SET output.compression.enabled true;
SET output.compression.codec com.hadoop.compression.lzo.LzopCodec
SET mapred.min.split.size 256000;
SET mapred.max.split.size 256000;
SET pig.noSplitCombination true;
SET mapred.max.jobs.per.node 1;
register file:/apps/pig/pacudf.jar
raw1 = LOAD '/data/serverx/20120710/serverx_20120710.lzo' USING PigStorage() as (field1);
pac = foreach raw1 generate pacudf.filegenerator(field1);
store pac into '/data/bazooka/';
Looks like mapred.min.split.size setting isn't working. I can see only 1 mapper being initiated which works on the whole 2 GB file on a single server of the cluster. As we have a 100 node cluster I was wondering if I can make use for more servers in parallel if I can spawn more mappers.
Thanks in advance
Compression support in PigStorage does not provide splitting ability. For splittable lzo compression support with pig, you would need the elephant-bird library from twitter. Also to get splitting work (properly ?) with existing regular-lzo files, you would need to index them prior to loading in your pig script.
Related
I tried the process(word labeling of sentence) of large data(about 150GB) using tez , but the problem is that it took so much time(1week or more),then
I tried to specify number of mapper.
Though I set mapred.map.tasks =2000,
but I can't stop mapper being set to about 150,
so I can't do what I want to do.
I specify the map value in oozie workflow file and use the tez.
How can I specify the number of mapper?
Finally I want to speed up the process, it is ok not to use tez.
In addition, I would like to count labeled sentence by reducer, it takes so much time,too.
And , I also want to know how I adjust memory size to use each mapper and reducer process.
In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration tez.grouping.split-count can be used...
... set tez.grouping.split-count=4 will create 4 mappers
https://community.pivotal.io/s/article/How-to-manually-set-the-number-of-mappers-in-a-TEZ-Hive-job
However, overall, you should optimize the storage format and the Hive partitions before you even begin tuning the Tez settings. Do not try and process data STORED AS TEXT in Hive. Convert it to ORC or Parquet first.
If Tez isn't working out for you, you can always try Spark. Plus labelling sentences is probably a Spark MLlib worlflow you can find somewhere
Trying to address the small files problem by compacting the files under hive partitions by Insert overwrite partition command in hadoop.
Query :
SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
INSERT OVERWRITE TABLE tbl1 PARTITION (year=2016, month=03, day=11)
SELECT col1,col2,col3 from tbl1
WHERE year=2016 and month=03 and day=11;
Input Files:
For testing purpose I have three files under the hive partition (2016/03/11) in HDFS with the size of 40 MB each.
2016/03/11/file1.csv
2016/03/11/file2.csv
2016/03/11/file3.csv
Example my block size is 128 , So I would like to create only one output files. But I am getting 3 different compressed files.
Please help me to get the hive configuration to restrict the output file size. If I am not using the compression I am getting the single file.
Hive Version : 1.1
It's interesting that you are still getting 3 files when specifying the partition when using compression so you may want to look into dynamic partitioning or ditch the partitioning and focus on the number of mappers and reducers being created by your job. If your files are small I could see how you would want them all in one file on your target, but then I would also question the need for compression on them.
The number of files created in your target is directly tied to the number of reducers or mappers. If the SQL you write needs to reduce then the number of files created will be the same as the number of reducers used in the job. This can be controlled by setting the number of reducers used in the job.
set mapred.reduce.tasks = 1;
In your example SQL there most likely wouldn't be any reducers used, so the number of files in the target is equal to the number of mappers used which is equal to the number of files in the source. It isn't as easy to control the number of output files on a map only job but there are a number of configuration settings that can be tried.
Setting to combine small input files so fewer mappers are spawned, the default is false.
set hive.hadoop.supports.splittable.combineinputformat = true;
Try setting a threshold in bytes for the input files, anything under this threshold would try to be converted to a map join which can affect the number of output files.
set hive.mapjoin.smalltable.filesize = 25000000;
As for the compression I would play with changing the type of compression being used just to see if that makes any difference in your output.
set hive.exec.orc.default.compress = gzip, snappy, etc...
When I run a query using tez , the number of output files are very huge. I have some 4-5 GB of data each having 46 MB or 16 MB. I want to have only 2-3 files as output files.
My output files location will be google cloud storage. How do I merge the files?
set mapred.reduce.tasks = 1;
set hive.merge.mapfiles = true;
set hive.mergejob.maponly = true;
set hive.merge.mapredfiles=true;
I did set these parameters. And I did write insert overwrite query to overwrite the data in same location. No use. Please help.
I was able to get this done. Earlier, when I was doing this, it was map only job. Now, I have changed the query a bit to use reducer also(Added distribute by). Then if I say "number of reducer = 1" it works. But it is not working for other parameters which should work for map only job
I'm running a pig script that does a series of joins and write using AvroStorage()
All is running well, and I am getting the data that I want... but it is being written to 845 avro files (~30kb each). This does not seem right at all... but I cannot seem to find any settings that I may have changed to go from my previous output of 1 large avro to 845 small avros (except adding another data source).
Would this change anything? And how can I get it back to one or two files??
Thanks!
A possibility is to change your block size. If you want to go back to less files, you can also try to use parquet. Transform your .avro files through a pig script and store it like a .parquet file this will reduce your 845 to less files.
But it isn't necessary to get back to less files except for a performance advantage.
The number of files written by MR job is defined by the number of reducers ran. You can use PARALLEL in Pig script to control the number of reducers.
If you are sure that the final data is small enough (comparable to your block size), you can add PARALLEL 1 to your JOIN statement to make sure that JOIN is translated to 1 reducers and thus writes output in only 1 file.
I solved that using SET pig.maxCombinedSplitSize 134217728;
with SET default_parallel 10; it may still output many small files depending on the PIG job.
Is it possible to have Pig process several small files with one mapper (assuming doing so will improve the speed of the job). We have an issue where there are thousands of small files in hdfs and pig creates hundreds of mappers. Is there a simple (full or partial) solution that Pig provides to address this issue?
You can make use of these properties to combine these multiple files into one file, so that they are processed by a single map :
pig.maxCombinedSplitSize – Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined untill this size is reached.
pig.splitCombination – Turns combine split files on or off (set to “true” by default).
This feature works with PigStorage without having to write any custom loader. More on this can be found here.
HTH
A common approach in Hadoop with a large number of small files is to aggregate them into large Sequence or Avro files and than use respective storage functions to read them.
For Pig and Avro take a look at AvroStorage