Merging compressed files on HDFS

Merging compressed files on HDFS - hadoop

How do I merge all files in a directory on HDFS, that I know are all compressed, into a single compressed file, without copying the data through the local machine? For example, but not necessarily, using Pig?
As an example, I have a folder /data/input that contains the files part-m-00000.gz and part-m-00001.gz. Now I want to merge them into a single file /data/output/foo.gz

I would suggest to look at FileCrush (https://github.com/edwardcapriolo/filecrush), a tool to merge files on HDFS using MapReduce. It does exactly what you described and provides several options to deal with compressions and control the number of output files.
Crush --max-file-blocks XXX /data/input /data/output
max-file-blocks represents the maximum number of dfs blocks per output file. For example, according to the documentation:
With the default value 8, 80 small files, each being 1/10th of a dfs
block will be grouped into to a single output file since 8 * 1/10 = 8
dfs blocks. If there are 81 small files, each being 1/10th of a dfs
block, two output files will be created. One output file contain the
combined contents of 41 files and the second will contain the combined
contents of the other 40. A directory of many small files will be
converted into fewer number of larger files where each output file is
roughly the same size.

If you set the Parallel to 1 - then you will have single output file.
This can be done in 2 ways:
in your pig add set default_parallel 20; but note that this effect everything in your pig
Change the Parallel for a single operation - like DISTINCT ID PARALLEL 1;
Can read more about Parallel Features

I know there's an option to do merging to the local filesystem using "hdfs dfs -getMerge" command. Perhaps you can use that to merge to the local filesystem and then use 'hdfs dfs -copyFromLocal" command to copy it back into hdfs.

Related

Number of Mappers

I have 4 files on hdfs.
1.txt,2.txt,3.txt and 4.txt. Out of this 4 files the first 3 files has data contents as below and 4.txt file is empty. How may mappers are executed.
Number of mappers = number of input splits.
My question is, are all this files stored in one 64 MB block or 4 different blocks? since the data is less than 64MB in size for each file.
1.txt This is text file 1
2.txt This is text file 2
3.txt This is text file 3
4.txt "Empty"

It would be stored in 4 different blocks unless and until you wrap it up and store in a HAR file. The concept is if your file size is more than the block size then your single file would be split and stored in different blocks, else if it is less than the block size then the files would be stored independently in different blocks. But however it would not use more than the actual file size even if the block size is 64 MB or more than that. Quoting from The Definitive Guide:
HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode.
So in your case it would still use 4 mappers as we have 4 blocks.

HDFS by default do not combine small files into one single block.
HDFS will store all files in seperate blocks so your HDFS will use 4blocks to store your 4files (each smaller than dfs.block.size). This do not mean than HDFS will occupy 4*64MB of size.Hence your MR job will spawn 4 Mappers to read all files
Ideally, you should not store small files on HDFS as it will increase load on Namenode.
You can combine the files before uploading to HDFS with unix utility or convert files to sequence files or write pig script/hive script/mapreduce to combine all small files into bigger files.
Small files on HDFS are described very well here : http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

What should be the size of the file in HDFS for best MapReduce job performance

I want to do a copy text files from external sources to HDFS. Lets assume that I can combine and split the files based on their size, what should be the size of the text file for best custom Map Reduce job performance. Does size matter ?

HDFS is designed to support very large files not small files. Applications that are compatible with HDFS are those that deal with large data sets.
These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files.In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB.
When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB
and you want to place that file in HDFS, then there will be 1GB/64MB = 16 split/blocks and these block will be distribute across the datanodes
The goal of splitting of file is parallel processing and fail over of data. These blocks/chunk will reside on a different DataNode based on your
cluster configuration.
How mappers get assigned
Number of mappers is determined by the number of splits of your data in the MapReduce job.
In a typical InputFormat, it is directly proportional to the number of files and file sizes.
suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size
then there will be 2 split and it will occupy 2 block and then 2 mapper will get assigned based on the blocks but suppose
if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigned based on that.
So you don't need to split the large file, but If you are dealing with very small files then it worth to combine them.
This link will be helpful to understand the problem with small files.
Please refer below link to get more detail about HDFS design.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Gzip: merge a set of small files (<64mb) into several larger files (64mb or 128mb)

I have around 14000 small .gz files (~from 90kb to 4mb) which are loaded into HDFS all in the same directory.
So the size of each of them is far away from the standard 64mb or 128mb block size of HDFS, which can lead to serious trouble (the "small files problem", see this blog post by cloudera) when running MR jobs which process these files.
The aforementioned blog post contains a number of solutions to this problem, which mostly involve writing a MapReduce Job or using Hadoop Archives (HAR).
However, I would like to tackle the problem at the source and merge the small files into 64mb or 128mb .gz files which will then be fed directly into HDFS.
What's the simplest way of doing this?

cat small-*.gz > large.gz
should be enough. Assuming you don't need to extract separate files from there, and the data is enough.
If you want separate files, just tar it:
tar cf large.tar small-*.gz

After experimenting a bit further, the following two steps do what I want:
zcat small-*.gz | split -d -l2000000 -a 3 - large_
This works in my case, because there is very little variance in the length of a line.
2000000 lines correspond to almost exactly 300Mb files.
Unfortunately, for some reason, gzip cannot be piped like this, so I have to do another step:
gzip *
This will then also compress the generated large files.
Gzip compresses each of these files by a factor of ~5, leading to 60mb files and thus satisfying my initial constraint of receiving .gz files < 64mb.

Advantages of Sequence file over hdfs textfile

What is the advantage of Hadoop Sequence File over HDFS flat file(Text)? In what way Sequence file is efficient?
Small files can be combined and written into a sequence file, but the same can be done for a HDFS text file also. Need to know the difference between the two ways. I have been googling about this for a while, would be helpful if i get clarity on this?

Sequence files are appropriate for situations in which you want to store keys and their corresponding values. For text files you can do that but you have to parse each line.
Can be compressed and still be splittable which means better workload. You can't split a compressed text file unless you use a splittable compression format.
Can be approached as binary files => more storage efficient. In a text file a double will be a number of chars => large storage overhead.

Advantages of Hadoop Sequence files ( As per Siva's article from hadooptutorial.info website)
More compact than text files
Provides support for compression at different levels - Block or Record etc.
Files can be split and processed in parallel
They can solve large number of small files problem in Hadoop where Hadoop main advantage is processing large file with Map reduce jobs. It can be used as a container for large number of small files
Temporary output of Mapper can be stored in sequential files
Disadvantages:
Sequential files are append only

Sequence files are intermediate files generated during mapper and reducer phase of MapReduce processing. Sequence file are compressible and fast in processing it is used to write output during mapper and reducer reds from it.
There are APIs in Hadoop and Spark to read/write sequence files

File split/partition in hadoop

In hadoop filesystem, I have two files say X and Y. Normally, hadoop makes chunks of files X and Y of 64 MB in size. Is it possible to force hadoop to divide the two files such that a 64 MB chunk is created out of 32 MB from X and 32 MB from Y. In other words, is it possible to override the default behaviour of file partitioning?

File partitioning is a function of the FileInputFormat, since it is logically depends on the file format. You can create you own input with any other format. So per single file - you can do it.
Mixing two part of the different files in the single split sounds problematic - since file is a basic unit of processing.
Why do you have such requirement?
I see the requriement below. Can be stated that data locality has to be sucrificed at least in part - we can run map local to one file but not to both.
I would suggest building some kind of "file pairs" file, putting it into distributed cache and then, in the map function load second file from the HDFS.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio