I have 4 files on hdfs.
1.txt,2.txt,3.txt and 4.txt. Out of this 4 files the first 3 files has data contents as below and 4.txt file is empty. How may mappers are executed.
Number of mappers = number of input splits.
My question is, are all this files stored in one 64 MB block or 4 different blocks? since the data is less than 64MB in size for each file.
1.txt This is text file 1
2.txt This is text file 2
3.txt This is text file 3
4.txt "Empty"
It would be stored in 4 different blocks unless and until you wrap it up and store in a HAR file. The concept is if your file size is more than the block size then your single file would be split and stored in different blocks, else if it is less than the block size then the files would be stored independently in different blocks. But however it would not use more than the actual file size even if the block size is 64 MB or more than that. Quoting from The Definitive Guide:
HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode.
So in your case it would still use 4 mappers as we have 4 blocks.
HDFS by default do not combine small files into one single block.
HDFS will store all files in seperate blocks so your HDFS will use 4blocks to store your 4files (each smaller than dfs.block.size). This do not mean than HDFS will occupy 4*64MB of size.Hence your MR job will spawn 4 Mappers to read all files
Ideally, you should not store small files on HDFS as it will increase load on Namenode.
You can combine the files before uploading to HDFS with unix utility or convert files to sequence files or write pig script/hive script/mapreduce to combine all small files into bigger files.
Small files on HDFS are described very well here : http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
Related
so I'm having some issues understanding in which way I should store large files.
For example, the block size in my HDFS is 128MB, and I have a 1GB file.
I know that saving files that are smaller than the block size is not the best practice and I understand why.
But what should I do with big files, for my 1GB file, should I save 1 file or 8 files of 128MB each, and why?
You can store 1 file with 1GB. Hadoop will autmatically store that file in 8 blocks.
Hadoop is designed for bigger files not smaller files. Please note that Block is physical storage in hadoop.
As you did not mention split size in your cluster so i assume it is 128 MB. Split is something that on which you parallelism depend. So if you process 1 GB file on 128 split size 8 mappers will be invoked ( 1 mapper on each split).
If you store 8 files of 128 mb each. There will be unneccesary overhead on your Namenode for maintaining info about those 8 files. In case of 8 files performance may be more or less similar as compared to 1 GB file but it will definitely better in case of 1 GB file with 8 blocks.
Do not confuse with Blocks in hadoop they are just storage unit like other file system. Hadoop will autmatically take care of storage no matter how bigger file is and it will divide files in block . Storing small files will be uncessary over head in i/o operations.
I want to do a copy text files from external sources to HDFS. Lets assume that I can combine and split the files based on their size, what should be the size of the text file for best custom Map Reduce job performance. Does size matter ?
HDFS is designed to support very large files not small files. Applications that are compatible with HDFS are those that deal with large data sets.
These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files.In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB.
When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB
and you want to place that file in HDFS, then there will be 1GB/64MB = 16 split/blocks and these block will be distribute across the datanodes
The goal of splitting of file is parallel processing and fail over of data. These blocks/chunk will reside on a different DataNode based on your
cluster configuration.
How mappers get assigned
Number of mappers is determined by the number of splits of your data in the MapReduce job.
In a typical InputFormat, it is directly proportional to the number of files and file sizes.
suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size
then there will be 2 split and it will occupy 2 block and then 2 mapper will get assigned based on the blocks but suppose
if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigned based on that.
So you don't need to split the large file, but If you are dealing with very small files then it worth to combine them.
This link will be helpful to understand the problem with small files.
Please refer below link to get more detail about HDFS design.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
How do I merge all files in a directory on HDFS, that I know are all compressed, into a single compressed file, without copying the data through the local machine? For example, but not necessarily, using Pig?
As an example, I have a folder /data/input that contains the files part-m-00000.gz and part-m-00001.gz. Now I want to merge them into a single file /data/output/foo.gz
I would suggest to look at FileCrush (https://github.com/edwardcapriolo/filecrush), a tool to merge files on HDFS using MapReduce. It does exactly what you described and provides several options to deal with compressions and control the number of output files.
Crush --max-file-blocks XXX /data/input /data/output
max-file-blocks represents the maximum number of dfs blocks per output file. For example, according to the documentation:
With the default value 8, 80 small files, each being 1/10th of a dfs
block will be grouped into to a single output file since 8 * 1/10 = 8
dfs blocks. If there are 81 small files, each being 1/10th of a dfs
block, two output files will be created. One output file contain the
combined contents of 41 files and the second will contain the combined
contents of the other 40. A directory of many small files will be
converted into fewer number of larger files where each output file is
roughly the same size.
If you set the Parallel to 1 - then you will have single output file.
This can be done in 2 ways:
in your pig add set default_parallel 20; but note that this effect everything in your pig
Change the Parallel for a single operation - like DISTINCT ID PARALLEL 1;
Can read more about Parallel Features
I know there's an option to do merging to the local filesystem using "hdfs dfs -getMerge" command. Perhaps you can use that to merge to the local filesystem and then use 'hdfs dfs -copyFromLocal" command to copy it back into hdfs.
So from my understanding, when hdfs stores a bzip2 compressed 1GB file with a block size of 64 MB, the file will be stored as 16 different blocks. If I want to run a map-reduce job on this compressed file, map reduce tries to split the file again. Why doesn't mapreduce automatically use the 16 blocks in hdfs instead of splitting the file again?
I think I see where your confusion is coming from. I'll attempt to clear it up.
HDFS slices your file up into blocks. These are physical partitions of the file.
MapReduce creates logical splits on top of these blocks. These splits are defined based on a number of parameters, with block boundaries and locations being a huge factor. You can set your minimum split size to 128MB, in which case each split will likely be exactly two 64MB blocks.
All of this is unrelated to your bzip2 compression. If you had used gzip compression, each split would be an entire file because gzip is not a splittable compression.
My understanding is that Hadoop takes a large file and saves it in chunks of "Datablocks". Are these data blocks stored in a T-file? Is the relationship between datablock and T-file 1-1?
HDFS stores large files as a series of data blocks (typically of a fixed size like 64/128/256/512 MB). Say you have a 1GB file, and a block size of 256MB - HDFS will represent this file as 4 blocks. The Name node will track what data nodes have copies (or replicas) of these blocks.
T-Files are a file format, containing Key/Value pairs. Hadoop would store a T-File using one or more data blocks in HDFS (depending on the size of the T-File, and the defined block size - either the system default or file specific).
In summary, you can store any file format in HDFS, it will just be chunked up into fixed sized blocks, distributed and replicated throughout the cluster.