File split/partition in hadoop - hadoop

In hadoop filesystem, I have two files say X and Y. Normally, hadoop makes chunks of files X and Y of 64 MB in size. Is it possible to force hadoop to divide the two files such that a 64 MB chunk is created out of 32 MB from X and 32 MB from Y. In other words, is it possible to override the default behaviour of file partitioning?

File partitioning is a function of the FileInputFormat, since it is logically depends on the file format. You can create you own input with any other format. So per single file - you can do it.
Mixing two part of the different files in the single split sounds problematic - since file is a basic unit of processing.
Why do you have such requirement?
I see the requriement below. Can be stated that data locality has to be sucrificed at least in part - we can run map local to one file but not to both.
I would suggest building some kind of "file pairs" file, putting it into distributed cache and then, in the map function load second file from the HDFS.

Related

HDFS behavior on lots of small files and 128 Mb block size

I have lots (up to hundreds of thousands) of small files, each 10-100 Kb. I have HDFS block size equal 128 MB. I have replication factor equal 1.
Is there any drawbacks of allocating HDFS block per small file?
I've seen pretty contradictory answers:
Answer which said the smallest file takes the whole block
Answer which said that HDFS is clever enough, and small file will take small_file_size + 300 bytes of metadata
I made a test like in this answer, and it proves that the 2nd option is correct - HDFS doesn't allocate the whole block for small files.
But, how about batch read of 10.000 small files from HDFS? Does it will be slow down because of 10.000 blocks and metadatas? Is there any reason to keep multiple small files within single block?
Update: my use case
I have only one use case for small files, from 1.000 up to 500.000. I calculate that files once, store it, and than read them all at once.
1) As I understand, NameNode space problem is not a problem for me. 500.000 is an absolute maximum, I will never have more. If each small file takes 150 bytes on NN, than the absolute maximum for me is - 71.52 MB, which is acceptable.
2) Does Apache Spark eliminate MapReduce problem? Will sequence files or HAR help me to solve the issue? As I understand, Spark shouldn't depend on Hadoop MR, but it's still too slow. 490 files takes 38 seconds to read, 3420 files - 266 seconds.
sparkSession
.read()
.parquet(pathsToSmallFilesCollection)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);
As you have noticed already, the HDFS file does not take anymore space than it needs, but there are other drawbacks of having the small files in the HDFS cluster. Let's go first through the problems without taking into consideration batching:
NameNode(NN) memory consumption. I am not aware about Hadoop 3 (which is being currently under development) but in previous versions NN is a single point of failure (you can add secondary NN, but it will not replace or enhance the primary NN at the end). NN is responsible for maintaining the file-system structure in memory and on the disk and has limited resources. Each entry in file-system object maintained by NN is believed to be 150 bytes (check this blog post). More files = more RAM consumed by the NN.
MapReduce paradigm (and as far as I know Spark suffers from the same symptoms). In Hadoop Mappers are being allocated per split (which by default corresponds to the block), this means, that for every small file you have out there a new Mapper will need to be started to process its contents. The problem is that for small files it actually takes much more for Hadoop to start the Mapper than process the file content. Basically, you system will be doing unnecessary work of starting/stopping Mappers instead of actually processing the data. This is the reason Hadoop processes much fast 1 128MBytes file (with 128MBytes blocks size) rather than 128 1MBytes files (with same block size).
Now, if we talk about batching, there are few options you have out there: HAR, Sequence File, Avro schemas, etc. It depends on the use case to give the precise answers to your questions. Let's assume you do not want to merge files, in this case you might be using HAR files (or any other solution featuring efficient archiving and indexing). In this case the NN problem is solved, but the number of Mappers still will be equal to the number of splits. In case merging files into large one is an option, you can use Sequence File, which basically aggregates small files into bigger ones, solving to some extend both problems. In both scenarios though you cannot really update/delete the information directly like you would be able to do with small files, thus more sophisticated mechanisms are required for managing those structures.
In general, in the main reason for maintaining many small files is an attempt to make fast reads, I would suggest to take a look to different systems like HBase, which were created for fast data access, rather than batch processing.

How many output files are created after an MR Job in Hadoop?

I have a file which is less than (very less) default block size. The output from my Mapper is a large number of <key,list<values>> pairs (greater than 20).
I read somewhere that the number of output files generated after an MR job is equal to the number of reducers which in my case are greater than 20. But I got a single file in the output.
Then I made job.setNumReduceTasks(2) hoping that it would generate two files in the output. But it still generated a single file.
So can I conclude that the number of output files is equal to the number of blocks?
And also, is one block of data fed to one Mapper?
- Block - A Physical Division:
HDFS was designed to hold and manage large amounts of data. A default block size is 64 MB. That means if a 128-MB text file was put in to HDFS, HDFS would divide the file into two blocks (128 MB/64 MB) and distribute the two chunks to the data nodes in the cluster.
- Split - A Logical Division:
When Hadoop submits jobs, it splits the input data logically and process by each Mapper. Split is only a reference. Split has details in org.apache.hadoop.mapreduce.InputSplitand rules (how to split) decided by getSplits() in class org.apache.hadoop.mapreduce.Input.FileInputFormat.
By default, the size of split = block size = 64M.
Now consider your block size is 64MB. The file which you are processing should be greater than 64MB to create its physical splits. If it is less than 64 MB then you will see only single file as you mentioned in your output. (No matter how many key-value your mapper will produce!)

What should be the size of the file in HDFS for best MapReduce job performance

I want to do a copy text files from external sources to HDFS. Lets assume that I can combine and split the files based on their size, what should be the size of the text file for best custom Map Reduce job performance. Does size matter ?
HDFS is designed to support very large files not small files. Applications that are compatible with HDFS are those that deal with large data sets.
These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files.In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB.
When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB
and you want to place that file in HDFS, then there will be 1GB/64MB = 16 split/blocks and these block will be distribute across the datanodes
The goal of splitting of file is parallel processing and fail over of data. These blocks/chunk will reside on a different DataNode based on your
cluster configuration.
How mappers get assigned
Number of mappers is determined by the number of splits of your data in the MapReduce job.
In a typical InputFormat, it is directly proportional to the number of files and file sizes.
suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size
then there will be 2 split and it will occupy 2 block and then 2 mapper will get assigned based on the blocks but suppose
if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigned based on that.
So you don't need to split the large file, but If you are dealing with very small files then it worth to combine them.
This link will be helpful to understand the problem with small files.
Please refer below link to get more detail about HDFS design.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Large number of small files Hadoop

Parameters of some machines are measured and uploaded via a web service to HDFS. Parameter values are saved in a file for each measurement, where a measurement has 1000 values in average.
The problem is - there is a large number of files. Only certain number of files is used for MapReduce job (for example, measurements from last month). Because of this I'm not able to merge them all into one large sequence file, since different files are needed in different time.
I understand that is bad to have a large number of small files, since the NameNode contains paths to all of them on HDFS (and keeps it in its memory) and on the other hand, each small file will result in a Mapper creation.
How can I avoid this problem?
A late answer: You can use SeaweedFS https://github.com/chrislusf/seaweedfs (I am working on this). It has special optimization for large number of small files.
HDFS actually has good support to delegate file storage to other file systems. Just add a SeaweedFS hadoop jar. See https://github.com/chrislusf/seaweedfs/wiki/Hadoop-Compatible-File-System
You could concatenate the required files into one temporal file that is deleted once analyzed. I think you can create this very easily in a script.
Anyway, make the numbers: such a big file will be also splited into several pieces whose size will be the blocksize (dfs.blocksize parameter a hdfs-defaul.xml), and each one of these pieces will be assigned to a mapper. I mean, depending on the blocksize and the average "small file" size, maybe the gain is not so great.

how can & where can i edit the InputSplit size in CDH4.7. By default it is 64 MB, but i want to mention it as 1MB

How can and where can i edit the Input Split size in CDH4.7 By default it is 64 MB but i want to mention it as 1MB because my MR job is running slow and i want increase the speed of MR job. i guess need to edit cor-site property IO.file.buffer.size but CDH4.7 is not allowing me to edit as it is read only.
just reapeating the question below the get my question posted
How can and where can i edit the Input Split size in CDH4.7 By default it is 64 MB but i want to mention it as 1MB because my MR job is running slow and i want increase the speed of MR job. i guess need to edit cor-site property IO.file.buffer.size but CDH4.7 is not allowing me to edit as it is read only.
The parameter "mapred.max.split.size" which can be set per job individually is what you looking for.
You don't change "dfs.block.size" because Hadoop Works better with a small number of large files than a large number of small files. One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file. If the file is very small ("small" means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. Compare a 1gb file broken into sixteen 64mb blocks, and 10.000 or so 100kb files. The 10.000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file and 16 map tasks.
You can change it directly from within the command using -D mapred.max.split.size=.. from command line and don't necessarily change any file permanently.

Resources