Hadoop smaller input file - hadoop

I am using hadoop in a little different way. In my case, input size is really small. However, computation time is more. I have some complicated algorithm which I will be running on every line of input. So even though the input size is less than 5mb, the overall computation time is over 10hrs. So I am using hadoop here. I am using NLineInputFormat to split the file by number of lines rather than block size. In my initial testing, I had around 1500 lines (Splitting by 200 lines) and I saw only a improvement of 1.5 times in a four node cluster compared to that of running it serially on one machine. I am using VM's. Could that be the issue or for smaller size input there wont be much benefits with hadoop? Any insights will be really helpful.

To me, your workload resembles SETI#Home work load -- small payloads but hours of crunching time.
Hadoop (or more specifically HDFS) is not designed for lots of small files. But I doubt that is an issue for MapReduce - the processing framework you are using.
If you want to keep your workload together:
1) split them into individual files (one workload, one file) if the file is less than block size then it will go to one mapper. Typical block sizes are 64MB or 128MB
2) create a wrapper for FileInputFormat, and override the 'isSplitable()' method to false. This will make sure entire file contents are fed to one mapper, rather than hadoop trying to split it line by line
reference : http://hadoopilluminated.com/hadoop_book/HDFS_Intro.html

Hadoop is not really good at dealing with tons of small files, hence, it is often desired to combine a large number of smaller input files into less number of bigger files so as to reduce number of mappers.
As Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is split into one or more InputSplits typically upper bounded by block size. This means the number of input splits is lower bounded by number of input files. This is not an ideal environment for MapReduce process when it’s dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there is relatively large number of small files.
The basic parameter which drives the spit size is mapred.max.split.size.
Using CombineFileInputFormat and this parameter we can control the number of mappers.
Checkout the implementation I had for another answer here.

Related

Large number of small files Hadoop

Parameters of some machines are measured and uploaded via a web service to HDFS. Parameter values are saved in a file for each measurement, where a measurement has 1000 values in average.
The problem is - there is a large number of files. Only certain number of files is used for MapReduce job (for example, measurements from last month). Because of this I'm not able to merge them all into one large sequence file, since different files are needed in different time.
I understand that is bad to have a large number of small files, since the NameNode contains paths to all of them on HDFS (and keeps it in its memory) and on the other hand, each small file will result in a Mapper creation.
How can I avoid this problem?
A late answer: You can use SeaweedFS https://github.com/chrislusf/seaweedfs (I am working on this). It has special optimization for large number of small files.
HDFS actually has good support to delegate file storage to other file systems. Just add a SeaweedFS hadoop jar. See https://github.com/chrislusf/seaweedfs/wiki/Hadoop-Compatible-File-System
You could concatenate the required files into one temporal file that is deleted once analyzed. I think you can create this very easily in a script.
Anyway, make the numbers: such a big file will be also splited into several pieces whose size will be the blocksize (dfs.blocksize parameter a hdfs-defaul.xml), and each one of these pieces will be assigned to a mapper. I mean, depending on the blocksize and the average "small file" size, maybe the gain is not so great.

Wordcount: More than 1 map task per block, with speculative execution off

In Wordcount, it appears that you can get More than 1 map task per block, with speculative execution off.
Does the jobtracker do some magic under the hood to distribute tasks more than provided by the InputSplits?
Blocks and Splits are 2 different things. You might get more than one mappers for one Block if that Block has more than one Splits.
The answer to this lies in the way that Hadoop InputFormats work:
IN HDFS :
Lets take an example where the blocks are of size 1MB, an input file to HDFS is of size 10MB, and the minimum split size is > 1MB
1) First, a file is added to HDFS.
2) The file is split in to 10 blocks, each of size 1MB.
3) Then, each 1MB block is read by input splitter.
4) Since the 1MB block is SMALLER then the MIN_SPLIT_SIZE, HDFS processes 1MB at a time, with no extra splitting.
The moral of the story: FileInputFormat will actually do extra splitting for you if your splits are below the minimum split size.
I guess I totally forgot about this, but looking back, this has been a feature in hadoop since the beginning. The ability of an input format to arbitrarily split blocks up at runtime is used by many ecosystem tools to distribute loads in an applicaiton specific way.
The part that is tricky here is the fact that in toy mapreduce jobs, would expect one block per split in all cases, and then in real clusters, we overlook the split default size parameters, which dont come into play unless you are using large files.

Hadoop HDFS - Keep many part files or concat?

After running a map-reduce job in Hadoop, the result is a directory with part files. The number of part files depend on the number of reducers, and can reach dozens (80 in my case).
Does keeping multiple part files affect the performance of future map-reduce operations, to the better or worse? Will taking an extra reduction step and merging all the parts improve or worsen the speed of further processing?
Please refer only to map-reduce performance issues. I don't care about splitting or merging these results in any other way.
Running further mapreduce operations on the part directory should have little to no impact on overall performance.
The reason is the first step Hadoop does is split the data in the input directory according to the size and places the split data onto the Mappers. Since it's already splitting the data into separate chunks, splitting one file vs many shouldn't impact performance, the amount of data being transferred over the network should be roughly equal, as should the amount of processing and disk time.
There might be some degenerate cases where part files will be slower. For example instead of 1 large file you had thousands/millions of part files. I also can think of situations where having many part files would be faster. For example, if you don't have splittable files (not usually the case unless you are using certain compression schemes), then you would have to put your 1 big file on a single mapper since its unsplittable, where the many part files would be distributed more or less as normal.
It all depends on what the next task needs to do.
If you have analytics data and you have 80 files per (partially processed) input day then you have a huge performance problem if the next job needs to combine the data over the last two years.
If however you have only those 80 then I wouldn't worry about it.

Is InputSplit size or number of map tasks affected by the number of input files

Would it make a difference to the number of map tasks spawned by a job if I have a lot of small files (~HDFS block size) vs a few large files
It depends which InputFormat you use, because this is what determines the input splits computation, and thus the number of map tasks.
If you use the default TextInputFormat, each file will have at least 1 split, so at least 1 mapper per file, even if these files are a few kB, each mapper doing very little work, but this introduces a lot of overhead for the Map/Reduce framework. That said if you have a guarantee that these "small" files will be close to the block size, that probably doesn't matter too much.
If you have no control over your files and they might get really small, I would advise using a different InputFormat called CombineFileInputFormat which combines several input files in the same split, the number of maps in this case will only depend on the overall amount of data, regardless of the number of files. An implementation can be found here.

how can i work with large number of small files in hadoop?

i am new to hadoop and i'm working with large number of small files in wordcount example.
it takes a lot of map tasks and results in slowing my execution.
how can i reduce the number of map tasks??
if the best solution to my problem is catting small files to a larger file, how can i cat them?
If you're using something like TextInputFormat, the problem is that each file has at least 1 split, so the upper bound of the number of maps is the number of files, which in your case where you have many very small files you will end up with many mappers processing each very little data.
To remedy to that, you should use CombineFileInputFormat which will pack multiple files into the same split (I think up to the block size limit), so with that format the number of mappers will be independent of the number of files, it will simply depend on the amount of data.
You will have to create your own input format by extending from CombineFileInputFormt, you can find an implementation here. Once you have your InputFormat defined, let's called it like in the link CombinedInputFormat, you can tell your job to use it by doing:
job.setInputFormatClass(CombinedInputFormat.class);
Cloudera posted a blog on small files problem sometime back. It's an old entry, but the suggested method still applies.

Resources