Slow MapReduce performance when using Custom Input Format - hadoop

I am having an issue with MapReduce. I had to read multiple CSV files.
1 CSV file outputs 1 single row.
I cannot split the CSV files in custom input format as the rows in the CSV files are not in the same format. For example:
row 1 contains A, B, C
row 2 contains D, E, F
my output value should be like A, B, D, F
I have 1100 CSV files so 1100 splits are created and hence 1100 Mappers are created. The mappers are very simple and they shouldn't take much time to process.
But the 1100 input files take a lot of time to process.
Can anyone please guide me what I can take a look at or if I am doing anything wrong in this approach?

Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block.)
The technical reasons for this are well explained in this Cloudera blog post
Map tasks usually process a block of input at a time (using the
default FileInputFormat). If the file is very small and there are a
lot of them, then each map task processes very little input, and there
are a lot more map tasks, each of which imposes extra bookkeeping
overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or
so 100KB files. The 10,000 files use one map each, and the job time
can be tens or hundreds of times slower than the equivalent one with a
single input file.
You can refer this link to get methods to solve this issue

Related

HDFS behavior on lots of small files and 128 Mb block size

I have lots (up to hundreds of thousands) of small files, each 10-100 Kb. I have HDFS block size equal 128 MB. I have replication factor equal 1.
Is there any drawbacks of allocating HDFS block per small file?
I've seen pretty contradictory answers:
Answer which said the smallest file takes the whole block
Answer which said that HDFS is clever enough, and small file will take small_file_size + 300 bytes of metadata
I made a test like in this answer, and it proves that the 2nd option is correct - HDFS doesn't allocate the whole block for small files.
But, how about batch read of 10.000 small files from HDFS? Does it will be slow down because of 10.000 blocks and metadatas? Is there any reason to keep multiple small files within single block?
Update: my use case
I have only one use case for small files, from 1.000 up to 500.000. I calculate that files once, store it, and than read them all at once.
1) As I understand, NameNode space problem is not a problem for me. 500.000 is an absolute maximum, I will never have more. If each small file takes 150 bytes on NN, than the absolute maximum for me is - 71.52 MB, which is acceptable.
2) Does Apache Spark eliminate MapReduce problem? Will sequence files or HAR help me to solve the issue? As I understand, Spark shouldn't depend on Hadoop MR, but it's still too slow. 490 files takes 38 seconds to read, 3420 files - 266 seconds.
sparkSession
.read()
.parquet(pathsToSmallFilesCollection)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);
As you have noticed already, the HDFS file does not take anymore space than it needs, but there are other drawbacks of having the small files in the HDFS cluster. Let's go first through the problems without taking into consideration batching:
NameNode(NN) memory consumption. I am not aware about Hadoop 3 (which is being currently under development) but in previous versions NN is a single point of failure (you can add secondary NN, but it will not replace or enhance the primary NN at the end). NN is responsible for maintaining the file-system structure in memory and on the disk and has limited resources. Each entry in file-system object maintained by NN is believed to be 150 bytes (check this blog post). More files = more RAM consumed by the NN.
MapReduce paradigm (and as far as I know Spark suffers from the same symptoms). In Hadoop Mappers are being allocated per split (which by default corresponds to the block), this means, that for every small file you have out there a new Mapper will need to be started to process its contents. The problem is that for small files it actually takes much more for Hadoop to start the Mapper than process the file content. Basically, you system will be doing unnecessary work of starting/stopping Mappers instead of actually processing the data. This is the reason Hadoop processes much fast 1 128MBytes file (with 128MBytes blocks size) rather than 128 1MBytes files (with same block size).
Now, if we talk about batching, there are few options you have out there: HAR, Sequence File, Avro schemas, etc. It depends on the use case to give the precise answers to your questions. Let's assume you do not want to merge files, in this case you might be using HAR files (or any other solution featuring efficient archiving and indexing). In this case the NN problem is solved, but the number of Mappers still will be equal to the number of splits. In case merging files into large one is an option, you can use Sequence File, which basically aggregates small files into bigger ones, solving to some extend both problems. In both scenarios though you cannot really update/delete the information directly like you would be able to do with small files, thus more sophisticated mechanisms are required for managing those structures.
In general, in the main reason for maintaining many small files is an attempt to make fast reads, I would suggest to take a look to different systems like HBase, which were created for fast data access, rather than batch processing.

Large number of small files Hadoop

Parameters of some machines are measured and uploaded via a web service to HDFS. Parameter values are saved in a file for each measurement, where a measurement has 1000 values in average.
The problem is - there is a large number of files. Only certain number of files is used for MapReduce job (for example, measurements from last month). Because of this I'm not able to merge them all into one large sequence file, since different files are needed in different time.
I understand that is bad to have a large number of small files, since the NameNode contains paths to all of them on HDFS (and keeps it in its memory) and on the other hand, each small file will result in a Mapper creation.
How can I avoid this problem?
A late answer: You can use SeaweedFS https://github.com/chrislusf/seaweedfs (I am working on this). It has special optimization for large number of small files.
HDFS actually has good support to delegate file storage to other file systems. Just add a SeaweedFS hadoop jar. See https://github.com/chrislusf/seaweedfs/wiki/Hadoop-Compatible-File-System
You could concatenate the required files into one temporal file that is deleted once analyzed. I think you can create this very easily in a script.
Anyway, make the numbers: such a big file will be also splited into several pieces whose size will be the blocksize (dfs.blocksize parameter a hdfs-defaul.xml), and each one of these pieces will be assigned to a mapper. I mean, depending on the blocksize and the average "small file" size, maybe the gain is not so great.

is there any way to control inputsplit in map reduce

I have lots of small(150-300 KB) text file 9000 per hour,I need to process them through map reduce. I created a simple MR which will process all the file and create single output file. when i run job this job for 1 hour data, it took 45 min. i started digging reason of poor performance, i found it takes as many input-split as the number of file. as i am guessing one reason for poor performance.
is there any way to control the input split by which i can say 1000 file would be entertained by one input split/Map.
Hadoop is designed for huge files in small numbers and not the other way. There are some ways to get around it like preprocessing the data, using the CombineFileInputFormat.

Writing RCFile - how many reducers?

I have a MapReduce implementation for processing certain logfiles directly into GZip Compressed RCFile, for easy loading into Hive (via external table projections).
In any event, I have code that successfully and correctly runs, emitting data as BytesRefArrayWritable into RCFileOutputFormat.
Currently, I am running this as a Map-only job, meaning that for N input splits, I get N output files. For example, for 50 input splits, I will get 50 files of .rc extension. Hive can interpret these files together without issue, but my question is as follows:
Is it optimal to have 50 (or N, as it were) RCFile in a single directory, or is it optimal to have a single RCFile containing all the data? I know that RCFile is a columnar format, so IO is optimized for queries such as filtering on a particular column's value.
In the example I mentioned above with 50 input splits, in the first case, MapReduce will need to open 50 files and seek to the location of a column in question. It will also be able to parallelize this operation, given that these 50 files will be spread across HDFS. In the second case (all data in one RCFile), I would imagine MapReduce would sequentially stream the column values in the single RCFile and not have to stitch together 50 different results...
Is there a good way to reason about this? Is it a function of HDFS blocksize and the aggregate size of the Hive table?
Please let me know if I can clarify anything -- thanks in advance
Is it a function of HDFS blocksize
Primarily yes. Adjust the number of reducers as to not create partitions smaller than a block. I would consider this as the main driving factor.
Other than that, a smaller number of files is healthier for the name node. You also get some administrative goodness from not having x50 times more partitions than you really need on a Hive table (think operations like removal of obsolete partitions).
And I must reiterate the point of trying to move to the arguably superior ORC format.

Hadoop smaller input file

I am using hadoop in a little different way. In my case, input size is really small. However, computation time is more. I have some complicated algorithm which I will be running on every line of input. So even though the input size is less than 5mb, the overall computation time is over 10hrs. So I am using hadoop here. I am using NLineInputFormat to split the file by number of lines rather than block size. In my initial testing, I had around 1500 lines (Splitting by 200 lines) and I saw only a improvement of 1.5 times in a four node cluster compared to that of running it serially on one machine. I am using VM's. Could that be the issue or for smaller size input there wont be much benefits with hadoop? Any insights will be really helpful.
To me, your workload resembles SETI#Home work load -- small payloads but hours of crunching time.
Hadoop (or more specifically HDFS) is not designed for lots of small files. But I doubt that is an issue for MapReduce - the processing framework you are using.
If you want to keep your workload together:
1) split them into individual files (one workload, one file) if the file is less than block size then it will go to one mapper. Typical block sizes are 64MB or 128MB
2) create a wrapper for FileInputFormat, and override the 'isSplitable()' method to false. This will make sure entire file contents are fed to one mapper, rather than hadoop trying to split it line by line
reference : http://hadoopilluminated.com/hadoop_book/HDFS_Intro.html
Hadoop is not really good at dealing with tons of small files, hence, it is often desired to combine a large number of smaller input files into less number of bigger files so as to reduce number of mappers.
As Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is split into one or more InputSplits typically upper bounded by block size. This means the number of input splits is lower bounded by number of input files. This is not an ideal environment for MapReduce process when it’s dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there is relatively large number of small files.
The basic parameter which drives the spit size is mapred.max.split.size.
Using CombineFileInputFormat and this parameter we can control the number of mappers.
Checkout the implementation I had for another answer here.

Resources