Hadoop HDFS - Keep many part files or concat?

Hadoop HDFS - Keep many part files or concat? - hadoop

After running a map-reduce job in Hadoop, the result is a directory with part files. The number of part files depend on the number of reducers, and can reach dozens (80 in my case).
Does keeping multiple part files affect the performance of future map-reduce operations, to the better or worse? Will taking an extra reduction step and merging all the parts improve or worsen the speed of further processing?
Please refer only to map-reduce performance issues. I don't care about splitting or merging these results in any other way.

Running further mapreduce operations on the part directory should have little to no impact on overall performance.
The reason is the first step Hadoop does is split the data in the input directory according to the size and places the split data onto the Mappers. Since it's already splitting the data into separate chunks, splitting one file vs many shouldn't impact performance, the amount of data being transferred over the network should be roughly equal, as should the amount of processing and disk time.
There might be some degenerate cases where part files will be slower. For example instead of 1 large file you had thousands/millions of part files. I also can think of situations where having many part files would be faster. For example, if you don't have splittable files (not usually the case unless you are using certain compression schemes), then you would have to put your 1 big file on a single mapper since its unsplittable, where the many part files would be distributed more or less as normal.

It all depends on what the next task needs to do.
If you have analytics data and you have 80 files per (partially processed) input day then you have a huge performance problem if the next job needs to combine the data over the last two years.
If however you have only those 80 then I wouldn't worry about it.

Related

HDFS behavior on lots of small files and 128 Mb block size

I have lots (up to hundreds of thousands) of small files, each 10-100 Kb. I have HDFS block size equal 128 MB. I have replication factor equal 1.
Is there any drawbacks of allocating HDFS block per small file?
I've seen pretty contradictory answers:
Answer which said the smallest file takes the whole block
Answer which said that HDFS is clever enough, and small file will take small_file_size + 300 bytes of metadata
I made a test like in this answer, and it proves that the 2nd option is correct - HDFS doesn't allocate the whole block for small files.
But, how about batch read of 10.000 small files from HDFS? Does it will be slow down because of 10.000 blocks and metadatas? Is there any reason to keep multiple small files within single block?
Update: my use case
I have only one use case for small files, from 1.000 up to 500.000. I calculate that files once, store it, and than read them all at once.
1) As I understand, NameNode space problem is not a problem for me. 500.000 is an absolute maximum, I will never have more. If each small file takes 150 bytes on NN, than the absolute maximum for me is - 71.52 MB, which is acceptable.
2) Does Apache Spark eliminate MapReduce problem? Will sequence files or HAR help me to solve the issue? As I understand, Spark shouldn't depend on Hadoop MR, but it's still too slow. 490 files takes 38 seconds to read, 3420 files - 266 seconds.
sparkSession
.read()
.parquet(pathsToSmallFilesCollection)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);

As you have noticed already, the HDFS file does not take anymore space than it needs, but there are other drawbacks of having the small files in the HDFS cluster. Let's go first through the problems without taking into consideration batching:
NameNode(NN) memory consumption. I am not aware about Hadoop 3 (which is being currently under development) but in previous versions NN is a single point of failure (you can add secondary NN, but it will not replace or enhance the primary NN at the end). NN is responsible for maintaining the file-system structure in memory and on the disk and has limited resources. Each entry in file-system object maintained by NN is believed to be 150 bytes (check this blog post). More files = more RAM consumed by the NN.
MapReduce paradigm (and as far as I know Spark suffers from the same symptoms). In Hadoop Mappers are being allocated per split (which by default corresponds to the block), this means, that for every small file you have out there a new Mapper will need to be started to process its contents. The problem is that for small files it actually takes much more for Hadoop to start the Mapper than process the file content. Basically, you system will be doing unnecessary work of starting/stopping Mappers instead of actually processing the data. This is the reason Hadoop processes much fast 1 128MBytes file (with 128MBytes blocks size) rather than 128 1MBytes files (with same block size).
Now, if we talk about batching, there are few options you have out there: HAR, Sequence File, Avro schemas, etc. It depends on the use case to give the precise answers to your questions. Let's assume you do not want to merge files, in this case you might be using HAR files (or any other solution featuring efficient archiving and indexing). In this case the NN problem is solved, but the number of Mappers still will be equal to the number of splits. In case merging files into large one is an option, you can use Sequence File, which basically aggregates small files into bigger ones, solving to some extend both problems. In both scenarios though you cannot really update/delete the information directly like you would be able to do with small files, thus more sophisticated mechanisms are required for managing those structures.
In general, in the main reason for maintaining many small files is an attempt to make fast reads, I would suggest to take a look to different systems like HBase, which were created for fast data access, rather than batch processing.

Processing HUGE number of small files independently

The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2).
Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR.
The reason is that processing files in my case, are independent. As far as I understand MR, it works best when the output is dependent on many individual files (for example counting the frequency of each word, given many documents, since a word might be included in any document in the input file). But in my case, I just need a lot of independent CPUs/Cores.
I was wondering if you have any advice on this.
Side Notes: There is another issue which is that MR works best for "huge files rather than huge number of small size". Although there seems to be solutions for that. So I am ignoring it for now.

It is possible to use map reduce for your needs. In MapReduce, there are two phases Map and Reduce, however, the reduce phase is not a must, just for your situation, you could write a map-only MapReduce job, and all the calculations on a single file should be put into a customised Map function.
However, I haven't process such huge num of files in a single job, no idea on its efficiency. Try it yourself, and share with us :)

This is quite easy to do. In such cases - the data for MR job is typically the list of files (and not the files themselves). So the size of the data submitted to Hadoop is the size of 10M file names - which is order of a couple of gigs max.
One uses MR to split up the list of files into smaller fragments (how many can be controlled by various options). Then each mapper gets a list of files. It can process one file at a time and generate the output.
(fwiw - I would suggest Qubole (where I am a founder) instead of EMR cause it would save you a ton of money with auto-scaling and spot integration).

Large number of small files Hadoop

Parameters of some machines are measured and uploaded via a web service to HDFS. Parameter values are saved in a file for each measurement, where a measurement has 1000 values in average.
The problem is - there is a large number of files. Only certain number of files is used for MapReduce job (for example, measurements from last month). Because of this I'm not able to merge them all into one large sequence file, since different files are needed in different time.
I understand that is bad to have a large number of small files, since the NameNode contains paths to all of them on HDFS (and keeps it in its memory) and on the other hand, each small file will result in a Mapper creation.
How can I avoid this problem?

A late answer: You can use SeaweedFS https://github.com/chrislusf/seaweedfs (I am working on this). It has special optimization for large number of small files.
HDFS actually has good support to delegate file storage to other file systems. Just add a SeaweedFS hadoop jar. See https://github.com/chrislusf/seaweedfs/wiki/Hadoop-Compatible-File-System

You could concatenate the required files into one temporal file that is deleted once analyzed. I think you can create this very easily in a script.
Anyway, make the numbers: such a big file will be also splited into several pieces whose size will be the blocksize (dfs.blocksize parameter a hdfs-defaul.xml), and each one of these pieces will be assigned to a mapper. I mean, depending on the blocksize and the average "small file" size, maybe the gain is not so great.

improving performance when you have many small input files using Pig Latin

Currently I'm working with approximately 19 gigabytes of log data,
and they are much seperated so that the nubmer of input files is 145258(pig stat).
Between executing application and starting mapreduce job in web UI,
enormous time is wasted to get prepared(about 3hours?) and then the mapreduce job starts.
and also mapreduce job itself(through Pig script) is pretty slow, it takes about an hour.
mapreduce logic is not that complex, just like a group by operation.
I have 3 datanodes and 1 namenode, 1 secondary namenode.
How can I optimize configuration to improve mapreduce performance?

You should set pig.maxCombinedSplitSize to a reasonable size and make sure that pig.splitCombination is set to its default true.
Where is your data? on HDFS? on S3? If the data is on S3, you should merge the data into larger files once and then execute your pig scripts on it, otherwise, it's going to take a long time anyway - S3 returns object lists with pagination and it takes a long time to fetch the list (also if you have more objects in the bucket and you're not searching for your files with a prefix only pattern, hadoop will list all of the objects (because there's no other option in S3).

Try a hadoop fs -ls /path/to/files | wc -l and look at how long that takes to come back - you have two problems:
Discovering the files to process - the above ls will probably take a good number of minutes to complete. Each file then has to be queried for its block size to determine whether it can be split / processed by multiple mappers
Retaining all the information from the above is most probably going to push the JVM limits of your client, you'll probably see a huge amount of GC trying to assign, allocate and grow the collection used to store the split information for the at minimum 145k splits.
So as already suggested, try to combine your files into more sensible file sizes (somewhere near you block size, or a multiple thereof). Maybe you can combine all files for the same hour into a single concatenated file (or to day, depends on your processing use case).

Looks like the problem is more of Hadoop than Pig. You might want to try to combine all the small files into a Hadoop Archive and see if it improves the performance. For details refer to this link
Another approach you can try is run a separate Pig job which periodically UNIONs all the log files into one "big" log file. This should help in reducing the processing time for your main job.

Hadoop smaller input file

I am using hadoop in a little different way. In my case, input size is really small. However, computation time is more. I have some complicated algorithm which I will be running on every line of input. So even though the input size is less than 5mb, the overall computation time is over 10hrs. So I am using hadoop here. I am using NLineInputFormat to split the file by number of lines rather than block size. In my initial testing, I had around 1500 lines (Splitting by 200 lines) and I saw only a improvement of 1.5 times in a four node cluster compared to that of running it serially on one machine. I am using VM's. Could that be the issue or for smaller size input there wont be much benefits with hadoop? Any insights will be really helpful.

To me, your workload resembles SETI#Home work load -- small payloads but hours of crunching time.
Hadoop (or more specifically HDFS) is not designed for lots of small files. But I doubt that is an issue for MapReduce - the processing framework you are using.
If you want to keep your workload together:
1) split them into individual files (one workload, one file) if the file is less than block size then it will go to one mapper. Typical block sizes are 64MB or 128MB
2) create a wrapper for FileInputFormat, and override the 'isSplitable()' method to false. This will make sure entire file contents are fed to one mapper, rather than hadoop trying to split it line by line
reference : http://hadoopilluminated.com/hadoop_book/HDFS_Intro.html

Hadoop is not really good at dealing with tons of small files, hence, it is often desired to combine a large number of smaller input files into less number of bigger files so as to reduce number of mappers.
As Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is split into one or more InputSplits typically upper bounded by block size. This means the number of input splits is lower bounded by number of input files. This is not an ideal environment for MapReduce process when it’s dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there is relatively large number of small files.
The basic parameter which drives the spit size is mapred.max.split.size.
Using CombineFileInputFormat and this parameter we can control the number of mappers.
Checkout the implementation I had for another answer here.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio