HDFS concat operation: Does it lead to increased seek time? - hadoop

I was trying to go through how HDFS implements the concat operation and drilled down to the following piece of code.
From this implementation it seems to me that concat is only a meta operation on Inode of the target file and the actual blocks are not moved. I was thinking if this would lead to fragmentation + increased seek time as different blocks would be on different locations on the disk (considering a magnetic disk). Is this assumption correct? If yes can we avoid this?

After a few experiments I found the answer to my own question. After very frequent file concat operations (around 1k per minute) the data node started complaining about too many blocks in around a day which lead me to believe that this indeed does lead to fragmentation and an increased number of blocks on disk. Solution I used is to write a separate job that concatenates (and compresses in my case) these files into a single splittable archive (note gzip is not splittable!).

Related

Spark not ignoring empty partitions

I am trying to read a subset of a dataset by using pushdown predicate.
My input dataset consists in 1,2TB and 43436 parquet files stored on s3. With the push down predicate I am supposed to read 1/4 of data.
Seeing the Spark UI. I see that the job actually reads 1/4 of data (300GB) but there are still 43436 partitions in the first stage of the job however only 1/4 of these partitions has data, the other 3/4 are empty ones (check the median input data in the attached screenshots).
I was expecting Spark to create partitions only for non empty partitions. I am seeing a 20% performance overhead when reading the whole dataset with the pushdown predicate comparing to reading the prefiltred dataset by another job (1/4 of data) directly. I suspect that this overhead is due to the huge number of empty partitions/tasks I have in my first stage, so I have two questions:
Are there any workaround to avoid these empty partitions?
Do you think to any other reason responsible for the overhead? may be the pushdown filter execution is naturally a little bit slow?
Thank you in advance
Using S3 Select, you can retrieve only a subset of data.
With Amazon EMR release version 5.17.0 and later, you can use S3 Select with Spark on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object.
Otherwise, S3 acts as an object store, in which case, an entire object has to be read. In your case you have to read all content from all files, and filter them on client side.
There is actually very similar question, where by testing you can see that:
The input size was always the same as the Spark job that processed all of the data
You can also see this question about optimizing data read from s3 of parquet files.
Seems like your files are rather small: 1.2TB / 43436 ≈ 30MB. So you may want to look at increasing the spark.sql.files.maxPartitionBytes, to see if it reduces the total number of partitions. I have not much experience with S3, so not sure whether its going to help given this note in its description:
The maximum number of bytes to pack into a single partition when
reading files. This configuration is effective only when using
file-based sources such as Parquet, JSON and ORC.
Empty partitions: It seems that spark (2.4.5) tries to really have partitions with size ≈ spark.sql.files.maxPartitionBytes (default 128MB) by packing many files into one partition, source code here.
However it does this work before running the job, so it can't know that 3/4 of files will not output data after the pushed down predicate being applied. For the partitions where it will put only files whose lines will be filtered out, I ended up with empty partitions. This explains also why my max partition size is 44MB and not 128MB, because none of the partitions had by chance files that passed all the pushdown filter.
20% Overhead: Finally this is not due to empty partitions, I managed to have much less empty partitions by setting spark.sql.files.maxPartitionBytes to 1gb but it didn't improve reading. I think that the overhead is due to opening many files and reading their metadata.
Spark estimates that opening a file is equivalent to reading 4MB spark.sql.files.openCostInBytes. So opening many files even if thanks to the filter won't be read shouldn't be negligible..

HDFS behavior on lots of small files and 128 Mb block size

I have lots (up to hundreds of thousands) of small files, each 10-100 Kb. I have HDFS block size equal 128 MB. I have replication factor equal 1.
Is there any drawbacks of allocating HDFS block per small file?
I've seen pretty contradictory answers:
Answer which said the smallest file takes the whole block
Answer which said that HDFS is clever enough, and small file will take small_file_size + 300 bytes of metadata
I made a test like in this answer, and it proves that the 2nd option is correct - HDFS doesn't allocate the whole block for small files.
But, how about batch read of 10.000 small files from HDFS? Does it will be slow down because of 10.000 blocks and metadatas? Is there any reason to keep multiple small files within single block?
Update: my use case
I have only one use case for small files, from 1.000 up to 500.000. I calculate that files once, store it, and than read them all at once.
1) As I understand, NameNode space problem is not a problem for me. 500.000 is an absolute maximum, I will never have more. If each small file takes 150 bytes on NN, than the absolute maximum for me is - 71.52 MB, which is acceptable.
2) Does Apache Spark eliminate MapReduce problem? Will sequence files or HAR help me to solve the issue? As I understand, Spark shouldn't depend on Hadoop MR, but it's still too slow. 490 files takes 38 seconds to read, 3420 files - 266 seconds.
sparkSession
.read()
.parquet(pathsToSmallFilesCollection)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);
As you have noticed already, the HDFS file does not take anymore space than it needs, but there are other drawbacks of having the small files in the HDFS cluster. Let's go first through the problems without taking into consideration batching:
NameNode(NN) memory consumption. I am not aware about Hadoop 3 (which is being currently under development) but in previous versions NN is a single point of failure (you can add secondary NN, but it will not replace or enhance the primary NN at the end). NN is responsible for maintaining the file-system structure in memory and on the disk and has limited resources. Each entry in file-system object maintained by NN is believed to be 150 bytes (check this blog post). More files = more RAM consumed by the NN.
MapReduce paradigm (and as far as I know Spark suffers from the same symptoms). In Hadoop Mappers are being allocated per split (which by default corresponds to the block), this means, that for every small file you have out there a new Mapper will need to be started to process its contents. The problem is that for small files it actually takes much more for Hadoop to start the Mapper than process the file content. Basically, you system will be doing unnecessary work of starting/stopping Mappers instead of actually processing the data. This is the reason Hadoop processes much fast 1 128MBytes file (with 128MBytes blocks size) rather than 128 1MBytes files (with same block size).
Now, if we talk about batching, there are few options you have out there: HAR, Sequence File, Avro schemas, etc. It depends on the use case to give the precise answers to your questions. Let's assume you do not want to merge files, in this case you might be using HAR files (or any other solution featuring efficient archiving and indexing). In this case the NN problem is solved, but the number of Mappers still will be equal to the number of splits. In case merging files into large one is an option, you can use Sequence File, which basically aggregates small files into bigger ones, solving to some extend both problems. In both scenarios though you cannot really update/delete the information directly like you would be able to do with small files, thus more sophisticated mechanisms are required for managing those structures.
In general, in the main reason for maintaining many small files is an attempt to make fast reads, I would suggest to take a look to different systems like HBase, which were created for fast data access, rather than batch processing.

A better alternative to chunking a file line by line

The closest question which I found to have any resemblance to what I am asking is here.
Linux shell command to read/print file chunk by chunk
My system conditions
A cluster with a shared filesystem served over NFS
Disk capacity = 20T
File Description
Standard FASTQ files used in large scale genomics analysis
A File containing n lines or n/4 records.
Typical file size is 100 - 200 G
I keep them as bunzips with a compression value of -9 (when specifying to bzip2)
When analyzing these files, I use SGE for my jobs therefore I analyze them in chunks of 1M or 10M records.
So when dividing the file I use
<(bzcat [options] filename) > Some_Numbered_Chunk
to divide these files up into smaller chunks for efficient processing over SGE.
Problems
When dividing these files up, this chunking step represents a significant amount of computation time.
i. Because there are a lot of records to sift through.
ii. Because NFS IO is not as fast as the bzcat pipe which I am using for chunking so NFS is limiting the speed at which a file can be chunked.
Many times I have to analyze almost 10-20 of these files together and unpacked all of them aggregate to nearly 1-2T of data. So on a shared system this is a very big limiting step and causes space cruches as others have to wait for me to go back and delete these files. (No I cannot delete all of these files as soon as the process has finished because I need to manually make sure that all processes completed successfully)
So how can I optimize this using other methods to lower the computation time, and also so that these chunks use up lesser amounts of hard disk space?
Several options spring to mind:
Increase your bandwidth of your storage (add more physical links).
Store your data in smaller files.
Increase your storage capacity so you can reduce your compression ratio.
Do your analysis off your shared storage (get the file over NFS, write to a local disk).

How to set data block size in Hadoop ? Is it advantage to change it?

If we can change the data block size in Hadoop please let me know how to do that.
Is it advantageous to change the block size, If yes, then let me know Why and how? If no, then let me know why and how?
You can change the block size any time unless dfs.blocksize parameter is defined as final in hdfs-site.xml.
To change block size
while running hadoop fs command you can run hadoop fs -Ddfs.blocksize=67108864 -put <local_file> <hdfs_path>. This command will save file with 64MB block size
while running hadoop jar command - hadoop jar <jar_file> <class> -Ddfs.blocksize=<desired_block_size> <other_args>. Reducer will use the defined block size while storing the output in HDFS
as part of the map reduce program, you can use job.set and set the value
Criteria for changing block size:
Typically 128 MB for uncompressed files works well
You can consider reducing block size on compressed files. If the compression rate is too high then having higher block size might slow down the processing. If the compression codec is not splittable, it will aggravate the issue.
As long as the file size is more than block size, you need not change the block size. If the number of mappers to process the data is very high, you can reduce number of mappers by increasing the split size. For example if you have 1TB of data with 128 MB block size, then by default it will take 8000 mappers. Instead of changing the block size you can consider changing the split size to 512 MB or even 1 GB and it will take far fewer number of mappers to process the data.
I have covered most of this in 2 and 3 of this performance tuning playlist.
There seems to be much confusion about this topic and also wrong advise going around. To lift the confusion it helps to think about how HDFS is actually implemented:
HDFS is an abstraction over distributed disk-based file systems. So the words "block" and "blocksize" have a different meaning than generally understood. For HDFS a "file" is just a collection of blocks, each "block" in return is stored as an actual file on a datanode. In fact the same file is stored on several datanodes, according to the replication factor. The blocksize of these individual files and their other performance characteristics in turn depend on the underlying filesystems of the individual datanodes.
The mapping between an HDFS-File and the individual files on the datanodes is maintained
by the namenode. But the namenode doesn't expect a specific blocksize, it just stores the
mappings which where created during the creation of the HDFS file, which is usually split
according to the default dfs.blocksize (but can be individually overwritten).
This means for example if you have 1 MB file with a replication of 3 and a blocksize of 64
MB, you don't lose 63 MB * 3 = 189 MB, since physically just three 1 MB files are stored
with the standard blocksize of the underlying filesystems (e.g. ext4).
So the question becomes what a good dfs.blocksize is and if it's advisable to change it.
Let me first list the aspects speaking for a bigger blocksize:
Namenode pressure: As mentioned the namenode has to maintain the mappings between dfs files and their blocks to physical files on datanodes. So the less blocks/file the less memory pressure and communication overhead it has
Disk throughput: Files are written by a single process in hadoop, which usually results in data written sequentially to disk. This is especially advantageous for rotational disks because it avoids costly seeks. If the data is written that way, it can also be read that way so it becomes an advantage for reads and writes. In fact this optimization in combination with data locally (i.e. do the processing where the data is) is one of the main ideas of mapreduce.
Network throughput: Data locality is the more important optimization, but in a distributed system this can not always be achieved, so sometimes it's necessary to copy data between nodes. Normally one file (dfs block) is transferred via one persistent TCP connection which can reach a higher throughput when big files are transferred.
Bigger default splits: even though the splitsize can be configured on Job level, most people don't consider this and just go with the default which is usually the blocksize. If your splitsize is too small though, you can end up with too many mappers which don't have much work to do which in turn can lead to even smaller output files, unnecessary overhead and many occupied containers which can starve other jobs. This also has an adverse affect on the reduce phase, since the results must be fetched from all mappers.
Of course the ideal splitsize heavily depends on the kind of work you've to do. But you always can set a lower splitsize when necessary, whereas when you set a higher splitsize than the blocksize you might lose some data locality.
The latter aspect is less of an issue than one would think though, because the rule for block placement in HDFS is: the first block is written on the datanode where the process creating the file runs, the second one on another node in the same rack and the third one on a node on another rack. So usually one replica for each block of a file can be found on a single datanode, so data locality can still be achieved even when one mapper is reading several blocks due to a splitsize which is a multiple of the blocksize. Still in this case the mapred framework can only select one node instead of the usual three to achieve data locality so an effect can't be denied.
But ultimately this point for a bigger blocksize is probably the weakest of all, since one can set the splitsize independently if necessary.
But there also have to be arguments for a smaller blocksize otherwise we should just set it to infinity…
Parallelism/Distribution: If your input data lies on just a few nodes even a big cluster doesn't help to achieve parallel processing, at least if you want to maintain some data locality. As a rule I would say a good blocksize should match what you also can accept as a splitsize for your default workload.
Fault tolerance and latency: If a network connection breaks the perturbation of retransmitting a smaller file is less. TCP throughput might be important but individual connections shouldn't take forever either.
Weighting these factors against each other depends on your kind of data, cluster, workload etc. But in general I think the default blocksize 128 MB is already a little low for typical usecases. 512 MB or even 1 GB might be worth considering.
But before you even dig into that you should first check the size of your input files. If most of your files are small and don't even reach the max default blocksize your blocksize is basically always the filesize and it wouldn't help anything to increase the default blocksize. There are workarounds like using an input combiner to avoid spawning too many mappers, but ultimately you need to ensure your input files are big enough to take advantage of a big blocksize.
And if your files are already small don't compound the problem by making the blocksize even smaller.
It depends on the input data. The number of mappers is directly proportional to input splits,which depend on DFS block size.
If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best.
If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
For smaller files, using a smaller block size is better.
Have a look at this article
If you have small files and which are less than minimum DFS block size, you can use some alternatives like HAR or SequenceFiles.
Have a look at this cloudera blog

Hadoop HDFS - Keep many part files or concat?

After running a map-reduce job in Hadoop, the result is a directory with part files. The number of part files depend on the number of reducers, and can reach dozens (80 in my case).
Does keeping multiple part files affect the performance of future map-reduce operations, to the better or worse? Will taking an extra reduction step and merging all the parts improve or worsen the speed of further processing?
Please refer only to map-reduce performance issues. I don't care about splitting or merging these results in any other way.
Running further mapreduce operations on the part directory should have little to no impact on overall performance.
The reason is the first step Hadoop does is split the data in the input directory according to the size and places the split data onto the Mappers. Since it's already splitting the data into separate chunks, splitting one file vs many shouldn't impact performance, the amount of data being transferred over the network should be roughly equal, as should the amount of processing and disk time.
There might be some degenerate cases where part files will be slower. For example instead of 1 large file you had thousands/millions of part files. I also can think of situations where having many part files would be faster. For example, if you don't have splittable files (not usually the case unless you are using certain compression schemes), then you would have to put your 1 big file on a single mapper since its unsplittable, where the many part files would be distributed more or less as normal.
It all depends on what the next task needs to do.
If you have analytics data and you have 80 files per (partially processed) input day then you have a huge performance problem if the next job needs to combine the data over the last two years.
If however you have only those 80 then I wouldn't worry about it.

Resources