A better alternative to chunking a file line by line

A better alternative to chunking a file line by line - bash

The closest question which I found to have any resemblance to what I am asking is here.
Linux shell command to read/print file chunk by chunk
My system conditions
A cluster with a shared filesystem served over NFS
Disk capacity = 20T
File Description
Standard FASTQ files used in large scale genomics analysis
A File containing n lines or n/4 records.
Typical file size is 100 - 200 G
I keep them as bunzips with a compression value of -9 (when specifying to bzip2)
When analyzing these files, I use SGE for my jobs therefore I analyze them in chunks of 1M or 10M records.
So when dividing the file I use
<(bzcat [options] filename) > Some_Numbered_Chunk
to divide these files up into smaller chunks for efficient processing over SGE.
Problems
When dividing these files up, this chunking step represents a significant amount of computation time.
i. Because there are a lot of records to sift through.
ii. Because NFS IO is not as fast as the bzcat pipe which I am using for chunking so NFS is limiting the speed at which a file can be chunked.
Many times I have to analyze almost 10-20 of these files together and unpacked all of them aggregate to nearly 1-2T of data. So on a shared system this is a very big limiting step and causes space cruches as others have to wait for me to go back and delete these files. (No I cannot delete all of these files as soon as the process has finished because I need to manually make sure that all processes completed successfully)
So how can I optimize this using other methods to lower the computation time, and also so that these chunks use up lesser amounts of hard disk space?

Several options spring to mind:
Increase your bandwidth of your storage (add more physical links).
Store your data in smaller files.
Increase your storage capacity so you can reduce your compression ratio.
Do your analysis off your shared storage (get the file over NFS, write to a local disk).

Related

HDFS behavior on lots of small files and 128 Mb block size

I have lots (up to hundreds of thousands) of small files, each 10-100 Kb. I have HDFS block size equal 128 MB. I have replication factor equal 1.
Is there any drawbacks of allocating HDFS block per small file?
I've seen pretty contradictory answers:
Answer which said the smallest file takes the whole block
Answer which said that HDFS is clever enough, and small file will take small_file_size + 300 bytes of metadata
I made a test like in this answer, and it proves that the 2nd option is correct - HDFS doesn't allocate the whole block for small files.
But, how about batch read of 10.000 small files from HDFS? Does it will be slow down because of 10.000 blocks and metadatas? Is there any reason to keep multiple small files within single block?
Update: my use case
I have only one use case for small files, from 1.000 up to 500.000. I calculate that files once, store it, and than read them all at once.
1) As I understand, NameNode space problem is not a problem for me. 500.000 is an absolute maximum, I will never have more. If each small file takes 150 bytes on NN, than the absolute maximum for me is - 71.52 MB, which is acceptable.
2) Does Apache Spark eliminate MapReduce problem? Will sequence files or HAR help me to solve the issue? As I understand, Spark shouldn't depend on Hadoop MR, but it's still too slow. 490 files takes 38 seconds to read, 3420 files - 266 seconds.
sparkSession
.read()
.parquet(pathsToSmallFilesCollection)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);

As you have noticed already, the HDFS file does not take anymore space than it needs, but there are other drawbacks of having the small files in the HDFS cluster. Let's go first through the problems without taking into consideration batching:
NameNode(NN) memory consumption. I am not aware about Hadoop 3 (which is being currently under development) but in previous versions NN is a single point of failure (you can add secondary NN, but it will not replace or enhance the primary NN at the end). NN is responsible for maintaining the file-system structure in memory and on the disk and has limited resources. Each entry in file-system object maintained by NN is believed to be 150 bytes (check this blog post). More files = more RAM consumed by the NN.
MapReduce paradigm (and as far as I know Spark suffers from the same symptoms). In Hadoop Mappers are being allocated per split (which by default corresponds to the block), this means, that for every small file you have out there a new Mapper will need to be started to process its contents. The problem is that for small files it actually takes much more for Hadoop to start the Mapper than process the file content. Basically, you system will be doing unnecessary work of starting/stopping Mappers instead of actually processing the data. This is the reason Hadoop processes much fast 1 128MBytes file (with 128MBytes blocks size) rather than 128 1MBytes files (with same block size).
Now, if we talk about batching, there are few options you have out there: HAR, Sequence File, Avro schemas, etc. It depends on the use case to give the precise answers to your questions. Let's assume you do not want to merge files, in this case you might be using HAR files (or any other solution featuring efficient archiving and indexing). In this case the NN problem is solved, but the number of Mappers still will be equal to the number of splits. In case merging files into large one is an option, you can use Sequence File, which basically aggregates small files into bigger ones, solving to some extend both problems. In both scenarios though you cannot really update/delete the information directly like you would be able to do with small files, thus more sophisticated mechanisms are required for managing those structures.
In general, in the main reason for maintaining many small files is an attempt to make fast reads, I would suggest to take a look to different systems like HBase, which were created for fast data access, rather than batch processing.

HDFS concat operation: Does it lead to increased seek time?

I was trying to go through how HDFS implements the concat operation and drilled down to the following piece of code.
From this implementation it seems to me that concat is only a meta operation on Inode of the target file and the actual blocks are not moved. I was thinking if this would lead to fragmentation + increased seek time as different blocks would be on different locations on the disk (considering a magnetic disk). Is this assumption correct? If yes can we avoid this?

After a few experiments I found the answer to my own question. After very frequent file concat operations (around 1k per minute) the data node started complaining about too many blocks in around a day which lead me to believe that this indeed does lead to fragmentation and an increased number of blocks on disk. Solution I used is to write a separate job that concatenates (and compresses in my case) these files into a single splittable archive (note gzip is not splittable!).

How if I set hdfs blocksize to 1 GB?

I want to ask. How if I set the hdfs blocksize to 1 GB, and I'll upload file with size almost 1 GB. Would it become faster to process mapreduce? I think that with larger block size, the container request to resource manager (map task) will fewer than the default. So, it will decrease the latency of initialize container, and also decrease network latency too.
So, what do you think all?
Thanks

There are a number of things that this impacts. Most obviously, a file will have fewer blocks if the block size is larger. This can potentially make it possible for client to read/write more data without interacting with the Namenode, and it also reduces the metadata size of the Namenode, reducing Namenode load (this can be an important consideration for extremely large file systems).
With fewer blocks, the file may potentially be stored on fewer nodes in total; this can reduce total throughput for parallel access,and make it more difficult for the MapReduce scheduler to schedule data-local tasks.
When using such a file as input for MapReduce (and not constraining the maximum split size to be smaller than the block size), it will reduce the number of tasks which can decrease overhead. But having fewer, longer tasks also means you may not gain maximum parallelism (if there are fewer tasks than your cluster can run simultaneously), increase the chance of stragglers, and if a task fails, more work needs to be redone. Increasing the amount of data processed per task can also cause additional read/write operations (for example, if a map task changes from having only one spill to having multiple and thus needing a merge at the end).
Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best. For smaller files, using a smaller block size is better. Note that you can have files with different block sizes on the same file system by changing the dfs.block.size parameter when the file is written, e.g. when uploading using the command line tools: "hdfs dfs -put localpath dfspath -D dfs.block.size=xxxxxxx"
Source: http://channel9.msdn.com/Forums/TechOff/Impact-of-changing-block-size-in-Hadoop-HDFS
Useful link to read:
Change block size of dfs file
How Mappers get assigned.

The up is right.You couldn't just to determine the goodness and badness of Hadoop system by adjust the blocksize.
But according to my test that used different blocksize in hadoop, the 256M is a good choice.

Hadoop HDFS - Keep many part files or concat?

After running a map-reduce job in Hadoop, the result is a directory with part files. The number of part files depend on the number of reducers, and can reach dozens (80 in my case).
Does keeping multiple part files affect the performance of future map-reduce operations, to the better or worse? Will taking an extra reduction step and merging all the parts improve or worsen the speed of further processing?
Please refer only to map-reduce performance issues. I don't care about splitting or merging these results in any other way.

Running further mapreduce operations on the part directory should have little to no impact on overall performance.
The reason is the first step Hadoop does is split the data in the input directory according to the size and places the split data onto the Mappers. Since it's already splitting the data into separate chunks, splitting one file vs many shouldn't impact performance, the amount of data being transferred over the network should be roughly equal, as should the amount of processing and disk time.
There might be some degenerate cases where part files will be slower. For example instead of 1 large file you had thousands/millions of part files. I also can think of situations where having many part files would be faster. For example, if you don't have splittable files (not usually the case unless you are using certain compression schemes), then you would have to put your 1 big file on a single mapper since its unsplittable, where the many part files would be distributed more or less as normal.

It all depends on what the next task needs to do.
If you have analytics data and you have 80 files per (partially processed) input day then you have a huge performance problem if the next job needs to combine the data over the last two years.
If however you have only those 80 then I wouldn't worry about it.

Serve static files from Hadoop

My job is to design a distributed system for static image/video files. The size of the data is about tens of Terabytes. It's mostly for HTTP access (thus no processing on data; or only simple processing such as resizing- however it's not important because it can be done directly in the application).
To be a little more clear, it's a system that:
Must be distributed (horizontal scale), because the total size of data is very big.
Primarily serves small static files (such as images, thumbnails, short videos) via HTTP.
Generally, no requirement on processing the data (thus MapReduce is not needed)
Setting HTTP access on the data could be done easily.
(Should have) good throughput.
I am considering:
Native network file system: But it seems not feasible because the data can not fit into one machine.
Hadoop filesystem. I worked with Hadoop mapreduce before, but I have no experience using Hadoop as a static file repository for HTTP requests. So I don't know if it's possible or if it's a recommended way.
MogileFS. It seems promising, but I feel that using MySQL to manage local files (on a single machine) will create too much overhead.
Any suggestion please?

I am the author of Weed-FS. For your requirement, WeedFS is ideal. Hadoop can not handle many small files, in addition to your reasons, each file needs to have an entry in the master. If the number of files are big, the hdfs master node can not scale.
Weed-FS is getting faster when compiled with latest Golang releases.
Many new improvements have been done on Weed-FS recently. Now you can test and compare very easily with the built-in upload tool. This one upload all files recursively under a directory.
weed upload -dir=/some/directory
Now you can compare by "du -k /some/directory" to see the disk usage, and "ls -l /your/weed/volume/directory" to see the Weed-FS disk usage.
And I suppose you would need replication with data center, rack aware, etc. They are in now!

Hadoop is optimized for large files e.g. It's default block size is 64M. A lot of small files are both wasteful and hard to manage on Hadoop.
You can take a look at other distributed file systems e.g. GlusterFS

Hadoop has a rest API for acessing files. See this entry in the documentation. I feel that Hadoop is not meant for storing large number of small files.
HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes. The block size is 64 mb. So even if the file is of 10kb, it would be allocated an entire block of 64 mb. Thats a waste disk space.
If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 files of 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
In "Hadoop Summit 2011", there was this talk by Karthik Ranganathan about Facebook Messaging in which he gave away this bit: Facebook stores data (profiles, messages etc) over HDFS but they dont use the same infra for images and videos. They have their own system named Haystack for images. Its not open source but they shared the abstract design level details about it.
This brings me to weed-fs: an open source project for inspired by Haystacks' design. Its tailor made for storing files. I have not used it till now but seems worth a shot.

If you are able to batch the files and have no requirement to update a batch after adding to HDFS, then you could compile multiple small files into a single larger binary sequence file. This is a more efficient way to store small files in HDFS (as Arnon points out above, HDFS is designed for large files and becomes very inefficient when working with small files).
This is the approach I took when using Hadoop to process CT images (details at Image Processing in Hadoop). Here the 225 slices of the CT scan (each an individual image) were compiled into a single, much larger, binary sequence file for long streaming reads into Hadoop for processing.
Hope this helps!
G

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio