Performance of accessing files moved vs copied - performance

I was casually wondering if there was a difference in read/write performance for files that are copied to the same folder as opposed to those moved (via mv).
I imagine that performing some serial operation to several files located in a contiguous memory block would be faster than those scattered across a hard drive. Such is the case (I guess ?) if you copy files vs move them from disparate origins. So... is there a performance difference of files moved vs copied to the same directory, how significant, and does it depend on storage technology (HDD, SSD)?
Note, I am not wondering whether mv vs cp is faster. Please don't respond with a description of the difference between the commands. Thanks!

The way that move and copy works will have some (limited) baring on this assuming source and destination are located on the same physical volume.
However assuming source and destination are not the same volume both will behave the same in terms of writing the destination data. If the destination volume is completely empty and freshly formatted then you 'probably' stand a good chance of their data being written to a similar location. If there is or has been data written to the volume then there is no guarantee the file system won't simply scatter the data anyway.
The file system will ultimately decide where the data is to be stored on the actual storage medium, and it may decide that neighbouring blocks are not the best solution. Copy or Move is irrelevant, as both will require the file system to store the data.
Grouping those files by mount point is possibly the best way of ensuring they reside within a similar region of storage.
HTH

Related

How does the scratch space differ from the normal disk space in the home node disk space?

I am new to HPC and I am struggling in setting up scratch space. In the cluster I am working with, I need to set-up Scratch space using the SLURM workload manager. And I am struggling with the following questions?
How does the scratch space differ from the normal disk space in the home node?
Is the scratch space setting up procedure differ from cluster to cluster?
Is it possible to copy files from the scratch space to the home node while the simulation is still in progress? and is it possible to transfer files from scratch space to my external hard disk without copying the files to my local home node disk space? or these things differ from cluster to cluster? Because I tried a simulation with scratch. For that purpose, using SLURM, I initially copied my input files to the scratch folder, then the timestep files are directed to the scratch folder and once the simulation is complete, the timestep output files are copied to the home node disk space. While the simulation is in progress, I was trying to access the timestep output files in the scratch folder. But, I couldn't see the output files anywhere in the scratch space. But, once the simulation is over, I was able to see the files in the home node. I am really confused about this.
Sorry, if these questions sound silly. I am just completely new to HPC. Please feel free to ask any questions.
Thanks
Ram
When maintaining a large shared cluster an often occurring problem is that people tend to store lots of data, and do not take the effort of cleaning up after themselves. One way to solve this is to limit the amount of data people can store in their home folder (e.g. 500GB). This has a very clear problem that when you are dealing with larger amounts of data you can not use the cluster. Generally this is solved with a so-called scratch space. On the scratch space users generally can store large amounts of data (e.g. 8TB), however the maintainers of the server might have some rules setup here (for instance that files automatically get deleted after two weeks).
the scratch space is different in that files might be removed by the admins after some time. And sometimes the scratch space has better hardware making it slightly faster to do IO processes there.
The scratch space usually is already setup, and can be found for instance on /scratch
(Usually) The recommended way is writing all your output to the scratch space (also because IO can be faster here), and when everything is done copy the final results from scratch to your home folder. To copy from one place to another take a look at scp or rsync docs, but yes it should be possible. I don't know why you couldn't see your files..

Merge multiple files into one on hadoop

A rather stupid question but how do I combine multiple files in a folder into one file without copying them to local machine?I do not care about the order.
I thought hadoop fs -getmerge would do the job but I have since found out that it copies the data into your local machine.
I would do it in my original spark application but adding coalesce is increasing my runtime by a large amount.
I am on Hadoop 2.4 if that matters.
how do I combine multiple files in a folder into one file without copying them to local machine?
You have to either copy the files to local node or one of the computation node.
HDFS is a file system. It doesn't care about your file format. If your file is raw text/binary, you can try the concatenation API which only manipulate metadata in NameNode without copying data. But if your file is parquet/gzip/lzo or else, these files can't not be simply concated, you have to download them from HDFS, merge them into one, and upload the merged one. Spark's coalesce(1) do the same thing except it's done in the executor node instead of your local node.
If you have many folders has files need to be merged, spark/MR is definitely the right choice. One reason is the parallelism. The other reason is, if your file is like gzip doesn't support split, one huge gzip file may slow down your job. With some math calculation, you can merge small files into relative large files. ( file size equals to or slightly smaller than blocksize). It very easy with coalesce(n) API.
I suggest you to merge small files. But as #cricket_007 mentioned in the comment, merging doesn't always gain benefit.

How to handle unsplittable 500 MB+ input files in hadoop?

I am writing a hadoop MapReduce job that is running over all source code files of a complete Debian mirror (≈ 40 GB). Since the Debian mirror data is on a separate machine and not in the hadoop cluster, the first step is to download the data.
My first implementation downloads a file and outputs key=$debian_package, value=$file_contents. The various values (typically 4) per key should then be reduced to a single entry. The next MapReduce job will then operate on debian packages as keys and all their files as values.
However, I noticed that hadoop works very poorly with output values that can sometimes be really big (700 MB is the biggest I’ve seen). In various places in the MapReduce framework, entire files are stored in memory, sometimes twice or even three times. I frequently encounter out of memory errors, even with a java heap size of 6 GB.
Now I wonder how I could split the data so that it better matches hadoop’s 64 MB block size.
I cannot simply split the big files into multiple pieces, because they are compressed (tar/bz2, tar/xz, tar/gz, perhaps others in the future). Until I shell out to dpkg-source on them to extract the package as a whole (necessary!), the files need to keep their full size.
One idea that came to my mind was to store the files on hdfs in the first MapReduce and only pass the paths to them to the second MapReduce. However, then I am circumventing hadoop’s support for data locality, or is there a way to fix that?
Are there any other techniques that I have been missing? What do you recommend?
You are correct. This is NOT a good case for Hadoop internals. Lots of copying... There are two obvious solutions, assuming you can't just untar it somewhere:
break up the tarballs using any of several libraries that will allow you to recursively read compressed and archive files (apache VFS has limited capability for this, but the apache compression library has more capability).
nfs mount a bunch of data nodes local space to your master node and then fetch and untar into that directory structure... then use forqlift or similar utility to load the small files into HDFS.
Another option is to write a utility to do this. I have done this for a client. Apache VFS and compression, truezip, then hadoop libraries to write (since I did a general purpose utility I used a LOT of other libraries, but this is the basic flow).

Serve static files from Hadoop

My job is to design a distributed system for static image/video files. The size of the data is about tens of Terabytes. It's mostly for HTTP access (thus no processing on data; or only simple processing such as resizing- however it's not important because it can be done directly in the application).
To be a little more clear, it's a system that:
Must be distributed (horizontal scale), because the total size of data is very big.
Primarily serves small static files (such as images, thumbnails, short videos) via HTTP.
Generally, no requirement on processing the data (thus MapReduce is not needed)
Setting HTTP access on the data could be done easily.
(Should have) good throughput.
I am considering:
Native network file system: But it seems not feasible because the data can not fit into one machine.
Hadoop filesystem. I worked with Hadoop mapreduce before, but I have no experience using Hadoop as a static file repository for HTTP requests. So I don't know if it's possible or if it's a recommended way.
MogileFS. It seems promising, but I feel that using MySQL to manage local files (on a single machine) will create too much overhead.
Any suggestion please?
I am the author of Weed-FS. For your requirement, WeedFS is ideal. Hadoop can not handle many small files, in addition to your reasons, each file needs to have an entry in the master. If the number of files are big, the hdfs master node can not scale.
Weed-FS is getting faster when compiled with latest Golang releases.
Many new improvements have been done on Weed-FS recently. Now you can test and compare very easily with the built-in upload tool. This one upload all files recursively under a directory.
weed upload -dir=/some/directory
Now you can compare by "du -k /some/directory" to see the disk usage, and "ls -l /your/weed/volume/directory" to see the Weed-FS disk usage.
And I suppose you would need replication with data center, rack aware, etc. They are in now!
Hadoop is optimized for large files e.g. It's default block size is 64M. A lot of small files are both wasteful and hard to manage on Hadoop.
You can take a look at other distributed file systems e.g. GlusterFS
Hadoop has a rest API for acessing files. See this entry in the documentation. I feel that Hadoop is not meant for storing large number of small files.
HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes. The block size is 64 mb. So even if the file is of 10kb, it would be allocated an entire block of 64 mb. Thats a waste disk space.
If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 files of 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
In "Hadoop Summit 2011", there was this talk by Karthik Ranganathan about Facebook Messaging in which he gave away this bit: Facebook stores data (profiles, messages etc) over HDFS but they dont use the same infra for images and videos. They have their own system named Haystack for images. Its not open source but they shared the abstract design level details about it.
This brings me to weed-fs: an open source project for inspired by Haystacks' design. Its tailor made for storing files. I have not used it till now but seems worth a shot.
If you are able to batch the files and have no requirement to update a batch after adding to HDFS, then you could compile multiple small files into a single larger binary sequence file. This is a more efficient way to store small files in HDFS (as Arnon points out above, HDFS is designed for large files and becomes very inefficient when working with small files).
This is the approach I took when using Hadoop to process CT images (details at Image Processing in Hadoop). Here the 225 slices of the CT scan (each an individual image) were compiled into a single, much larger, binary sequence file for long streaming reads into Hadoop for processing.
Hope this helps!
G

How do I determine the degree in which a file is fragmented?

I would like to provide a way to recognize when a large file is fragmented to a certain extent, and alert the user when they should perform a defragmentation. In addition, I'd like to show them a visual display demonstrating how the file is actually broken into pieces across the disk.
I don't need to know how to calculate how fragmented it is, or how to make the visual display. What I need to know is two things: 1) how to identify the specific clusters on any disk which contain pieces of any particular given file, and 2) how to identify the total number of clusters on that disk. I would essentially need a list of all the clusters which contain pieces of this file, and where on the disk each of those clusters is located.
Most defragmentation utilities have a visual display showing how the files are spread across the disk. My display will show how one particular file is split up into different areas of a disk. I just need to know how I can retrieve the necessary data to tell me where the file's clusters/sectors are located on the disk, so I can further determine how fragmented it is.
You can use the DeviceIoControl function with the FSCTL_GET_RETRIEVAL_POINTERS control code.
The FSCTL_GET_RETRIEVAL_POINTERS operation retrieves a variably sized
data structure that describes the allocation and location on disk of a
specific file. The structure describes the mapping between virtual
cluster numbers (VCN offsets within the file or stream space) and
logical cluster numbers (LCN offsets within the volume space).

Resources