I would like to provide a way to recognize when a large file is fragmented to a certain extent, and alert the user when they should perform a defragmentation. In addition, I'd like to show them a visual display demonstrating how the file is actually broken into pieces across the disk.
I don't need to know how to calculate how fragmented it is, or how to make the visual display. What I need to know is two things: 1) how to identify the specific clusters on any disk which contain pieces of any particular given file, and 2) how to identify the total number of clusters on that disk. I would essentially need a list of all the clusters which contain pieces of this file, and where on the disk each of those clusters is located.
Most defragmentation utilities have a visual display showing how the files are spread across the disk. My display will show how one particular file is split up into different areas of a disk. I just need to know how I can retrieve the necessary data to tell me where the file's clusters/sectors are located on the disk, so I can further determine how fragmented it is.
You can use the DeviceIoControl function with the FSCTL_GET_RETRIEVAL_POINTERS control code.
The FSCTL_GET_RETRIEVAL_POINTERS operation retrieves a variably sized
data structure that describes the allocation and location on disk of a
specific file. The structure describes the mapping between virtual
cluster numbers (VCN offsets within the file or stream space) and
logical cluster numbers (LCN offsets within the volume space).
Related
I was casually wondering if there was a difference in read/write performance for files that are copied to the same folder as opposed to those moved (via mv).
I imagine that performing some serial operation to several files located in a contiguous memory block would be faster than those scattered across a hard drive. Such is the case (I guess ?) if you copy files vs move them from disparate origins. So... is there a performance difference of files moved vs copied to the same directory, how significant, and does it depend on storage technology (HDD, SSD)?
Note, I am not wondering whether mv vs cp is faster. Please don't respond with a description of the difference between the commands. Thanks!
The way that move and copy works will have some (limited) baring on this assuming source and destination are located on the same physical volume.
However assuming source and destination are not the same volume both will behave the same in terms of writing the destination data. If the destination volume is completely empty and freshly formatted then you 'probably' stand a good chance of their data being written to a similar location. If there is or has been data written to the volume then there is no guarantee the file system won't simply scatter the data anyway.
The file system will ultimately decide where the data is to be stored on the actual storage medium, and it may decide that neighbouring blocks are not the best solution. Copy or Move is irrelevant, as both will require the file system to store the data.
Grouping those files by mount point is possibly the best way of ensuring they reside within a similar region of storage.
HTH
Parameters of some machines are measured and uploaded via a web service to HDFS. Parameter values are saved in a file for each measurement, where a measurement has 1000 values in average.
The problem is - there is a large number of files. Only certain number of files is used for MapReduce job (for example, measurements from last month). Because of this I'm not able to merge them all into one large sequence file, since different files are needed in different time.
I understand that is bad to have a large number of small files, since the NameNode contains paths to all of them on HDFS (and keeps it in its memory) and on the other hand, each small file will result in a Mapper creation.
How can I avoid this problem?
A late answer: You can use SeaweedFS https://github.com/chrislusf/seaweedfs (I am working on this). It has special optimization for large number of small files.
HDFS actually has good support to delegate file storage to other file systems. Just add a SeaweedFS hadoop jar. See https://github.com/chrislusf/seaweedfs/wiki/Hadoop-Compatible-File-System
You could concatenate the required files into one temporal file that is deleted once analyzed. I think you can create this very easily in a script.
Anyway, make the numbers: such a big file will be also splited into several pieces whose size will be the blocksize (dfs.blocksize parameter a hdfs-defaul.xml), and each one of these pieces will be assigned to a mapper. I mean, depending on the blocksize and the average "small file" size, maybe the gain is not so great.
Currently, I'm programming something on image classification with Spark. I need to read all the images into memory as RDD and my method is as following:
val images = spark.wholeTextFiles("hdfs://imag-dir/")
imag-dir is the target image storing directory on hdfs. With this method, all the images will be loaded into memory and every image will be organized as "image name, image content" pair. However, I find this process is time consuming, is there any better way to load large image data set into spark?
I suspect that may be because you have a lot of small files on HDFS, which is a problem as such (the 'small files problem'). Here you'll find a few suggestions in addressing the issue.
You may also want to set the number of partitions (the minpartitions argument of wholetextFiles) to a reasonable number : at least 2x the number of cores in your cluster (look there for details).
But in sum, apart from the 2 ideas above, the way you're loading those is ok and not where your problem lies (assuming spark is your Spark context).
My job is to design a distributed system for static image/video files. The size of the data is about tens of Terabytes. It's mostly for HTTP access (thus no processing on data; or only simple processing such as resizing- however it's not important because it can be done directly in the application).
To be a little more clear, it's a system that:
Must be distributed (horizontal scale), because the total size of data is very big.
Primarily serves small static files (such as images, thumbnails, short videos) via HTTP.
Generally, no requirement on processing the data (thus MapReduce is not needed)
Setting HTTP access on the data could be done easily.
(Should have) good throughput.
I am considering:
Native network file system: But it seems not feasible because the data can not fit into one machine.
Hadoop filesystem. I worked with Hadoop mapreduce before, but I have no experience using Hadoop as a static file repository for HTTP requests. So I don't know if it's possible or if it's a recommended way.
MogileFS. It seems promising, but I feel that using MySQL to manage local files (on a single machine) will create too much overhead.
Any suggestion please?
I am the author of Weed-FS. For your requirement, WeedFS is ideal. Hadoop can not handle many small files, in addition to your reasons, each file needs to have an entry in the master. If the number of files are big, the hdfs master node can not scale.
Weed-FS is getting faster when compiled with latest Golang releases.
Many new improvements have been done on Weed-FS recently. Now you can test and compare very easily with the built-in upload tool. This one upload all files recursively under a directory.
weed upload -dir=/some/directory
Now you can compare by "du -k /some/directory" to see the disk usage, and "ls -l /your/weed/volume/directory" to see the Weed-FS disk usage.
And I suppose you would need replication with data center, rack aware, etc. They are in now!
Hadoop is optimized for large files e.g. It's default block size is 64M. A lot of small files are both wasteful and hard to manage on Hadoop.
You can take a look at other distributed file systems e.g. GlusterFS
Hadoop has a rest API for acessing files. See this entry in the documentation. I feel that Hadoop is not meant for storing large number of small files.
HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes. The block size is 64 mb. So even if the file is of 10kb, it would be allocated an entire block of 64 mb. Thats a waste disk space.
If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 files of 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
In "Hadoop Summit 2011", there was this talk by Karthik Ranganathan about Facebook Messaging in which he gave away this bit: Facebook stores data (profiles, messages etc) over HDFS but they dont use the same infra for images and videos. They have their own system named Haystack for images. Its not open source but they shared the abstract design level details about it.
This brings me to weed-fs: an open source project for inspired by Haystacks' design. Its tailor made for storing files. I have not used it till now but seems worth a shot.
If you are able to batch the files and have no requirement to update a batch after adding to HDFS, then you could compile multiple small files into a single larger binary sequence file. This is a more efficient way to store small files in HDFS (as Arnon points out above, HDFS is designed for large files and becomes very inefficient when working with small files).
This is the approach I took when using Hadoop to process CT images (details at Image Processing in Hadoop). Here the 225 slices of the CT scan (each an individual image) were compiled into a single, much larger, binary sequence file for long streaming reads into Hadoop for processing.
Hope this helps!
G
I know there have been similar posts on here but I can't find one that really has a solid answer.
We have a Hadoop cluster loaded with binary files. These files can range anywhere in size from a few hundred k to hundreds of mb.
We are currently processing these files using a custom record reader that reads the entire contents of the file into each map. From there we extract the appropriate metadata we want a serialize it into JSON.
The problem we are foreseeing is that we might eventually reach a size that our namenode can't handle. There is only so much memory to go around and having a namenode with a couple terabytes of memory seems ridiculous.
Is there a graceful way to process large binary files like this? Especially those which can't be split because we don't know what order the reducer will put them back together?
So not an answer as such, but i have so many questions that a list of comments would be more difficult to convey, so here goes:
You say you read the entire contents into memory for each map, are you able to elaborate on the actual binary input format of these files:
Do they contain logical records i.e. does a single input file represent a single record, or does it contain many records?
Are the files compressed (after-the-fact or some internal compression mechanism)?
How are you currently processing this file-at-once, what's you're overall ETL logic to convert to JSON?
Do you actually need to read the entire file read into memory before processing can begin or can you process once you have a buffer of some size populated (DOM vs SAX XML parsing for example).
My guess is that you can migrate some of your mapper logic to the record reader, and possibly even find a way to 'split' the file between multiple mappers. This would then allow you to address your scalability concerns.
To address some points in your question:
NameNode only requires memory to store information about the blocks (names, blocks[size, length, locations]). Assuming you assign it a decent memory footprint (GB's), there is no reason you can't have a cluster that holds Petabytes of data in HDFS storage (assuming you have enough physical storage)
Namenode doesn't have anything to do either with storage or processing.You should be concentrated on your Datanodes and Tasktrackers instead.Also I am not getting whether you are trying to address the storage issue or the processing of of your files here.If you are dealing with lots of Binary files, it is worth having a look at Hadoop SequenceFile. A SequenceFile is a flat file consisting of binary key/value pairs, hence extensively used in MapReduce as input/output formats. For a detailed explanation you can visit this page -
http://wiki.apache.org/hadoop/SequenceFile
When you have large binary files, use SequenceFile format as the input format and set the mapred input split size accordingly. You can set the number of mappers based on the total input size and the split size you had set. Hadoop will take care of splitting the input data.
If you have binary files compressed in some format, then hadoop cannot do this split. So the binary format has to be SequenceFile.