I'm using Windows 8.1.
There appears to be an inconsistency. Windows states adding hardlinks to a file doesn't use much disk space, and this makes sense since you're only creating a pointer.
However, the file system doesn't reflect this. If I create a hardlink, it lists the disk space usage for that file as doubled.
If doing this only adds a pointer but the FS thinks it's doubled, then it doesn't matter how much space the file is actually using when calculating remaining disk space if what the FS thinks is only taken in consideration.
So what gives? Which is it? Which is being considered when calculating remaining disk space? Appreciated!
According to Harry Johnston:
"Explorer isn't the file system. The fact that Explorer doesn't take hard links into account when calculating the total size of a group of files doesn't affect the amount of disk space available. (If you look at the properties of the drive rather than of a particular set of files, Explorer asks the file system for the actual amount of space used and available on the volume. Those figures are correct.)"
This was the answer I was looking for. Thanks!
Related
I want to parallelize some disk I/O, but this will hurt performance badly if my files are on the same disk.
Note that there may be various filter drivers in between the drivers and the actual disk -- e.g. two files might be on different virtual disks (VHDs), but on the same physical disk.
How do I detect if two HANDLEs refer to files (or partitions, or whatever) on the same physical disk?
If this is not possible -- what is the next-best/closest alternative?
This is rather convoluted, but the first step is getting the volume name from the handle. You can do that on Vista or above using the GetFinalPathNameByHandle function. If you need to support earlier Windows versions, you can use this example.
Now that you have the file name, you can parse out the drive letter or volume name, and use the method explained in the accepted answer to this question.
One can, of course, use fopen or any other large number of APIs available on the Mac to read a file, but what I need to do is open and read every file on the disk and to do so as efficiently as possible.
So, my thought was to using /dev/rdisk* (?) or /dev/(?) to start with the files at the beginning of the device. I would do my best to read the files in order as they appear on the disk, minimize the amount of seeking across the device since files may be fragmented, and read in large blocks of data into RAM where it can be processed very quickly.
So, the primary question I have is when reading the data from my device directly, how can I determine exactly what data belongs with what files?
I assume I could start by reading a catalog of the files and that there would be a way to determine the start and stop locations of file or file fragments on the disk, but I am not sure where to find information about how to obtain such information...?
I am running Mac OS X 10.6.x and one can assume a standard setup for the drive. I might assume the same information would apply to a standard, read-only, uncompressed .dmg created by Disk Utility as well.
Any information on this topic or articles to read would be of interest.
I don't imagine what I want to do is particularly difficult once the format and layout of the files on disk was understood.
thank you
As mentioned in the comments, you need to look at the file system format, however by reading the raw disk sequentially, you are for (1) not guaranteed that subsequent blocks belong to same file, so you may have to seek anyway slowing down the advantage you had from reading directly from /dev/device, and (2) if your disk only is 50% full, you may still end up reading 100% of the disk, as you will be reading the unallocated space as well as the space allocated to file, and hence directly ready from /dev/device may be in efficient as well.
However fsck and similar does this operation, but they do it with moderation nased on possible error they are looking for when repairing file systems.
I wonder what kind of reliability guarantees NTFS provides about the data stored on it? For example, suppose I'm opening a file, appending to the end, then closing it, and the power goes out at a random time during this operation. Could I find the file completely corrupted?
I'm asking because I just had a system lock-up and found two of the files that were being appended to completely zeroed out. That is, of the right size, but made entirely of the zero byte. I thought this isn't supposed to happen on NTFS, even when things fail.
NTFS is a transactional file system, so it guarantees integrity - but only for the metadata (MFT), not the (file) content.
The short answer is that NTFS does metadata journaling, which assures valid metadata.
Other modifications (to the body of a file) are not journaled, so they're not guaranteed.
There are file systems that do journaling of all writes (e.g., AIX has one, if memory serves), but with them, you tend to get a tradeoff between disk utilization and write speed. IOW, you need a lot of "free" space to get decent performance -- they basically just do all writes to free space, and link that new data into the right spots in the file. Then they go through and clean out the garbage (i.e., free up parts that have since been overwritten, and usually coalesce the pieces of a file together as well). This can get slow if they have to do it very often though.
I need to store large amounts of data on-disk in approximately 1k blocks. I will be accessing these objects in a way that is hard to predict, but where patterns probably exist.
Is there an algorithm or heuristic I can use that will rearrange the objects on disk based on my access patterns to try to maximize sequential access, and thus minimize disk seek time?
On modern OSes (Windows, Linux, etc) there is absolutely nothing you can do to optimise seek times! Here's why:
You are in a pre-emptive multitasking system. Your application and all it's data can be flushed to disk at any time - user switches task, screen saver kicks in, battery runs out of charge, etc.
You cannot guarantee that the file is contiguous on disk. Doing Aaron's first bullet point will not ensure an unfragmented file. When you start writing the file, the OS doesn't know how big the file is going to be so it could put it in a small space, fragmenting it as you write more data to it.
Memory mapping the file only works as long as the file size is less than the available address range in your application. On Win32, the amount of address space available is about 2Gb - memory used by application. Mapping larger files usually involves un-mapping and re-mapping portions of the file, which won't be the best of things to do.
Putting data in the centre of the file is no help as, for all you know, the central portion of the file could be the most fragmented bit.
To paraphrase Raymond Chen, if you have to ask about OS limits, you're probably doing something wrong. Treat your filesystem as an immutable black box, it just is what it is (I know, you can use RAID and so on to help).
The first step you must take (and must be taken whenever you're optimising) is to measure what you've currently got. Never assume anything. Verify everything with hard data.
From your post, it sounds like you haven't actually written any code yet, or, if you have, there is no performance problem at the moment.
The only real solution is to look at the bigger picture and develop methods to get data off the disk without stalling the application. This would usually be through asynchronous access and speculative loading. If your application is always accessing the disk and doing work with small subsets of the data, you may want to consider reorganising the data to put all the useful stuff in one place and the other data elsewhere. Without knowing the full problem domain it's not possible to to be really helpful.
Depending on what you mean by "hard to predict", I can think of a few options:
If you always seek based on the same block field/property, store the records on disk sorted by that field. This lets you use binary search for O(log n) efficiency.
If you seek on different block fields, consider storing an external index for each field. A b-tree gives you O(log n) efficiency. When you seek, grab the appropriate index, search it for your block's data file address and jump to it.
Better yet, if your blocks are homogeneous, consider breaking them down into database records. A database gives you optimized storage, indexing, and the ability to perform advanced queries for free.
Use memory-mapped file access rather than the usual open-seek-read/write pattern. This technique works on Windows and Unix platforms.
In this way the operating system's virtual memory system will handle the caching for you. Accesses of blocks that are already in memory will result in no disk seek or read time. Writes from memory back to disk are handled automatically and efficiently and without blocking your application.
Aaron's notes are good too as they will affect initial-load time for a chunk that's not in memory. Combine that with the memory-mapped technique -- after all it's easier to reorder chunks using memcpy() than by reading/writing from disk and attempting swapouts etc.
The most simple way to solve this is to use an OS which solves that for you under the hood, like Linux. Give it enough RAM to hold 10% of the objects in RAM and it will try to keep as many of them in the cache as possible reducing the load time to 0. The recent server versions of Windows might work, too (some of them didn't for me, that's why I'm mentioning this).
If this is a no go, try this algorithm:
Create a very big file on the harddisk. It is very important that you write this in one go so the OS will allocate a continuous space on disk.
Write all your objects into that file. Make sure that each object is the same size (or give each the same space in the file and note the length in the first few bytes of of each chunk). Use an empty harddisk or a disk which has just been defragmented.
In a data structure, keep the offsets of each data chunk and how often it is accessed. When it is accessed very often, swap its position in the file with a chunk that is closer to the start of the file and which has a lesser access count.
[EDIT] Access this file with the memory-mapped API of your OS to allow the OS to effectively cache the most used parts to get best performance until you can optimize the file layout next time.
Over time, heavily accessed chunks will bubble to the top. Note that you can collect the access patterns over some time, analyze them and do the reorder over night when there is little load on your machine. Or you can do the reorder on a completely different machine and swap the file (and the offset table) when that's done.
That said, you should really rely on a modern OS where a lot of clever people have thought long and hard to solve these issues for you.
That's an interesting challenge. Unfortunately, I don't know how to solve this out of the box, either. Corbin's approach sounds reasonable to me.
Here's a little optimization suggestion, at least: Place the most-accessed items at the center of your disk (or unfragmented file), not at the start of end. That way, seeking to lesser-used data will be closer by average. Err, that's pretty obvious, though.
Please let us know if you figure out a solution yourself.
Windows XP Disk Defragmenter report shows a constant gap in disk usage on a number of disk partitions on my system. I'm not referring to the little transitory gaps that occur. In disk D below, the gap in question is the one under the word "defragmentation". In disk P below, the gap is the one under "usage before def" the but a bigger one. The C partition doesn't have this anomaly. The size and placement pattern isn't obvious. It is as though there was an area, a no-man's land, that both the file system and the defragmenter avoid. These gaps survive daily use and defragmentation. I don't believe this is a residue from a paging file -- it should show up in green, anyway. Recycle bin is empty.
Any ideas?
Disk D (20 Gig):
Disk P (40 Gig):
That is probably the space reserved for the MFT, which will only be used for files if the disk gets really full. This empty space allows it to grow for a while without getting fragmented.
References:
How NTFS reserves space for its Master File Table (MFT)
No idea what's causing this, but the defragger that comes with Win XP is Diskkeeper Lite, which is not very good. A better defragger might get rid of the gap if it's not being caused by anything. I personally use O&O Defrag; it's not free, but there's a 30-day trial.
Defragging to the point that there are absolutely no gaps is not necessarily a good thing. Some OSs/FileSystems try to pack files in as tightly as possible and fill without gaps where possible.
The problem with this is if any of the earlier files get changed or appended to then you are either leaving an early gap (which will tend to case fragments) or forcing the extra bit to be entered at the next gap (creating a fragment again).
Defrag when you start getting weird behaviour (quite often it helps... even though it is not supposed to); however you don't need to do it every day, nor is a totally defragmented drive a sign of a particularly health drive.
Like the poster above said, that's most likely the reserved zone for the MFT. When the drive is formatted, about 12.5% of the partition is reserved for the MFT, and this can grow as needed to accomodate new records if the initial allocation is used up. Mind you, the MFT can also fragment if the adjacent contiguous free space is not large enough to accomodate the expansion.
Reg. defragging, instead of defragging manually regularly, save yourself the trouble and get Diskeeper. The newest version i.e 2008 Professional is fully automatic and defrags in the background using idle resources. There is also a manual/scheduled defrag mode, but I don't see any reason to waste my time; it does a fine job running on automatic on my systems.