Windows XP Disk Defragmenter report shows a constant gap in disk usage on a number of disk partitions on my system. I'm not referring to the little transitory gaps that occur. In disk D below, the gap in question is the one under the word "defragmentation". In disk P below, the gap is the one under "usage before def" the but a bigger one. The C partition doesn't have this anomaly. The size and placement pattern isn't obvious. It is as though there was an area, a no-man's land, that both the file system and the defragmenter avoid. These gaps survive daily use and defragmentation. I don't believe this is a residue from a paging file -- it should show up in green, anyway. Recycle bin is empty.
Any ideas?
Disk D (20 Gig):
Disk P (40 Gig):
That is probably the space reserved for the MFT, which will only be used for files if the disk gets really full. This empty space allows it to grow for a while without getting fragmented.
References:
How NTFS reserves space for its Master File Table (MFT)
No idea what's causing this, but the defragger that comes with Win XP is Diskkeeper Lite, which is not very good. A better defragger might get rid of the gap if it's not being caused by anything. I personally use O&O Defrag; it's not free, but there's a 30-day trial.
Defragging to the point that there are absolutely no gaps is not necessarily a good thing. Some OSs/FileSystems try to pack files in as tightly as possible and fill without gaps where possible.
The problem with this is if any of the earlier files get changed or appended to then you are either leaving an early gap (which will tend to case fragments) or forcing the extra bit to be entered at the next gap (creating a fragment again).
Defrag when you start getting weird behaviour (quite often it helps... even though it is not supposed to); however you don't need to do it every day, nor is a totally defragmented drive a sign of a particularly health drive.
Like the poster above said, that's most likely the reserved zone for the MFT. When the drive is formatted, about 12.5% of the partition is reserved for the MFT, and this can grow as needed to accomodate new records if the initial allocation is used up. Mind you, the MFT can also fragment if the adjacent contiguous free space is not large enough to accomodate the expansion.
Reg. defragging, instead of defragging manually regularly, save yourself the trouble and get Diskeeper. The newest version i.e 2008 Professional is fully automatic and defrags in the background using idle resources. There is also a manual/scheduled defrag mode, but I don't see any reason to waste my time; it does a fine job running on automatic on my systems.
Related
I'm using Windows 8.1.
There appears to be an inconsistency. Windows states adding hardlinks to a file doesn't use much disk space, and this makes sense since you're only creating a pointer.
However, the file system doesn't reflect this. If I create a hardlink, it lists the disk space usage for that file as doubled.
If doing this only adds a pointer but the FS thinks it's doubled, then it doesn't matter how much space the file is actually using when calculating remaining disk space if what the FS thinks is only taken in consideration.
So what gives? Which is it? Which is being considered when calculating remaining disk space? Appreciated!
According to Harry Johnston:
"Explorer isn't the file system. The fact that Explorer doesn't take hard links into account when calculating the total size of a group of files doesn't affect the amount of disk space available. (If you look at the properties of the drive rather than of a particular set of files, Explorer asks the file system for the actual amount of space used and available on the volume. Those figures are correct.)"
This was the answer I was looking for. Thanks!
I wonder what kind of reliability guarantees NTFS provides about the data stored on it? For example, suppose I'm opening a file, appending to the end, then closing it, and the power goes out at a random time during this operation. Could I find the file completely corrupted?
I'm asking because I just had a system lock-up and found two of the files that were being appended to completely zeroed out. That is, of the right size, but made entirely of the zero byte. I thought this isn't supposed to happen on NTFS, even when things fail.
NTFS is a transactional file system, so it guarantees integrity - but only for the metadata (MFT), not the (file) content.
The short answer is that NTFS does metadata journaling, which assures valid metadata.
Other modifications (to the body of a file) are not journaled, so they're not guaranteed.
There are file systems that do journaling of all writes (e.g., AIX has one, if memory serves), but with them, you tend to get a tradeoff between disk utilization and write speed. IOW, you need a lot of "free" space to get decent performance -- they basically just do all writes to free space, and link that new data into the right spots in the file. Then they go through and clean out the garbage (i.e., free up parts that have since been overwritten, and usually coalesce the pieces of a file together as well). This can get slow if they have to do it very often though.
The purpose of the VirtualLock WinAPI call is to lock pages into the working set of a process. However, the WorkingSet64 API inexplicably doesn't count those pages.
Possibly as a result of this, neither Process Explorer nor the standard Task Manager count locked pages in their per-process memory usage statistics.
What's up with this? Could someone intimately familiar with virtual memory in WinNT shed some light on this inconsistency, which can cause gigabytes of used RAM to go essentially undetected? (think of SQL Server or VirtualBox)
Ah, that is easily explained: You're using the wrong API. GetProcessWorkingSetSize queries the minimum and maximum working set sizes. Those are quotas, not acutal values.
The minimum working set size is what Windows will guarantee to keep locked in RAM as long as the world does not end. The maximum working set size is the amount of memory that Windows will allow your process before pages are moved into the pool (they are not necessarily gone, but accessing them causes a fault and re-mapping).
You want GetProcessMemoryInfo
EDIT:
Since it is now clear that you were not using the wrong API (only named the wrong func), I've done some testing (VirtualAlloc and memory mapped files, both in combination with VirtualLock) on my XP system. At first sight, it looked like you are totally right. Allocating 512MB or memory mapping 512MB out of a 650MB file added 512MB to the virtual size but did not increase the working set. Following with a VirtualLock(512MB) did not affect the working set at all!
Then it occurred to me that VirtualLock took exactly zero time in every case, which did not seem plausible e.g. for having to fetch half a gigabyte from disk. So, I checked the return code and guess what. Windows doesn't think that locking 512MB is a good idea, and will refuse to do it.
Repeated the experiment with only 64MB, and behold, the working set immediately went up by 64MB, just as it should. So, in one word: "works for me".
Just to be sure, you did check the return code?
On a second look, this behaviour is even well-defined and well-documented. The docs to VirtualLock state explicitly:
The maximum number of pages that a
process can lock is equal to the
number of pages in its minimum working
set minus a small overhead.
With and without locking, after appropriately setting the WS quotas:
VirtualBox is a different matter, what you see in the task manager is only the working set of the "Interface" program and "Manager" frontend, both of which maintain working set sizes of below 64M at all times. Though I'm not sure what memory it maybe allocates in some drivers, or if they lock memory at all.
I'm currently running 2 virtual machines with 1.6GB main memory each. Seeing how my 32-bit Windows only sees 3.25GB, that would leave a mere 50MB for if the memory belonging to the VMs is locked. Besides, Process Explorer tells me that Firefox alone has a working set of 474MB and going up while I'm typing this (holy...?!!). That does not make it likely that all the memory in the virtual machines is really locked, because such figures would be entirely impossible then.
As requested, here's a shot of VMMap:
The figures are admittedly funny... the VM has 1.6M total of which according to VMMap 821MiB are reserved and 772MiB are committed, Process Explorer only shows 163MiB and 54MiB, respectively. Something is definitively fishy there, but I suspect this is probably some obscure VirtualBox hackery rather than a Windows issue.
I need to store large amounts of data on-disk in approximately 1k blocks. I will be accessing these objects in a way that is hard to predict, but where patterns probably exist.
Is there an algorithm or heuristic I can use that will rearrange the objects on disk based on my access patterns to try to maximize sequential access, and thus minimize disk seek time?
On modern OSes (Windows, Linux, etc) there is absolutely nothing you can do to optimise seek times! Here's why:
You are in a pre-emptive multitasking system. Your application and all it's data can be flushed to disk at any time - user switches task, screen saver kicks in, battery runs out of charge, etc.
You cannot guarantee that the file is contiguous on disk. Doing Aaron's first bullet point will not ensure an unfragmented file. When you start writing the file, the OS doesn't know how big the file is going to be so it could put it in a small space, fragmenting it as you write more data to it.
Memory mapping the file only works as long as the file size is less than the available address range in your application. On Win32, the amount of address space available is about 2Gb - memory used by application. Mapping larger files usually involves un-mapping and re-mapping portions of the file, which won't be the best of things to do.
Putting data in the centre of the file is no help as, for all you know, the central portion of the file could be the most fragmented bit.
To paraphrase Raymond Chen, if you have to ask about OS limits, you're probably doing something wrong. Treat your filesystem as an immutable black box, it just is what it is (I know, you can use RAID and so on to help).
The first step you must take (and must be taken whenever you're optimising) is to measure what you've currently got. Never assume anything. Verify everything with hard data.
From your post, it sounds like you haven't actually written any code yet, or, if you have, there is no performance problem at the moment.
The only real solution is to look at the bigger picture and develop methods to get data off the disk without stalling the application. This would usually be through asynchronous access and speculative loading. If your application is always accessing the disk and doing work with small subsets of the data, you may want to consider reorganising the data to put all the useful stuff in one place and the other data elsewhere. Without knowing the full problem domain it's not possible to to be really helpful.
Depending on what you mean by "hard to predict", I can think of a few options:
If you always seek based on the same block field/property, store the records on disk sorted by that field. This lets you use binary search for O(log n) efficiency.
If you seek on different block fields, consider storing an external index for each field. A b-tree gives you O(log n) efficiency. When you seek, grab the appropriate index, search it for your block's data file address and jump to it.
Better yet, if your blocks are homogeneous, consider breaking them down into database records. A database gives you optimized storage, indexing, and the ability to perform advanced queries for free.
Use memory-mapped file access rather than the usual open-seek-read/write pattern. This technique works on Windows and Unix platforms.
In this way the operating system's virtual memory system will handle the caching for you. Accesses of blocks that are already in memory will result in no disk seek or read time. Writes from memory back to disk are handled automatically and efficiently and without blocking your application.
Aaron's notes are good too as they will affect initial-load time for a chunk that's not in memory. Combine that with the memory-mapped technique -- after all it's easier to reorder chunks using memcpy() than by reading/writing from disk and attempting swapouts etc.
The most simple way to solve this is to use an OS which solves that for you under the hood, like Linux. Give it enough RAM to hold 10% of the objects in RAM and it will try to keep as many of them in the cache as possible reducing the load time to 0. The recent server versions of Windows might work, too (some of them didn't for me, that's why I'm mentioning this).
If this is a no go, try this algorithm:
Create a very big file on the harddisk. It is very important that you write this in one go so the OS will allocate a continuous space on disk.
Write all your objects into that file. Make sure that each object is the same size (or give each the same space in the file and note the length in the first few bytes of of each chunk). Use an empty harddisk or a disk which has just been defragmented.
In a data structure, keep the offsets of each data chunk and how often it is accessed. When it is accessed very often, swap its position in the file with a chunk that is closer to the start of the file and which has a lesser access count.
[EDIT] Access this file with the memory-mapped API of your OS to allow the OS to effectively cache the most used parts to get best performance until you can optimize the file layout next time.
Over time, heavily accessed chunks will bubble to the top. Note that you can collect the access patterns over some time, analyze them and do the reorder over night when there is little load on your machine. Or you can do the reorder on a completely different machine and swap the file (and the offset table) when that's done.
That said, you should really rely on a modern OS where a lot of clever people have thought long and hard to solve these issues for you.
That's an interesting challenge. Unfortunately, I don't know how to solve this out of the box, either. Corbin's approach sounds reasonable to me.
Here's a little optimization suggestion, at least: Place the most-accessed items at the center of your disk (or unfragmented file), not at the start of end. That way, seeking to lesser-used data will be closer by average. Err, that's pretty obvious, though.
Please let us know if you figure out a solution yourself.
I always wondered what different methods Google Desktop Search is using so that it uses least CPU and memory while indexing a computer containing more 100,000 files on an average.
In just few hours it has indexed the whole system and I did not see it eating up my CPU, memory etc.
If any of you have done some research, please do share.
The trick is simple: It starts to work then very soon stops and just sits there in in memory, doing nothing. Of course it's then totally useless but at least, it keeps light and fast. Sorry, couldn't resist :-) I Switched to Windows Search 4.0 and I'm much happier about it.
It doesn't...
I installed it on one computer, and quickly removed it because it was intrusive (although this can be probably configured) and hungry (particularly on a low end PC).
It is installed on a laptop near me right now, and if I compare it to a couple of small utilities I run permanently (SlickRun, CLCL, my AutoHotkey script...) it uses more than 10 times their CPU and 5 to 20 times their memory. Times two, since, for some reason, I have one instance running another, plus the ToolbarNotifier (less hungry).
Even Trend Micro anti-virus uses less memory and CPU.
Perhaps I will try it again when I will get a more modern PC with lot of memory, but right now I am happy enough with some grep utilities, even if they are slower.
Take a look at disk usage. If you build many keys/indexes you will use lots of disk space and the searches will be fast.
For example;
30 gig drive 75% used. 3.6 gig used for 2 instances of Google Desktop. (roaming profiles suck)
Once it has done the initial index, and written it to disc, it doesn't need to anything.
Searching using the index will require very little resources, the only thing that will is indexing new or modified files..