Given:
NTFS volume
folder path
some date time value - lets call it $date
What is the fastest way to search for all files with
("last modification date" > $date) or ("creation date" > $date)
Simply I want to search for all added or modified files.
For performance reasons I don't want to do recursive crawl of all subfolders and read every file attributes.
For technical reasons (i.e. UAC, NTFS documentation) I would like to avoid parsing \\.\$mft file
Is there some Windows API that will allow me to do search in that way?
Edit: One more constraint: For maintenance reasons I don't want to be dependant on indexing service.
I can say with certainty that there is no other realistic option given the problem as stated. FindFirst et al do not have a filtering mechanism. If you were keeping up with the USN journal, there might be some leeway, but otherwise, no.
If the windows indexing service is turned on, and the files you want are indexed, you quickly find your files using the query api.
Related
My Mac app keeps a collection of objects (with Core Data), each of which has a cover image, and to which I assign a UUID upon creation. I had originally been storing the cover images as a field in my Core Data store, but recently started storing them on disk in the file system, instead.
Initially, I'm storing the covers in a flat directory, using the UUID to name the file, as below. This gives me O(1) fetching, as I know exactly where to look.
...
/.../Covers/3B723A52-C228-4C5F-A71C-3169EBA33677.jpg
/.../Covers/6BEC2FC4-B9DA-4E28-8A58-387BC6FF8E06.jpg
...
I've looked at the way other applications handle this task, though, and noticed a multi-level scheme, as below (for instance). This could still be implemented in O(1) time.
...
/.../Covers/A/B/3B723A52-C228-4C5F-A71C-3169EBA33677.jpg
/.../Covers/C/D/6BEC2FC4-B9DA-4E28-8A58-387BC6FF8E06.jpg
...
What might be the reason to do it this way? Does OS X limit the number of files in a directory? Is it in some way faster to retrieve them from disk? It would make the code used to calculate the file's name more complicated, so I want to find out if there is a good reason to do it that way.
On certain file systems (and I beleive HFS+ too), having too many files in the same directory will cause performance issues.
I used to work in an ISP where they would break up the home directories (they had 90k+ of them) Using a multi-directory scheme. You can partition your directories by using, say, the first two characters of the UUID, then the second two, eg:
/.../Covers/3B/72/3B723A52-C228-4C5F-A71C-3169EBA33677.jpg
/.../Covers/6B/EC/6BEC2FC4-B9DA-4E28-8A58-387BC6FF8E06.jpg
That way you don't need to calculate any extra characters or codes, just use the ones you have already to break it up. Since your UUIDs will be different every time, this should suffice.
The main reason is that in the latter way, as you've mentioned, disk retrieval is faster because your directory is smaller (so the FS will lookup in a smaller table for a file to exists).
As others mentioned, on some file systems it takes longer for the OS to open the file, because one directory with many files is longer to read than a couple of short directories.
However, you should perform measurements on your particular file system and for your particular usage scenario. I did this for NTFS on Windows XP and was surprised to discover that flat directory was performing better in all kinds of tests, than hierarchical structure.
how can I get the file sequence which is the same as windows file system? Because there are so many file system sorting items: name, size, last modified date time, tag(win 7), rating(win 7), so if I using CFileFind API to simulate the sorting behavior as windows file system is quite difficult. So how can I get the files whose sequence is the same as windows file system??
I'm not sure what CFindFile does, but FindFirstFile and friends returns files in the order they exist in the NTFS directory.
I'm not sure why that would be the most desirable, though, it's not exactly "intuitive" by anyone's definition...
Raymond Chen did a pretty detailed article on "Why do NTFS and Explorer disagree on filename sorting?"
However, note that FindFirstFile() and its relatives don't actually sort the results - it's just giving the files back to you in whatever order the filesystem hands them up. NTFS has an ordering for its own purposes (and I'm not sure that that ordering is specified - that it appears ordered to you is probably just a happy coincidence). FAT file systems and network filesystems will have their own ordering (or no ordering - the files might just be in the directory in whatever order they happened to be created - I think FAT systems are like that).
If you need a particular order for files returned by FindFirstFile() and friends, you'll need to do that sorting yourself.
From the FindFirstFile() docs: "FindFirstFile does no sorting of the search results. For additional information, see FindNextFile."
And from the docs for FindNextFile(): "The order in which the search returns the files, such as alphabetical order, is not guaranteed, and is dependent on the file system. If the data must be sorted, the application must do the ordering after obtaining all the results."
CFileFind() makes no promises about the order of the filenames returned - I'd be astonished if it did any sorting either (since it would have to get all possible files from the destination directory before returning the first one to be able to pull it off).
I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.
I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.
I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?
Thanks,
Barry.
Update:
Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)
There are several things you could/should do
Disable automatic NTFS short file name generation (google it)
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
As already posted consider splitting up the files in multiple directories.
.e.g. instead of
directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg
use
directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg
You could try an SSD....
http://www.crucial.com/promo/index.aspx?prog=ssd
Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.
Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.
The solution is most likely to restrict the number of files per directory.
I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.
gbp97m.xls
was stored in
g/b/p97m.xls
This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.
One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.
Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.
I need to get the list of all the files on a drive. I am using a recursive solution. But it is taking a lot of time. I was wondering that, is it possible to get the names and location of all the files on a NTFS drive from it's Master File Table? I think it will be very fast. Any suggestions?
There is a tool that will search the mft directly, it's called ndff. I have used it before and it is very fast.
Presumably it is possible to do what you want - there is another tool called "Everything" which I guess does the same thing - it also uses the USN change journal to update it's index.
When you get a list of all the files on an NTFS-formatted drive using a recursive solution, you are getting them from the MFT. There should be little disk IO outside of the MFT when simply retrieving a list of filenames and directories.
Before going down the path of determining the format of the MFT (which is available from a variety of places on the Internet) and writing code to read it directly, you should probably profile your code and determine that you aren't already CPU or IO bound.
I have the impression you're imagining some kind of list-like structure in the MFT which you can read in one go with no or minimal seeking.
This is not the case. The MFT uses a type of b-tree to store pathnames. When you scan the directory structure on your disk, you are in fact walking the MFT b-tree; you are doing what you would have to do if you accessed the MFT directly.
Yes there is, and the program I just open-sourced does exactly this.
You can read the source to find out how it works, but basically, it just looks for FILE_NAME attributes inside the $MFT and then uses the ParentDirectory field to get the parent of every file.
That way it can completely avoid reading the contents of any directory.
I have a program that uses save files. It needs to load the newest save file, but fall back on the next newest if that one is unavailable or corrupted. Can I use the windows file creation timestamp to tell the order of when they were created, or is this unreliable? I am asking because the "changed" timestamps seem unreliable. I can embed the creation time/date in the name if I have to, but it would be easier to use the file system dates if possible.
If you have a directory full of arbitrary and randomly named files and 'time' is the only factor, it may be more pointful to establish a filename that matches the timestamp to eliminate need for using tools to view it.
2008_12_31_24_60_60_1000
Would be my recommendation for a flatfile system.
Sometimes if you have a lot of files, you may want to group them, ie:
2008/
2008/12/
2008/12/31
2008/12/31/00-12/
2008/12/31/13-24/24_60_60_1000
or something larger
2008/
2008/12_31/
etc etc etc.
( Moreover, if you're not embedding the time, what is your other distinguishing characteritics, you cant have a null file name, and creating monotonically increasing sequences is way harder ? need info )
What do you mean by "reliable"? When you create a file, it gets a timestamp, and that works. Now, the resolution of that timestamp is not necessarily high -- on FAT16 it was 2 seconds, I think. On FAT32 and NTFS it probably is 1 second. So if you are saving your files at a rate of less then one per second, you should be good there. Keep in mind, that user can change the timestamp value arbitrarily. If you are concerned about that, you'll have to embed the timestamp into the file itself (although in my opinion that would be ovekill)
Of course if the user of the machine is an administrator, they can set the current time to whatever they want it to be, and the system will happily timestamp files with that time.
So it all depends on what you're trying to do with the information.
Windows timestamps are in UTC. So if your timezone changes (ie. when daylight savings starts or ends) the timestamp will move forward/back an hour. Apart from that, and the accuracy of about 2 seconds, there is no reason to think that the timestamps are invalid, and its certainly ok to use them. But I think its bad practice, when you can simply put the timestamp in the name, or in the file itself even.
What if the system time is changed for some reason? It seems handy, but perhaps some other version number counting up would be better.
Added: A similar question, but with databases, here.
I faced some issues with created time of a file after deletion and recreation under same name.
Something similar to this comment in GetFileInfoEx docs
Problem getting correct Creation Time after file was recreated
I tried to use GetFileAttributesEx and then get ftCreationTime field of
the resulting WIN32_FILE_ATTRIBUTE_DATA structure. It works just fine
at first, but after I delete file and recreate again, it keeps giving
me the original already incorrect value until I restart the process
again. The same problem happens for FindFirstFile API, as well. I use
Window 2003.
this is said to be related to something called tunnelling
try usining this when you want to rename the file
Path.Combine(ArchivedPath, currentDate + " " + fileInfo.Name))