I have to collect certain attributes of files (modification date and so on). But there are many small files to analyze.
My question is: would it be more performant if I read, say, 3 or 4 files at the same time? If you access a file on the web this is better since you have to wait for the server to respond. But what about a harddisk? Is the concurrent strategy faster if the files are already cached by the harddisk?
You are accessing metadata, it seems (mtime), which are stored in the file's inode and therefore in the file system. Your limiting factor should (in UNIX terms) be the syscall to get the stat information, which could profit from parallelization.
Related
Everything is a file searching program. As its author hasn't released the source code, I am wondering how it works.
How could it index files so efficiently?
What data structures does it use for file searching?
How can its file searching be so fast?
To quote its FAQ,
"Everything" only indexes file and folder names and generally takes a
few seconds to build its database. A fresh install of Windows 10
(about 120,000 files) will take about 1 second to index. 1,000,000
files will take about 1 minute.
If it takes only one second to index the whole Windows 10, and takes only 1 minute to index one million files, does this mean it can index 120,000 files per second?
To make the search fast, there must be a special data structure. Searching by file name doesn't only search from the start, but also from the middle in most cases. This makes it some widely used indexing structures such as Trie and Red–black tree ineffective.
The FAQ clarifies further.
Does "Everything" hog my system resources?
No, "Everything" uses very little system resources. A fresh install of
Windows 10 (about 120,000 files) will use about 14 MB of ram and less
than 9 MB of disk space. 1,000,000 files will use about 75 MB of ram
and 45 MB of disk space.
Short Answer: MFT (Master File Table)
Getting the data
Many search engines used to recursively walk through the entire disk structure so that it finds all the files. Therefore it used to take longer time to complete the indexing process (even when contents are not indexed). If contents were also indexed, it would take a lot longer.
From my analysis of Everything, it does not recurse at all. If we observe the speed, in about 5 seconds it indexed an entire 1tb drive (SSD). Even if it had to recurse it would take longer - since there are thousands of small files - each with its own file size, date etc - all spread across.
Instead, Everything does not even touch the actual data, it reads and parses the 'Index' of the hard drive. For NTFS, MFT store all the file names, its physical location (like concept of iNodes in Linux). So, in one small contiguous area (a file), all the data inside MFT is present. So, the search indexer does not have waste time finding where the info about next file is, it does not have to seek. Since MFT by design is contiguous (rare exception if there are many more files and MFT for some reason is filled up or corrupt, it will link to a new one which will cause a seek time - but that edge case is very rare).
MFT is not plain text, it needs to be parsed. Folks at Everything have designed a superfast parser and decoder for NFT and hence all is well.
FSCTL_GET_NTFS_VOLUME_DATA (declared in winioctl.h) will get you the cluster locations for mft. Alternatively, you could use NTFSInfo (Microsoft SysInternals - NTFSInfo v1.2).
MFT zone clusters : 90400352 - 90451584
Storing and retrieving
The .db file from my index at C:\Users\XXX\AppData\Local\Everything I assume this is a regular nosql file-based database. Since it uses a DB and not a flat file, that contributes to the speed. And also, at start of program, it loads this db file into RAM, so all the queries do not look up the DB on disk, instead on RAM. All this combined makes it slick.
How could it index files so efficiently?
First, it indexes only file/directory names, not contents.
I don't know if it's efficient enough for your needs, but the ordinary way is with FindFirstFile function. Write a simple C program to list folders/files recursively, and see if it's fast enough for you. The second step through optimization would be running threads in parallel, but maybe disk access would be the bottleneck, if so multiple threads would add little benefit.
If this is not enough, finally you could try to dig into the even lower Native API functions; I have no experience with these, so I can't give you further advice. You'd be pretty close to the metal, and maybe the Linux NTFS project has some concepts you need to learn.
What data structures does it use for file searching?
How can its file searching be so fast?
Well, you know there are many different data structures designed for fast searching... probably the author ran a lot of benchmarks.
Is there any limitation on number of files in one directory (in any host) ?
If I have a directory with 30k (named from 1 to 3ok) files and another one with only 10 is there a major difference in performance to fetch a specific files ?
thanx
it depends from your file system type. Answer on this question will be find out in your current file system type spec.
Archlinux Wiki perfomance optimizing page
Summary:
XFS: Excellent performance with large files. Low speed with small files. A good choice for /home. -Reiserfs: Excellent performance with small files. A good choice for /var.
Ext3: Average performance, reliable.
Ext4: Great overall performance, reliable,has performance issues with sqlite and some other databases.
JFS: Good overall performance, very low CPU usage, extremely fast resume after power failure.
Btrfs: Probably best overall performance (with compression) and lots of features. Still in heavy development and fully supported, but considered as unstable. Do not use this filesystem yet unless you know what you are doing and are prepared for potential data loss.
fsck time vs Inode Count
I'd say the max number of files is OS specific and file system specific. But having a huge number ob files in one directory can drastically hit your performance, when accessing a file.
I can not give you any numbers for any specific os/fs, but maybe a solution if you have performance issues:
In mediawiki software (thats the software wikipedia runs on) they use subdirectories to counter that problem. This is how they store media files:
md5-hash the name of the file
take the first digit of the md5hash as subdirectory of the files dir
take the first 2 digits of the md5hash as name of a subsubdirectory to that subdirectory
store the file there
this way they can find the file by the name only, but don't need to rely on a good os/fs for sillions of files. It results in something like this:
http://upload.wikimedia.org/wikipedia/commons/7/74/Flag_of_Hamburg.svg is the path for Flag_of_Hamburg.svg
I have a Delphi app that references a datafile of 28-byte records. The file is written sequentially but read randomly. The datafile is split into N physical files which are rolled over at 10 megs or so to provide some insurance against disk problems, and because we are only ever writing to the most recent one, and I found it became slower and slower to write to if it were allowed to grow to big. On startup I read the entire file set and build an index so that I can quickly know which file to seek into given a virtual record number.
As part of the splitting into N files I implemented a read cache. I realise now that Windows does a fair amount of caching on it's own, and I wonder if I'm gaining anything by sticking another cache between myself and the disk files.
Any thoughts appreciated.
No. Use file mappings to use the existing file cache more efficiently.
I have the following loop:
for fileName in fileList:
f = open(fileName)
txt = open(f).read()
analyze(txt)
The fileList is a list of more than 1 million small files. Empirically, I have found that call to open(fileName) takes more than 90% of the loop running time. What would you do in order to optimize this loop. This is a "software only" question, buying a new hardware is not an option.
Some information about this file collection:
Each file name is a 9-13 digit ID. The files are arranged in subfolders according to the first 4 digits of the ID. The files are stored on an NTFS disk and I rather not change disk format for reasons I won't get into, unless someone here has a strong belief that such a change will make a huge difference.
Solution
Thank you all for the answers.
My solution was to pass over all the files, parsing them and putting the results in an SQLite database. No the analyses that I perform on the data (select several entries, do the math) take only seconds. Already said, the reading part took about 90% of the time, so parsing the XML files in advance had little effect on the performance, compared to the effect of not having to read the actual files from the disk.
Hardware solution
You should really benefit from using a solid state drive (SSD). These are a lot faster than traditional hard disk drives, because they don't have any hardware components that need to spin and move around.
Software solution
Are these files under your control, or are they coming from an external system? If you're in control, I'd suggest you use a database to store the information.
If a database is too much of a hassle for you, try to store the information in a single file and read from that. If the isn't fragmented too much, you'll have much better performance compared to having millions of small files.
If opening and closing of files is taking most of your time, a good idea will be use a database or data store for your storage rather than a collection of flat files
To address your final point:
unless someone here has a strong belief that such a change will make a huge difference
If we're really talking about 1 million small files, merging them into one large file (or a small number of files) will almost certainly make a huge difference. Try it as an experiment.
Store the files in a single .zip archive and read them from that. You are just reading these files, right?
So, let's get this straight: you have sound empirical data that shows that your bottleneck is the filesystem, but you don't want to change your file structure? Look up Amdahl's law. If opening the files takes 90% of the time, then without changing that part of the program, you will not be able to speed things up by more than 10%.
Take a look at the properties dialog box for the directory containing all those files. I'd imagine the "size on disk" value is much larger than the total size of the files, because of the overhead of the filesystem (things like per-file metadata that is probably very redundant, and files being stored with an integer number of 4k blocks).
Since what you have here is essentially a large hash table, you should store it in a file format that is more suited to that kind of usage. Depending on whether you will need to modify these files and whether the data set will fit in RAM, you should look in to using a full-fledged database, a ligheweight embeddable one like sqlite, your language's hash table/dictionary serialization format, a tar archive, or a key-value store program that has good persistence support.
I'm looking to build a server with lots of tiny files delivered by an XML API. It won't be doing a whole lot of iterating over directories or blocks of sequential files--we're talking lots and lots of seeks for discontinuous data.
Will seek time on BSD UFS degrade over time for requests for individual files? I understand that the filesystem's inode limit is based on the size of the partition/slice, but the hard drive has to step through the inode table for every file request before it can discover the location of the data. What filesystem yields the best performance for seek time?
The alternative is to setup 2-4GB "blob" files and have a separate system of seeking a file contained in them from within the software. The software's "inode table" could be optimized for delivery based on currently logged in user, etc... These "inode tables" would likely be cached in RAM and would only relate to the users currently logged in so that there are fewer wasted resources.
Where do these two solutions rate on a scalability and maintenance standpoint? What sort of performance gains, if any, could I expect by using the second solution?
The most obvious and time-proven mitigation technique is to use a good hierarchical design for directories (and pathname search strategies), and have more directories with fewer files in each.
For recent FreeBSD versions with dirhash and softupdates I have seen no problems with a few ten thousand files per directory. You probably don't want to go north of 500.000 files or so. E.g. deleting a directory with 2.500.000 files took me three days.
I'm not sure i understand you question correctly, but if you want to seek over lots of files, why not use a partioned mysql table laid out on a RAID0 or VFS filesystem?
Edit: as far as i know, lots of files in one folder will degrade any FS speed as it has to maintain bigger lists of files, permissions and names, a database is designed to keep lists of data in memory and seek in a very optimized way through it.
More details of your situation would be helpful, are the files existing or would they be created by your application? If you need a way to store arbitrary data with out the structure of a relational database have you looked at object databases
Another option, if your objects should or can be accessed via HTTP, is to use a varnish cache in front of a small web server. Initially objects would be stored on disk, but varnish would store and serve objects from memory after the first access to a given object.