Optimizing file reading from HD - performance

I have the following loop:
for fileName in fileList:
f = open(fileName)
txt = open(f).read()
analyze(txt)
The fileList is a list of more than 1 million small files. Empirically, I have found that call to open(fileName) takes more than 90% of the loop running time. What would you do in order to optimize this loop. This is a "software only" question, buying a new hardware is not an option.
Some information about this file collection:
Each file name is a 9-13 digit ID. The files are arranged in subfolders according to the first 4 digits of the ID. The files are stored on an NTFS disk and I rather not change disk format for reasons I won't get into, unless someone here has a strong belief that such a change will make a huge difference.
Solution
Thank you all for the answers.
My solution was to pass over all the files, parsing them and putting the results in an SQLite database. No the analyses that I perform on the data (select several entries, do the math) take only seconds. Already said, the reading part took about 90% of the time, so parsing the XML files in advance had little effect on the performance, compared to the effect of not having to read the actual files from the disk.

Hardware solution
You should really benefit from using a solid state drive (SSD). These are a lot faster than traditional hard disk drives, because they don't have any hardware components that need to spin and move around.
Software solution
Are these files under your control, or are they coming from an external system? If you're in control, I'd suggest you use a database to store the information.
If a database is too much of a hassle for you, try to store the information in a single file and read from that. If the isn't fragmented too much, you'll have much better performance compared to having millions of small files.

If opening and closing of files is taking most of your time, a good idea will be use a database or data store for your storage rather than a collection of flat files

To address your final point:
unless someone here has a strong belief that such a change will make a huge difference
If we're really talking about 1 million small files, merging them into one large file (or a small number of files) will almost certainly make a huge difference. Try it as an experiment.

Store the files in a single .zip archive and read them from that. You are just reading these files, right?

So, let's get this straight: you have sound empirical data that shows that your bottleneck is the filesystem, but you don't want to change your file structure? Look up Amdahl's law. If opening the files takes 90% of the time, then without changing that part of the program, you will not be able to speed things up by more than 10%.
Take a look at the properties dialog box for the directory containing all those files. I'd imagine the "size on disk" value is much larger than the total size of the files, because of the overhead of the filesystem (things like per-file metadata that is probably very redundant, and files being stored with an integer number of 4k blocks).
Since what you have here is essentially a large hash table, you should store it in a file format that is more suited to that kind of usage. Depending on whether you will need to modify these files and whether the data set will fit in RAM, you should look in to using a full-fledged database, a ligheweight embeddable one like sqlite, your language's hash table/dictionary serialization format, a tar archive, or a key-value store program that has good persistence support.

Related

What makes Everything's file search and index so efficient?

Everything is a file searching program. As its author hasn't released the source code, I am wondering how it works.
How could it index files so efficiently?
What data structures does it use for file searching?
How can its file searching be so fast?
To quote its FAQ,
"Everything" only indexes file and folder names and generally takes a
few seconds to build its database. A fresh install of Windows 10
(about 120,000 files) will take about 1 second to index. 1,000,000
files will take about 1 minute.
If it takes only one second to index the whole Windows 10, and takes only 1 minute to index one million files, does this mean it can index 120,000 files per second?
To make the search fast, there must be a special data structure. Searching by file name doesn't only search from the start, but also from the middle in most cases. This makes it some widely used indexing structures such as Trie and Red–black tree ineffective.
The FAQ clarifies further.
Does "Everything" hog my system resources?
No, "Everything" uses very little system resources. A fresh install of
Windows 10 (about 120,000 files) will use about 14 MB of ram and less
than 9 MB of disk space. 1,000,000 files will use about 75 MB of ram
and 45 MB of disk space.
Short Answer: MFT (Master File Table)
Getting the data
Many search engines used to recursively walk through the entire disk structure so that it finds all the files. Therefore it used to take longer time to complete the indexing process (even when contents are not indexed). If contents were also indexed, it would take a lot longer.
From my analysis of Everything, it does not recurse at all. If we observe the speed, in about 5 seconds it indexed an entire 1tb drive (SSD). Even if it had to recurse it would take longer - since there are thousands of small files - each with its own file size, date etc - all spread across.
Instead, Everything does not even touch the actual data, it reads and parses the 'Index' of the hard drive. For NTFS, MFT store all the file names, its physical location (like concept of iNodes in Linux). So, in one small contiguous area (a file), all the data inside MFT is present. So, the search indexer does not have waste time finding where the info about next file is, it does not have to seek. Since MFT by design is contiguous (rare exception if there are many more files and MFT for some reason is filled up or corrupt, it will link to a new one which will cause a seek time - but that edge case is very rare).
MFT is not plain text, it needs to be parsed. Folks at Everything have designed a superfast parser and decoder for NFT and hence all is well.
FSCTL_GET_NTFS_VOLUME_DATA (declared in winioctl.h) will get you the cluster locations for mft. Alternatively, you could use NTFSInfo (Microsoft SysInternals - NTFSInfo v1.2).
MFT zone clusters : 90400352 - 90451584
Storing and retrieving
The .db file from my index at C:\Users\XXX\AppData\Local\Everything I assume this is a regular nosql file-based database. Since it uses a DB and not a flat file, that contributes to the speed. And also, at start of program, it loads this db file into RAM, so all the queries do not look up the DB on disk, instead on RAM. All this combined makes it slick.
How could it index files so efficiently?
First, it indexes only file/directory names, not contents.
I don't know if it's efficient enough for your needs, but the ordinary way is with FindFirstFile function. Write a simple C program to list folders/files recursively, and see if it's fast enough for you. The second step through optimization would be running threads in parallel, but maybe disk access would be the bottleneck, if so multiple threads would add little benefit.
If this is not enough, finally you could try to dig into the even lower Native API functions; I have no experience with these, so I can't give you further advice. You'd be pretty close to the metal, and maybe the Linux NTFS project has some concepts you need to learn.
What data structures does it use for file searching?
How can its file searching be so fast?
Well, you know there are many different data structures designed for fast searching... probably the author ran a lot of benchmarks.

The max number of files in one directory?

Is there any limitation on number of files in one directory (in any host) ?
If I have a directory with 30k (named from 1 to 3ok) files and another one with only 10 is there a major difference in performance to fetch a specific files ?
thanx
it depends from your file system type. Answer on this question will be find out in your current file system type spec.
Archlinux Wiki perfomance optimizing page
Summary:
XFS: Excellent performance with large files. Low speed with small files. A good choice for /home. -Reiserfs: Excellent performance with small files. A good choice for /var.
Ext3: Average performance, reliable.
Ext4: Great overall performance, reliable,has performance issues with sqlite and some other databases.
JFS: Good overall performance, very low CPU usage, extremely fast resume after power failure.
Btrfs: Probably best overall performance (with compression) and lots of features. Still in heavy development and fully supported, but considered as unstable. Do not use this filesystem yet unless you know what you are doing and are prepared for potential data loss.
fsck time vs Inode Count
I'd say the max number of files is OS specific and file system specific. But having a huge number ob files in one directory can drastically hit your performance, when accessing a file.
I can not give you any numbers for any specific os/fs, but maybe a solution if you have performance issues:
In mediawiki software (thats the software wikipedia runs on) they use subdirectories to counter that problem. This is how they store media files:
md5-hash the name of the file
take the first digit of the md5hash as subdirectory of the files dir
take the first 2 digits of the md5hash as name of a subsubdirectory to that subdirectory
store the file there
this way they can find the file by the name only, but don't need to rely on a good os/fs for sillions of files. It results in something like this:
http://upload.wikimedia.org/wikipedia/commons/7/74/Flag_of_Hamburg.svg is the path for Flag_of_Hamburg.svg

Storage for Write Once Read Many

I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.

Performance of one huge unix directory VS a directory tree?

My PHP project will use thousands of pictures and each needs only a single number for it's storage name.
My initial idea was to put all of the pictures in a single directory and name the files "0.jpg", "1.jpg", "2.jpg", and all the way to "4294967295.jpg" .
Would it be better performance-wise to create a directory tree structure and name the files something like "429 / 496 / 7295.jpg"?
If the answer is yes, then the follow up question would be: what is the optimal amount of subdirs or files per level of depth? And what effect does the chosen filesystem have on this?
Each file will have a corresponding MySQL entry with an UNSIGNED LONGINT id-number.
Thank you.
Yes, hard-to-say, quite a bit, perhaps you should use a database
The conventional wisdom is "use a database", but using the filesystem is a reasonable plan for larger objects like images.
Some filesystems have limits on the number of directory entries. Some filesystems do not have any sort of data structure for filename lookups, but just do a linear scan of the directory.
Optimizations like you are discussing are restricted to specific environmental profiles. Do you even know right now what future hardware your application will run on? Might it be a good idea to not stress the filesystem and make a nice, hierarchical directory structure? If you do that it will run well on any filesystem or storage server.
It depends on which filesystem is being used. ext{2,3,4} have a dir_index option that can be set when they are created that make storing thousands or even millions of files in a single directory reasonably fast.
btrfs is not yet production ready, but it implicitly supports this idea at a very basic level.
But if you're using the ext series without dir_index or most other Unix filesystems you will need to go for the more complex scheme of having several levels of directories. I would suggest you avoid that if you can. It just adds a lot of extra complication for something filesystems ought to be handling reasonably for you.
If you do use the more complex scheme, I would suggest encoding the number in hex and having 256 files/directories at each level. Filesystems that aren't designed to handle large numbers of files in each directory typically do linear scans. The goal is to approximate a B-Tree type structure on your own. 2 hex digits at each level gives you about half a 4kiB (a common size) disk block per level with common means of encoding directories. That's about as good as you're going to get without a really complicated scheme like encoding your numbers in base 23 or base 24.
Having several thousands files in one directory will slow things down considerably. I'd say a safe number is up to 1024 files per directory, 512 even better.
The answer, of course, is: It depends.
In particular, it depends on which file system you use. For example, the ext2 and ext3 file systems have a limits to the number of files per directory. Those file systems would not be able to put all of your pictures in one directory!
You might look into something other than a file system. In the company I work for, because we needed to store lots of material, we moved from file-based storage to a database-based storage run on Apache Jackrabbit.

Filesystem seek performance with lots of tiny files

I'm looking to build a server with lots of tiny files delivered by an XML API. It won't be doing a whole lot of iterating over directories or blocks of sequential files--we're talking lots and lots of seeks for discontinuous data.
Will seek time on BSD UFS degrade over time for requests for individual files? I understand that the filesystem's inode limit is based on the size of the partition/slice, but the hard drive has to step through the inode table for every file request before it can discover the location of the data. What filesystem yields the best performance for seek time?
The alternative is to setup 2-4GB "blob" files and have a separate system of seeking a file contained in them from within the software. The software's "inode table" could be optimized for delivery based on currently logged in user, etc... These "inode tables" would likely be cached in RAM and would only relate to the users currently logged in so that there are fewer wasted resources.
Where do these two solutions rate on a scalability and maintenance standpoint? What sort of performance gains, if any, could I expect by using the second solution?
The most obvious and time-proven mitigation technique is to use a good hierarchical design for directories (and pathname search strategies), and have more directories with fewer files in each.
For recent FreeBSD versions with dirhash and softupdates I have seen no problems with a few ten thousand files per directory. You probably don't want to go north of 500.000 files or so. E.g. deleting a directory with 2.500.000 files took me three days.
I'm not sure i understand you question correctly, but if you want to seek over lots of files, why not use a partioned mysql table laid out on a RAID0 or VFS filesystem?
Edit: as far as i know, lots of files in one folder will degrade any FS speed as it has to maintain bigger lists of files, permissions and names, a database is designed to keep lists of data in memory and seek in a very optimized way through it.
More details of your situation would be helpful, are the files existing or would they be created by your application? If you need a way to store arbitrary data with out the structure of a relational database have you looked at object databases
Another option, if your objects should or can be accessed via HTTP, is to use a varnish cache in front of a small web server. Initially objects would be stored on disk, but varnish would store and serve objects from memory after the first access to a given object.

Resources