I am defining the criteria to create an optimal folder structure for cache files on the filesystem.
The aim is to create the most performing file hierarchy from a filesystem perspective.
Files cached are mainly HTML pages, so they are small files, but as far as now my searches worked out that it is not the size of files to stress the index of a filesystem table, it is the number of files inside a directory.
Identifying the ideal number in 200 files per directory, I thought to create 10 subdirectories and 180 cache files, replicating this pattern in any subdirectory, but I admit it is a purely random attempt, not driven by a real knowledge of a good methodology to calculate it.
Can anyone suggest me which are the characteristics to evaluate in this decision, or share an authoritative resource speaking at an academic level about filesystem optimization, possibly applied to great file trees?
Related
My application need to keep a large amount of fairly small files (10-100k) that are usually accessed with some 'locality' in the filename's string expression.
Eg. if file_5_5 is accessed, files like file_4_5 or file_5_6 may be accessed too in a short while.
I've seen that web browser file caches are often sorted in a tree like fashion resembling the lexical order of the filename, which is a kind of hash. Eg. sadisadji would reside at s/a/d/i/ssadisadji for example. I guess that is optimized for fast random access to any of these files.
Would such a tree structure be useful for my case too? Or does a flat folder keeping all files in one location does equal well?
A tree structure would be better because many filesystems have trouble with listing a single directory with 100,000 files or more in them.
One approach taken by the .mbtiles file format, which stores a large number of image files for use with maps, is to store all of the files in an SQLite database, circumventing the problems caused by having thousands of files in a directory. Their reasoning and implementation is described here:
https://www.mapbox.com/developers/mbtiles/
Which type of file-system is beneficial for storing images in a social-networking website of around 50 thousand users?
I mean to say how to create the directory? What should be the hierarchy of folders for storing images (such as by album or by user).
I know Facebook use haystack now, but before that it uses simple NFS. What is the hierarchy of NFS?
There is no "best" way to do this from a filesystems perspective -- NFS, for example, doesn't have any set "hierarchy" other than the directories that you create in the NFS share where you're writing the photos.
Each underlying filesystem type (not NFS, I mean the server-side filesystem that you would use NFS to serve files from) has its own distinct performance characteristics, but probably all of them will have a relatively fast (O(1) or at least O(log(n))) way to look up files in a directory. For that reason, you could basically do any directory structure you want and get "not terrible" performance. Therefore, you should make the decision based on what makes writing and maintaining your application the easiest, especially since you have a relatively small number of users right now.
That said, if I were trying to solve this problem and wanted to use a relatively simple solution, I would probably give each photo a long random number in hex (like b16eabce1f694f9bb754f3d84ba4b73e) or use a checksum of the photo (such as the output from running md5/md5sum on the photo file, like 5983392e6eaaf5fb7d7ec95357cf0480), and then split that into a "directory" prefix and a "filename" suffix, like 5983392e6/eaaf5fb7d7ec95357cf0480.jpg. Choosing how far into the number to create the split will determine how many files you'll end up with in each directory. Then I'd store the number/checksum as a column in the database table you're using to keep track of the photos that have been uploaded.
The tradeoffs between these two approaches are mostly performance-related: creating random numbers is much faster than doing checksums, but checksums allow you to notice that multiple of the same photo have been uploaded and save storage (if that's likely to be common on your website, which I have no idea about :-) ). Cryptographically secure checksums also create very well-distributed values, so you can be certain that you won't end up with an artificially high number of photos in one particular directory (even if a hacker knows what checksum algorithm you're using).
If you ever find that the exact splitting point you chose can no longer scale because it requires too many files per directory, you can simply add another level of directory nesting, for instance by switching from 5983392e6/eaaf5fb7d7ec95357cf0480.jpg to 5983392e6/eaaf5fb7/d7ec95357cf0480.jpg. Also, if your single NFS server can't handle the load by itself anymore, you could use the prefix to distribute the photos across multiple NFS servers instead of simply across multiple directories.
I did a bit of googling on this as I was sure this question must have been answered, but I found nothing concise. I realize that it depends very much on the type of filesystem used. But are there any general statements one can make?
Is it generally faster to have, say, 10.000 files in a single folder, or 100 folders containing 100 files each?
It really depends on context, and what you're trying to do with those files. I usually keep my Windows folders below 4k files (4096), because Explorer tends to bog down when displaying them.
However, in *nix based OS'es, I've had 10k+ files in a folder with no discernable performance loss, given I knew what files I was looking for.
Obviously, if you're going to do any iterating through a folder, which is an O(n) operation, it will take longer the more files you have.
It's faster to the operating system reach the file when you have 100 folders with 100 files inside, I see a lot of improvement when I split one directory that had more than 20.000 files inside.
Im starting to develop a website that will have 500,000+ images. Besides needing a lot of disk space, do you think that the performance of the site will be affected if I use a basic folder schema, for example all the images under /images?
It will be too slow if a user requests /images/img13452.jpg?
If the performance decreases proportional to the quantity of images in the same folder, which schema/arquitecture do you recommend me to use?
Thanks in advance,
Juan
Depends on the file system, depends on many other things. On common approach though is to hash the filenames and then create subdirectories, this limits the files per directory and therefore will improve performance (again depending on the FS).
Example given:
ab\
ab.png
lm\
ablm.png
cd\
cd.png
xo\
xo.png
You may also want to search SO for more on that topic:
https://stackoverflow.com/search?q=filesystem+performance
That's going to depend on the OS and filesystem but generally speaking you won't see a performance hit if you reference the file by the full path. Where you might have problems is with maintenance scripts and such having to read a giant directory listing. I always find it better in the long run to organize a large amount of files into some kind of logical directory structure.
I'm in the process of implementing caching for my project. After looking at cache directory structures, I've seen many examples like:
cache
cache/a
cache/a/a/
cache/a/...
cache/a/z
cache/...
cache/z
...
You get the idea. Another example for storing files, let's say our file is named IMG_PARTY.JPG, a common way is to put it in a directory named:
files/i/m/IMG_PARTY.JPG
Some thoughts come to mind, but I'd like to know the real reasons for this.
Filesystems doing linear lookups find files faster when there's fewer of them in a directory. Such structure spreads files thin.
To not mess up *nix utilities like rm, which take a finite number of arguments and deleting large number of files at once tends to be hacky (having to pass it though find etc.)
What's the real reason? What is a "good" cache directory structure and why?
Every time I've done it, it has been to avoid slow linear searches in filesystems. Luckily, at least on Linux, this is becoming a thing of the past.
However, even today, with b-tree based directories, a very large directory will be hard to deal with, since it will take forever and a day just to get a listing of all the files, never mind finding the right file.
Just use dates. Since you will remove by date. :)
If you do ls -l, all the files need to be stat()ed to get details, which adds considerably to the listing time - this happens whether the FS uses hashed or linear structures.
So even if the FS has a capability of coping with incredibly large directory sizes, there are good reasons not to have large flat structures (They're also a pig to back up)
I've benchmarked GFS2 (clustered) with 32,000 files in a directory or arranged in a tree structure - recursive listings were around 300 times faster than getting a listing when they were all in a flat structure (could take up to 10 minutes to get a directory listing)
EXT4 showed similar ratios but as the end point was only a couple of seconds most people wouldn't notice.