My application need to keep a large amount of fairly small files (10-100k) that are usually accessed with some 'locality' in the filename's string expression.
Eg. if file_5_5 is accessed, files like file_4_5 or file_5_6 may be accessed too in a short while.
I've seen that web browser file caches are often sorted in a tree like fashion resembling the lexical order of the filename, which is a kind of hash. Eg. sadisadji would reside at s/a/d/i/ssadisadji for example. I guess that is optimized for fast random access to any of these files.
Would such a tree structure be useful for my case too? Or does a flat folder keeping all files in one location does equal well?
A tree structure would be better because many filesystems have trouble with listing a single directory with 100,000 files or more in them.
One approach taken by the .mbtiles file format, which stores a large number of image files for use with maps, is to store all of the files in an SQLite database, circumventing the problems caused by having thousands of files in a directory. Their reasoning and implementation is described here:
https://www.mapbox.com/developers/mbtiles/
Related
I've heard too many images in a single folder can cause performance issues, but does lots of directories create a performance issue? I'm running a website that creates a folder per image uploaded. Down the road I expect to get between 1 million and a few million photos uploaded which means 1-3 million folders. In each folder 6 images are stored with various sizes.
If this is problematic one idea is to have one folder per album which on average could store between 30-90 literal images (the sizes force the number to be multiplied by 6). It's just an idea, what I really want to do is use the best practices for image storage.
So my two options for storage are:
site/images/folder-id/id-size-file-name.jpg (single folder per album)
site/images/folder-id/photo-id/size-file-name.jpg (single folder per image)
Any insights on folder performance will go appreciated.
Performance of filesystems tend to degrade with the number of entries in a directory, be they files, directories, symbolic links, or other kinds of entries. This is inherent in most methods of storing the entries; the filesystem will have to search through it somehow, though it's possible the search algorithm used has O(log n) time.
The usual way of dealing with this (used by MediaWiki, at least), is to have some sort of uniformly distributed identifier (often a cryptographic hash) and store images in a sort of structure based on prefixes of the hashes. For example, if an image had a hash of 0123456789abcdef, one might store the image in 01/0123/image.jpg. You can, of course, tweak it so there are more or less than 256 entries in each level, or add more levels or make other tweaks.
I am defining the criteria to create an optimal folder structure for cache files on the filesystem.
The aim is to create the most performing file hierarchy from a filesystem perspective.
Files cached are mainly HTML pages, so they are small files, but as far as now my searches worked out that it is not the size of files to stress the index of a filesystem table, it is the number of files inside a directory.
Identifying the ideal number in 200 files per directory, I thought to create 10 subdirectories and 180 cache files, replicating this pattern in any subdirectory, but I admit it is a purely random attempt, not driven by a real knowledge of a good methodology to calculate it.
Can anyone suggest me which are the characteristics to evaluate in this decision, or share an authoritative resource speaking at an academic level about filesystem optimization, possibly applied to great file trees?
Which type of file-system is beneficial for storing images in a social-networking website of around 50 thousand users?
I mean to say how to create the directory? What should be the hierarchy of folders for storing images (such as by album or by user).
I know Facebook use haystack now, but before that it uses simple NFS. What is the hierarchy of NFS?
There is no "best" way to do this from a filesystems perspective -- NFS, for example, doesn't have any set "hierarchy" other than the directories that you create in the NFS share where you're writing the photos.
Each underlying filesystem type (not NFS, I mean the server-side filesystem that you would use NFS to serve files from) has its own distinct performance characteristics, but probably all of them will have a relatively fast (O(1) or at least O(log(n))) way to look up files in a directory. For that reason, you could basically do any directory structure you want and get "not terrible" performance. Therefore, you should make the decision based on what makes writing and maintaining your application the easiest, especially since you have a relatively small number of users right now.
That said, if I were trying to solve this problem and wanted to use a relatively simple solution, I would probably give each photo a long random number in hex (like b16eabce1f694f9bb754f3d84ba4b73e) or use a checksum of the photo (such as the output from running md5/md5sum on the photo file, like 5983392e6eaaf5fb7d7ec95357cf0480), and then split that into a "directory" prefix and a "filename" suffix, like 5983392e6/eaaf5fb7d7ec95357cf0480.jpg. Choosing how far into the number to create the split will determine how many files you'll end up with in each directory. Then I'd store the number/checksum as a column in the database table you're using to keep track of the photos that have been uploaded.
The tradeoffs between these two approaches are mostly performance-related: creating random numbers is much faster than doing checksums, but checksums allow you to notice that multiple of the same photo have been uploaded and save storage (if that's likely to be common on your website, which I have no idea about :-) ). Cryptographically secure checksums also create very well-distributed values, so you can be certain that you won't end up with an artificially high number of photos in one particular directory (even if a hacker knows what checksum algorithm you're using).
If you ever find that the exact splitting point you chose can no longer scale because it requires too many files per directory, you can simply add another level of directory nesting, for instance by switching from 5983392e6/eaaf5fb7d7ec95357cf0480.jpg to 5983392e6/eaaf5fb7/d7ec95357cf0480.jpg. Also, if your single NFS server can't handle the load by itself anymore, you could use the prefix to distribute the photos across multiple NFS servers instead of simply across multiple directories.
Im starting to develop a website that will have 500,000+ images. Besides needing a lot of disk space, do you think that the performance of the site will be affected if I use a basic folder schema, for example all the images under /images?
It will be too slow if a user requests /images/img13452.jpg?
If the performance decreases proportional to the quantity of images in the same folder, which schema/arquitecture do you recommend me to use?
Thanks in advance,
Juan
Depends on the file system, depends on many other things. On common approach though is to hash the filenames and then create subdirectories, this limits the files per directory and therefore will improve performance (again depending on the FS).
Example given:
ab\
ab.png
lm\
ablm.png
cd\
cd.png
xo\
xo.png
You may also want to search SO for more on that topic:
https://stackoverflow.com/search?q=filesystem+performance
That's going to depend on the OS and filesystem but generally speaking you won't see a performance hit if you reference the file by the full path. Where you might have problems is with maintenance scripts and such having to read a giant directory listing. I always find it better in the long run to organize a large amount of files into some kind of logical directory structure.
I'm in the process of implementing caching for my project. After looking at cache directory structures, I've seen many examples like:
cache
cache/a
cache/a/a/
cache/a/...
cache/a/z
cache/...
cache/z
...
You get the idea. Another example for storing files, let's say our file is named IMG_PARTY.JPG, a common way is to put it in a directory named:
files/i/m/IMG_PARTY.JPG
Some thoughts come to mind, but I'd like to know the real reasons for this.
Filesystems doing linear lookups find files faster when there's fewer of them in a directory. Such structure spreads files thin.
To not mess up *nix utilities like rm, which take a finite number of arguments and deleting large number of files at once tends to be hacky (having to pass it though find etc.)
What's the real reason? What is a "good" cache directory structure and why?
Every time I've done it, it has been to avoid slow linear searches in filesystems. Luckily, at least on Linux, this is becoming a thing of the past.
However, even today, with b-tree based directories, a very large directory will be hard to deal with, since it will take forever and a day just to get a listing of all the files, never mind finding the right file.
Just use dates. Since you will remove by date. :)
If you do ls -l, all the files need to be stat()ed to get details, which adds considerably to the listing time - this happens whether the FS uses hashed or linear structures.
So even if the FS has a capability of coping with incredibly large directory sizes, there are good reasons not to have large flat structures (They're also a pig to back up)
I've benchmarked GFS2 (clustered) with 32,000 files in a directory or arranged in a tree structure - recursive listings were around 300 times faster than getting a listing when they were all in a flat structure (could take up to 10 minutes to get a directory listing)
EXT4 showed similar ratios but as the end point was only a couple of seconds most people wouldn't notice.