I've heard too many images in a single folder can cause performance issues, but does lots of directories create a performance issue? I'm running a website that creates a folder per image uploaded. Down the road I expect to get between 1 million and a few million photos uploaded which means 1-3 million folders. In each folder 6 images are stored with various sizes.
If this is problematic one idea is to have one folder per album which on average could store between 30-90 literal images (the sizes force the number to be multiplied by 6). It's just an idea, what I really want to do is use the best practices for image storage.
So my two options for storage are:
site/images/folder-id/id-size-file-name.jpg (single folder per album)
site/images/folder-id/photo-id/size-file-name.jpg (single folder per image)
Any insights on folder performance will go appreciated.
Performance of filesystems tend to degrade with the number of entries in a directory, be they files, directories, symbolic links, or other kinds of entries. This is inherent in most methods of storing the entries; the filesystem will have to search through it somehow, though it's possible the search algorithm used has O(log n) time.
The usual way of dealing with this (used by MediaWiki, at least), is to have some sort of uniformly distributed identifier (often a cryptographic hash) and store images in a sort of structure based on prefixes of the hashes. For example, if an image had a hash of 0123456789abcdef, one might store the image in 01/0123/image.jpg. You can, of course, tweak it so there are more or less than 256 entries in each level, or add more levels or make other tweaks.
Related
My application need to keep a large amount of fairly small files (10-100k) that are usually accessed with some 'locality' in the filename's string expression.
Eg. if file_5_5 is accessed, files like file_4_5 or file_5_6 may be accessed too in a short while.
I've seen that web browser file caches are often sorted in a tree like fashion resembling the lexical order of the filename, which is a kind of hash. Eg. sadisadji would reside at s/a/d/i/ssadisadji for example. I guess that is optimized for fast random access to any of these files.
Would such a tree structure be useful for my case too? Or does a flat folder keeping all files in one location does equal well?
A tree structure would be better because many filesystems have trouble with listing a single directory with 100,000 files or more in them.
One approach taken by the .mbtiles file format, which stores a large number of image files for use with maps, is to store all of the files in an SQLite database, circumventing the problems caused by having thousands of files in a directory. Their reasoning and implementation is described here:
https://www.mapbox.com/developers/mbtiles/
Which type of file-system is beneficial for storing images in a social-networking website of around 50 thousand users?
I mean to say how to create the directory? What should be the hierarchy of folders for storing images (such as by album or by user).
I know Facebook use haystack now, but before that it uses simple NFS. What is the hierarchy of NFS?
There is no "best" way to do this from a filesystems perspective -- NFS, for example, doesn't have any set "hierarchy" other than the directories that you create in the NFS share where you're writing the photos.
Each underlying filesystem type (not NFS, I mean the server-side filesystem that you would use NFS to serve files from) has its own distinct performance characteristics, but probably all of them will have a relatively fast (O(1) or at least O(log(n))) way to look up files in a directory. For that reason, you could basically do any directory structure you want and get "not terrible" performance. Therefore, you should make the decision based on what makes writing and maintaining your application the easiest, especially since you have a relatively small number of users right now.
That said, if I were trying to solve this problem and wanted to use a relatively simple solution, I would probably give each photo a long random number in hex (like b16eabce1f694f9bb754f3d84ba4b73e) or use a checksum of the photo (such as the output from running md5/md5sum on the photo file, like 5983392e6eaaf5fb7d7ec95357cf0480), and then split that into a "directory" prefix and a "filename" suffix, like 5983392e6/eaaf5fb7d7ec95357cf0480.jpg. Choosing how far into the number to create the split will determine how many files you'll end up with in each directory. Then I'd store the number/checksum as a column in the database table you're using to keep track of the photos that have been uploaded.
The tradeoffs between these two approaches are mostly performance-related: creating random numbers is much faster than doing checksums, but checksums allow you to notice that multiple of the same photo have been uploaded and save storage (if that's likely to be common on your website, which I have no idea about :-) ). Cryptographically secure checksums also create very well-distributed values, so you can be certain that you won't end up with an artificially high number of photos in one particular directory (even if a hacker knows what checksum algorithm you're using).
If you ever find that the exact splitting point you chose can no longer scale because it requires too many files per directory, you can simply add another level of directory nesting, for instance by switching from 5983392e6/eaaf5fb7d7ec95357cf0480.jpg to 5983392e6/eaaf5fb7/d7ec95357cf0480.jpg. Also, if your single NFS server can't handle the load by itself anymore, you could use the prefix to distribute the photos across multiple NFS servers instead of simply across multiple directories.
Im starting to develop a website that will have 500,000+ images. Besides needing a lot of disk space, do you think that the performance of the site will be affected if I use a basic folder schema, for example all the images under /images?
It will be too slow if a user requests /images/img13452.jpg?
If the performance decreases proportional to the quantity of images in the same folder, which schema/arquitecture do you recommend me to use?
Thanks in advance,
Juan
Depends on the file system, depends on many other things. On common approach though is to hash the filenames and then create subdirectories, this limits the files per directory and therefore will improve performance (again depending on the FS).
Example given:
ab\
ab.png
lm\
ablm.png
cd\
cd.png
xo\
xo.png
You may also want to search SO for more on that topic:
https://stackoverflow.com/search?q=filesystem+performance
That's going to depend on the OS and filesystem but generally speaking you won't see a performance hit if you reference the file by the full path. Where you might have problems is with maintenance scripts and such having to read a giant directory listing. I always find it better in the long run to organize a large amount of files into some kind of logical directory structure.
For an open source project I have I am writing an abstraction layer on top of the filesystem.
This layer allows me to attach metadata and relationships to each file.
I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.
To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies.
So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.
One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.
Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.
I need this algorithm to work for very large files (1GB+ and tiny files 5K)
EDIT
I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).
EDIT 2
Related question: Algorithm for determining a file’s identity (Optimisation)
Bucketing, multiple layers of comparison should be fastest and scalable across the range of files you're discussing.
First level of indexing is just the length of the file.
Second level is hash. Below a certain size it is a whole-file hash. Beyond that, yes, I agree with your idea of a sampling algorithm. Issues that I think might affect the sampling speed:
To avoid hitting regularly spaced headers which may be highly similar or identical, you need to step in a non-conforming number, eg: multiples of a prime or successive primes.
Avoid steps which might end up encountering regular record headers, so if you are getting the same value from your sample bytes despite different location, try adjusting the step by another prime.
Cope with anomalous files with large stretches of identical values, either because they are unencoded images or just filled with nulls.
Do the first 128k, another 128k at the 1mb mark, another 128k at the 10mb mark, another 128k at the 100mb mark, another 128k at the 1000mb mark, etc. As the file sizes get larger, and it becomes more likely that you'll be able to distinguish two files based on their size alone, you hash a smaller and smaller fraction of the data. Everything under 128k is taken care of completely.
Believe it or not, I use the ticks for the last write time for the file. It is as cheap as it gets and I am still to see a clash between different files.
If you can drop the Linux share requirement and confine yourself to NTFS, then NTFS Alternate Data Streams will be a perfect solution that:
doesn't require any kind of hashing;
survives renames; and
survives moves (even between different NTFS volumes).
You can read more about it here. Basically you just append a colon and a name for your stream (e.g. ":meta") and write whatever you like to it. So if you have a directory "D:\Movies\Terminator", write your metadata using normal file I/O to "D:\Movies\Terminator:meta". You can do the same if you want to save the metadata for a specific file (as opposed to a whole folder).
If you'd prefer to store your metadata somewhere else and just be able to detect moves/renames on the same NTFS volume, you can use the GetFileInformationByHandle API call (see MSDN /en-us/library/aa364952(VS.85).aspx) to get the unique ID of the folder (combine VolumeSerialNumber and FileIndex members). This ID will not change if the file/folder is moved/renamed on the same volume.
How about storing some random integers ri, and looking up bytes (ri mod n) where n is the size of file? For files with headers, you can ignore them first and then do this process on the remaining bytes.
If your files are actually pretty different (not just a difference in a single byte somewhere, but say at least 1% different), then a random selection of bytes would notice that. For example, with a 1% difference in bytes, 100 random bytes would fail to notice with probability 1/e ~ 37%; increasing the number of bytes you look at makes this probability go down exponentially.
The idea behind using random bytes is that they are essentially guaranteed (well, probabilistically speaking) to be as good as any other sequence of bytes, except they aren't susceptible to some of the problems with other sequences (e.g. happening to look at every 256-th byte of a file format where that byte is required to be 0 or something).
Some more advice:
Instead of grabbing bytes, grab larger chunks to justify the cost of seeking.
I would suggest always looking at the first block or so of the file. From this, you can determine filetype and such. (For example, you could use the file program.)
At least weigh the cost/benefit of something like a CRC of the entire file. It's not as expensive as a real cryptographic hash function, but still requires reading the entire file. The upside is it will notice single-byte differences.
Well, first you need to look more deeply into how filesystems work. Which filesystems will you be working with? Most filesystems support things like hard links and soft links and therefore "filename" information is not necessarily stored in the metadata of the file itself.
Actually, this is the whole point of a stackable layered filesystem, that you can extend it in various ways, say to support compression or encryption. This is what "vnodes" are all about. You could actually do this in several ways. Some of this is very dependent on the platform you are looking at. This is much simpler on UNIX/Linux systems that use a VFS concept. You could implement your own layer on tope of ext3 for instance or what have you.
**
After reading your edits, a couplre more things. File systems already do this, as mentioned before, using things like inodes. Hashing is probably going to be a bad idea not just because it is expensive but because two or more preimages can share the same image; that is to say that two entirely different files can have the same hashed value. I think what you really want to do is exploit the metadata of that the filesystem already exposes. This would be simpler on an open source system, of course. :)
Which bytes should I choose for the sample?
I think that I would try to use some arithmetic progression like Fibonacci numbers. These are easy to calculate, and they have a diminishing density. Small files would have a higher sample ratio than big files, and the sample would still go over spots in the whole file.
This work sounds like it could be more effectively implemented at the filesystem level or with some loose approximation of a version control system (both?).
To address the original question, you could keep a database of (file size, bytes hashed, hash) for each file and try to minimize the number of bytes hashed for each file size. Whenever you detect a collision you either have an identical file, or you increase the hash length to go just past the first difference.
There's undoubtedly optimizations to be made and CPU vs. I/O tradeoffs as well, but it's a good start for something that won't have false-positives.
I'm in the process of implementing caching for my project. After looking at cache directory structures, I've seen many examples like:
cache
cache/a
cache/a/a/
cache/a/...
cache/a/z
cache/...
cache/z
...
You get the idea. Another example for storing files, let's say our file is named IMG_PARTY.JPG, a common way is to put it in a directory named:
files/i/m/IMG_PARTY.JPG
Some thoughts come to mind, but I'd like to know the real reasons for this.
Filesystems doing linear lookups find files faster when there's fewer of them in a directory. Such structure spreads files thin.
To not mess up *nix utilities like rm, which take a finite number of arguments and deleting large number of files at once tends to be hacky (having to pass it though find etc.)
What's the real reason? What is a "good" cache directory structure and why?
Every time I've done it, it has been to avoid slow linear searches in filesystems. Luckily, at least on Linux, this is becoming a thing of the past.
However, even today, with b-tree based directories, a very large directory will be hard to deal with, since it will take forever and a day just to get a listing of all the files, never mind finding the right file.
Just use dates. Since you will remove by date. :)
If you do ls -l, all the files need to be stat()ed to get details, which adds considerably to the listing time - this happens whether the FS uses hashed or linear structures.
So even if the FS has a capability of coping with incredibly large directory sizes, there are good reasons not to have large flat structures (They're also a pig to back up)
I've benchmarked GFS2 (clustered) with 32,000 files in a directory or arranged in a tree structure - recursive listings were around 300 times faster than getting a listing when they were all in a flat structure (could take up to 10 minutes to get a directory listing)
EXT4 showed similar ratios but as the end point was only a couple of seconds most people wouldn't notice.