Im starting to develop a website that will have 500,000+ images. Besides needing a lot of disk space, do you think that the performance of the site will be affected if I use a basic folder schema, for example all the images under /images?
It will be too slow if a user requests /images/img13452.jpg?
If the performance decreases proportional to the quantity of images in the same folder, which schema/arquitecture do you recommend me to use?
Thanks in advance,
Juan
Depends on the file system, depends on many other things. On common approach though is to hash the filenames and then create subdirectories, this limits the files per directory and therefore will improve performance (again depending on the FS).
Example given:
ab\
ab.png
lm\
ablm.png
cd\
cd.png
xo\
xo.png
You may also want to search SO for more on that topic:
https://stackoverflow.com/search?q=filesystem+performance
That's going to depend on the OS and filesystem but generally speaking you won't see a performance hit if you reference the file by the full path. Where you might have problems is with maintenance scripts and such having to read a giant directory listing. I always find it better in the long run to organize a large amount of files into some kind of logical directory structure.
Related
I have question which i wanna discuss with u. i am a fresh gradutate and just got a job as IT programmer. my company is making a game, the images or graphics use inside the game have one folder but different files of images. They give me task that how we can convert different files of images into one file and the program still access that file. If u have any kind of idea share with me ..Thanks
I'm not really sure what the advantage of this approach is for a game that runs on the desktop, but if you've already carefully considered that and decided that having a single file is important, then it's certainly possible to do so.
Since the question, as Oded points out, shows very little research or otherwise effort on your part, I won't provide a complete solution. And even if I wanted to do so, I'm not sure I could because you don't give us any information on what programming language and UI framework you're using. Visual Studio 2010 supports a lot of different ones.
Anyway, the trick involves creating a sprite. This is a fairly common technique for web design, where it actually is helpful to reduce load times by using only a single image, and you can find plenty of explanation and examples by searching the web. For example, here.
Basically, what you do is make one large image that contains all of your smaller images, offset from each other by a certain number of pixels. Then, you load that single large image and access the individual images by specifying the offset coordinates of each image.
I do not, however, recommend doing as Jan recommends and compressing the image directory (into a ZIP file or any other format), because then you'll just have to pay the cost of uncompressing it each time you want to use one of the images. That also buys you extremely little; disk storage is cheap nowadays.
I am defining the criteria to create an optimal folder structure for cache files on the filesystem.
The aim is to create the most performing file hierarchy from a filesystem perspective.
Files cached are mainly HTML pages, so they are small files, but as far as now my searches worked out that it is not the size of files to stress the index of a filesystem table, it is the number of files inside a directory.
Identifying the ideal number in 200 files per directory, I thought to create 10 subdirectories and 180 cache files, replicating this pattern in any subdirectory, but I admit it is a purely random attempt, not driven by a real knowledge of a good methodology to calculate it.
Can anyone suggest me which are the characteristics to evaluate in this decision, or share an authoritative resource speaking at an academic level about filesystem optimization, possibly applied to great file trees?
Which type of file-system is beneficial for storing images in a social-networking website of around 50 thousand users?
I mean to say how to create the directory? What should be the hierarchy of folders for storing images (such as by album or by user).
I know Facebook use haystack now, but before that it uses simple NFS. What is the hierarchy of NFS?
There is no "best" way to do this from a filesystems perspective -- NFS, for example, doesn't have any set "hierarchy" other than the directories that you create in the NFS share where you're writing the photos.
Each underlying filesystem type (not NFS, I mean the server-side filesystem that you would use NFS to serve files from) has its own distinct performance characteristics, but probably all of them will have a relatively fast (O(1) or at least O(log(n))) way to look up files in a directory. For that reason, you could basically do any directory structure you want and get "not terrible" performance. Therefore, you should make the decision based on what makes writing and maintaining your application the easiest, especially since you have a relatively small number of users right now.
That said, if I were trying to solve this problem and wanted to use a relatively simple solution, I would probably give each photo a long random number in hex (like b16eabce1f694f9bb754f3d84ba4b73e) or use a checksum of the photo (such as the output from running md5/md5sum on the photo file, like 5983392e6eaaf5fb7d7ec95357cf0480), and then split that into a "directory" prefix and a "filename" suffix, like 5983392e6/eaaf5fb7d7ec95357cf0480.jpg. Choosing how far into the number to create the split will determine how many files you'll end up with in each directory. Then I'd store the number/checksum as a column in the database table you're using to keep track of the photos that have been uploaded.
The tradeoffs between these two approaches are mostly performance-related: creating random numbers is much faster than doing checksums, but checksums allow you to notice that multiple of the same photo have been uploaded and save storage (if that's likely to be common on your website, which I have no idea about :-) ). Cryptographically secure checksums also create very well-distributed values, so you can be certain that you won't end up with an artificially high number of photos in one particular directory (even if a hacker knows what checksum algorithm you're using).
If you ever find that the exact splitting point you chose can no longer scale because it requires too many files per directory, you can simply add another level of directory nesting, for instance by switching from 5983392e6/eaaf5fb7d7ec95357cf0480.jpg to 5983392e6/eaaf5fb7/d7ec95357cf0480.jpg. Also, if your single NFS server can't handle the load by itself anymore, you could use the prefix to distribute the photos across multiple NFS servers instead of simply across multiple directories.
I did a bit of googling on this as I was sure this question must have been answered, but I found nothing concise. I realize that it depends very much on the type of filesystem used. But are there any general statements one can make?
Is it generally faster to have, say, 10.000 files in a single folder, or 100 folders containing 100 files each?
It really depends on context, and what you're trying to do with those files. I usually keep my Windows folders below 4k files (4096), because Explorer tends to bog down when displaying them.
However, in *nix based OS'es, I've had 10k+ files in a folder with no discernable performance loss, given I knew what files I was looking for.
Obviously, if you're going to do any iterating through a folder, which is an O(n) operation, it will take longer the more files you have.
It's faster to the operating system reach the file when you have 100 folders with 100 files inside, I see a lot of improvement when I split one directory that had more than 20.000 files inside.
I'm in the process of implementing caching for my project. After looking at cache directory structures, I've seen many examples like:
cache
cache/a
cache/a/a/
cache/a/...
cache/a/z
cache/...
cache/z
...
You get the idea. Another example for storing files, let's say our file is named IMG_PARTY.JPG, a common way is to put it in a directory named:
files/i/m/IMG_PARTY.JPG
Some thoughts come to mind, but I'd like to know the real reasons for this.
Filesystems doing linear lookups find files faster when there's fewer of them in a directory. Such structure spreads files thin.
To not mess up *nix utilities like rm, which take a finite number of arguments and deleting large number of files at once tends to be hacky (having to pass it though find etc.)
What's the real reason? What is a "good" cache directory structure and why?
Every time I've done it, it has been to avoid slow linear searches in filesystems. Luckily, at least on Linux, this is becoming a thing of the past.
However, even today, with b-tree based directories, a very large directory will be hard to deal with, since it will take forever and a day just to get a listing of all the files, never mind finding the right file.
Just use dates. Since you will remove by date. :)
If you do ls -l, all the files need to be stat()ed to get details, which adds considerably to the listing time - this happens whether the FS uses hashed or linear structures.
So even if the FS has a capability of coping with incredibly large directory sizes, there are good reasons not to have large flat structures (They're also a pig to back up)
I've benchmarked GFS2 (clustered) with 32,000 files in a directory or arranged in a tree structure - recursive listings were around 300 times faster than getting a listing when they were all in a flat structure (could take up to 10 minutes to get a directory listing)
EXT4 showed similar ratios but as the end point was only a couple of seconds most people wouldn't notice.