I'm in the process of implementing caching for my project. After looking at cache directory structures, I've seen many examples like:
cache
cache/a
cache/a/a/
cache/a/...
cache/a/z
cache/...
cache/z
...
You get the idea. Another example for storing files, let's say our file is named IMG_PARTY.JPG, a common way is to put it in a directory named:
files/i/m/IMG_PARTY.JPG
Some thoughts come to mind, but I'd like to know the real reasons for this.
Filesystems doing linear lookups find files faster when there's fewer of them in a directory. Such structure spreads files thin.
To not mess up *nix utilities like rm, which take a finite number of arguments and deleting large number of files at once tends to be hacky (having to pass it though find etc.)
What's the real reason? What is a "good" cache directory structure and why?
Every time I've done it, it has been to avoid slow linear searches in filesystems. Luckily, at least on Linux, this is becoming a thing of the past.
However, even today, with b-tree based directories, a very large directory will be hard to deal with, since it will take forever and a day just to get a listing of all the files, never mind finding the right file.
Just use dates. Since you will remove by date. :)
If you do ls -l, all the files need to be stat()ed to get details, which adds considerably to the listing time - this happens whether the FS uses hashed or linear structures.
So even if the FS has a capability of coping with incredibly large directory sizes, there are good reasons not to have large flat structures (They're also a pig to back up)
I've benchmarked GFS2 (clustered) with 32,000 files in a directory or arranged in a tree structure - recursive listings were around 300 times faster than getting a listing when they were all in a flat structure (could take up to 10 minutes to get a directory listing)
EXT4 showed similar ratios but as the end point was only a couple of seconds most people wouldn't notice.
Related
My application need to keep a large amount of fairly small files (10-100k) that are usually accessed with some 'locality' in the filename's string expression.
Eg. if file_5_5 is accessed, files like file_4_5 or file_5_6 may be accessed too in a short while.
I've seen that web browser file caches are often sorted in a tree like fashion resembling the lexical order of the filename, which is a kind of hash. Eg. sadisadji would reside at s/a/d/i/ssadisadji for example. I guess that is optimized for fast random access to any of these files.
Would such a tree structure be useful for my case too? Or does a flat folder keeping all files in one location does equal well?
A tree structure would be better because many filesystems have trouble with listing a single directory with 100,000 files or more in them.
One approach taken by the .mbtiles file format, which stores a large number of image files for use with maps, is to store all of the files in an SQLite database, circumventing the problems caused by having thousands of files in a directory. Their reasoning and implementation is described here:
https://www.mapbox.com/developers/mbtiles/
I have a problem where my current algorithm uses a naive linear search algorithm to retrieve data from several data files through matching strings.
It is something like this (pseudo code):
while count < total number of files
open current file
extract line from this file
build an arrayofStrings from this line
foreach string in arrayofStrings
foreach file in arrayofDataReferenceFiles
search in these files
close file
increment count
For a large real life job, a process can take about 6 hours to complete.
Basically I have a large set of strings that uses the program to search through the the same set of files (for example 10 in 1 instance and can be 3 in the next instance the program runs). Since the reference data files can change, I do not think it is smart to build a permanent index of these files.
I'm pretty much a beginner and am not aware of any faster techniques for unsorted data.
I was thinking since the search gets repetitive after a while, is it possible to prebuild an index of locations of specific lines in the data reference files without using any external perl libraries once the file array gets built (files are known)? This script is going to be ported onto a server that probably only has standard Perl installed.
I figured it might be worth spending 3-5 minutes building some sort of index for a search before processing the job.
Is there a specific concept of indexing/searching that applies to my situation?
Thanks everyone!
It is difficult to understand exactly what you're trying to achieve.
I assume the data set does not fit in RAM.
If you are trying to match each line in many files against a set of patterns, it may be better to read each line in once, then match it against all the patterns while it's in memory before moving on. This will reduce IO over looping for each pattern.
On the other hand, if the matching is what's taking the time you're probably better off using a library which can simultaneously match lots of patterns.
You could probably replace this:
foreach file in arrayofDataReferenceFiles
search in these files
with a preprocessing step to build a DBM file (i.e. an on-disk hash) as a reverse index which maps each word in your reference files to a list of the files containing that word (or whatever you need). The Perl core includes DBM support:
dbmopen HASH,DBNAME,MASK
This binds a dbm(3), ndbm(3), sdbm(3), gdbm(3), or Berkeley DB file to a hash.
You'd normally access this stuff through tie but that's not important, every Perl should have some support for at least one hash-on-disk library without needing non-core packages installed.
As MarkR said, you want to read each line from each file no more than one time. The pseudocode you posted looks like you're reading each line of each file multiple times (once for each word that is searched for), which will slow things down considerably, especially on large searches. Reversing the order of the two innermost loops should (judging by the posted pseudocode) fix this.
But, also, you said, "Since the reference data files can change, I do not think it is smart to build a permanent index of these files." This is, most likely, incorrect. If performance is a concern (if you're getting 6-hour runtimes, I'd say that probably makes it a concern) and, on average, each file gets read more than once between changes to that particular file, then building an index on disk (or even... using a database!) would be a very smart thing to do. Disk space is very cheap these days; time that people spend waiting for results is not.
Even if files frequently undergo multiple changes without being read, on-demand indexing (when you want to check the file, first look to see whether an index exists and, if not, build one before doing the search) would be an excellent approach - when a file gets searched more than once, you benefit from the index; when it doesn't, building the index first, then doing an search off the index will be slower than a linear search by such a small margin as to be largely irrelevant.
Which type of file-system is beneficial for storing images in a social-networking website of around 50 thousand users?
I mean to say how to create the directory? What should be the hierarchy of folders for storing images (such as by album or by user).
I know Facebook use haystack now, but before that it uses simple NFS. What is the hierarchy of NFS?
There is no "best" way to do this from a filesystems perspective -- NFS, for example, doesn't have any set "hierarchy" other than the directories that you create in the NFS share where you're writing the photos.
Each underlying filesystem type (not NFS, I mean the server-side filesystem that you would use NFS to serve files from) has its own distinct performance characteristics, but probably all of them will have a relatively fast (O(1) or at least O(log(n))) way to look up files in a directory. For that reason, you could basically do any directory structure you want and get "not terrible" performance. Therefore, you should make the decision based on what makes writing and maintaining your application the easiest, especially since you have a relatively small number of users right now.
That said, if I were trying to solve this problem and wanted to use a relatively simple solution, I would probably give each photo a long random number in hex (like b16eabce1f694f9bb754f3d84ba4b73e) or use a checksum of the photo (such as the output from running md5/md5sum on the photo file, like 5983392e6eaaf5fb7d7ec95357cf0480), and then split that into a "directory" prefix and a "filename" suffix, like 5983392e6/eaaf5fb7d7ec95357cf0480.jpg. Choosing how far into the number to create the split will determine how many files you'll end up with in each directory. Then I'd store the number/checksum as a column in the database table you're using to keep track of the photos that have been uploaded.
The tradeoffs between these two approaches are mostly performance-related: creating random numbers is much faster than doing checksums, but checksums allow you to notice that multiple of the same photo have been uploaded and save storage (if that's likely to be common on your website, which I have no idea about :-) ). Cryptographically secure checksums also create very well-distributed values, so you can be certain that you won't end up with an artificially high number of photos in one particular directory (even if a hacker knows what checksum algorithm you're using).
If you ever find that the exact splitting point you chose can no longer scale because it requires too many files per directory, you can simply add another level of directory nesting, for instance by switching from 5983392e6/eaaf5fb7d7ec95357cf0480.jpg to 5983392e6/eaaf5fb7/d7ec95357cf0480.jpg. Also, if your single NFS server can't handle the load by itself anymore, you could use the prefix to distribute the photos across multiple NFS servers instead of simply across multiple directories.
I did a bit of googling on this as I was sure this question must have been answered, but I found nothing concise. I realize that it depends very much on the type of filesystem used. But are there any general statements one can make?
Is it generally faster to have, say, 10.000 files in a single folder, or 100 folders containing 100 files each?
It really depends on context, and what you're trying to do with those files. I usually keep my Windows folders below 4k files (4096), because Explorer tends to bog down when displaying them.
However, in *nix based OS'es, I've had 10k+ files in a folder with no discernable performance loss, given I knew what files I was looking for.
Obviously, if you're going to do any iterating through a folder, which is an O(n) operation, it will take longer the more files you have.
It's faster to the operating system reach the file when you have 100 folders with 100 files inside, I see a lot of improvement when I split one directory that had more than 20.000 files inside.
Im starting to develop a website that will have 500,000+ images. Besides needing a lot of disk space, do you think that the performance of the site will be affected if I use a basic folder schema, for example all the images under /images?
It will be too slow if a user requests /images/img13452.jpg?
If the performance decreases proportional to the quantity of images in the same folder, which schema/arquitecture do you recommend me to use?
Thanks in advance,
Juan
Depends on the file system, depends on many other things. On common approach though is to hash the filenames and then create subdirectories, this limits the files per directory and therefore will improve performance (again depending on the FS).
Example given:
ab\
ab.png
lm\
ablm.png
cd\
cd.png
xo\
xo.png
You may also want to search SO for more on that topic:
https://stackoverflow.com/search?q=filesystem+performance
That's going to depend on the OS and filesystem but generally speaking you won't see a performance hit if you reference the file by the full path. Where you might have problems is with maintenance scripts and such having to read a giant directory listing. I always find it better in the long run to organize a large amount of files into some kind of logical directory structure.