SHA-1 hash for storing Files - ruby

After reading this, it sounds like a great idea to store files using the SHA-1 for the directory.
I have no idea what this means however, all I know is that SHA-1 and MD5 are hashing algorithms. If I calculate the SHA-1 hash using this ruby script, and I change the file's content (which changes the hash), how do I know where the file is stored then?
My question is then, what are the basics of implementing a SHA-1/file-storage system?
If all of the files are changing content all the time, is there a better solution for storing them, or do you just have to keep updating the hash?
I'm just thinking about how to create a generic file storing system like GoogleDocs, Flickr, Youtube, DropBox, etc., something that you could reuse in different environments (such as storing PubMed journal articles or Cramster homework assignments and tests, or just images like on Flickr). I'd probably store them on Amazon EC2. Just some system so I can say "this is how I'll 99% of the time do file storing from now on", so I can stop thinking about building a solid/consistent way to store files and get onto some real problems.

First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.
Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.
When you have a digest, for example, 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:
00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt
This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.
Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.

The idea is not to change the file content, but rather its name (and path), by using a hash value.
Changing the content with a hash would be disastrous since a hash is normally not reversible.
I'm not sure of the motivivation for using a hash rather than the file name (or even rather than a long random number), but here are a few advantages of the hash appraoch:
the file names on the disk is uniform
the upper or lower parts of the hash value can be used to name the directories and hence distribute the files relatively uniformely
the name becomes a code, making it difficult for someone to
a) guess a file name
b) categorize pictures (would someone steal the hard drive content)
be able to retrieve the filename and location from the file contents itself (assuming the hash comes from such content. (not quite sure which use case would involve this... a bit contrieved...)
The general interest of using a hash is that unlike a file name, a hash is meaningless, and therefore one would require the database to relate images and "bibliographic" type data (name of uploader, date of upload, tags, ...)
In thinking about it, re-reading the referenced SO response, I don't really see much of an advantage of a hash, as compared to, say, a random number...
Furthermore... some hashes produce a numeric value, typically expressed in hexadecimal (as seen in the refernced SO question) and this could be seen as wasteful, by making the file names longer than they need to be, and hence putting more stress on the file system (bigger directories...)

One advantage I see with storing files using their hash is that the file data only needs to be stored once and then can be referenced multiple times within your database. This will save you space if you have a different users uploading the exact same file.
However the downside to this is when a user deletes what they think is there file from your app, you can't just physically delete the file from disk because other users that uploaded the same exact file may still be using it.

The idea is that you need to come up with a name for the photo, and you probably want to scatter the files among a number of directories. One easy way to come up with a unique name is to use the hash.
So the beginning of the hash was peeled off for a multi-level directory structure and the rest of the hash was used for a filename for the jpg.
This has the additional benefit of detecting duplicate uploads.

Related

Hash in Laravel

there. I'm preparing a Laravel test, and there's a question that I think is not correct.
When you should use a hash?
The available answers are:
When you want to compress the contents of a file.
When you want to securely store credit card information so you can use it later.
When you want to secure sen a password over email.
When you want to identify the contents of a file without storing the entire file
Since hashing is for encrypting passwords (not to send'em over email) none of this answers seem to be correct. What do you think?
Option 4. Identifying contents of a file.
Hash is a function which is supposed to return a constant length output for every input. The other property of hash functions is that for any input a it always returns the same value b. It means that if you provide file a and store its hash b then whenever you supply file a again you're going to get hash b. The last property is that for different inputs c, d and hash function f, f(c) should be different from f(d) (or the chances of outputs being equal should be near 0)
In real case scenario, you can often find hashes when downloading software and want to verify if the file you've downloaded is correct. The developer puts a hash of the executable on their site. You are downloading the file and calculate checksum (hash) just to compare it with the one from the website. If it matches, then you know it's the same (as long as the hash algorithm is not known to have collisions...).
It is quite good approach to comparing file contents, bc hashes are taking much less space than actual files.

Algorithm that can encode a string (with known maximum length) to a fix-length string?

I'm downloading a lot of files whose URLs are listed in a text file.
When saving a file to disk, I use the MD5 checksum of its URL as the new filename. This is to avoid file name conflicts and invalid characters in the original file name.
But I also need a way to find the original URL from a downloaded file name, if I use MD5, I'll have to use a mapping that's very huge.
Is there any algorithm I can use instead that allow me to just decode the original URL from the file name?
Note that I also don't want the length of file names to vary to much.
You can use base62, which is file system friendly and can be en-/decrypted. But you can't avoid file name collisions. If you want to avoid them too, you could append a MD5 of the file to the encrypted filename, and remove the MD5 when decrypting.
If you want a generic solution look for short string compression algorithms. Here's a previously answered question about it An efficient compression algorithm for short text strings.
There's no way to grantee that you get equal length strings because some of them will compress better than others.
Since you are dealing with only html you can use that to store some data. For example you can simply put the original URL in front of the leading HTML tag or after the closing HTML tag. Or add a special tag or attribute to the file to store this information. Then you can keep MD5 as the file name, but if you need the url you would open the file and look for it there. This should allow you to store the data without affecting any use of the file and without having to store a large mapping table.

What is the purpose of wierdly named subdirectories for cache directories

When looking at the "Storage" directory of Spotify's cache I realized there are a lot of subdirectories, named with 2-digit hexadecimal names. Each of them contains one or more weirdly named files.
I've come across similar directory structures created by other programs in the past, and I have always wondered what the reason for such a naming/storing scheme is.
So why would you do such a thing? What benefits does this concept hold?
I hope you are still interested in the answer.
Usually you cache things like images or content. Say we cache an image from a different url to our server for performance/stability reasons. How do you name that image? You could name it like the URL, but you cannot include slashes or other special characters in the name and the length has a limit too.
Therefore you compute a so called hash of that URL and use it as the file name. Hashes are usually no hexadecimal, so no illegal characters and their length is always the same. If you now need the image from the URL you compute its hash and check if you find it in the cache.
The reason you dont store all cached files in one directory is for size-limitations. You usually group the cached files in subdirectories based on their first characters. See this answer: softwareengineering.stackexchange.com/a/301401.
For example let's say we want to cache https://example.com/favicon.ico. The MD5-Hash of that URL is f54403d0da4a57aa79bdf459897f08bd. You have now three different options:
If the cache is expected to remain small store it like /cache/f54403d0da4a57aa79bdf459897f08bd.ico (You usually want to preserve file-extensions for a number of reasons).
Four medium caches you could do /cache/f/f54403d0da4a57aa79bdf459897f08bd.ico or remove duplicate information by trimming the first character, as it already exists in the directory name like /cache/f/54403d0da4a57aa79bdf459897f08bd.ico
For very large caches you can subdivide even more like /cache/f/5/4403d0da4a57aa79bdf459897f08bd.ico`
These are just a few examples but the basic principle keeps the same.

What is a sensible data-structure for allowing efficient synchronisation between two root paths?

I am working on an application that involves maintaining consistency between two local directories. Specifically, the directories should be identical, with the exception that all files in one of the directories are modified in some particular way (this part is not important to my question).
While running, my application runs two processes that listen for changes occurring under each of the paths, and performs relevant operations to bring them back in sync when necessary.
In terms of my specific question: I'm looking for advice on the tricker situation of when one starts the application. At this point, each process needs to check all files/folders under both the path that it is looking after, to see if anything has changed in anyway whilst the application was not running. (Let us assume that the application cannot be notified by the OS of anything that happened while it was shutdown, and thus will need to directly check every file/folder.)
Each process will have access to (and maintain) a persistent data-structure of all files/folder under its designated path. I was thinking that the following should be held within the data-structure for each of the files and folders:
File/folder name;
File hash (CRC32);
File/folder last mod data; and
File/folder size.
These pieces of information will obviously help to check for any changes to files/folder, but what is the best way to store them?
It seems to me that one sensible way to approach the situation of an application start is for each process to recursively scan through all files/folders under its designated path, and compare the metadata for each file scanned to the metadata stored in its data-structure. Then the processes should also iterate through the data-structures to look for things that have been removed from the paths. Some cases that may be encountered during this process are:
file modified (file name found in data-structure, but hash differs);
file added (no identical filename or hash found in data-structure);
file renamed (file with same hash exists in data-structure, but not with same filename);
folder added (no folder name in data-structure);
folder removed (folder name in data-structure, but not under path);
folder renamed (tricky one).
So, what's the best data-structure to use for this task? In my head I'm thinking some form of sorted associative array, e.g., a red-black tree, which store file and folder objects. Each file object contains name, hash and mod-date attributes , while each folder object contains name and children attributes, where children stores another associative array with everything underneath. Given the path to an arbitrary file, e.g., /foo/bar/file.txt, you begin at the root (foo), check for bar and so on until you get to file.txt's parent object.
Another alternative I can think of is to merely store everything flatly, such that there is one red-black tree where each key is the full path to each file/folder, and the value is the file / folder object. This would probably be quicker for retrieval, but it won't be possible to detect renamed files/folders without iterating through all values anyway, which sounds expensive. In the first approach, it may be the case that identifying a rename would only involves checking a portion of the data-structure rather than all of it.
Sorry the above ideas aren't terribly well thought out. What's the state of the art in this area, and are there any well-trodden approaches to these types of problems?
You're modelling a filesystem, so it's quite natural to use a hierarchical data structure. After all, you don't need to compare the file at dir1\dir2\foo.txt to dir3\bar.txt, right? You didn't mention file moves between directories as something you're tracking.
So, the data structure could be:
interface IFSEntry {
string name
datetime creationDate
pure virtual bool Compare(IFSEntry other)
pure virtual void UpdateFrom(IFSEntry other)
pure virtual bool WasRenamed(Dictionary<string,IFSEntry> possibleOriginals, out string oldName)
...
}
class File : IFSEntry {
...
}
class Directory : IFSEntry {
private Dictionary<string,IFSEntry> children;
...
}
The Directory implementations of UpdateFrom and Compare would recurse down their children.
File renames would be relatively easy by comparing CRC's. You'd miss files that changed in both places and were renamed. You could add a CRC dictionary to the Directory class if the time to run the comparisons proves a performance problem.
For directory moves, if the child files also changed, then you've got a fuzzy logic situation. It would be best to have a merge tool that the user would operate for that situation.
If a file changes in both places, you also need a user-facing merge strategy if conflicting changes occur. I'd argue that is always a good idea, just to let the user eyeball that the document didn't lose coherence.

How to generate a unique hash for a URL?

Given these two images from twitter.
http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg
http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg
I want to download them to local filesystem & store them in a single directory.
How shall I overcome name conflicts ?
In the example above, I cannot store them as lowres_profilepic.jpg.
My design idea is treat the URLs as opaque strings except for the last segment.
What algorithms (implemented as f) can I use to hash the prefixes into unique strings.
f( "http://a3.twimg.com/profile_images/130500759/" ) = 6tgjsdjfjdhgf
f( "http://a1.twimg.com/profile_images/58079916/" ) = iuhd87ysdfhdk
That way, I can save the files as:-
6tgjsdjfjdhgf_lowres_profilepic.jpg
iuhd87ysdfhdk_lowres_profilepic.jpg
I don't want a cryptographic algorithm as it this needs to be a performant operation.
Irrespective of the how you do it (hashing, encoding, database lookup) I recommend that you don't try to map a huge number of URLs to files in a big flat directory.
The reason is that file lookup for most file systems involves a linear scan through the filenames in a directory. So if all N of your files are in one directory, a lookup will involve 1/2 N comparisons on average; i.e. O(N) (Note that ReiserFS organizes the names in a directory as a BTree. However, ReiserFS seems to be the exception rather than the rule.)
Instead of one big flat directory, it would be better to map the URIs to a tree of directories. Depending on the shape of the tree, lookup can be as good as O(logN). For example, if you organized the tree so that it had 3 levels of directory with at most 100 entries in each directory, you could accommodate 1 million URLs. If you designed the mapping to use 2 character filenames, each directory should easily fit into a single disk block, and a pathname lookup (assuming that the required directories are already cached) should take a few microseconds.
It seems what you really want is to have a legal filename that won't collide with others.
Any encoding of the URL will work, even base64: e.g. filename = base64(url)
A crypto hash will give you what you want - although you claim this will be a performance bottleneck, don't be sure until you've benchmarked
A very simple approach:
f( "http://a3.twimg.com/profile_images/130500759/" ) = a3_130500759.jpg
f( "http://a1.twimg.com/profile_images/58079916/" ) = a1_58079916.jpg
As the other parts of this URL are constant, you can use the subdomain, the last part of the query path as a unique filename.
Don't know what could be a problem with this solution
The nature of a hash is that it may result in collisions. How about one of these alternatives:
use a directory tree. Literally create sub directories for each component of the URL.
Generate a uniques id. The problem here is how to keep the mapping between real name and saved id. You could use a database which maps between a URL and generated unique id. You can simply insert a record into a database which generates unique ids, and then use that id as the filename.
One of the key concepts of a URL is that it is unique. Why not use it?
Every algorithm that shortens the info, can produce collisions. Maybe unlikely, but possible nevertheless
While CRC32 produces a maximum 2^32 values regardless of your input and so will not avoid conflicts, it is still a viable option for this scenario.
It is fast, so if you generate filename that conflicts, just add/change a character to your URL and simply re-calc the CRC.
4.3 billion possible checksums mean the likelihood of a filename conflict, when combined with the original filename, are going to be so low as to be be unimportant in normal situations.
I've used this approach myself for something similar and was pleased with the performance.
See Fast CRC32 in Software.
You can use UUID Class in Java to generate anything into UUID from bytes which is unique and you won't be having a problem with file lookup
String url = http://www.google.com;
String shortUrl = UUID.nameUUIDFromBytes("http://www.google.com".getBytes()).toString();
I see your question is what is the best hash algorithm for this matter. You might want to check this Best hashing algorithm in terms of hash collisions and performance for strings
The git content management system is based on SHA1 because it has very minimal chance for collision.
If it good for git it will be good to you so.
I'm playing with thumbalizr using a modified version of their caching script, and it has a few good solutions I think. The code is on github.com/mptre/thumbalizr but the short version is that is uses md5 to build the file names, and it takes the first two characters from the filename and uses it to create a folder which is named the exact same thing. This means that it is easy to break the folders up, and fast to find the corresponding folder without a database. Kind of blew my mind with it's simplicity.
It generates file names like this
http://pappmaskin.no/opensource/delicious_snapcasa/mptre-thumbalizr/cache/fc/fcc3a328e0f4c1b51bf5e13747614e7a_1280_1024_8_90_250.png
the last part, _1280_1024_8_90_250, matches the different settings that the script uses when talking to the thumbalizr api, but I guess fcc3a328e0f4c1b51bf5e13747614e7a is a straight md5 of the url, in this case for thumbalizr.com
I tried changing the config to generate images 200px wide, and that images goes in the same folder, but instead of _250.png it is called _200.png
I haven't had time to dig that much in the code, but I'm sure it could be pulled apart from the thumbalizr logic and made more generic.
You said:
I don't want a cryptographic algorithm as it this needs to be a performant operation.
Well, I understand your need for speed, but I think you need to consider drawbacks from your approach. If you just need to create hash for urls, you should stick with it and don't to write a new algorithm, where you'll need to deal with collisions, for instance.
So you could have a Dictionary<string, string> to work as a cache to your urls. So, when you get a new address, you first do a lookup in that list and, if doesn't find a match, hash it and storage for future usage.
Following this line, you could give MD5 a try:
public static void Main(string[] args)
{
foreach (string url in new string[]{
"http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg",
"http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg" })
{
Console.WriteLine(HashIt(url));
}
}
private static string HashIt(string url)
{
Uri path = new Uri(new Uri(url), ".");
MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider();
byte[] data = md5.ComputeHash(
Encoding.ASCII.GetBytes(path.OriginalString));
return Convert.ToBase64String(data);
}
You'll get:
rEoztCAXVyy0AP/6H7w3TQ==
0idVyXLs6sCP/XLBXwtCXA==
It appears that the numerical part of twimg.com URLs are already a unique value for each image. My research indicates that the number is sequential (i.e. the example url below is for the 433,484,366th profile image ever uploaded - which just happens to be mine). Thus, this number is unique. My solution would be to simply use the numerical part of the filename as the "hash value", with no fear of ever finding a non-unique value.
URL: http:​//a2.twimg.com/profile_images/433484366/terrorbite-industries-256.png
Filename: 433484366.terrorbite-industries-256.png
Unique ID: 433484366
I already use this system for a Python script that displays notifications for new tweets, and as part of its operation it caches profile image thumbnails to reduce unneccessary downloads.
P.S. It makes no difference what subdomain the image is downloaded from, all images are available from all subdomains.

Resources