I am calculating a MD5 sum for a file to compare it with values supplied in a text file. I use the following line to create the checksum:
cksum = File.open(File.join(File.dirname(path), file),'rb') do |f|
MD5.hexdigest(f.read)
end
Every once in a while I get one that does not match but running the md5 manually at the system level shows the file has the correct MD5.
Does anyone see any issue with the process I am using to calculate the MD5 value or have any idea why they sometimes do not match when calculated by this ruby method?
For followers, there's also a method:
for a file: Digest::MD5.file('filename').hexdigest
At this point MD5 is a well-exercised message digest with an extensive suite of test vectors. It is extremely unlikely that there is an issue with Ruby's implementation of it.
It is almost certainly another explanation, such as that perhaps when your checksum executes, the file has not yet been fully written (i.e. by another process). In troubleshooting, it may be helpful to note the length of the result from f.read and verify that against the file size. You could even save the read contents to a separate file for later comparison when you discover a discrepancy. That could offer a clue.
You're correctly opening the file with binary mode, so that is good.
Related
there. I'm preparing a Laravel test, and there's a question that I think is not correct.
When you should use a hash?
The available answers are:
When you want to compress the contents of a file.
When you want to securely store credit card information so you can use it later.
When you want to secure sen a password over email.
When you want to identify the contents of a file without storing the entire file
Since hashing is for encrypting passwords (not to send'em over email) none of this answers seem to be correct. What do you think?
Option 4. Identifying contents of a file.
Hash is a function which is supposed to return a constant length output for every input. The other property of hash functions is that for any input a it always returns the same value b. It means that if you provide file a and store its hash b then whenever you supply file a again you're going to get hash b. The last property is that for different inputs c, d and hash function f, f(c) should be different from f(d) (or the chances of outputs being equal should be near 0)
In real case scenario, you can often find hashes when downloading software and want to verify if the file you've downloaded is correct. The developer puts a hash of the executable on their site. You are downloading the file and calculate checksum (hash) just to compare it with the one from the website. If it matches, then you know it's the same (as long as the hash algorithm is not known to have collisions...).
It is quite good approach to comparing file contents, bc hashes are taking much less space than actual files.
I am looking at for preventing the duplication of the same pdf document with one of my application.
I know it's quite an easy task along with the name of the document but, I don't want to match the PDF duplication by its name.
Here the challenge comes, I want to check the duplication of the uploaded document based on the contents, not by the name of the document uploaded by the end user.
I have never prevented with such scenario in past but want to know, If someone has a way to get resolved my problem.
You solution or tricks will be really helpful.
Thanks in Advance and waiting for a wondering solution with the same.
I think the best way is to generate checksum from uploaded file, store in the database (or some other place) and then check if checksums for new uploaded files are already present in the database.
In Ruby you can use Digest module to do that:
require "digest"
data = File.read("some_file_path")
checksum = Digest::MD5.hexdigest(data)
You don't have to check for filename, just use this checksum.
One simple way is to look by MD5 checksum. Instead of reading or parsing files line by line, generate MD5 digests for them and match. Those with same MD5 values are same files.
How to generate MD5 for a file in Ruby?
require 'digest'
Digest::MD5.file("path/tp/pdf").hexdigest
# md5 string
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to read a file from bottom to top in Ruby?
In the course of working on my Ruby program, I had the Eureka Moment that it would be much simpler to write if I were able to parse the text files backwards, rather than forward.
It seems like it would be simple to simply read the text file, line by line, into an array, then write the lines backwards into a text file, parse this temp file forwards (which would now effectively be going backwards) make any necessary changes, re-catalog the resulting lines into an array, and write them backwards a second time, restoring the original direction, before saving the modifications as a new file.
While feasible in theory, I see several problems with it in practice, the biggest of which is that if the size of the text file is very large, a single array will not be able to hold the entirety of the document at once.
Is there a more elegant way to accomplish reading a text file backwards?
If you are not using lots of UTF-8 characters you can use Elif library which work just like File.open. just load Elif and replace File.open with Elif.open
Elif.open('read.txt', "r").each_line{ |s|
puts s
}
This is a great library, but the only problem I am experiencing right now is that it have several problems with line ending in UTF-8. I now have to re-think a way to iterate my files
Additional Details
As I google a way to answer this problem for UTF-8 reverse file reading. I found a way that already implemented by File library:
To read a file backward you can try the ff code:
File.readlines('manga_search.test.txt').reverse_each{ |s|
puts s
}
This can do a good job as well
There's no software limit to Ruby array. There are some memory limitations though: Array size too big - ruby
Your approach would work much faster if you can read everything into memory, operate there and write it back to disk. Assuming the file fits in memory of course.
Let's say your lines are 80 chars wide on average, and you want to read 100 lines. If you want it efficient (as opposed to implemented with the least amount of code), then go back 80*100 bytes from the end (using seek with the "relative to end" option), then read ONE line (this is likely a partial one, so throw it away). Remember your current position via tell, then read everything up until the end.
You now have either more or less than a 100 lines in memory. If less, go back (100+1.5*no_of_missing_lines)*80, and repeat the above steps, but only reading lines until you reach your remembered position from before. Rinse and repeat.
How about just going to the end of the file and iterating backwards over each char until you reach a newline, read the line, and so on? Not elegant, but certainly effective.
Example: https://gist.github.com/1117141
I can't think of an elegant way to do something so unusual as this, but you could probably do it using the file-tail library. It uses random access files in Ruby to read it backwards (and you could even do it yourself, look for random access at this link).
You could go throughout the file once forward, storing only the byte offset of each \n instead of storing the full string for each line. Then you traverse your offset array backward and can use ios.sysseek and ios.sysread to get lines out of the file. Unless your file is truly enormous, that should alleviate the memory issue.
Admitedly, this absolutely fails the elegance test.
I need to edit some cfg files for an application, but the thing is the application wont start if I do since it must match. I dont have the sources of the application.
I guess if the hash doesnt match the hash of the exe, it exits.
Could you bypass this somehow?
Actually, there is a way:
while(hash of malicious config file does not match original)
{
make random, non-functional change to malicious config file.
}
This might take a while.
With cretain hash algorithms, you can append data to the end of a file (if an xml file, say, inside comment tags). But its probably more trouble than its worth. E.g., http://www.schneier.com/blog/archives/2005/06/more_md5_collis.html
If the program uses a good hash, it will be difficult to change without breaking the hash. Some applications use relatively poor hashes. It's relatively easy, for example, to edit a file without affecting a CRC-32 if you can afford to set 32 bits of the file to arbitrary values. Any idea what sort of hash function is used?
You can have the app quit checking, but no, there is no way to duplicate a crypto hash of an existing file. That's the point.
Does a file exist having your desired settings and with the same hash? possibly
Will you be able to find it? Almost certainly not
It's time to break out your disassembler and pull apart the application to get rid of the hash check I'm afraid. No other solution will do what you want in a timely manner.
This kind of validation is intentionally difficult to circumvent. Hashes generally work such that small changes in the input produce widely varied output. The check in this case is doing its duty, unfortunately for your situation.
Although in theory there are other inputs that hash to the same thing, they'll be very different from your input, not just a little different. Finding these inputs will also be as time-consuming and difficult as hacking encrypted data. So basically, no.
As some other posts have mentioned, if you are adventurous and life and death are at stake, you could disassemble the application binary and actually remove the machine language check for the hash. This is varsity-ninja work though.
After reading this, it sounds like a great idea to store files using the SHA-1 for the directory.
I have no idea what this means however, all I know is that SHA-1 and MD5 are hashing algorithms. If I calculate the SHA-1 hash using this ruby script, and I change the file's content (which changes the hash), how do I know where the file is stored then?
My question is then, what are the basics of implementing a SHA-1/file-storage system?
If all of the files are changing content all the time, is there a better solution for storing them, or do you just have to keep updating the hash?
I'm just thinking about how to create a generic file storing system like GoogleDocs, Flickr, Youtube, DropBox, etc., something that you could reuse in different environments (such as storing PubMed journal articles or Cramster homework assignments and tests, or just images like on Flickr). I'd probably store them on Amazon EC2. Just some system so I can say "this is how I'll 99% of the time do file storing from now on", so I can stop thinking about building a solid/consistent way to store files and get onto some real problems.
First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.
Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.
When you have a digest, for example, 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:
00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt
This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.
Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.
The idea is not to change the file content, but rather its name (and path), by using a hash value.
Changing the content with a hash would be disastrous since a hash is normally not reversible.
I'm not sure of the motivivation for using a hash rather than the file name (or even rather than a long random number), but here are a few advantages of the hash appraoch:
the file names on the disk is uniform
the upper or lower parts of the hash value can be used to name the directories and hence distribute the files relatively uniformely
the name becomes a code, making it difficult for someone to
a) guess a file name
b) categorize pictures (would someone steal the hard drive content)
be able to retrieve the filename and location from the file contents itself (assuming the hash comes from such content. (not quite sure which use case would involve this... a bit contrieved...)
The general interest of using a hash is that unlike a file name, a hash is meaningless, and therefore one would require the database to relate images and "bibliographic" type data (name of uploader, date of upload, tags, ...)
In thinking about it, re-reading the referenced SO response, I don't really see much of an advantage of a hash, as compared to, say, a random number...
Furthermore... some hashes produce a numeric value, typically expressed in hexadecimal (as seen in the refernced SO question) and this could be seen as wasteful, by making the file names longer than they need to be, and hence putting more stress on the file system (bigger directories...)
One advantage I see with storing files using their hash is that the file data only needs to be stored once and then can be referenced multiple times within your database. This will save you space if you have a different users uploading the exact same file.
However the downside to this is when a user deletes what they think is there file from your app, you can't just physically delete the file from disk because other users that uploaded the same exact file may still be using it.
The idea is that you need to come up with a name for the photo, and you probably want to scatter the files among a number of directories. One easy way to come up with a unique name is to use the hash.
So the beginning of the hash was peeled off for a multi-level directory structure and the rest of the hash was used for a filename for the jpg.
This has the additional benefit of detecting duplicate uploads.