Match duplication of the uploaded PDF document? - ruby

I am looking at for preventing the duplication of the same pdf document with one of my application.
I know it's quite an easy task along with the name of the document but, I don't want to match the PDF duplication by its name.
Here the challenge comes, I want to check the duplication of the uploaded document based on the contents, not by the name of the document uploaded by the end user.
I have never prevented with such scenario in past but want to know, If someone has a way to get resolved my problem.
You solution or tricks will be really helpful.
Thanks in Advance and waiting for a wondering solution with the same.

I think the best way is to generate checksum from uploaded file, store in the database (or some other place) and then check if checksums for new uploaded files are already present in the database.
In Ruby you can use Digest module to do that:
require "digest"
data = File.read("some_file_path")
checksum = Digest::MD5.hexdigest(data)
You don't have to check for filename, just use this checksum.

One simple way is to look by MD5 checksum. Instead of reading or parsing files line by line, generate MD5 digests for them and match. Those with same MD5 values are same files.
How to generate MD5 for a file in Ruby?
require 'digest'
Digest::MD5.file("path/tp/pdf").hexdigest
# md5 string

Related

Ruby miscalculation of MD5 for file

I am calculating a MD5 sum for a file to compare it with values supplied in a text file. I use the following line to create the checksum:
cksum = File.open(File.join(File.dirname(path), file),'rb') do |f|
MD5.hexdigest(f.read)
end
Every once in a while I get one that does not match but running the md5 manually at the system level shows the file has the correct MD5.
Does anyone see any issue with the process I am using to calculate the MD5 value or have any idea why they sometimes do not match when calculated by this ruby method?
For followers, there's also a method:
for a file: Digest::MD5.file('filename').hexdigest
At this point MD5 is a well-exercised message digest with an extensive suite of test vectors. It is extremely unlikely that there is an issue with Ruby's implementation of it.
It is almost certainly another explanation, such as that perhaps when your checksum executes, the file has not yet been fully written (i.e. by another process). In troubleshooting, it may be helpful to note the length of the result from f.read and verify that against the file size. You could even save the read contents to a separate file for later comparison when you discover a discrepancy. That could offer a clue.
You're correctly opening the file with binary mode, so that is good.

Algorithm that can encode a string (with known maximum length) to a fix-length string?

I'm downloading a lot of files whose URLs are listed in a text file.
When saving a file to disk, I use the MD5 checksum of its URL as the new filename. This is to avoid file name conflicts and invalid characters in the original file name.
But I also need a way to find the original URL from a downloaded file name, if I use MD5, I'll have to use a mapping that's very huge.
Is there any algorithm I can use instead that allow me to just decode the original URL from the file name?
Note that I also don't want the length of file names to vary to much.
You can use base62, which is file system friendly and can be en-/decrypted. But you can't avoid file name collisions. If you want to avoid them too, you could append a MD5 of the file to the encrypted filename, and remove the MD5 when decrypting.
If you want a generic solution look for short string compression algorithms. Here's a previously answered question about it An efficient compression algorithm for short text strings.
There's no way to grantee that you get equal length strings because some of them will compress better than others.
Since you are dealing with only html you can use that to store some data. For example you can simply put the original URL in front of the leading HTML tag or after the closing HTML tag. Or add a special tag or attribute to the file to store this information. Then you can keep MD5 as the file name, but if you need the url you would open the file and look for it there. This should allow you to store the data without affecting any use of the file and without having to store a large mapping table.

codeigniter upload encrypt_name uniqueness

1. Does codeigniter's upload library's encrypt_name option check to be unique?
I know that the overwrite option is important. If overwrite is TRUE, it would overwrite and if it's FALSE, it would rename the file by adding a number at the end of the name.
The question is: Will it regenerate the encrypted name until finding a unique name even if overwrite is TRUE? I ask this because it's obvious that when we want encrypted name, of course we don't want to overwrite.
The problem with rename by adding some numbers is that it corrupts the size of file names. Many files will have 32 chars filename, and some might have 33 chars filename that corrupts coordination.
2. Is that possible to generate an ever-duplicated result at all?
Since Codeigniter is using md5(uniqid(mt_rand())) to generate the encrypted file name, I'd guess that you'll find your answer in the PHP docs for uniquid.
Short answer (for 2.) would be: maybe, but probably not.
And to answer your first question: no, CI doesn't generate a new encrypted filename, if it already exists. It adds a number to the end of the name.
A short glance at the source code of /libraries/Upload.php, line 415, helps.

SHA-1 hash for storing Files

After reading this, it sounds like a great idea to store files using the SHA-1 for the directory.
I have no idea what this means however, all I know is that SHA-1 and MD5 are hashing algorithms. If I calculate the SHA-1 hash using this ruby script, and I change the file's content (which changes the hash), how do I know where the file is stored then?
My question is then, what are the basics of implementing a SHA-1/file-storage system?
If all of the files are changing content all the time, is there a better solution for storing them, or do you just have to keep updating the hash?
I'm just thinking about how to create a generic file storing system like GoogleDocs, Flickr, Youtube, DropBox, etc., something that you could reuse in different environments (such as storing PubMed journal articles or Cramster homework assignments and tests, or just images like on Flickr). I'd probably store them on Amazon EC2. Just some system so I can say "this is how I'll 99% of the time do file storing from now on", so I can stop thinking about building a solid/consistent way to store files and get onto some real problems.
First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.
Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.
When you have a digest, for example, 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:
00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt
This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.
Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.
The idea is not to change the file content, but rather its name (and path), by using a hash value.
Changing the content with a hash would be disastrous since a hash is normally not reversible.
I'm not sure of the motivivation for using a hash rather than the file name (or even rather than a long random number), but here are a few advantages of the hash appraoch:
the file names on the disk is uniform
the upper or lower parts of the hash value can be used to name the directories and hence distribute the files relatively uniformely
the name becomes a code, making it difficult for someone to
a) guess a file name
b) categorize pictures (would someone steal the hard drive content)
be able to retrieve the filename and location from the file contents itself (assuming the hash comes from such content. (not quite sure which use case would involve this... a bit contrieved...)
The general interest of using a hash is that unlike a file name, a hash is meaningless, and therefore one would require the database to relate images and "bibliographic" type data (name of uploader, date of upload, tags, ...)
In thinking about it, re-reading the referenced SO response, I don't really see much of an advantage of a hash, as compared to, say, a random number...
Furthermore... some hashes produce a numeric value, typically expressed in hexadecimal (as seen in the refernced SO question) and this could be seen as wasteful, by making the file names longer than they need to be, and hence putting more stress on the file system (bigger directories...)
One advantage I see with storing files using their hash is that the file data only needs to be stored once and then can be referenced multiple times within your database. This will save you space if you have a different users uploading the exact same file.
However the downside to this is when a user deletes what they think is there file from your app, you can't just physically delete the file from disk because other users that uploaded the same exact file may still be using it.
The idea is that you need to come up with a name for the photo, and you probably want to scatter the files among a number of directories. One easy way to come up with a unique name is to use the hash.
So the beginning of the hash was peeled off for a multi-level directory structure and the rest of the hash was used for a filename for the jpg.
This has the additional benefit of detecting duplicate uploads.

Image Upload Service - Image locations and Image Identifiers

I would like to create a image uploading service (yes, i am aware of imageshack, photobucket, flickr...etc) :)
I have seen only imageshack show the directory names ("img294", "1646") of where the image is located, in the same way - i would like to do this.
http://img294.imageshack.us/img294/1646/**jquerykd5**.jpg
1) Are there any security issues I should be aware if i take this implementation?
2) How do these sites come up with short unique identifiers ("kd5")?
Thanks all for any advice and help.
Well for starters, unless you would like the directory to be public, put dummy index.html files in there or just restrict access to public users for those directories.
As for the unique identifiers there are many ways of going about this... some of my favourite chunks of information to use:
UNIX time (if running a unix based server)
chunks of the md5 of the file
pseudo random numbers
piece of the original filename
With these and many other pieces of information at your fingertips it should be easy to prevent duplicate image names conflicting on your server as well, you can gather as many as you like and concatenate them into a string for the filename. The md5 can be placed in a database as well to aid in a method of duplicate image detection, which could save you disk space as well.
I can promise you they all use URL rewriting. This will help with security issues, too.

Resources