1. Does codeigniter's upload library's encrypt_name option check to be unique?
I know that the overwrite option is important. If overwrite is TRUE, it would overwrite and if it's FALSE, it would rename the file by adding a number at the end of the name.
The question is: Will it regenerate the encrypted name until finding a unique name even if overwrite is TRUE? I ask this because it's obvious that when we want encrypted name, of course we don't want to overwrite.
The problem with rename by adding some numbers is that it corrupts the size of file names. Many files will have 32 chars filename, and some might have 33 chars filename that corrupts coordination.
2. Is that possible to generate an ever-duplicated result at all?
Since Codeigniter is using md5(uniqid(mt_rand())) to generate the encrypted file name, I'd guess that you'll find your answer in the PHP docs for uniquid.
Short answer (for 2.) would be: maybe, but probably not.
And to answer your first question: no, CI doesn't generate a new encrypted filename, if it already exists. It adds a number to the end of the name.
A short glance at the source code of /libraries/Upload.php, line 415, helps.
Related
I am looking at for preventing the duplication of the same pdf document with one of my application.
I know it's quite an easy task along with the name of the document but, I don't want to match the PDF duplication by its name.
Here the challenge comes, I want to check the duplication of the uploaded document based on the contents, not by the name of the document uploaded by the end user.
I have never prevented with such scenario in past but want to know, If someone has a way to get resolved my problem.
You solution or tricks will be really helpful.
Thanks in Advance and waiting for a wondering solution with the same.
I think the best way is to generate checksum from uploaded file, store in the database (or some other place) and then check if checksums for new uploaded files are already present in the database.
In Ruby you can use Digest module to do that:
require "digest"
data = File.read("some_file_path")
checksum = Digest::MD5.hexdigest(data)
You don't have to check for filename, just use this checksum.
One simple way is to look by MD5 checksum. Instead of reading or parsing files line by line, generate MD5 digests for them and match. Those with same MD5 values are same files.
How to generate MD5 for a file in Ruby?
require 'digest'
Digest::MD5.file("path/tp/pdf").hexdigest
# md5 string
I'm downloading a lot of files whose URLs are listed in a text file.
When saving a file to disk, I use the MD5 checksum of its URL as the new filename. This is to avoid file name conflicts and invalid characters in the original file name.
But I also need a way to find the original URL from a downloaded file name, if I use MD5, I'll have to use a mapping that's very huge.
Is there any algorithm I can use instead that allow me to just decode the original URL from the file name?
Note that I also don't want the length of file names to vary to much.
You can use base62, which is file system friendly and can be en-/decrypted. But you can't avoid file name collisions. If you want to avoid them too, you could append a MD5 of the file to the encrypted filename, and remove the MD5 when decrypting.
If you want a generic solution look for short string compression algorithms. Here's a previously answered question about it An efficient compression algorithm for short text strings.
There's no way to grantee that you get equal length strings because some of them will compress better than others.
Since you are dealing with only html you can use that to store some data. For example you can simply put the original URL in front of the leading HTML tag or after the closing HTML tag. Or add a special tag or attribute to the file to store this information. Then you can keep MD5 as the file name, but if you need the url you would open the file and look for it there. This should allow you to store the data without affecting any use of the file and without having to store a large mapping table.
Seems like it must be easy, but I just can't figure it out. How do you delete the very last character of a file using Ruby IO?
I took a look at the answer for deleting the last line of a file with Ruby but didn't fully understand it, and there must be a simpler way.
Any help?
There is File.truncate:
truncate(file_name, integer) → 0
Truncates the file file_name to be at most integer bytes long. Not available on all platforms.
So you can say things like:
File.truncate(file_name, File.size(file_name) - 1)
That should truncate the file with a single system call to adjust the file's size in the file system without copying anything.
Note that not available on all platforms caveat though. File.truncate should be available on anything unixy (such as Linux or OSX), I can't say anything useful about Windows support.
I assume you are referring to a text file. The usual way of changing such is to read it, make the changes, then write a new file:
text = File.read(in_fname)
File.write(out_fname, text[0..-2])
Insert the name of the file you are reading from for in_fname and the name of the file you are writing to for 'out_fname'. They can be the same file, but if that's the intent it's safer to write to a temporary file, copy the temporary file to the original file then delete the temporary file. That way, if something goes wrong before the operations are completed, you will probably still have either the original or temporary file. text[0..-2] is a string comprised of all characters read except for the last one. You could alternatively do this:
File.write(out_fname, File.read(in_fname, File.stat(in_fname).size-1))
I'm trying to accomplish something that will let a user download a file from a web application onto their system. The file will contain a unique five digit code. Using this unique five digit code the users can search for a file in their file system.
I'm wondering where is the best place to put this five digit code in a file so that users can easily search for the file. The simplest approach would be to put it in the name of the file, however, users can change the name of the file easily.
I'm looking for a filed where I can put the code so that users won't be able to modify it but will still be able to search for it. Is this possible?
If you say File.. what kind of file format do you mean. I'm asking because a file is just a pile of bytes and you can append your 5 digit code every where in the file, if it is your own file format. But if you tell us which file format you use, probably there are some fields which can be used to search for it. As example Tiff has many tags. Images have other meta data. etc
After reading this, it sounds like a great idea to store files using the SHA-1 for the directory.
I have no idea what this means however, all I know is that SHA-1 and MD5 are hashing algorithms. If I calculate the SHA-1 hash using this ruby script, and I change the file's content (which changes the hash), how do I know where the file is stored then?
My question is then, what are the basics of implementing a SHA-1/file-storage system?
If all of the files are changing content all the time, is there a better solution for storing them, or do you just have to keep updating the hash?
I'm just thinking about how to create a generic file storing system like GoogleDocs, Flickr, Youtube, DropBox, etc., something that you could reuse in different environments (such as storing PubMed journal articles or Cramster homework assignments and tests, or just images like on Flickr). I'd probably store them on Amazon EC2. Just some system so I can say "this is how I'll 99% of the time do file storing from now on", so I can stop thinking about building a solid/consistent way to store files and get onto some real problems.
First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.
Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.
When you have a digest, for example, 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:
00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt
This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.
Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.
The idea is not to change the file content, but rather its name (and path), by using a hash value.
Changing the content with a hash would be disastrous since a hash is normally not reversible.
I'm not sure of the motivivation for using a hash rather than the file name (or even rather than a long random number), but here are a few advantages of the hash appraoch:
the file names on the disk is uniform
the upper or lower parts of the hash value can be used to name the directories and hence distribute the files relatively uniformely
the name becomes a code, making it difficult for someone to
a) guess a file name
b) categorize pictures (would someone steal the hard drive content)
be able to retrieve the filename and location from the file contents itself (assuming the hash comes from such content. (not quite sure which use case would involve this... a bit contrieved...)
The general interest of using a hash is that unlike a file name, a hash is meaningless, and therefore one would require the database to relate images and "bibliographic" type data (name of uploader, date of upload, tags, ...)
In thinking about it, re-reading the referenced SO response, I don't really see much of an advantage of a hash, as compared to, say, a random number...
Furthermore... some hashes produce a numeric value, typically expressed in hexadecimal (as seen in the refernced SO question) and this could be seen as wasteful, by making the file names longer than they need to be, and hence putting more stress on the file system (bigger directories...)
One advantage I see with storing files using their hash is that the file data only needs to be stored once and then can be referenced multiple times within your database. This will save you space if you have a different users uploading the exact same file.
However the downside to this is when a user deletes what they think is there file from your app, you can't just physically delete the file from disk because other users that uploaded the same exact file may still be using it.
The idea is that you need to come up with a name for the photo, and you probably want to scatter the files among a number of directories. One easy way to come up with a unique name is to use the hash.
So the beginning of the hash was peeled off for a multi-level directory structure and the rest of the hash was used for a filename for the jpg.
This has the additional benefit of detecting duplicate uploads.