Users will be able to upload images and the name will be changed so it doesn't have the same name as another file. Using a simple convention like calling them 1.jpg, 2.jpg, 3.jpg and so on will mean other users can simple type in 4.jpg and see someone else's image.
Is there a way or a convention for naming images different while still ensuring guessing an image name is hard?
You could just write a PHP script to generate some pseudo random image name each time, like 21412adfs.jpg.
Better yet, take the name of the file being uploaded, and append a 6 digit random number to it or something, say, 19353--toy--car.png, you could even replace the 6 digit number with a number representing the date of the image uploaded.
Naming conventions can be in any form you want really, whatever works best for your setup and archive purposes. Including the date in the image name can be good, as you could easily sort images into different upload folders depending on their dates, etc.
Your best bet is to use a hashing function to generate a random string that you can use. For example in PHP you could use the following.
$filename = md5('SOME RANDOM STRING'.rand(0,200000).time());
MD5 can sometimes generate the same random string which would cause a filename collision, but the likelyhood of this happening is quite small, and if so ... all you have to do is to run the name generation again - a collision twice in a row is extremely unlikely.
Make sure you change 'SOME RANDOM STRING' to something that only you know and use on your site. It's what's known as a "site salt", it means that outsiders will have a much harder job guessing the names of your generated filenames, because they wont be able to predict and reverse engineer everything that you've put in to the mix to generate it.
Hope that helps?
This is a question about organising lots of images in a web project. Say you had the following two icons in a web project that represeneted, for example, a product selected or a product not selected:
What would you name them?
Seems a simple question, but I suspect naming images is something of an art.
For example:
star_active.png and star_inactive.png: Seems fair enough but what if you want to replace the star at a later date with a circle say. Then your name is misleading so you would have to rename it and then update all your css etc.
product_selected.png and product_unselected.png: Great for the when used for the specific action of selecting a product but what if I wanted to use the same image for a different purpose. Then the name is confusing and too specific.
Should the image size be part of the image name? eg. someImage_16.png
What is the best naming convention you have found for naming images?
You're asking for a naming convention that predicts future attributes and applications of the file so that you never have to update the file name. That is impossible. You have to rely on your own intuition when you initially name the files.
There is no way around it. If you end up changing either a file or it's application so drastically that the file name no longer accurately reflects its use, then you will either need to keep the misleading name or replace it throughout your files.
Most decent text editors should be able to easily do the latter across multiple files.
The only alternative is to assign names which are not descriptive from the start, which is obviously not a good idea.
Listen to Kobi and look into sprites, or if you're averse to sprites, do it the way Arvin said for the reasons given.
What's the best-practice method of storing a user's uploaded pictures and it's corresponding thumbnails. I noticed Flickr uses filename distinctions like: http://farm5.static.flickr.com/1234/789456123a_s.jpg where _s.jpg describes the size of the image (_s.jpg = small, _m.jpg = medium...). However, does storing images like the following make sense?
/images/123456.jpg
/images/small/123456.jpg
/images/medium/123456.jpg
...therefore, it's easy to access different sizes by simply pre-pending the folder-size name
Whatever works for you - pick a scheme and stick to it. As long as you're consistent, and document what you do, you should be fine.
I would like to create a image uploading service (yes, i am aware of imageshack, photobucket, flickr...etc) :)
I have seen only imageshack show the directory names ("img294", "1646") of where the image is located, in the same way - i would like to do this.
http://img294.imageshack.us/img294/1646/**jquerykd5**.jpg
1) Are there any security issues I should be aware if i take this implementation?
2) How do these sites come up with short unique identifiers ("kd5")?
Thanks all for any advice and help.
Well for starters, unless you would like the directory to be public, put dummy index.html files in there or just restrict access to public users for those directories.
As for the unique identifiers there are many ways of going about this... some of my favourite chunks of information to use:
UNIX time (if running a unix based server)
chunks of the md5 of the file
pseudo random numbers
piece of the original filename
With these and many other pieces of information at your fingertips it should be easy to prevent duplicate image names conflicting on your server as well, you can gather as many as you like and concatenate them into a string for the filename. The md5 can be placed in a database as well to aid in a method of duplicate image detection, which could save you disk space as well.
I can promise you they all use URL rewriting. This will help with security issues, too.
Given these two images from twitter.
http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg
http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg
I want to download them to local filesystem & store them in a single directory.
How shall I overcome name conflicts ?
In the example above, I cannot store them as lowres_profilepic.jpg.
My design idea is treat the URLs as opaque strings except for the last segment.
What algorithms (implemented as f) can I use to hash the prefixes into unique strings.
f( "http://a3.twimg.com/profile_images/130500759/" ) = 6tgjsdjfjdhgf
f( "http://a1.twimg.com/profile_images/58079916/" ) = iuhd87ysdfhdk
That way, I can save the files as:-
6tgjsdjfjdhgf_lowres_profilepic.jpg
iuhd87ysdfhdk_lowres_profilepic.jpg
I don't want a cryptographic algorithm as it this needs to be a performant operation.
Irrespective of the how you do it (hashing, encoding, database lookup) I recommend that you don't try to map a huge number of URLs to files in a big flat directory.
The reason is that file lookup for most file systems involves a linear scan through the filenames in a directory. So if all N of your files are in one directory, a lookup will involve 1/2 N comparisons on average; i.e. O(N) (Note that ReiserFS organizes the names in a directory as a BTree. However, ReiserFS seems to be the exception rather than the rule.)
Instead of one big flat directory, it would be better to map the URIs to a tree of directories. Depending on the shape of the tree, lookup can be as good as O(logN). For example, if you organized the tree so that it had 3 levels of directory with at most 100 entries in each directory, you could accommodate 1 million URLs. If you designed the mapping to use 2 character filenames, each directory should easily fit into a single disk block, and a pathname lookup (assuming that the required directories are already cached) should take a few microseconds.
It seems what you really want is to have a legal filename that won't collide with others.
Any encoding of the URL will work, even base64: e.g. filename = base64(url)
A crypto hash will give you what you want - although you claim this will be a performance bottleneck, don't be sure until you've benchmarked
A very simple approach:
f( "http://a3.twimg.com/profile_images/130500759/" ) = a3_130500759.jpg
f( "http://a1.twimg.com/profile_images/58079916/" ) = a1_58079916.jpg
As the other parts of this URL are constant, you can use the subdomain, the last part of the query path as a unique filename.
Don't know what could be a problem with this solution
The nature of a hash is that it may result in collisions. How about one of these alternatives:
use a directory tree. Literally create sub directories for each component of the URL.
Generate a uniques id. The problem here is how to keep the mapping between real name and saved id. You could use a database which maps between a URL and generated unique id. You can simply insert a record into a database which generates unique ids, and then use that id as the filename.
One of the key concepts of a URL is that it is unique. Why not use it?
Every algorithm that shortens the info, can produce collisions. Maybe unlikely, but possible nevertheless
While CRC32 produces a maximum 2^32 values regardless of your input and so will not avoid conflicts, it is still a viable option for this scenario.
It is fast, so if you generate filename that conflicts, just add/change a character to your URL and simply re-calc the CRC.
4.3 billion possible checksums mean the likelihood of a filename conflict, when combined with the original filename, are going to be so low as to be be unimportant in normal situations.
I've used this approach myself for something similar and was pleased with the performance.
See Fast CRC32 in Software.
You can use UUID Class in Java to generate anything into UUID from bytes which is unique and you won't be having a problem with file lookup
String url = http://www.google.com;
String shortUrl = UUID.nameUUIDFromBytes("http://www.google.com".getBytes()).toString();
I see your question is what is the best hash algorithm for this matter. You might want to check this Best hashing algorithm in terms of hash collisions and performance for strings
The git content management system is based on SHA1 because it has very minimal chance for collision.
If it good for git it will be good to you so.
I'm playing with thumbalizr using a modified version of their caching script, and it has a few good solutions I think. The code is on github.com/mptre/thumbalizr but the short version is that is uses md5 to build the file names, and it takes the first two characters from the filename and uses it to create a folder which is named the exact same thing. This means that it is easy to break the folders up, and fast to find the corresponding folder without a database. Kind of blew my mind with it's simplicity.
It generates file names like this
http://pappmaskin.no/opensource/delicious_snapcasa/mptre-thumbalizr/cache/fc/fcc3a328e0f4c1b51bf5e13747614e7a_1280_1024_8_90_250.png
the last part, _1280_1024_8_90_250, matches the different settings that the script uses when talking to the thumbalizr api, but I guess fcc3a328e0f4c1b51bf5e13747614e7a is a straight md5 of the url, in this case for thumbalizr.com
I tried changing the config to generate images 200px wide, and that images goes in the same folder, but instead of _250.png it is called _200.png
I haven't had time to dig that much in the code, but I'm sure it could be pulled apart from the thumbalizr logic and made more generic.
You said:
I don't want a cryptographic algorithm as it this needs to be a performant operation.
Well, I understand your need for speed, but I think you need to consider drawbacks from your approach. If you just need to create hash for urls, you should stick with it and don't to write a new algorithm, where you'll need to deal with collisions, for instance.
So you could have a Dictionary<string, string> to work as a cache to your urls. So, when you get a new address, you first do a lookup in that list and, if doesn't find a match, hash it and storage for future usage.
Following this line, you could give MD5 a try:
public static void Main(string[] args)
{
foreach (string url in new string[]{
"http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg",
"http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg" })
{
Console.WriteLine(HashIt(url));
}
}
private static string HashIt(string url)
{
Uri path = new Uri(new Uri(url), ".");
MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider();
byte[] data = md5.ComputeHash(
Encoding.ASCII.GetBytes(path.OriginalString));
return Convert.ToBase64String(data);
}
You'll get:
rEoztCAXVyy0AP/6H7w3TQ==
0idVyXLs6sCP/XLBXwtCXA==
It appears that the numerical part of twimg.com URLs are already a unique value for each image. My research indicates that the number is sequential (i.e. the example url below is for the 433,484,366th profile image ever uploaded - which just happens to be mine). Thus, this number is unique. My solution would be to simply use the numerical part of the filename as the "hash value", with no fear of ever finding a non-unique value.
URL: http://a2.twimg.com/profile_images/433484366/terrorbite-industries-256.png
Filename: 433484366.terrorbite-industries-256.png
Unique ID: 433484366
I already use this system for a Python script that displays notifications for new tweets, and as part of its operation it caches profile image thumbnails to reduce unneccessary downloads.
P.S. It makes no difference what subdomain the image is downloaded from, all images are available from all subdomains.