Custom hash for unique ID - algorithm

I want to do something similar to what YouTube does. For example, this is a valid YouTube video ID didzxUkrtS0
Right now I am storing user's IDs with an integer number, and I want to translate those numbers to a 8 character alphanumerical identification. For example: FZ3EY1IC (not hexadecimal)
I already know that I can implement it with MD5 and then take the first 8 hex numbers, but it doesn't cover the entire alphabet.
What should I do to create a unique pattern using integers, that should never repeat?

Make your integer 5 8-bit bytes long (by adding a byte with a random value if your integer is 32-bit), that's 40 bits of data.
Cryptographically encrypt the 5 bytes of your integer using some key, which you probably want to keep private.
Slice the 40 encrypted bits into 8 5-bit parts. Encode each part using 32 alphanumeric characters. You may choose to use different 32 (out of the total 36) characters for each part.
The reverse operation is trivial.

I'm pretty sure what sites like youtube and bitly do is store a big table in the database that translates the alphanumerical identifiers for each link to the internal ID of what they're for (either that or it's stored in the row). And when it needs a new identifier, compute a random one and store it. The reason why you need to do this is so that an attacker cannot predict the ID of the next piece of content to be added.

Related

How can I generate a set of (short) unique identifiers which are easy to check but hard to spoof?

For example, such as the license keys much software uses. I had thought of cryptographically signing a sequence, so I could have maybe 4 bytes for the ID and say 8 bytes for the signature, but I can't find a suitable algorithm.
What I need is something that an attacker can't readily generate, but which is stored in less than approx 20 ASCII bytes. I also need to be confident of uniqueness. This doesn't need to be completely secure, only secure against a casual attack.
Note: I'm doing this in java on appengine.
Just generate a GUID for each ID and keep track of the ones you've generated in a database. The universe of GUIDs is so large that each will be unique. It's not cryptographic so there's a possibility that anyone who has a large enough population of your generated ones could produce a match, but I think the odds are still miniscule.
A GUID is 128 bits, which can be encoded in 23 bytes using Base64.
Sounds like HMAC. You will probably need to ensure uniqueness manually though.
Calculate the values including the id into a string and use the byte based HDMAC with a secret key and max length. Just make sure that you have a unique part in the values to encrypt. This could be server time or some other ID. The length will need to be tested that it remains within your 20 character requirement.
Encryption is reversible, so the output is guaranteed unique for unique inputs. Just encrypt 0, 1, 2, 3, 4, 5 etc. using the same key every time. For 128 bit output use AES and 128 bit numbers in ECB mode. Other modes will need identical IV/Nonces as well. For 64 bit numbers use DES. For other size numbers either use Hasty Pudding cypher or roll your own simple Feistel cypher for the size you want.
ECB is not the most secure mode, but I do not get the impression that you are looking for very high levels of security here.

Generating/Compression unique key

In my work I have many users and each users have set of files in there home directories. Due to some pre defined rules I have given each file a UID (unique identification), based on the user file content and its creation time. But now I came to know that the number of files in user account cannot exceed say 1 million. The current UID is about 32 characters long. Is there any way through which I can bring down my UID to about 6 (ideal condition) character to about 10-12 character long as the current uidl is using lots of space in my NoSQL database.
Current uidl looks like
timestamp.prrocess_whichcreated_it.size
EDIT
Let me rephrase the problem. What I actually need is a compressing algo:
For e.g.
I have list of 1,000,000 strings( each unique )and each 32 character long. I need a compress function f, such that F(string) = s2 , where S2 is of length 10 characters and all the S2 strings are uniquely mapped
Sort your UID's and replace the old UID's with a new UID indicating the index in the sorted array of the old UID's
a simplified pseudo code should look like that:
sorted <- sort(UID's)
for each file:
file.UID <- sorted.indexOf(file.UID)
It very difficult to take a UNIQUE id compress it and keep it UNIQUE. You tend to run into collisions.
#amit's suggestion really is the best one. Perhaps his implementation was a bit glib though.
How about you create a table with an AUTO INCREMENTING INTEGER "ID" column and a string/varchar "OldGUID". INSERT all your old/current GUIDs into the table and now you have a 1-to-1 match between the GUID and a shorter/compressed "ID". As you create new GUIDs just INSERT them into the table and you'll continue having the 1-to-1 match so you can switch back and forth between long and short version.
If you only need a Unique Identifier, then my first thought goes to UUID.
However, generic UUID will consume 16 bytes, and is binary format. It does not meat your requirement of 6 characters. Compared to your current method using 32 characters, it "only" saves 50% space.
Therefore, a milder scheme would be to use 64-bit UID (8 bytes) with a general Hash Function. With a good hash, the probability of collision remains fairly reasonable as long as the total number of UID generated is below < 100 millions. If that seems acceptable, then 8-bytes seems pretty close to your space requirement.

10-char hash of an email

How reliable is it to use a 10-char hash to identify email addresses?
MailChimp has 10-character alphanumeric IDs for email addresses.
10 chars 4 bit each gives 40 bits, a bit over one trillion. Maybe for an enterprise sized like MailChimp this gives a reasonable headroom for a unique index space, and they have a single table with all possible emails, indexed with a 40-bit number.
I'd love to use same style of hashes or coded IDs to include in links. To decide whether to go for indexes or hashes, need to estimate a probability of two valid email addresses leading to the same 10-char hash.
Any hints to evaluating that for a custom hash function, other than raw testing?
You don't explicitly say what you mean by "reliable", but I presume you're trying to avoid collisions. As wildplasser says, for random identifiers it's all about the birthday paradox, and the chance of a collision in an identifier space with 2^n IDs reaches 50% when 2^(n/2) IDs are in use.
The Wikipedia page on Birthday Attacks has a great table illustrating probabilities for collisions under various parameters; for instance with 64 bits and a desired maximum collision probability of 1 in 1 million, you can have about 6 million identifiers.
Bear in mind that there are a lot more efficient ways to represent data in characters than hex; base64, for instance, gives you 3 bytes per 4 characters, meaning 10 characters gives you 60 bits, instead of 40 with hex.

I'm brainstorming for a serial number scheme. Am I doing it wrong?

serial number format:
24 octets represented by 24 hex
characters plus hyphens for
readibility
e.g. D429-A7C5-9C15-8516-D15D-3A1C
0-15: {email+master hash}
16-19: {id}
20-23: {timestamp}
email+master hash algorithm:
generate md5 hash of user's email (32 bytes)
generate md5 hash of undisclosed master key
xor these two hashes
remove odd bytes, reducing size to 16
e.g. D429A7C59C158516D15D3A1CB00488ED --> D2AC9181D531B08E
id:
initially 0x00000000, then incremented with each licence sold
timestamp:
timestamp generated when license is purchased
validation:
in order to register product, user must enter 1) email address and 2) serial number
generate email+master hash and verify that it matches 0-15 of serial
extract timestamp from serial and verify that it is < current timestamp and >= date first license is sold
I'm no expert on this, but there are a few things that might be problematic with this approach:
Using MD5 doesn't seem like a good idea. MD5 has known security weaknesses and someone with enough time on their hands could easily come up with some sort of hash collision. Depending on how you use the serial number, someone could easily forge a serial number that looks like it matches some other serial number. Using something from the SHA family might prevent this.
Your XOR of the user email hash with a master key isn't particularly secure - I could recover the hash of the master key easily by XORing the serial number with a hash of my own email.
Dropping every odd byte out of a secure hash breaks the guarantee that the hash is secure. In particular, any hash function with a good security guarantee usually requires that all of the bytes in the resulting hash be there in the output. As an example, I could trivially construct a secure hash function from any existing secure hash function by taking the output of that first hash, interspersing 0s in-between all the old bytes, then outputting the result. It's secure because if you could break any of the security properties of my new hash, it would be equivalent to breaking security properties of the original hash. However, if you drop all the even-numbered bytes from the new hash, you get all zeros, which isn't at all secure.
Is four bytes enough for the id? That only gives you 2^32 different ids.
Some points to add to templatetypedef´s reply:
If you must combine hashes for the email and your master key, hash the concatenation of both. Even better, hash email+key+id for even "better" security in case someone purchases two or more licenses and sees the pattern.
Use a hash function that gives you only 16 bytes. If you must use MD5, any truncation is equally bad, so just take the first 16 bytes.
Your id is never used in the validation.
You will not be protected from key sharing (e.g. warez sites).
A serial number protects you from very few attacks. It´s probably not worth your time and effort.

YouTube URL algorithm?

How would you go about generating the unique video URL's that YouTube uses?
Example:
http://www.youtube.com/watch?v=CvUN8qg9lsk
YouTube uses Base64 encoding to generate IDs for each video.Characters involved in generating Ids consists of
(A-Z) + (a-z) + (0-9) + (-) + (_). (64 Characters).
Using Base64 encoding and only up to 11 characters they can generate 73+ Quintilian unique IDs.How much large pool of ID is that?
Well, it's enough for everyone on earth to produce video every single minute for 18000 years.
And they have achieved such huge number by only using 11 characters (64*64*64*64*64*64*64*64*64*64*64) if they need more IDs they will just have to add 1 more character to their IDs.
So when video is uploaded on YouTube they basically randomly select from 73+ Quintilian possibility and see if its already taken or not.if not use it otherwise look for another one.
Refer to this video for detailed explanation.
Using some non-trivial hashing function. The probability of collision is very low, depending on the function, the parameters and the input domain. Keep in mind that cryptographic hashes were specifically designed to have very low collision rates for non-random input (i.e. completely different hashes for two close-but-unequal inputs).
This post by Jeff Attwood is a nice overview of the topic.
And here is an online hash calculator you can play with.
There is no need to use a hash. It is probably just a quasi-random 64 bit value passed through base64 or some equivalent.
By quasi-random, I mean it is just a one-to-one mapping with the counting integers, just shuffled.
For example, you could take a monotonically increasing database id and multiply it by some prime near 2^64, then base64 the result. If you did not want people to be able to guess, you might choose a more complex mapping or just pick a random number that is not in the database yet.
Normal base64 would add an equals at the end, but in this case it is implied because the size is known. The character mapping could easily be something besides the standard.
Eli's link to Jeff's article is, in my opinion, irrelevant. URL shortening is not the same thing as presenting an ID to the world. Instead, a nicer way would be to convert your existing integer ID to a different radix.
An example in PHP:
$id = 9999;
//$url_id = base_convert($id, 10, 26+26+10); // PHP doesn't like this
$url_id = base_convert($id, 10, 26+10); // Works, but only digits + lowercase
Sadly, PHP only supports up to base 36 (digits + alphabet). Base 62 would support alphabet in both upper-case and lower-case.
People are talking about these other systems:
Random number/letters - Why? If you want people to not see the next video (id+1), then just make it private. On a website like youtube, where it actively shows any video it has, why bother with random ids?
Hashing an ID - This design concept really stinks. Think about it; so you have an ID guaranteed by your DBM software to be unique, and you hash it (introducing a collision factor)? Give me one reason why to even consider this idea.
Using the ID in URL - To be honest, I don't see any problems with this either, though it will grow to be large when in fact you can express the same number with fewer letters (hence my solution).
Using Base64 - Base64 expects bytes of data, literally anything from nulls to spaces. Why use this function when your data consists of a number (ie, a mix of 10 different characters, instead of 256)?
You can use any library or some languages like python provides it in standard library.
Example:
import secrets
id_length = 12
random_video_id = secrets.token_urlsafe(id_length)
You could generate a GUID and have that as the ID for the video.
Guids are very unlikely to collide.
Your best bet is probably to simply generate random strings, and keep track (in a DB for example) of which strings you've already used so you don't duplicate. This is very easy to implement and it cannot fail if properly implemented (no duplicates, etc).
I don't think that the URL v parameter has anything to do with the content (video properties, title, description etc).
It's a randomly generated string of fixed length and contains a very specific set of characters. No duplicates are allowed.
I suggest using a perfect hash function:
Perfect Hash Function for Human Readable Order Codes
As the accepted answer indicates, take a number, then apply a sequence of "bijective" (or reversible) operations on the number to get a hashed number.
The input numbers should be in sequence: 0, 1, 2, 3, and so on.
Typically you're hiding a numeric identifier in the form of something that doesn't look numeric. One simple method is something like base-36 encoding the number. You should be able to pull that off with one or another variant of itoa() in the language of your choice.
Just pick random values until you have one never seen before.
Randomly picking and exhausting all values form a set runs in expected time O(nlogn): What is O value for naive random selection from finite set?
In your case you wouldn't exhaust the set, so you should get constant time picks. Just use a fast data structure to do the duplication lookups.

Resources