How reliable is it to use a 10-char hash to identify email addresses?
MailChimp has 10-character alphanumeric IDs for email addresses.
10 chars 4 bit each gives 40 bits, a bit over one trillion. Maybe for an enterprise sized like MailChimp this gives a reasonable headroom for a unique index space, and they have a single table with all possible emails, indexed with a 40-bit number.
I'd love to use same style of hashes or coded IDs to include in links. To decide whether to go for indexes or hashes, need to estimate a probability of two valid email addresses leading to the same 10-char hash.
Any hints to evaluating that for a custom hash function, other than raw testing?
You don't explicitly say what you mean by "reliable", but I presume you're trying to avoid collisions. As wildplasser says, for random identifiers it's all about the birthday paradox, and the chance of a collision in an identifier space with 2^n IDs reaches 50% when 2^(n/2) IDs are in use.
The Wikipedia page on Birthday Attacks has a great table illustrating probabilities for collisions under various parameters; for instance with 64 bits and a desired maximum collision probability of 1 in 1 million, you can have about 6 million identifiers.
Bear in mind that there are a lot more efficient ways to represent data in characters than hex; base64, for instance, gives you 3 bytes per 4 characters, meaning 10 characters gives you 60 bits, instead of 40 with hex.
Related
I have about 100 million simple key-value pairs(it's legacy data, never need to update, and keys are random string), and i want to store them in redis for query.
my thought is that i use the first four character as a hash key, and store them into a hash type, so there're about a million hash key in redis, with each hash key has about 1000 sub-keys.
but things just don't go as planed. for some reason, i found some hash keys only have one sub-key, but some have more than 500,000 sub-keys, which may not encoded in memory very efficiently.
so i'd like to know that is there are some simple understandable algorithm which can divide my 100 million string averagely into 100 thousand buckets(int). when I pick up a string, I can know where it goes by using the same algorithm.
thanks!!
Using only a small portion of the string to compute the hash function can be a problem because your strings could, for example, all share the same prefix.
There is a description of string hash functions which take the entire string at http://www.javamex.com/tutorials/collections/hash_function_technical_2.shtml and Good Hash Function for Strings (actually they give two different descriptions of the same function).
One way to look at this is that it regards the characters of a string as the coefficients A,B,C of a polynomial of the form A + Bx + Cx^2 + Dx^3... where in this case x is 31 and arithmetic is modulo 2^32. If x is well chosen then this is a scheme with which there is a lot of experience and some maths may apply which gives it good properties. Even better is to do the arithmetic modulo the size of the hash table, and to chose the size of the hash table to be a prime. If your data is static, it might be worth trying a few different primes of around your preferred table size and a few different values of x, and pick the combination which gives you the most evenly populated table.
I'm writing a compression algorithm (mostly for fun) in C, and I need to be able to store a list of numbers in binary. Each element of this list will be in the form of two digits, both under 10 (like (5,5), (3,6), (9,2)). I'll potentially be storing thousands of these pairs (one pair is made for each character in a string in my compression algorithm).
Obviously the simplest way to do this would be to concatenate each pair (-> 55, 36, 92) to make a 2-digit number (since they're just one digit each), then store each pair as a 7-bit number (since 99 is the highest). Unfortunately, this isn't so space-efficient (7 bits per pair).
Then I thought perhaps if I concatenate each pair, then concatenate that (553692), I'd be able to then store that as a plain number in binary form (10000111001011011100, which for three pairs is already smaller than storing each number separately), and keep a quantifier for the number of bits used for the binary number. The only problem is, this approach requires a bigint library and could be potentially slow because of that. As the number gets bigger and bigger (+2 digits per character in the string) the memory usage and slowdown would get bigger and bigger as well.
So here's my question: Is there a better storage-efficient way to store a list of numbers like I'm doing, or should I just go with the bignum or 7-bit approach?
The information-theoretic minimum for storing 100 different values is log2100, which is about 6.644. In other words, the possible compression from 7 bits is a hair more than 5%. (log2100 / 7 is 94.91%.)
If these pairs are simply for temporary storage during the algorithm, then it's almost certainly not worth going to a lot of effort to save 5% of storage, even if you managed to do that.
If the pairs form part of you compressed output, then your compression cannot be great (a character is only eight bits, and presumably the pairs are additional to any compressed character data.) Nonetheless, the easy compression technique is to store up to 6 pairs in 40 bits (5 bytes), which can be done without a bigint package assuming a 64-bit machine. (Alternatively, store up to 3 pairs in 20 bits and then pack two 20-bit sequences into five bytes.) That gives you 99.66% of the maximum compression for the values.
All of the above assumes that the 100 possible values are equally distributed. If the distribution is not even and it is possible to predict the frequencies, then you can use Hoffman encoding to improve compression. Even so, I wouldn't recommend it for temporary storage.
Certain system is supposed to spawn objects with unique IDs. That system will run in different computers without connection between them; yet no ID collision can happen. The only way to implement this is generating random numbers. How wide should be the those so you can consider it is virtually impossible for a collision to ever happen?
This is basically a generalization of the birthday problem.
This probability table can help you to figure out how many bits you are going to need in order to achieve the probability you desire - based on p - desired probability, and #elements that are going to be "hashed" (generated).
In your question you mentioned:
The only way to implement this is generating random numbers
No, this is NOT the only way to do this. In fact this is one of the ways NOT to do it.
There is already a well known and widely used method for doing something like this that you yourself are using right now: adding a prefix (or postfix, doesn't matter). The prefix is called many things by many systems: Ethernet and WiFi call it vendor id. In TCP/IP it's called a subnet (technically it's called a "network").
The idea is simple. Say for example you want to use a 32 bit number for your global id. Reserve something like 8 bits to identify which system it's on and the rest can simply be sequential numbers within each system.
Stealing syntax from IPv4 for a moment. Say system 1 has an id of 1. And system 2 has an id of 2. Therefore ids form system 1 will be in the range between 1.0.0.0 - 1.255.255.255 and ids from system 2 will be between 2.0.0.0 - 2.255.255.255.
That's just an example. There's nothing that forces you to waste so many bits for system id. In fact, IPv4 is itself no longer organized by byte boundaries. You can instead use 4 bits as system id and 28 bits for individual ids. You can also use 64 bits if you need more ids or go the IPv6 route and use 128 bits (in which case you can definitely afford to waste a byte or two for system id).
Because each system cannot generate an id that's generated by another system no collision will ever occur before the ids overflow.
If you need the ids to look "random" use a hashing algorithm. Good hashing algorithms such as SHA1 and CRC are guaranteed to never collide if your data is of a fixed size below the size of the hash. For example, SHA1 is 160 bits so if your id generation system is less than 160 bits internally then the SHA1 hash of ids will never collide. The caveat being that you must use all 160 bits. Turncating the SHA1 will cause collisions. For 32 bit ids CRC32 is a perfect fit while there's also CRC64 if you want to generate 64 bit ids.
Guids use 2^128 and the likelyhood of collision is negligible
I want to do something similar to what YouTube does. For example, this is a valid YouTube video ID didzxUkrtS0
Right now I am storing user's IDs with an integer number, and I want to translate those numbers to a 8 character alphanumerical identification. For example: FZ3EY1IC (not hexadecimal)
I already know that I can implement it with MD5 and then take the first 8 hex numbers, but it doesn't cover the entire alphabet.
What should I do to create a unique pattern using integers, that should never repeat?
Make your integer 5 8-bit bytes long (by adding a byte with a random value if your integer is 32-bit), that's 40 bits of data.
Cryptographically encrypt the 5 bytes of your integer using some key, which you probably want to keep private.
Slice the 40 encrypted bits into 8 5-bit parts. Encode each part using 32 alphanumeric characters. You may choose to use different 32 (out of the total 36) characters for each part.
The reverse operation is trivial.
I'm pretty sure what sites like youtube and bitly do is store a big table in the database that translates the alphanumerical identifiers for each link to the internal ID of what they're for (either that or it's stored in the row). And when it needs a new identifier, compute a random one and store it. The reason why you need to do this is so that an attacker cannot predict the ID of the next piece of content to be added.
The problem seems simple at first: just assign an id and represent that in binary.
The issue arises because the user is capable of changing as many 0 bits to a 1 bit. To clarify, the hash could go from 0011 to 0111 or 1111 but never 1010. Each bit has an equal chance of being changed and is independent of other changes.
What would you have to store in order to go from hash -> user assuming a low percentage of bit tampering by the user? I also assume failure in some cases so the correct solution should have an acceptable error rate.
I would an estimate the maximum number of bits tampered with would be about 30% of the total set.
I guess the acceptable error rate would depend on the number of hashes needed and the number of bits being set per hash.
I'm worried with enough manipulation the id can not be reconstructed from the hash. The question I am asking I guess is what safe guards or unique positioning systems can I use to ensure this happens.
Your question isn't entirely clear to me.
Are you saying that you want to validate a user based on a hash of the user ID, but are concerned that the user might change some of the bits in the hash?
If that is the question, then as long as you are using a proven hash algorithm (such as MD5), there is very low risk of a user manipulating the bits of their hash to get another user's ID.
If that's not what you are after, could you clarify your question?
EDIT
After reading your clarification, it looks like you might be after Forward Error Correction, a family of algorithms that allow you to reconstruct altered data.
Essentially with FEC, you encode each bit as a series of 3 bits and apply the "majority wins" principal when decoding again. When encoding you represent "1" as "111" and "0" as "000". When decoding, if most of the encoded 3 bits are zero, you decode that to mean zero. If most of the encoded 3 bits are 1, you decode that to mean 1.
Assign each user an ID with the same number of bits set.
This way you can detect immediately if any tampering has occurred. If you additionally make the Hamming distance between any two IDs at least 2n, then you'll be able to reconstruct the original ID in cases where less than n bits have been set.
So you're trying to assign a "unique id" that will still remain a unique id even if it's changed to something else?
If the only "tampering" is changing 0's to 1's (but not vice-versa) (which seems fairly contrived), then you could get an effective 'ID' by assigning each user a particular bit position, set that bit to zero in that user's id, and to one in every other user's id.
Thus any fiddling by the user will result in corrupting their own id, but not allow impersonation of anyone else.
The distance between two IDs, ( the number of bits you have to change to get from one word to the other ) is called the Hamming distance. Error correcting codes can correct up to half this distance and still give you the original word. If you assume that 30% of the bits can be tampered with, this means that the distance between 2 words should be 60% of the bits. This leaves 40% of that space to be used for IDs. As long as you randomly generate up to 40% of the IDs you could for a given number of bits ( also include the error correcting part), you should be able to recover the original ID.