String to unique int algorithm - algorithm

We are trying to implement the following case. We have a invoice table and there is a column which has email address. We want to somehow generate a unique int value from this email address and store that in a separate column. This will be used as a FK and indexed. So what I am looking for is an algorithm for generating ints from strings (please note that the email string should always output the same int so each email address as a unique int representation). We can use a bigint as well

Simplest solution is to put the email address into its own table along with an identity/auto_increment type column. Then you can simply carry around that identify field (a standard int), and you don't run into any issues with potential hash collisions, and no hashing overhead.

It seems a simple hashcode (MD5, SHA1, ...) should fit your needs; depending on your RDBMS, you might be able to use built-in packages (e.g. Oracle's dbms_crypto) or have to compute them externally.
Some things to keep in mind:
convert everything to lower/uppercase before computing the hashcode (so USER#DOMAIN.COM gets the same hashcode as user#domain.com)
apparently, you have a denormalized schema. It would make more sense to have a separate customer table containing the E-Mail adress; invoice should then contain only a foreign key customer_fk

MD5 - gives you a 128-bit integer. (Admittedly, this is bigger than the int datatype in most languages, but you won't get near guaranteed uniqueness with with just 32-bits.)

I don't know if you can get away with a 64-bit int: the max length of an email address is 254 characters and, in this case where you need to preserve the uniqueness of each, hashing will not do it.
So it seems you are stuck with having to get over this 254-character hurdle. My approach (always the brute force approach for me) would be to take the alphabet of allowable characters in an email address, map those to 6-bit values, and use the map to pack them into a series of words.
Take a look at rfc3696 which deals with email addresses in a way that's actually comprehensible.
Sorry to be of so little help.

Related

How can I generate a unique identifier that is apparently not progressive [duplicate]

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.

Hash code, why is string and numeric key not good for memory address

In the data structure lecture (and still happening right now), our lecturer explained that hash codes are useful for memory addresses.
That made sense, but then he added "except for numeric and string keys – Why?"
I thought the reason was because then we can no longer apply hash functions but according to him that is not true.
As we can either implement different hash function for strings or use integer representation of the memory addresses.
He claimed the reason is due to the fact that strings are array and numeric can be array type as well. And applying the hash function would only allocate part of that character to the 'bucket array'.
The thing is our lecturer isn't the guy who made the lecture note (He uses the previous lecturer's one last year) and I don't think what he said today is correct, Can someone enlighten me on this please?
These lecture notes you refer to come directly from Goodrich & Tamassia’s book (both the Algorithm design one and Data Structures). It discusses a variety of hash code functions - such as using the memory address of the object, using an integer cast, component sum, or polynomial accumulation. It notes that using the memory address of an object is “good in general, except for numeric and string keys”.
There are times when the hash code that maps an object to an integer based on its memory address is sufficient, even if that object is a string. However, two objects with equal value (a=‘hello’, b=‘hello’) would not have the same hash code using this method, since they have different memory addresses. The same applies for other objects such as numeric keys (a=10, b=10 are equal in value but not in memory address).
Consider a simple system which stores a password for a user as a hash code. If the user enters in a password, the string they entered in is hashed according to the same hash function and compared against the one which is stored. These two passwords have the same value (the one the user first created and the one they use to log in), so they should produce the same hash to successfully log in. Therefore, we would not want to use the memory address to map the string to an integer in this scenario.

Is r.uuid() guaranteed to be unique?

Is r.uuid() guaranteed to be unique?
Return a UUID (universally unique identifier), a string that can be used as a unique ID.
How universal is r.uuid()? Is it scoped to a table/database/instance of RethinkDB? Or is it simply computing the hash of a random byte sequence (e.g. /dev/rand)? Or does it hash nano-unix time?
You can check the answers to a related question in here.
UUIDs are supposed to be uniques because of the very low probability of colisitions. Although in theory they may not be uniques as it's a random algorithm that generates the UUIDs, you will hardly generate a duplicate.
From the Wikipedia they say that for 68,719,476,736 generated UUIDs (Which it's a very huge number for a common application) you have 0.0000000000000004 for an accidental clash. It's almost impossible..
UUID means universally unique identifier. In this context the word unique should be taken to mean "practically unique" rather than "guaranteed unique". Since the identifiers have a finite size, it is possible for two differing items to share the same identifier. This is a form of hash collision.
Anyone can create a UUID and use it to identify something with reasonable confidence that the same identifier will never be unintentionally created by anyone to identify something else.
A UUID is simply a 128-bit value.

How can you hash an email address into a zero or one with relatively even distribution?

This may be a very stupid question - apologies in advance.
I'm wondering if it's possible to generate a random number from an email address. I'm imagining something similar to how you can generate an md5 hash of an email address (or pretty much any string for that matter).
So basically such a function would allow you to generate the same random number from the same email address every time you ran it.
The application that I have in mind is to slot email addresses into an A/B test randomly. Normally the way that you would implement such a thing would be to just generate a random number for each email address and store that along with the email address in order to tag a given email as belonging to A or B.
The nice thing about a function that could generate a random number from an email is that you wouldn't have to store that association anywhere. You could run it on the fly to determine at any given time which bucket the email should fall into.
UPDATE: What I'm looking for is a hash, not a random number. So it's just a matter of figuring out how to go from something like an MD5 hash to an integer with a value of 0 or 1.
UPDATE 2: Thanks for the answers and nudging me in the right direction. So one solution in MYSQL is simply:
ASCII(SUBSTR(MD5(CONCAT(customer_email, 'salt')), 1, 1)) % 2
Yes a Hash by definition does this ( or it appears to ) create a someone random value given some string. But note that it's not really random. To deal with this we do a salted hash, which is to do a Hash that has a random number appended to it, then store both the salted hash with the random number. And it will give you the same results (as long as you retrieved the corresponding random number that the email was stored with).
When generated random number is same every time, it is no longer a random number. You can use ascii coding of characters used in the email for your random number. But there is a catch here : abc#xyz.com will be same as cba#xyz.com. So you have to take care of this somehow. Things will become complex if more special characters are used like _ or a dot(.) . Why can't we use the email itself as KEY.

Hashing and encryption technique for a huge data set containing phone numbers

Description of problem:
I'm in the process of working with a highly sensitive data-set that contains the people's phone number information as one of the columns. I need to apply (encryption/hash function on them) to convert them as some encoded values and do my analysis. It can be an one-way hash - i.e, after processing with the encrypted data we wont be converting them back to original phone numbers. Essentially, am looking for an anonymizer that takes phone numbers and converts them to some random value on which I can do my processing. Suggest the best way to do about this process. Recommendations on the best algorithms to use are welcome.
Update: size of the dataset
My dataset is really huge in the size of hundreds of GB.
Update: Sensitive
By sensitive, I meant that phone number should not be a part of our analysis.So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value.
Update: Implementation ?
Thanks for your answers.I am looking for elaborate implementation.I was going through python's hashlib library for hashing, Does it necessarily do the same set of steps that you suggested ? Here is the link
Can you give me some example code to achieve the process , preferably in Python ?
Generate a key for your data set (16 or 32 bytes) and keep it secret. Use Hmac-sha1 on your data with this key, and base 64 encode that and you have a random unique string per phonenumber that isn't reversable (without the key).
Example (Hmac-Sha1 with 256bit key) using Keyczar:
Create random secret key:
$> python keyczart.py create --location=path_to_key_set --purpose=sign
$> python keyczart.py addkey --location=path_to_key_set --status=primary
Anonymize phone number:
from keyczar import keyczar
def anonymize(phone_num):
signer = keyczar.Signer.Read("path_to_key_set");
return signer.Sign(phone_num)
If you're going to use cryptography, you want to apply a pseudorandom function to each phone number and throw away the key. Collision-resistant hashes such as SHA-256 do not provide the right security guarantees. Really, though, are there that many different phone numbers that you can't just construct incrementally a map representing an actually random function?
sort your data by the respective column and start counting distinct values ... replace the actual values with their respective counter value ... collision free ... one way ...
"So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value."
This screams for a solution based on a cryptographic hash function. MD5 and SHA-1 are the best known examples, and work wonderfully for this. You will read that "MD5 has been cracked", but for your purpose that doesn't matter.

Resources