How can you hash an email address into a zero or one with relatively even distribution? - random

This may be a very stupid question - apologies in advance.
I'm wondering if it's possible to generate a random number from an email address. I'm imagining something similar to how you can generate an md5 hash of an email address (or pretty much any string for that matter).
So basically such a function would allow you to generate the same random number from the same email address every time you ran it.
The application that I have in mind is to slot email addresses into an A/B test randomly. Normally the way that you would implement such a thing would be to just generate a random number for each email address and store that along with the email address in order to tag a given email as belonging to A or B.
The nice thing about a function that could generate a random number from an email is that you wouldn't have to store that association anywhere. You could run it on the fly to determine at any given time which bucket the email should fall into.
UPDATE: What I'm looking for is a hash, not a random number. So it's just a matter of figuring out how to go from something like an MD5 hash to an integer with a value of 0 or 1.
UPDATE 2: Thanks for the answers and nudging me in the right direction. So one solution in MYSQL is simply:
ASCII(SUBSTR(MD5(CONCAT(customer_email, 'salt')), 1, 1)) % 2

Yes a Hash by definition does this ( or it appears to ) create a someone random value given some string. But note that it's not really random. To deal with this we do a salted hash, which is to do a Hash that has a random number appended to it, then store both the salted hash with the random number. And it will give you the same results (as long as you retrieved the corresponding random number that the email was stored with).

When generated random number is same every time, it is no longer a random number. You can use ascii coding of characters used in the email for your random number. But there is a catch here : abc#xyz.com will be same as cba#xyz.com. So you have to take care of this somehow. Things will become complex if more special characters are used like _ or a dot(.) . Why can't we use the email itself as KEY.

Related

How to select nth random integer from a range of integers without repetition or storage? [duplicate]

This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 1 year ago.
Let's say my system needs to provide a unique integer id regularly, between 1 and 10^20, from a function like --
function getNextRandomUniqueId(index:BigInt, min:BigInt, max:BigInt, seed:BigInt): BigInt { ? }
id = getNextRandomUniqueId(index=42, min=1, max=10^20, seed=0)
These ids need to be provided in random order as the index increases, not sequentially. Once an id has been provided, it cannot be provided again, as long as the index increases. My system cannot store a random list of all the numbers to be issued, or all the numbers issued, there's too many. I also don't want to rely on something like a random UUID, which is exceedingly unlikely to have a collision, but not guaranteed to.
How can this be done? To have a deterministic mathematical way to iterate randomly through a set of sequential integers without repetition and without storage?
EDIT: Fixed 1^20 to 10^20
This can be done, assuming you are allowed to store an encryption key and counter. Encryption is a one-to-one mapping so by encrypting all the numbers in a given range you will get back all those same numbers in a randomized order. Different keys will give a different order. Encrypt the numbers 0, 1, 2, 3, ... in order, using the key and keeping track of how far you have got.
Depending on the range of numbers, you may need to use some form of Format Preserving encryption to keep the outputs within the required range.
You cannot guarantee that your same id is not in another seed sequence.
Most languages use the time to generate the sequence when you are not providing a seed yourself. You have set your seed to zero so each time you restart your program, you will get your same ids. This is most likely not your intent :-)
But even when you would do this, the chance that you hit the same id is there.
1 in the 100,000,000,000,000,000,000.
The reason you can get the same id is because it is RANDOM
I would go with a GUID.
1 in the 340.280.000.000.000.000.000.000.000.000.000.000.000

How can I generate a unique identifier that is apparently not progressive [duplicate]

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.

Hashing and encryption technique for a huge data set containing phone numbers

Description of problem:
I'm in the process of working with a highly sensitive data-set that contains the people's phone number information as one of the columns. I need to apply (encryption/hash function on them) to convert them as some encoded values and do my analysis. It can be an one-way hash - i.e, after processing with the encrypted data we wont be converting them back to original phone numbers. Essentially, am looking for an anonymizer that takes phone numbers and converts them to some random value on which I can do my processing. Suggest the best way to do about this process. Recommendations on the best algorithms to use are welcome.
Update: size of the dataset
My dataset is really huge in the size of hundreds of GB.
Update: Sensitive
By sensitive, I meant that phone number should not be a part of our analysis.So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value.
Update: Implementation ?
Thanks for your answers.I am looking for elaborate implementation.I was going through python's hashlib library for hashing, Does it necessarily do the same set of steps that you suggested ? Here is the link
Can you give me some example code to achieve the process , preferably in Python ?
Generate a key for your data set (16 or 32 bytes) and keep it secret. Use Hmac-sha1 on your data with this key, and base 64 encode that and you have a random unique string per phonenumber that isn't reversable (without the key).
Example (Hmac-Sha1 with 256bit key) using Keyczar:
Create random secret key:
$> python keyczart.py create --location=path_to_key_set --purpose=sign
$> python keyczart.py addkey --location=path_to_key_set --status=primary
Anonymize phone number:
from keyczar import keyczar
def anonymize(phone_num):
signer = keyczar.Signer.Read("path_to_key_set");
return signer.Sign(phone_num)
If you're going to use cryptography, you want to apply a pseudorandom function to each phone number and throw away the key. Collision-resistant hashes such as SHA-256 do not provide the right security guarantees. Really, though, are there that many different phone numbers that you can't just construct incrementally a map representing an actually random function?
sort your data by the respective column and start counting distinct values ... replace the actual values with their respective counter value ... collision free ... one way ...
"So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value."
This screams for a solution based on a cryptographic hash function. MD5 and SHA-1 are the best known examples, and work wonderfully for this. You will read that "MD5 has been cracked", but for your purpose that doesn't matter.

String to unique int algorithm

We are trying to implement the following case. We have a invoice table and there is a column which has email address. We want to somehow generate a unique int value from this email address and store that in a separate column. This will be used as a FK and indexed. So what I am looking for is an algorithm for generating ints from strings (please note that the email string should always output the same int so each email address as a unique int representation). We can use a bigint as well
Simplest solution is to put the email address into its own table along with an identity/auto_increment type column. Then you can simply carry around that identify field (a standard int), and you don't run into any issues with potential hash collisions, and no hashing overhead.
It seems a simple hashcode (MD5, SHA1, ...) should fit your needs; depending on your RDBMS, you might be able to use built-in packages (e.g. Oracle's dbms_crypto) or have to compute them externally.
Some things to keep in mind:
convert everything to lower/uppercase before computing the hashcode (so USER#DOMAIN.COM gets the same hashcode as user#domain.com)
apparently, you have a denormalized schema. It would make more sense to have a separate customer table containing the E-Mail adress; invoice should then contain only a foreign key customer_fk
MD5 - gives you a 128-bit integer. (Admittedly, this is bigger than the int datatype in most languages, but you won't get near guaranteed uniqueness with with just 32-bits.)
I don't know if you can get away with a 64-bit int: the max length of an email address is 254 characters and, in this case where you need to preserve the uniqueness of each, hashing will not do it.
So it seems you are stuck with having to get over this 254-character hurdle. My approach (always the brute force approach for me) would be to take the alphabet of allowable characters in an email address, map those to 6-bit values, and use the map to pack them into a series of words.
Take a look at rfc3696 which deals with email addresses in a way that's actually comprehensible.
Sorry to be of so little help.

I'm brainstorming for a serial number scheme. Am I doing it wrong?

serial number format:
24 octets represented by 24 hex
characters plus hyphens for
readibility
e.g. D429-A7C5-9C15-8516-D15D-3A1C
0-15: {email+master hash}
16-19: {id}
20-23: {timestamp}
email+master hash algorithm:
generate md5 hash of user's email (32 bytes)
generate md5 hash of undisclosed master key
xor these two hashes
remove odd bytes, reducing size to 16
e.g. D429A7C59C158516D15D3A1CB00488ED --> D2AC9181D531B08E
id:
initially 0x00000000, then incremented with each licence sold
timestamp:
timestamp generated when license is purchased
validation:
in order to register product, user must enter 1) email address and 2) serial number
generate email+master hash and verify that it matches 0-15 of serial
extract timestamp from serial and verify that it is < current timestamp and >= date first license is sold
I'm no expert on this, but there are a few things that might be problematic with this approach:
Using MD5 doesn't seem like a good idea. MD5 has known security weaknesses and someone with enough time on their hands could easily come up with some sort of hash collision. Depending on how you use the serial number, someone could easily forge a serial number that looks like it matches some other serial number. Using something from the SHA family might prevent this.
Your XOR of the user email hash with a master key isn't particularly secure - I could recover the hash of the master key easily by XORing the serial number with a hash of my own email.
Dropping every odd byte out of a secure hash breaks the guarantee that the hash is secure. In particular, any hash function with a good security guarantee usually requires that all of the bytes in the resulting hash be there in the output. As an example, I could trivially construct a secure hash function from any existing secure hash function by taking the output of that first hash, interspersing 0s in-between all the old bytes, then outputting the result. It's secure because if you could break any of the security properties of my new hash, it would be equivalent to breaking security properties of the original hash. However, if you drop all the even-numbered bytes from the new hash, you get all zeros, which isn't at all secure.
Is four bytes enough for the id? That only gives you 2^32 different ids.
Some points to add to templatetypedef´s reply:
If you must combine hashes for the email and your master key, hash the concatenation of both. Even better, hash email+key+id for even "better" security in case someone purchases two or more licenses and sees the pattern.
Use a hash function that gives you only 16 bytes. If you must use MD5, any truncation is equally bad, so just take the first 16 bytes.
Your id is never used in the validation.
You will not be protected from key sharing (e.g. warez sites).
A serial number protects you from very few attacks. It´s probably not worth your time and effort.

Resources