I'm brainstorming for a serial number scheme. Am I doing it wrong? - software-distribution

serial number format:
24 octets represented by 24 hex
characters plus hyphens for
readibility
e.g. D429-A7C5-9C15-8516-D15D-3A1C
0-15: {email+master hash}
16-19: {id}
20-23: {timestamp}
email+master hash algorithm:
generate md5 hash of user's email (32 bytes)
generate md5 hash of undisclosed master key
xor these two hashes
remove odd bytes, reducing size to 16
e.g. D429A7C59C158516D15D3A1CB00488ED --> D2AC9181D531B08E
id:
initially 0x00000000, then incremented with each licence sold
timestamp:
timestamp generated when license is purchased
validation:
in order to register product, user must enter 1) email address and 2) serial number
generate email+master hash and verify that it matches 0-15 of serial
extract timestamp from serial and verify that it is < current timestamp and >= date first license is sold

I'm no expert on this, but there are a few things that might be problematic with this approach:
Using MD5 doesn't seem like a good idea. MD5 has known security weaknesses and someone with enough time on their hands could easily come up with some sort of hash collision. Depending on how you use the serial number, someone could easily forge a serial number that looks like it matches some other serial number. Using something from the SHA family might prevent this.
Your XOR of the user email hash with a master key isn't particularly secure - I could recover the hash of the master key easily by XORing the serial number with a hash of my own email.
Dropping every odd byte out of a secure hash breaks the guarantee that the hash is secure. In particular, any hash function with a good security guarantee usually requires that all of the bytes in the resulting hash be there in the output. As an example, I could trivially construct a secure hash function from any existing secure hash function by taking the output of that first hash, interspersing 0s in-between all the old bytes, then outputting the result. It's secure because if you could break any of the security properties of my new hash, it would be equivalent to breaking security properties of the original hash. However, if you drop all the even-numbered bytes from the new hash, you get all zeros, which isn't at all secure.
Is four bytes enough for the id? That only gives you 2^32 different ids.

Some points to add to templatetypedef´s reply:
If you must combine hashes for the email and your master key, hash the concatenation of both. Even better, hash email+key+id for even "better" security in case someone purchases two or more licenses and sees the pattern.
Use a hash function that gives you only 16 bytes. If you must use MD5, any truncation is equally bad, so just take the first 16 bytes.
Your id is never used in the validation.
You will not be protected from key sharing (e.g. warez sites).
A serial number protects you from very few attacks. It´s probably not worth your time and effort.

Related

How can you hash an email address into a zero or one with relatively even distribution?

This may be a very stupid question - apologies in advance.
I'm wondering if it's possible to generate a random number from an email address. I'm imagining something similar to how you can generate an md5 hash of an email address (or pretty much any string for that matter).
So basically such a function would allow you to generate the same random number from the same email address every time you ran it.
The application that I have in mind is to slot email addresses into an A/B test randomly. Normally the way that you would implement such a thing would be to just generate a random number for each email address and store that along with the email address in order to tag a given email as belonging to A or B.
The nice thing about a function that could generate a random number from an email is that you wouldn't have to store that association anywhere. You could run it on the fly to determine at any given time which bucket the email should fall into.
UPDATE: What I'm looking for is a hash, not a random number. So it's just a matter of figuring out how to go from something like an MD5 hash to an integer with a value of 0 or 1.
UPDATE 2: Thanks for the answers and nudging me in the right direction. So one solution in MYSQL is simply:
ASCII(SUBSTR(MD5(CONCAT(customer_email, 'salt')), 1, 1)) % 2
Yes a Hash by definition does this ( or it appears to ) create a someone random value given some string. But note that it's not really random. To deal with this we do a salted hash, which is to do a Hash that has a random number appended to it, then store both the salted hash with the random number. And it will give you the same results (as long as you retrieved the corresponding random number that the email was stored with).
When generated random number is same every time, it is no longer a random number. You can use ascii coding of characters used in the email for your random number. But there is a catch here : abc#xyz.com will be same as cba#xyz.com. So you have to take care of this somehow. Things will become complex if more special characters are used like _ or a dot(.) . Why can't we use the email itself as KEY.

Custom hash for unique ID

I want to do something similar to what YouTube does. For example, this is a valid YouTube video ID didzxUkrtS0
Right now I am storing user's IDs with an integer number, and I want to translate those numbers to a 8 character alphanumerical identification. For example: FZ3EY1IC (not hexadecimal)
I already know that I can implement it with MD5 and then take the first 8 hex numbers, but it doesn't cover the entire alphabet.
What should I do to create a unique pattern using integers, that should never repeat?
Make your integer 5 8-bit bytes long (by adding a byte with a random value if your integer is 32-bit), that's 40 bits of data.
Cryptographically encrypt the 5 bytes of your integer using some key, which you probably want to keep private.
Slice the 40 encrypted bits into 8 5-bit parts. Encode each part using 32 alphanumeric characters. You may choose to use different 32 (out of the total 36) characters for each part.
The reverse operation is trivial.
I'm pretty sure what sites like youtube and bitly do is store a big table in the database that translates the alphanumerical identifiers for each link to the internal ID of what they're for (either that or it's stored in the row). And when it needs a new identifier, compute a random one and store it. The reason why you need to do this is so that an attacker cannot predict the ID of the next piece of content to be added.

Hashing and encryption technique for a huge data set containing phone numbers

Description of problem:
I'm in the process of working with a highly sensitive data-set that contains the people's phone number information as one of the columns. I need to apply (encryption/hash function on them) to convert them as some encoded values and do my analysis. It can be an one-way hash - i.e, after processing with the encrypted data we wont be converting them back to original phone numbers. Essentially, am looking for an anonymizer that takes phone numbers and converts them to some random value on which I can do my processing. Suggest the best way to do about this process. Recommendations on the best algorithms to use are welcome.
Update: size of the dataset
My dataset is really huge in the size of hundreds of GB.
Update: Sensitive
By sensitive, I meant that phone number should not be a part of our analysis.So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value.
Update: Implementation ?
Thanks for your answers.I am looking for elaborate implementation.I was going through python's hashlib library for hashing, Does it necessarily do the same set of steps that you suggested ? Here is the link
Can you give me some example code to achieve the process , preferably in Python ?
Generate a key for your data set (16 or 32 bytes) and keep it secret. Use Hmac-sha1 on your data with this key, and base 64 encode that and you have a random unique string per phonenumber that isn't reversable (without the key).
Example (Hmac-Sha1 with 256bit key) using Keyczar:
Create random secret key:
$> python keyczart.py create --location=path_to_key_set --purpose=sign
$> python keyczart.py addkey --location=path_to_key_set --status=primary
Anonymize phone number:
from keyczar import keyczar
def anonymize(phone_num):
signer = keyczar.Signer.Read("path_to_key_set");
return signer.Sign(phone_num)
If you're going to use cryptography, you want to apply a pseudorandom function to each phone number and throw away the key. Collision-resistant hashes such as SHA-256 do not provide the right security guarantees. Really, though, are there that many different phone numbers that you can't just construct incrementally a map representing an actually random function?
sort your data by the respective column and start counting distinct values ... replace the actual values with their respective counter value ... collision free ... one way ...
"So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value."
This screams for a solution based on a cryptographic hash function. MD5 and SHA-1 are the best known examples, and work wonderfully for this. You will read that "MD5 has been cracked", but for your purpose that doesn't matter.

How can I generate a set of (short) unique identifiers which are easy to check but hard to spoof?

For example, such as the license keys much software uses. I had thought of cryptographically signing a sequence, so I could have maybe 4 bytes for the ID and say 8 bytes for the signature, but I can't find a suitable algorithm.
What I need is something that an attacker can't readily generate, but which is stored in less than approx 20 ASCII bytes. I also need to be confident of uniqueness. This doesn't need to be completely secure, only secure against a casual attack.
Note: I'm doing this in java on appengine.
Just generate a GUID for each ID and keep track of the ones you've generated in a database. The universe of GUIDs is so large that each will be unique. It's not cryptographic so there's a possibility that anyone who has a large enough population of your generated ones could produce a match, but I think the odds are still miniscule.
A GUID is 128 bits, which can be encoded in 23 bytes using Base64.
Sounds like HMAC. You will probably need to ensure uniqueness manually though.
Calculate the values including the id into a string and use the byte based HDMAC with a secret key and max length. Just make sure that you have a unique part in the values to encrypt. This could be server time or some other ID. The length will need to be tested that it remains within your 20 character requirement.
Encryption is reversible, so the output is guaranteed unique for unique inputs. Just encrypt 0, 1, 2, 3, 4, 5 etc. using the same key every time. For 128 bit output use AES and 128 bit numbers in ECB mode. Other modes will need identical IV/Nonces as well. For 64 bit numbers use DES. For other size numbers either use Hasty Pudding cypher or roll your own simple Feistel cypher for the size you want.
ECB is not the most secure mode, but I do not get the impression that you are looking for very high levels of security here.

10-char hash of an email

How reliable is it to use a 10-char hash to identify email addresses?
MailChimp has 10-character alphanumeric IDs for email addresses.
10 chars 4 bit each gives 40 bits, a bit over one trillion. Maybe for an enterprise sized like MailChimp this gives a reasonable headroom for a unique index space, and they have a single table with all possible emails, indexed with a 40-bit number.
I'd love to use same style of hashes or coded IDs to include in links. To decide whether to go for indexes or hashes, need to estimate a probability of two valid email addresses leading to the same 10-char hash.
Any hints to evaluating that for a custom hash function, other than raw testing?
You don't explicitly say what you mean by "reliable", but I presume you're trying to avoid collisions. As wildplasser says, for random identifiers it's all about the birthday paradox, and the chance of a collision in an identifier space with 2^n IDs reaches 50% when 2^(n/2) IDs are in use.
The Wikipedia page on Birthday Attacks has a great table illustrating probabilities for collisions under various parameters; for instance with 64 bits and a desired maximum collision probability of 1 in 1 million, you can have about 6 million identifiers.
Bear in mind that there are a lot more efficient ways to represent data in characters than hex; base64, for instance, gives you 3 bytes per 4 characters, meaning 10 characters gives you 60 bits, instead of 40 with hex.

Resources