Related
I was reading a post in Troy Hunt's blog (https://www.troyhunt.com/ive-just-launched-pwned-passwords-version-2/), about a feature called "Pwned Passwords" that checks if your password is in a database with more than 1 billion leaked passwords.
To do this check without passing your password, the client code hash it and pass just the first five chars of this hash, the backend returns all the sha1 hashes of the passwords that starts with the prefix that you passed. Then, to check if the hash of your password is in the database or not, the comparison is made on client code.
And he put some info about the data of these hashed passwords...
Every hash prefix from 00000 to FFFFF is populated with data (16^5 combinations)
The average number of hashes returned is 478
The smallest is 381 (hash prefixes "E0812" and "E613D")
The largest is 584 (hash prefixes "00000" and "4A4E8")
In the comments, people was wondering if the presence of this "00000" is a coincidence or is math...
Could someone that understands the SHA1 algorithm explain it to us?
Well, since the passwords originally come from data breaches, my best guess is that the password table in one of the breached systems was sorted or clustered by the (unsalted -- those are the kind of folks who get their passwords stolen) SHA1 hash of the password. When the system was breached, the attackers started with the "00000" hashes and just didn't make it all the way through...
Or maybe the list that Troy used includes the first part of an SHA1 rainbow table (https://en.wikipedia.org/wiki/Rainbow_table)...
Or something like that. The basic idea is that the SHA1 hash of the passwords was part of the password selection process.
It's either a coincidence, or (less likely) an artifact/error in acquiring or assembling the results for publication.
Not that it looks like a significant outlier. The spread that's described (381 min, 478 average, 584 max) seems like an even spread for the sample size. A graph of the entire corpus would probably look pretty random.
Like any reasonably constructed hashing algorithm, character frequency in SHA1 results should be randomly distributed. (If SHA1 had some kind of bias, this would be major news in the math and cryptography/cryptology community!)
someone would need to check my guess against the sha1 algorithm (and troy may have already debunked it since as per his blog answer he "took a peak at the [plain text] passwords) but since passwords are just alpha/numeric and limited symbols as depicted in ASCII creating a hash will ALWAYS start working with a first bit of ZERO (ascii is 0-255 but letters numbers and symbols used are in 32-98 range i believe, so first bit of every 8 bits always zero) and while it is the function of a hash to gloss over this, I suspect predictable bit positioning isn't as easy to obfuscate as one expects. while it ties with 4, 0 is 00000000 in bit form and 4 is 00000100 so both have first FIVE bits as 0,
also note that the two least frequent hash headers both start with E, WHICH IS 11111110 in binary, so they are almost exact opposite in construction (1's vs 0's) AND frequency (low vs high) implying the presence of zero bits may be a side effect of either the algorithm outright (doubtful) or a function of the algorithm on a limited subset skewed by convention, in other words, letters and digits occupy only 1/3rd - 1/4th of the full range depicted by ASCII which is most probable
of course we could go "tin foil hat" with this convo, but I'd just bet coincidence and ASCII are more to blame than that man on the grassy knoll
I'd like to know how I can compress a string into fewer characters using a shell script. The goal is to take a Mac's serial number and MAC address then compress those values into a 14 character string. I'm not sure if this is possible, but I'd like to hear if anyone has any suggestions.
Thank you
Your question is way too vague to result in a detailed answer.
Given your restriction of a 14 character string output, you won't be able to use "real" compression (like zip), due to the overhead. This leaves you with simple algorithms, like RLE or bit concatenation.
If by "string" you mean "printable string", i.e. only about 62 or so values are usable in a character (depending on the exact printable set you choose), then you have an additional space constraint.
A handy trick you could use with the MAC address part is, since it belongs to an Apple device, you already know that the first three values (AA:BB:CC) are one of 297 combinations, so you could save 6 characters (plus 2 for the colons) worth of information into 2+ characters (depending on your output character set, see above).
The remaining three MAC address values are base-16 (0-9, A-F), so you could "compress" this information slightly as well.
A similar analysis can be done for the Mac serial number (which values can it take? how much space can be saved?).
The effort to do this in bash would be disproportionate though. I'd highly recommend a C (or other programming language) approach.
Cheating answer
Get someone at Apple to give you access to the database I'm assuming they have which matches devices' serial numbers to MAC addresses. Then you can just store the MAC address and look it up in the database whenever you need the serial number. The 64-bit MAC address can easily be stored in 12 characters with standard base64 encoding.
Frustrating answer
You have to make some unreliable assumptions just to make this approachable. You can fix the assumptions later, but I don't know if it would still fit in 14 characters. Personally, I have no idea why you want to save space by reprocessing the serial and MAC numbers, but here's how I'd start.
Simplifying assumptions
Apple will never use MAC address prefixes beyond the 297 combinations mentioned in Sir Athos' answer.
The "new" Mac serial number format in this article from
2010 is the only format Apple has used or ever will use.
Core concepts of encoding
You're taking something which could have n possible values and you're converting it into something else with n possible values.
There may be gaps in the original's possible values, such as if Apple cancels building a manufacturing plant after already assigning it a location code.
There may be gaps in your encoded form's possible values, perhaps in anticipation of Apple doing things that would fill the gaps.
Abstract integer encoding
Break apart the serial number into groups as "PPP Y W SSS CCCC" (like the article describes)
Make groups for the first 3 bytes and last 5 bytes of the MAC address.
Translate each group into a number from 0 to n-1 where n is the number of possible values for something in the group. As far as I can tell from the article, the values are n_P=36^3, n_Y=20, n_W=27, n_S=3^3, and n_C=36^4. The first 3 MAC bytes has 297 values and the last 5 have 2^(8*5)=2^40 values.
Set a variable, i, to the value of the first group's number.
For each remaining group's number, multiply i by the number of values possible for the group, and then add the number to i.
Base n encoding
Make a list of n characters that you want to use in your final output.
Print the character in your list at index i%n.
Subtract the modulus from the integer encoding and divide by n.
Repeat 1 and 2 until the integer becomes 0.
Result
This results in a total of 36^3 * 20 * 27 * 36 * 7 * 297 * 2^40 ~= 2 * 10^24 combinations. If you let n=64 for a custom base64 encoding
(without any padding characters), then you can barely fit that into ceiling(log(2 * 10^24) / log(64)) = 14 characters. If you use all 95 printable ASCII characters, then you can fit it into ceiling(log(2 * 10^24) / log(95)) = 13 characters.
Fixing the assumptions
If you're trying to build something that uses this and are determined to make it work, here's what you need to do to make it solid, along with some tips.
Do the same analysis on every other serial number format you may care about. You might want to see if there's any redundant information between the serial and MAC numbers.
Figure out a way to detect between serial number formats. Adding an extra thing at the end of the abstract number encoding can enable you to track which version it uses.
Think long and careful about the format you're making. It's a lot easier to make changes before you're stuck with backwards compatibility.
If you can, use a language that's well suited for mapping between values, doing a lot of arithmetic, and handling big numbers. You may be able to do it in Bash, but it'd probably be easier in, say, Python.
[Background Story]
I am working with a 5 year old user identification system, and I am trying to add IDs to the database. The problem I have is that the system that reads the ID numbers requires some sort of checksum, and no-one working here now has ever worked with it, so no-one knows how it works.
I have access to the list of existing IDs, which already have correct checksums. Also, as the checksum only has 16 possible values, I can create any ID I want and run it through the authentication system up to 16 times until I get the correct checksum (but this is quite time consuming)
[Question]
What methods can I use to help guess the checksum algorithm of used for some data?
I have tried a few simple methods such as XORing and summing, but these have not worked.
So my question is: if I have data (in hexadecimal) like this:
data checksum
00029921 1
00013481 B
00026001 3
00004541 8
What methods can I use work out what sort of checksum is used?
i.e. should I try sequential numbers such as 00029921,00029922,00029923,... or 00029911,00029921,00029931,... If I do this what patterns should I look for in the changing checksum?
Similarly, would comparing swapped digits tell me anything useful about the checksum?
i.e. 00013481 and 00031481
Is there anything else that could tell me something useful? What about inverting one bit, or maybe one hex digit?
I am assuming that this will be a common checksum algorithm, but I don't know where to start in testing it.
I have read the following links, but I am not sure if I can apply any of this to my case, as I don't think mine is a CRC.
stackoverflow.com/questions/149617/how-could-i-guess-a-checksum-algorithm
stackoverflow.com/questions/2896753/find-the-algorithm-that-generates-the-checksum
cosc.canterbury.ac.nz/greg.ewing/essays/CRC-Reverse-Engineering.html
[ANSWER]
I have now downloaded a much larger list of data, and it turned out to be simpler than I was expecting, but for completeness, here is what I did.
data:
00024901 A
00024911 B
00024921 C
00024931 D
00042811 A
00042871 0
00042881 1
00042891 2
00042901 A
00042921 C
00042961 0
00042971 1
00042981 2
00043021 4
00043031 5
00043041 6
00043051 7
00043061 8
00043071 9
00043081 A
00043101 3
00043111 4
00043121 5
00043141 7
00043151 8
00043161 9
00043171 A
00044291 E
From these, I could see that when just one value was increased by a value, the checksum was also increased by the same value as in:
00024901 A
00024911 B
Also, two digits swapped did not change the checksum:
00024901 A
00042901 A
This means that the polynomial value (for these two positions at least) must be the same
Finally, the checksum for 00000000 was A, so I calculated the sum of digits plus A mod 16:
( (Σxi) +0xA )mod16
And this matched for all the values I had. Just to check that there was nothing sneaky going on with the first 3 digits that never changed in my data, I made up and tested some numbers as Eric suggested, and those all worked with this too!
Many checksums I've seen use simple weighted values based on the position of the digits. For example, if the weights are 3,5,7 the checksum might be 3*c[0] + 5*c[1] + 7*c[2], then mod 10 for the result. (In your case, mod 16, since you have 4 bit checksum)
To check if this might be the case, I suggest that you feed some simple values into your system to get an answer:
1000000 = ?
0100000 = ?
0010000 = ?
... etc. If there are simple weights based on position, this may reveal it. Even if the algorithm is something different, feeding in nice, simple values and looking for patterns may be enlightening. As Matti suggested, you/we will likely need to see more samples before decoding the pattern.
When dealing with a series of numbers, and wanting to use hash results for security reasons, what would be the best way to generate a hash value from a given series of digits? Examples of input would be credit card numbers, or bank account numbers. Preferred output would be a single unsigned integer to assist in matching purposes.
My feeling is that most of the string implementations appear to have low entropy when run against such a short range of characters and because of that, the collision rate might be higher than when run against a larger sample.
The target language is Delphi, however answers from other languages are welcome if they can provide a mathmatical basis which can lead to an optimal solution.
The purpose of this routine will be to determine if a previously received card/account was previously processed or not. The input file could have multiple records against a database of multiple records so performance is a factor.
With security questions all the answers lay on a continuum from most secure to most convenient. I'll give you two answers, one that is very secure, and one that is very convenient. Given that and the explanation of each you can choose the best solution for your system.
You stated that your objective was to store this value in lieu of the actual credit card so you could later know if the same credit card number is used again. This means that it must contain only the credit card number and maybe a uniform salt. Inclusion of the CCV, expiration date, name, etc. would render it useless since it the value could be different with the same credit card number. So we will assume you pad all of your credit card numbers with the same salt value that will remain uniform for all entries.
The convenient solution is to use a FNV (As Zebrabox and Nick suggested). This will produce a 32 bit number that will index quickly for searches. The downside of course is that it only allows for at max 4 billion different numbers, and in practice will produce collisions much quicker then that. Because it has such a high collision rate a brute force attack will probably generate enough invalid results as to make it of little use.
The secure solution is to rely on SHA hash function (the larger the better), but with multiple iterations. I would suggest somewhere on the order of 10,000. Yes I know, 10,000 iterations is a lot and it will take a while, but when it comes to strength against a brute force attack speed is the enemy. If you want to be secure then you want it to be SLOW. SHA is designed to not have collisions for any size of input. If a collision is found then the hash is considered no longer viable. AFAIK the SHA-2 family is still viable.
Now if you want a solution that is secure and quick to search in the DB, then I would suggest using the secure solution (SHA-2 x 10K) and then storing the full hash in one column, and then take the first 32 bits and storing it in a different column, with the index on the second column. Perform your look-up on the 32 bit value first. If that produces no matches then you have no matches. If it does produce a match then you can compare the full SHA value and see if it is the same. That means you are performing the full binary comparison (hashes are actually binary, but only represented as strings for easy human reading and for transfer in text based protocols) on a much smaller set.
If you are really concerned about speed then you can reduce the number of iterations. Frankly it will still be fast even with 1000 iterations. You will want to make some realistic judgment calls on how big you expect the database to get and other factors (communication speed, hardware response, load, etc.) that may effect the duration. You may find that your optimizing the fastest point in the process, which will have little to no actual impact.
Also, I would recommend that you benchmark the look-up on the full hash vs. the 32 bit subset. Most modern database system are fairly fast and contain a number of optimizations and frequently optimize for us doing things the easy way. When we try to get smart we sometimes just slow it down. What is that quote about premature optimization . . . ?
This seems to be a case for key derivation functions. Have a look at PBKDF2.
Just using cryptographic hash functions (like the SHA family) will give you the desired distribution, but for very limited input spaces (like credit card numbers) they can be easily attacked using brute force because this hash algorithms are usually designed to be as fast as possible.
UPDATE
Okay, security is no concern for your task. Because you have already a numerical input, you could just use this (account) number modulo your hash table size. If you process it as string, you might indeed encounter a bad distribution, because the ten digits form only a small subset of all possible characters.
Another problem is probably that the numbers form big clusters of assigned (account) numbers with large regions of unassigned numbers between them. In this case I would suggest to try highly non-linear hash function to spread this clusters. And this brings us back to cryptographic hash functions. Maybe good old MD5. Just split the 128 bit hash in four groups of 32 bits, combine them using XOR, and interpret the result as a 32 bit integer.
While not directly related, you may also have a look at Benford's law - it provides some insight why numbers are usually not evenly distributed.
If you need security, use a cryptographically secure hash, such as SHA-256.
I needed to look deeply into hash functions a few months ago. Here are some things I found.
You want the hash to spread out hits evenly and randomly throughout your entire target space (usually 32 bits, but could be 16 or 64-bits.) You want every character of the input to have and equally large effect on the output.
ALL the simple hashes (like ELF or PJW) that simply loop through the string and xor in each byte with a shift or a mod will fail that criteria for a simple reason: The last characters added have the most effect.
But there are some really good algorithms available in Delphi and asm. Here are some references:
See 1997 Dr. Dobbs article at burtleburtle.net/bob/hash/doobs.html
code at burtleburtle.net/bob/c/lookup3.c
SuperFastHash Function c2004-2008 by Paul Hsieh (AKA HsiehHash)
www.azillionmonkeys.com/qed/hash.html
You will find Delphi (with optional asm) source code at this reference:
http://landman-code.blogspot.com/2008/06/superfasthash-from-paul-hsieh.html
13 July 2008
"More than a year ago Juhani Suhonen asked for a fast hash to use for his
hashtable. I suggested the old but nicely performing elf-hash, but also noted
a much better hash function I recently found. It was called SuperFastHash (SFH)
and was created by Paul Hsieh to overcome his 'problems' with the hash functions
from Bob Jenkins. Juhani asked if somebody could write the SFH function in basm.
A few people worked on a basm implementation and posted it."
The Hashing Saga Continues:
2007-03-13 Andrew: When Bad Hashing Means Good Caching
www.team5150.com/~andrew/blog/2007/03/hash_algorithm_attacks.html
2007-03-29 Andrew: Breaking SuperFastHash
floodyberry.wordpress.com/2007/03/29/breaking-superfasthash/
2008-03-03 Austin Appleby: MurmurHash 2.0
murmurhash.googlepages.com/
SuperFastHash - 985.335173 mb/sec
lookup3 - 988.080652 mb/sec
MurmurHash 2.0 - 2056.885653 mb/sec
Supplies c++ code MurmurrHash2.cpp and aligned-read-only implementation -
MurmurHashAligned2.cpp
//========================================================================
// Here is Landman's MurmurHash2 in C#
//2009-02-25 Davy Landman does C# implimentations of SuperFashHash and MurmurHash2
//landman-code.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00%2B01%3A00&updated-max=2010-01-01T00%3A00%3A00%2B01%3A00&max-results=2
//
//Landman impliments both SuperFastHash and MurmurHash2 4 ways in C#:
//1: Managed Code 2: Inline Bit Converter 3: Int Hack 4: Unsafe Pointers
//SuperFastHash 1: 281 2: 780 3: 1204 4: 1308 MB/s
//MurmurHash2 1: 486 2: 759 3: 1430 4: 2196
Sorry if the above turns out to look like a mess. I had to just cut&paste it.
At least one of the references above gives you the option of getting out a 64-bit hash, which would certainly have no collisions in the space of credit card numbers, and could be easily stored in a bigint field in MySQL.
You do not need a cryptographic hash. They are much more CPU intensive. And the purpose of "cryptographic" is to stop hacking, not to avoid collisions.
If performance is a factor I suggest to take a look at a CodeCentral entry of Peter Below. It performs very well for large number of items.
By default it uses P.J. Weinberger ELF hashing function. But others are also provided.
By definition, a cryptographic hash will work perfectly for your use case. Even if the characters are close, the hash should be nicely distributed.
So I advise you to use any cryptographic hash (SHA-256 for example), with a salt.
For a non cryptographic approach you could take a look at the FNV hash it's fast with a low collision rate.
As a very fast alternative, I've also used this algorithm for a few years and had few collision issues however I can't give you a mathematical analysis of it's inherent soundness but for what it's worth here it is
=Edit - My code sample was incorrect - now fixed =
In c/c++
unsigned int Hash(const char *s)
{
int hash = 0;
while (*s != 0)
{
hash *= 37;
hash += *s;
s++;
}
return hash;
}
Note that '37' is a magic number, so chosen because it's prime
Best hash function for the natural numbers let
f(n)=n
No conflicts ;)
I am no way experienced in this type of thing so I am not even sure of the keywords (hence the title).
Basically I need a two way function
encrypt(w,x,y) = z
decrypt(z) = w, x, y
Where w = integer
x = string (username)
y = unix timestamp
and z = is an 8 digit number (possibly including letters, spec isn't there yet.)
I would like z to be not easily guessable and easily verifiable. Speed isn't a huge concern, security isn't either. Tracking one-to-one relationship is the main requirement.
Any resources or direction would be appreciated.
EDIT
Thanks for the answers, learning a lot. So to clarify, 8 characters is the only hard requirement, along with the ability to link W <-> Z. The username (Y) and timestamp (Z) would be considered icing on the cake.
I would like to do this mathematically rather than doing some database looks up, if possible.
If i had to finish this tonight, I could just find a fitting hash algorithm and use a look up table. I am simply trying to expand my understanding of this type of thing and see if I could do it mathematically.
Encryption vs. Hashing
This is an encryption problem, since the original information needs to be recovered. The quality of a cryptographic hash is judged by how difficult it is to reverse the hash and recover the original information, so hashing is not applicable here.
To perform encryption, some key material is needed. There are many encryption algorithms, but they fall into two main groups: symmetric and asymmetric.
Application
The application here isn't clear. But if you are "encrypting" some information and sending it somewhere, then later getting it back and doing something with it, symmetric encryption is the way to go. For example, say you want to encode a user name, an IP address, and some identifier from your application in a parameter that you include in a link in some HTML. When the user clicks the link, that parameter is passed back to your application and you decode it to recover the original information. That's a great fit for symmetric encryption, because the sender and the recipient are the same party, and key exchange is a no-op.
Background
In symmetric encryption, the sender and recipient need to know the same key, but keep it secret from everyone else. As a simple example, two people could meet in person, and decide on a password. Later on, they could use that password to keep their email to each other private. However, anyone who overhears the password exchange will be able to spy on them; the exchange has to happen over a secure channel... but if you had a secure channel to begin with, you wouldn't need to exchange a new password.
In asymmetric encryption, each party creates a pair of keys. One is public, and can be freely distributed to anyone who wants to send a private message. The other is private. Only the message recipient knows that private key.
A big advantage to symmetric encryption is that it is fast. All well-designed protocols use a symmetric algorithm to encrypt large amounts of data. The downside is that it can be difficult to exchange keys securely—what if you can't "meet up" (virtually or physically) in a secure place to agree on a password?
Since public keys can be freely shared, two people can exchange a private message over an insecure channel without having previously agreed on a key. However, asymmetric encryption is much slower, so its usually used to encrypt a symmetric key or perform "key agreement" for a symmetric cipher. SSL and most cryptographic protocols go through a handshake where asymmetric encryption is used to set up a symmetric key, which is used to protect the rest of the conversation.
You just need to encrypt a serialization of (w, x, y) with a private key. Use the same private key to decrypt it.
In any case, the size of z cannot be simply bounded like you did, since it depends on the size of the serialization (since it needs to be two way, there's a bound on the compression you can do, depending on the entropy).
And you are not looking for a hash function, since it would obviously lose some information and you wouldn't be able to reverse it.
EDIT: Since the size of z is a hard limit, you need to restrict the input to 8 bytes, and choose a encryption technique that use 64 bits (or less) block size. Blowfish and Triple DES use 64 bits blocks, but remember that those algorithms didn't receive the same scrutiny as AES.
If you want something really simple and quite unsecure, just xor your input with a secret key.
You probably can't.
Let's say that w is 32 bits, x supports at least 8 case-insensitive ASCII chars, so at least 37 bits, and y is 32 bits (gets you to 2038, and 31 bits doesn't even get you to now).
So, that's a total of at least 101 bits of data. You're trying to store it in an 8 digit number. It's mathematically impossible to create an invertible function from a larger set to a smaller set, so you'd need to store more than 12.5 bits per "digit".
Of course if you go to more than 8 characters, or if your characters are 16 bit unicode, then you're at least in with a chance.
Let's formalize your problem, to better study it.
Let k be a key from the set K of possible keys, and (w, x, y) a piece of information, from a set I, that we need to crypt. Let's define the set of "crypted-messages" as A8, where A is the alphabet from which we extract the characters to our crypted message (A = {0, 1, ..., 9, a, b, ..., z, ... }, depending on your specs, as you said).
We define the two functions:
crypt: I * K --> A^8.
decrypt A^8 * K --> I
The problem here is that the size of the set A^8, of crypted-messages, might be smaller than the set of pieces of information (w, x, y). If this is so, it is simply impossible to achieve what you are looking for, unless we try something different...
Let's say that only YOU (or your server, or your application on your server) have to be able to calculate (w, x, y) from z. That is, you might send z to someone, and you don't care that they will not be able to decrypt it.
In this case, what you can do is use a database on your server. You will crypt the information using a well-known algorithm, than you generate a random number z. You define the table:
Id: char[8]
CryptedInformation: byte[]
You will then store z on the Id column, and the crypted information on the corresponding column.
When you need to decrypt the information, someone will give you z, the index of the crypted information, and then you can proceed to decryption.
However, if this works for you, you might not even need to crypt the information, you could have a table:
Id: char[8]
Integer: int
Username: char[]
Timestamp: DateTime
And use the same method, without crypting anything.
This can be applied to an "e-mail verification system" on a subscription process, for example. The link you would send to the user by mail would contain z.
Hope this helps.
I can't tell if you are trying to set this up a way to store passwords, but if you are, you should not use a two way hash function.
If you really want to do what you described, you should just concatenate the string and the timestamp (fill in extra spaces with underscores or something). Take that resulting string, convert it to ASCII or UTF-8 or something, and find its value modulo the largest prime less than 10^8.
Encryption or no encryption, I do not think it is possible to pack that much information into an 8 digit number in such a way that you will ever be able to get it out again.
An integer is 4 bytes. Let's assume your username is limited to 8 characters, and that characters are bytes. Then the timestamp is at least another 4 bytes. That's 16 bytes right there. In hex, that will take 32 digits. Base36 or something will be less, but it's not going to be anywhere near 8.
Hashes by definition are one way only, once hashed, it is very difficult to get the original value back again.
For 2 way encryption i would look at TripleDES which .net has baked right in with TripleDESCryptoServiceProvider.
A fairly straight forward implementation article.
EDIT
It has been mentioned below that you can not cram a lot of information into a small encrypted value. However, for many (not all) situations this is exactly what Bit Masks exist to solve.