Related
So I've read the Wikipedia page on Hash functions as I'm currently playing with some.
Both on that page and other sources I've read mention that the distribution of the data affects the hash function.
Despite some explanations it is still unclear to me what exactly those effects are and perhaps why. So my question:
Just to make sure I've got it right, when they mention
distribution is this the frequency of each word in the input data
set?
What effect does the distribution of input data have on hash
functions? Of particular interest is, the performance of the hash
function, in terms of both speed and uniformity of the output produced by the hash algorithm.
EDIT 1:
I'm thinking specifically of the Wikipedia English corpus vs data from a more dynamic source, Twitter's tweets for example.
Usually you do not have as many input datasets as you have possible inputs. The distribution is therefore more of a propability, that a certain input with certain features will be picked. (essentially the same as you said, but with p<1 for every word instead of some count n>1) E.g. if you know, that the first bit of the input will always be 1, then the data is not uniformly distributed.
If your hash were very simple, eg. by only taking the first byte as 'hash', then this non-uniform distribution would lead to more collisions than anticipated. (only 128 values are possible even though you expected to get 256 different values)
Most (cryptographic) hash functions that you might know by name are good enough so that you do not have to care about this. For cryptography it is even an explicit condition: you must not be able to tell how many bits in the input changed just by looking at the difference of the hashes. That does not mean that it is impossible though. I can vaguely remember a paper stating an increased collision rate for md5 when only ascii letters and digits were hashed. I cannot find it right now, so enjoy this piece of information with care - but even if i have mixed up something, such a scenario is easily possible. And no matter whether it is md5 or some other algorithm, if you actually have such a relation, then certainly your distribution of input datasets is relevant again.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
how to check if a string looks randomized, or human generated and pronouncable?
Is there any way to detect strings like putjbtghguhjjjanika?
Is there any algorithm that is able to detect how random a online nickname is? There are many situations where it would come in useful.
Given any alphanumeric name, the algorithm should be able to give it a "randomness" value. If the randomness value is too high, the application could then force the user to choose another name.
For example, "Mikel" would pass the test and be allowed to be used, while "Agslj" would not and be forced to choose another name
If there isn't an algorithm already available, how would I be able to create an algorithm for this?
You would probably want to look at the usage pattern / frequency of English letters, Letter Frequency Analysis is a good website for the very basics.
This page will also help you in your hunt.
Basically what you want to do is use similar techniques to what cryptographers use when trying to crack codes.
If the work entered into the text box matches the known usage pattern within a lose threshold then you can allow it, whereas if the entry does not match any frequency / usage pattern even a little bit then you can safely discard it.
I would defiantly recommend you looking into such techniques before attempting any algorithm of this sort.
However on such short input I cannot guarantee your accuracy...
It seems hard to implement an algorithm like this but you could try something :
If you check english based words (or any Latin derived language), there isn't many words with more than 3 continuous consonants. The same for vowels.
Also, you can count how many numbers are present, at which place, etc.
I have a project that needs to do validation on the frontend for an American Social Security Number (format ddd-dd-dddd). One suggestion would be to use a hash algorithm, but given the tiny character set used ([0-9]), this would be disastrous. It would be acceptable to validate with some high probability that a number is correct and allow the backend to do a final == check, but I need to do far better than "has nine digits" etc etc.
In my search for better alternatives, I came upon the validation checksums for ISBN numbers and UPC. These look like a great alternative with a high probability of success on the frontend.
Given those constraints, I have three questions:
Is there a way to prove that an algorithm like ISBN13 will work with a different category of data like SSN, or whether it is more or less fit to the purpose from a security perspective? The checksum seems reasonable for my quite large sample of one real SSN, but I'd hate to find out that they aren't generally applicable for some reason.
Is this a solved problem somewhere, so that I can simply use a pre-existing validation scheme to take care of the problem?
Are there any such algorithms that would also easily accommodate validating the last 4 digits of an SSN without giving up too much extra information?
Thanks as always,
Joe
UPDATE:
In response to a question below, a little more detail. I have the customer's SSN as previously entered, stored securely on the backend of the app. What I need to do is verification (to the maximum extent possible) that the customer has entered that same value again on this page. The issue is that I need to prevent the information from being incidentally revealed to the frontend in case some non-authorized person is able to access the page.
That is why an MD5/SHA1 hash is inappropriate: namely that it can be used to derive the complete SSN without much difficulty. A checksum (say, modulo 11) provides nearly no information to the frontend while still allowing a high degree of accuracy for the field validation. However, as stated above I have concerns over its general applicability.
Wikipedia is not the best source for this kind of thing, but given that caveat, http://en.wikipedia.org/wiki/Social_Security_number says
Unlike many similar numbers, no check digit is included.
But before that it mentions some widely used filters:
The SSA publishes the last group number used for each area number. Since group numbers are allocated in a regular (if unusual) pattern, it is possible to identify an unissued SSN that contains an invalid group number. Despite these measures, many fraudulent SSNs cannot easily be detected using only publicly available information. In order to do so there are many online services that provide SSN validation.
Restating your basic requirements:
A reasonably strong checksum to protect against simple human errors.
"Expected" checksum is sent from server -> client, allowing client-side validation.
Checksum must not reveal too much information about SSN, so as to minimize leakage of sensitive information.
I might propose using a cryptographic has (SHA-1, etc), but do not send the complete hash value to the client. For example, send only the lowest 4 bits of the 160 bit hash result[1]. By sending 4 bits of checksum, your chance of detecting a data entry error are 15/16-- meaning that you'll detect mistakes 93% of the time. The flip side, though, is that you have "leaked" enough info to reduce their SSN to 1/16 of search space. It's up to you to decide if the convenience of client-side validation is worth this leakage.
By tuning the number of "checksum" bits sent, you can adjust between convenience to the user (i.e. detecting mistakes) and information leakage.
Finally, given your requirements, I suspect this convenience / leakage tradeoff is an inherent problem: Certainly, you could use a more sophisticated crypto challenge / response algorithm (as Nick ODell astutely suggests). However, doing so would require a separate round-trip request-- something you said you were trying to avoid in the first place.
[1] In a good crypto hash function, all output digits are well randomized due to avalanche effect, so the specific digits you choose don't particularly matter-- they're all effectively random.
Simple solution. Take the number mod 100001 as your checksum. There is 1/100_000 chance that you'll accidentally get the checksum right with the wrong number (and it will be very resistant to one or two digit mistakes canceling out), and 10,000 possible SSNs that it could be so you have not revealed the SSN to an attacker.
The only drawback is that the 10,000 possible other SSNs are easy to figure out. If the person can get the last 4 of the SSN from elsewhere, then they can probably figure out the SSN. If you are concerned about this then you should take the user's SSN number, add a salt, and hash it. And deliberately use an expensive hash algorithm to do so. (You can just iterate a cheaper algorithm, like MD5, a fixed number of times to increase the cost.) Then use only a certain number of bits. The point here being that while someone can certainly go through all billion possible SSNs to come up with a limited list of possibilities, it will cost them more to do so. Hopefully enough that they don't bother.
I'm wondering how does one go about reversing an algorithm such as one for storing logins or pin codes.
Lets say I have an amount of data where:
7262627 -> ? -> 8172
5353773 -> ? -> 1132
etc. This is just an example. Or say a hex string that is tansformed into another.
&h8712 -> &h1283 or something like that.
How do I go about starting to figure out what that algorithm is? Where does one start?
Would you start trying different shifts, xors and hope something stands out? I'm sure there's a better way as this seems like stabbing in the dark.
Is it even practically possible to reverse engineer this kind of algorithm?
Sorry if this is a stupid question. Thanks for your help / pointers.
There are a few things people try:
Get the source code, or disassemble an executable.
Guess, based on the hash functions other people use. For example, a hash consisting of 32 hex digits might well be one or more repetitions of MD5, and if you can get a single input/output pair then it is quite easy to confirm or refute this (although see "salt", below).
Statistically analyze a large number of pairs of inputs and outputs, looking for any kind of pattern or correlations, and relate those correlations to properties of known hash functions and/or possible operations that the designer of the system might have used. This is beyond the scope of a single technique, and into the realms of general cryptanalysis.
Ask the author. Secure systems don't usually rely on the secrecy of the hash algorithms they use (and don't usually stay secure long if they do). The examples you give are quite small, though, and secure hashing of passwords would always involve a salt, which yours apparently don't. So we might not be talking about the kind of system where the author is confident to do that.
In the case of a hash where the output is only 4 decimal digits, you can attack it simply by building a table of every possible 7 digit input, together with its hashed value. You can then reverse the table and you have your (one-to-many) de-hashing operation. You never need to know how the hash is actually calculated. How do you get the input/output pairs? Well, if an outsider can somehow specify a value to be hashed, and see the result, then you have what's called a "chosen plaintext", and an attack relying on that is a "chosen plaintext attack". So a 7 digit -> 4 digit hash would be very weak indeed if it was used in a way which allowed chosen plaintext attacks to generate a lot of input/output pairs. I realise that's just one example, but it's also just one example of a technique to reverse it.
Note that reverse engineering the hash, and actually reversing it, are two different things. You could figure out that I'm using SHA-256, but that wouldn't help you reverse it (i.e., given an output, work out the input value). Nobody knows how to fully reverse SHA-256, although of course there are always rainbow tables (see "salt", above) <conspiracy>At least nobody admits they do, so it's no use to you or me.</conspiracy>
Probably, you can't. Suppose the transformation function is known, something like
function hash(text):
return sha1("secret salt"+text)
But the "secret salt" is not known, and is cryptographically strong (a very large, random integer). You could never brute force the secret salt from even a very large number of plain-text, crypttext pairs.
In fact, if the precise hash function used was known to be one of two equally strong functions, you could never even get a good guess between which one was being used.
Stabbing in the dark will drive you to insanity. There are some algorithms that, given current understanding, you couldn't hope to deduce the inner workings of between now and the [predicted] end of the universe without knowing the exact details (potentially including private keys or internal state). Of course, some of these algorithms are the foundations of modern cryptography.
If you know in advance that there's a pattern to be discovered though, there are sometimes ways of approaching this. For instance, if the dataset contains several input values that differ by 1, compare the corresponding output values:
7262627 -> 8172
7262628 -> 819
7262629 -> 1732
...
7262631 -> 3558
Here it's fairly clear (given a few minutes and a calculator) that when the input increases by 1, the output increases by 913 modulo 8266 (i.e. a simple linear congruential generator).
Differential cryptanalysis is a relatively modern technique used to analyse the strength of cryptographic block ciphers, relying on a similar but more complex idea for where the cipher algorithm is known, but it's assumed the private key isn't. Input blocks differing from each other by a single bit are considered and the effect of that bit is traced through the cipher to deduce how likely each output bit is to "flip" as a result.
Other ways of approaching this kind of problem would be to look at the extremes (maximum, minimum values), distribution (leading to frequency analysis), direction (do the numbers always increase? decrease?) and (if this is allowed) consider the context in which the data sets were found. For instance, some types of PIN codes always contain a repeated digit to make them easier to remember (I'm not saying a PIN code can necessarily be deduced from anything else - just that a repeated digit is one less digit to worry about!).
Is it even practically possible to reverse engineer this kind of algorithm?
It is possible with a flawed algorithm and enough encrypted/unencrypted pairs, but a well designed algorithm can eliminate that possibility of doing it at all.
I want a program that does what I said in the title.
I realize that this is a pretty vague problem. I also realize that figuring out how to transform any input into any output is nearly impossible, but it seems like handling some simple cases should be feasible.
To provide a concrete example (in Python):
>>> def find_transform(start, desired):
>>> # Insert magic here
>>> find_trasform([1,2,3], [3,2,1])
"reverse"
>>> find_trasform([1,2,3,4], [[1,2], [3,4]])
"divide 2"
I suspect there's an official word for this sort of thing, but I don't know what it is.
Well, the term is called Data Mapping. It's a vast field that can serve a few purposes, including yours.
The tools for such a task are hard to master, so don't expect this to be easy. For this specific case, you will be looking for Data-Driven Mapping methodologies. These involve a combination of heuristics and statistics to find the relevant relationships.
Thankfully your examples are well expressed mathematically and will remain true for every element of the data sets. So you can start tackling this problem by trying to analyze mathematical relationships between paired elements and trying to validate any found relationships on the remaining pairs.
EDIT: The last example adds a new dimension. So this must be observable too. If you stick to a progressive relationship between sets as is the case there (first establish relationship between set 1 and 2, then between set 2 and 3) all will be well. It may even help to more quickly prove a relationship since you don't need to recurse so often. But more complex relationships between sets may force you into a much more complex problem to handle. Try to keep it simple.