How does this ETSFS algorithm Shifting work? - algorithm

I am following the ETSFS enryption algorithm here
To give some context, ETSFS is an encryption algorithm encrypts a 4x4 data matrix by putting it through a series of data changing functions.
The step (with I/O) examples provided are quite clear in the before and after but the shifting step of the algorithm is somewhat confusing to me. (See Page 5/10) Here is the shifting I/O example from that page:
Please note that the allowed symbols in this algorithm is, in this order:
* - . / : # _
['*','-','.','/',':','#','_']
a-z and A-Z are labeled 0-25
It says that the shifting is based on position of the element in the array. Unfortunately I cannot draw much information out of the given image. In the third line, it is not clear to me how v became s when I cannot see a correlation between 3 and 2. Especially the last line in the image above where 4 becomes 2.
How does this shifting work?

As far as I can tell, the information you're seeking (i.e. the content of the arrayAlpha, arrayNumber and arraySymbol arrays used in the shifting step) is simply not specified anywhere in the paper you cite, nor do they appear to be derived from the key. If you wanted to know what those arrays should contain, you'd have to contact the authors and ask.
In any case, I'd advise you not to bother. The paper you cite does not appear to be written by professional cryptographers, and the algorithm it describes does not seem to be a secure encryption scheme in the modern sense. It clearly does not provide semantic security, much less non-malleability.
Certainly, it would never have been published in any actual crypto conference. The authors appear to have fallen into the common amateur cryptographer's trap of designing a cipher just complex enough that they themselves can't think of a way to break it. Also, they don't appear to be particularly experienced at breaking (or designing) ciphers.
I'd suggest using the time you would've spent working on this algorithm to instead familiarize yourself with the actual state of the art and best practices in modern cryptography. For that, I'd recommend any decent introductory crypto book, such as Katz & Lindell's Introduction to Modern Cryptography or Ferguson, Schneier & Kohno's Cryptography Engineering.
Oh, and if you can avoid it, don't try to write your own low-level crypto code (except as a learning exercise). Instead, find an existing reputable crypto library written and reviewed by professional cryptographers, such as NaCl, and use it.
In fact, the "ETSFS" encryption scheme described in the paper looks pretty weak even for an amateur cipher. As far as I can tell, the whole thing seems to amount to nothing more than a polyalphabetic substitution cipher with 16 distinct key-dependent alphabets, combined with a key-independent(!) transposition cipher. (That's not how it's specified, but that's what it appears to work out to, if you trace out how each data character is affected by the iterated steps.)
As such, given that the transposition part of the cipher is fixed and publicly known, a few dozen 16-character chosen plaintext / ciphertext pairs (specifically, one for each character of the alphabet) encrypted with the same key should be sufficient to fully determine the substitution part, thus allowing decryption of any data encrypted with that particular key. If a chosen-plaintext attack is not possible, a slightly larger sample of known plaintext should be sufficient to recover most of the substitution table, if not all of it.
Also, given that there appears to be no mixing between the ciphertext letters, even if the transposition part was made key-dependent, breaking it would still only require 16 more chosen 16-character plaintext / ciphertext pairs. (Exercise: Figure out a set of 16 plaintexts to choose so that, given the corresponding ciphertexts, you can fully determine the transposition part of the cipher without any prior knowledge of the substitution part. Then figure out which additional plaintexts you need to also determine the substitution part.)

Related

What is the output of a fingerprint scanner? Is there any deterministic identifying information?

I am planning on generating a set of public/private keys from a deterministic identifying piece of information from a person and was planning on using fingerprints.
My question, therefore, is: what is the output of a fingerprint scanner? Is there any deterministic output I could use, or is it always going to be a matter of "confidence level"? i.e. Do I always get a "number" which, if matched exactly to the database, will allow access, or do I rather get a number which, if "close enough" to the stored value on the database, allows access, based on a high degree of confidence, rather than an exact match?
I am quite sure the second option is the answer but just wanted to double-check. Is there any way to get some sort of deterministic output? My hope was to re-generate keys every time rather than actually storing fingerprint data. That way a wrong fingerprint would simply generate a new and useless key.
Any suggestions?
Thanks in advance.
I would advise against it for several reasons.
The fingerprints are not entirely deterministic. As suggested in #ImSimplyAnna answer, you might 'round' the results in order to have more chances to obtain a deterministic result. But that would significantly reduce the number of possible/plausible fingerprints, and thus not meet the search space size requirement for a cryptographic algorithm. On top of it, I suspect the entropy of such result to be somehow low, compared to the requirements of modern algorithm which are always based on high quality random numbers.
Fingerprints are not secret, we expose them to everyone all the time, and they can be revealed to an attacker at any time, and stored in a picture using a simple camera. A key must be a secret, and the only place we know we can store secrets without exposing them is our brain (which is why we use passwords).
An important feature for cryptographic keys is the possibility to generate new one if there is a reason to believe the current ones might be compromised. This is not possible with fingerprints.
That is why I would advise against it. Globally, I discourage anyone (myself included) to write his/her own cryptographic algorithm, because it is so easy to screw them up. It might be the easiest thing to screw up, out of all the things you could write, because attacker are so vicicous!
The only good approach, if you're not a skilled specialist, is to use libraries that are used all around, because they've been written by experts on the matter, and they've been subject to many attacks and attempts to break them, so the ones still standing will offer much better levels of protection that anything a non specialist could write (or basically anything a single human could write).
You can also have a look at this question, on the crypto stack exchange. They also discourage the OP in using anything else than a battle hardened algorithm, or protocol.
Edit:
I am planning on generating a set of public/private keys from a
deterministic identifying piece of information
Actually, It did not strike me at first (it should have), but keys MUST NOT be generated from anything which is not random. NEVER.
You have to generate them randomly. If you don't, you already give more information to the attacker than he/she wants. Being a programmer does not make you a cryptographer. Your user's informations are at stake, do not take any chance (and if you're not a cryptographer, you actually don't stand any).
A fingerprint scanner looks for features where the lines on the fingerprint either split or end. It then calculates the distances and angles between such features in an attempt to find a match.
Here's some more reading on the subject:
https://www.explainthatstuff.com/fingerprintscanners.html
in the section "How fingerprints are stored and compared".
The source is the best explanation I can find, but looking around some more it seems that all fingerprint scanners use some variety of that algorithm to generate data that can be matched.
Storing raw fingerprints would not only take up way more space on a database but also be a pretty significant security risk if that information was ever leaked, so it's not really done unless absolutely necessary.
Judging by that algorithm, I would assume that there is always some "confidence level". The angles and distances will never be 100% equal between scans, so there has to be some leeway to make sure a match is still found even if the finger is pressed against the scanner a bit harder or the finger is at a slightly different angle.
Based on this, I'd assume that generating a key pair based on a fingerprint would be possible, if you can figure out a way to make similar scans result in the same information. Simply rounding the angles and distances may work, but may introduce cases where two different people generate the same key pairs, or cases where different scans of the same fingerprint have a high chance of generating several different keys.

What hard problems do we need to solve to break SHA, MD, DES?

I seek a STRICTLY a mathematic (and not pragmatic) answer.
We know that the hard problem behind RSA is integer factorization. If one were to solve that problem, he would easily break any RSA encryption. We already know that quantuum computers may hold the key to solving integer factorization.
My question is wether one can be formulated, and if yes, then which hard mathematical problem is behind (providing the one-wayness (is there such a word?) of) SHA, MD-x, (and although not a hash algorithm, DES, which is known to have been broken, although maybe not a mathematical way). In case of the hash functions, breaking it would mean generating (all) messages m that have the h hash value.
With that information I would like to be able to assess the long-term (let's say multi-decade long) security (ha-ha, right?) of these algorithms, in a strictly mathematical sense (sideway attacks ignored).
This is actually a good question because for a long time, the research community could not even say itself what it means for a hash function to be secure. I mean, if we talk in terms of Turing machines, then none of them are secure: there always exists a Turing machine that can output a collision in O(1) time for anything SHA or MD related. It was not until our good friend Professor Phillip Rogaway came along and said the security is based upon nothing but human ignorance: https://eprint.iacr.org/2006/281.pdf . So in other words, there is no mathematical foundation to their security.
There are actually hash functions out there that are more formally defined (such as VSH or SWIFFT), but it is not something you see in practice.
As for DES, it does have fixed size inputs and keys, so unless it is generalised to some design that scales, it also cannot be founded upon complexity theory.
One last point: yes, RSA is based upon factoring, but it is not proven that you need to factor to break it. Much of cryptographic history involves trying to formalise the padding for RSA to make it provably secure in some sense of the definition or provable security. To my knowledge, they never got past proving it secure in the so called random oracle model (see OAEP), which many theorists are not satisfied with.

Understanding the effect the distribution of data has on hashing

So I've read the Wikipedia page on Hash functions as I'm currently playing with some.
Both on that page and other sources I've read mention that the distribution of the data affects the hash function.
Despite some explanations it is still unclear to me what exactly those effects are and perhaps why. So my question:
Just to make sure I've got it right, when they mention
distribution is this the frequency of each word in the input data
set?
What effect does the distribution of input data have on hash
functions? Of particular interest is, the performance of the hash
function, in terms of both speed and uniformity of the output produced by the hash algorithm.
EDIT 1:
I'm thinking specifically of the Wikipedia English corpus vs data from a more dynamic source, Twitter's tweets for example.
Usually you do not have as many input datasets as you have possible inputs. The distribution is therefore more of a propability, that a certain input with certain features will be picked. (essentially the same as you said, but with p<1 for every word instead of some count n>1) E.g. if you know, that the first bit of the input will always be 1, then the data is not uniformly distributed.
If your hash were very simple, eg. by only taking the first byte as 'hash', then this non-uniform distribution would lead to more collisions than anticipated. (only 128 values are possible even though you expected to get 256 different values)
Most (cryptographic) hash functions that you might know by name are good enough so that you do not have to care about this. For cryptography it is even an explicit condition: you must not be able to tell how many bits in the input changed just by looking at the difference of the hashes. That does not mean that it is impossible though. I can vaguely remember a paper stating an increased collision rate for md5 when only ascii letters and digits were hashed. I cannot find it right now, so enjoy this piece of information with care - but even if i have mixed up something, such a scenario is easily possible. And no matter whether it is md5 or some other algorithm, if you actually have such a relation, then certainly your distribution of input datasets is relevant again.

How does one go about reverse engineering an algorithm?

I'm wondering how does one go about reversing an algorithm such as one for storing logins or pin codes.
Lets say I have an amount of data where:
7262627 -> ? -> 8172
5353773 -> ? -> 1132
etc. This is just an example. Or say a hex string that is tansformed into another.
&h8712 -> &h1283 or something like that.
How do I go about starting to figure out what that algorithm is? Where does one start?
Would you start trying different shifts, xors and hope something stands out? I'm sure there's a better way as this seems like stabbing in the dark.
Is it even practically possible to reverse engineer this kind of algorithm?
Sorry if this is a stupid question. Thanks for your help / pointers.
There are a few things people try:
Get the source code, or disassemble an executable.
Guess, based on the hash functions other people use. For example, a hash consisting of 32 hex digits might well be one or more repetitions of MD5, and if you can get a single input/output pair then it is quite easy to confirm or refute this (although see "salt", below).
Statistically analyze a large number of pairs of inputs and outputs, looking for any kind of pattern or correlations, and relate those correlations to properties of known hash functions and/or possible operations that the designer of the system might have used. This is beyond the scope of a single technique, and into the realms of general cryptanalysis.
Ask the author. Secure systems don't usually rely on the secrecy of the hash algorithms they use (and don't usually stay secure long if they do). The examples you give are quite small, though, and secure hashing of passwords would always involve a salt, which yours apparently don't. So we might not be talking about the kind of system where the author is confident to do that.
In the case of a hash where the output is only 4 decimal digits, you can attack it simply by building a table of every possible 7 digit input, together with its hashed value. You can then reverse the table and you have your (one-to-many) de-hashing operation. You never need to know how the hash is actually calculated. How do you get the input/output pairs? Well, if an outsider can somehow specify a value to be hashed, and see the result, then you have what's called a "chosen plaintext", and an attack relying on that is a "chosen plaintext attack". So a 7 digit -> 4 digit hash would be very weak indeed if it was used in a way which allowed chosen plaintext attacks to generate a lot of input/output pairs. I realise that's just one example, but it's also just one example of a technique to reverse it.
Note that reverse engineering the hash, and actually reversing it, are two different things. You could figure out that I'm using SHA-256, but that wouldn't help you reverse it (i.e., given an output, work out the input value). Nobody knows how to fully reverse SHA-256, although of course there are always rainbow tables (see "salt", above) <conspiracy>At least nobody admits they do, so it's no use to you or me.</conspiracy>
Probably, you can't. Suppose the transformation function is known, something like
function hash(text):
return sha1("secret salt"+text)
But the "secret salt" is not known, and is cryptographically strong (a very large, random integer). You could never brute force the secret salt from even a very large number of plain-text, crypttext pairs.
In fact, if the precise hash function used was known to be one of two equally strong functions, you could never even get a good guess between which one was being used.
Stabbing in the dark will drive you to insanity. There are some algorithms that, given current understanding, you couldn't hope to deduce the inner workings of between now and the [predicted] end of the universe without knowing the exact details (potentially including private keys or internal state). Of course, some of these algorithms are the foundations of modern cryptography.
If you know in advance that there's a pattern to be discovered though, there are sometimes ways of approaching this. For instance, if the dataset contains several input values that differ by 1, compare the corresponding output values:
7262627 -> 8172
7262628 -> 819
7262629 -> 1732
...
7262631 -> 3558
Here it's fairly clear (given a few minutes and a calculator) that when the input increases by 1, the output increases by 913 modulo 8266 (i.e. a simple linear congruential generator).
Differential cryptanalysis is a relatively modern technique used to analyse the strength of cryptographic block ciphers, relying on a similar but more complex idea for where the cipher algorithm is known, but it's assumed the private key isn't. Input blocks differing from each other by a single bit are considered and the effect of that bit is traced through the cipher to deduce how likely each output bit is to "flip" as a result.
Other ways of approaching this kind of problem would be to look at the extremes (maximum, minimum values), distribution (leading to frequency analysis), direction (do the numbers always increase? decrease?) and (if this is allowed) consider the context in which the data sets were found. For instance, some types of PIN codes always contain a repeated digit to make them easier to remember (I'm not saying a PIN code can necessarily be deduced from anything else - just that a repeated digit is one less digit to worry about!).
Is it even practically possible to reverse engineer this kind of algorithm?
It is possible with a flawed algorithm and enough encrypted/unencrypted pairs, but a well designed algorithm can eliminate that possibility of doing it at all.

Why do you need lots of randomness for effective encryption?

I've seen it mentioned in many places that randomness is important for generating keys for symmetric and asymmetric cryptography and when using the keys to encrypt messages.
Can someone provide an explanation of how security could be compromised if there isn't enough randomness?
Randomness means unguessable input. If the input is guessable, then the output can be easily calculated. That is bad.
For example, Debian had a long standing bug in its SSL implementation that failed to gather enough randomness when creating a key. This resulted in the software generating one of only 32k possible keys. It is thus easily possible to decrypt anything encrypted with such a key by trying all 32k possibilities by trying them out, which is very fast given today's processor speeds.
The important feature of most cryptographic operations is that they are easy to perform if you have the right information (e.g. a key) and infeasible to perform if you don't have that information.
For example, symmetric cryptography: if you have the key, encrypting and decrypting is easy. If you don't have the key (and don't know anything about its construction) then you must embark on something expensive like an exhaustive search of the key space, or a more-efficient cryptanalysis of the cipher which will nonetheless require some extremely large number of samples.
On the other hand, if you have any information on likely values of the key, your exhaustive search of the keyspace is much easier (or the number of samples you need for your cryptanalysis is much lower). For example, it is (currently) infeasible to perform 2^128 trial decryptions to discover what a 128-bit key actually is. If you know the key material came out of a time value that you know within a billion ticks, then your search just became 340282366920938463463374607431 times easier.
To decrypt a message, you need to know the right key.
The more possibly keys you have to try, the harder it is to decrypt the message.
Taking an extreme example, let's say there's no randomness at all. When I generate a key to use in encrypting my messages, I'll always end up with the exact same key. No matter where or when I run the keygen program, it'll always give me the same key.
That means anyone who have access to the program I used to generate the key, can trivially decrypt my messages. After all, they just have to ask it to generate a key too, and they get one identical to the one I used.
So we need some randomness to make it unpredictable which key you end up using. As David Schmitt mentions, Debian had a bug which made it generate only a small number of unique keys, which means that to decrypt a message encrypted by the default OpenSSL implementation on Debian, I just have to try this smaller number of possible keys. I can ignore the vast number of other valid keys, because Debian's SSL implementation will never generate those.
On the other hand, if there was enough randomness in the key generation, it's impossible to guess anything about the key. You have to try every possible bit pattern. (and for a 128-bit key, that's a lot of combinations.)
It has to do with some of the basic reasons for cryptography:
Make sure a message isn't altered in transit (Immutable)
Make sure a message isn't read in transit (Secure)
Make sure the message is from who it says it's from (Authentic)
Make sure the message isn't the same as one previously sent (No Replay)
etc
There's a few things you need to include, then, to make sure that the above is true. One of the important things is a random value.
For instance, if I encrypt "Too many secrets" with a key, it might come out with "dWua3hTOeVzO2d9w"
There are two problems with this - an attacker might be able to break the encryption more easily since I'm using a very limited set of characters. Further, if I send the same message again, it's going to come out exactly the same. Lastly, and attacker could record it, and send the message again and the recipient wouldn't know that I didn't send it, even if the attacker didn't break it.
If I add some random garbage to the string each time I encrypt it, then not only does it make it harder to crack, but the encrypted message is different each time.
The other features of cryptography in the bullets above are fixed using means other than randomness (seed values, two way authentication, etc) but the randomness takes care of a few problems, and helps out on other problems.
A bad source of randomness limits the character set again, so it's easier to break, and if it's easy to guess, or otherwise limited, then the attacker has fewer paths to try when doing a brute force attack.
-Adam
A common pattern in cryptography is the following (sending text from alice to bob):
Take plaintext p
Generate random k
Encrypt p with k using symmetric encryption, producing crypttext c
Encrypt k with bob's private key, using asymmetric encryption, producing x
Send c+x to bob
Bob reverses the processes, decrypting x using his private key to obtain k
The reason for this pattern is that symmetric encryption is much faster than asymmetric encryption. Of course, it depends on a good random number generator to produce k, otherwise the bad guys can just guess it.
Here's a "card game" analogy: Suppose we play several rounds of a game with the same deck of cards. The shuffling of the deck between rounds is the primary source of randomness. If we didn't shuffle properly, you could beat the game by predicting cards.
When you use a poor source of randomness to generate an encryption key, you significantly reduce the entropy (or uncertainty) of the key value. This could compromise the encryption because it makes a brute-force search over the key space much easier.
Work out this problem from Project Euler, and it will really drive home what "lots of randomness" will do for you. When I saw this question, that was the first thing that popped into my mind.
Using the method he talks about there, you can easily see what "more randomness" would gain you.
A pretty good paper that outlines why not being careful with randomness can lead to insecurity:
http://www.cs.berkeley.edu/~daw/papers/ddj-netscape.html
This describes how back in 1995 the Netscape browser's key SSL implementation was vulnerable to guessing the SSL keys because of a problem seeding the PRNG.

Resources