Decoding Permutated English Strings - algorithm

A coworker was recently asked this when trying to land a (different) research job:
Given 10 128-character strings which have been permutated in exactly the same way, decode the strings. The original strings are English text with spaces, numbers, punctuation and other non-alpha characters removed.
He was given a few days to think about it before an answer was expected. How would you do this? You can use any computer resource, including character/word level language models.

This is a basic transposition cipher. My question above was simply to determine if it was a transposition cipher or a substitution cipher. Cryptanalysis of such systems is fairly straightforward. Others have already alluded to basic methods. Optimal approaches will attempt to place the hardest and rarest letters first, as these will tend to uniquely identify the letters around them, which greatly reduces the subsequent search space. Simply finding a place to place an "a" (no pun intended) is not hard, but finding a location for a "q", "z", or "x" is a bit more work.
The overarching goal for an algorithm's quality isn't to decipher the text, as it can be done by better than brute force methods, nor is it simply to be fast, but it should eliminate possibilities absolutely as fast as possible.
Since you can use multiple strings simultaneously, attempting to create words from the rarest characters is going to allow you to test dictionary attacks in parallel. Finding the correct placement of the rarest terms in each string as quickly as possible will decipher that ciphertext PLUS all of the others at the same time.
If you search for cryptanalysis of transposition ciphers, you'll find a bunch with genetic algorithms. These are meant to advance the research cred of people working in GA, as these are not really optimal in practice. Instead, you should look at some basic optimizatin methods, such as branch and bound, A*, and a variety of statistical methods. (How deep you should go depends on your level of expertise in algorithms and statistics. :) I would switch between deterministic methods and statistical optimization methods several times.)
In any case, the calculations should be dirt cheap and fast, because the scale of initial guesses could be quite large. It's best to have a cheap way to filter out a LOT of possible placements first, then spend more CPU time on sifting through the better candidates. To that end, it's good to have a way of describing the stages of processing and the computational effort for each stage. (At least that's what I would expect if I gave this as an interview question.)
You can even buy a fairly credible reference book on deciphering double transposition ciphers.
Update 1: Take a look at these slides for more ideas on iterative improvements. It's not a great reference set of slides, but it's readily accessible. What's more, although the slides are about GA and simulated annealing (methods that come up a lot in search results for transposition cipher cryptanalysis), the author advocates against such methods when you can use A* or other methods. :)

first, you'd need a test for the correct ordering. something fairly simple like being able to break the majority of texts into words using a dictionary ordered by frequency of use without backtracking.
one you have that, you can play with various approaches. two i would try are:
using a genetic algorithm, with scoring based on 2 and 3-letter tuples (which you can either get from somewhere or generate yourself). the hard part of genetic algorithms is finding a good description of the process that can be fragmented and recomposed. i would guess that something like "move fragment x to after fragment y" would be a good approach, where the indices are positions in the original text (and so change as the "dna" is read). also, you might need to extend the scoring with something that gets you closer to "real" text near the end - something like the length over which the verification algorithm runs, or complete words found.
using a graph approach. you would need to find a consistent path through the graph of letter positions, perhaps with a beam-width search, using the weights obtained from the pair frequencies. i'm not sure how you'd handle reaching the end of the string and restarting, though. perhaps 10 sentences is sufficient to identify with strong probability good starting candidates (from letter frequency) - wouldn't surprise me.
this is a nice problem :o) i suspect 10 sentences is a strong constraint (for every step you have a good chance of common letter pairs in several strings - you probably want to combine probabilities by discarding the most unlikely, unless you include word start/end pairs) so i think the graph approach would be most efficient.

Frequency analysis would drastically prune the search space. The most-common letters in English prose are well-known.
Count the letters in your encrypted input, and put them in most-common order. Matching most-counted to most-counted, translated the cypher text back into an attempted plain text. It will be close to right, but likely not exactly. By hand, iteratively tune your permutation until plain text emerges (typically few iterations are needed.)
If you find checking by hand odious, run attempted plain texts through a spell checker and minimize violation counts.

First you need a scoring function that increases as the likelihood of a correct permutation increases. One approach is to precalculate the frequencies of triplets in standard English (get some data from Project Gutenburg) and add up the frequencies of all the triplets in all ten strings. You may find that quadruplets give a better outcome than triplets.
Second you need a way to produce permutations. One approach, known as hill-climbing, takes the ten strings and enters a loop. Pick two random integers from 1 to 128 and swap the associated letters in all ten strings. Compute the score of the new permutation and compare it to the old permutation. If the new permutation is an improvement, keep it and loop, otherwise keep the old permutation and loop. Stop when the number of improvements slows below some predetermined threshold. Present the outcome to the user, who may accept it as given, accept it and make changes manually, or reject it, in which case you start again from the original set of strings at a different point in the random number generator.
Instead of hill-climbing, you might try simulated annealing. I'll refer you to Google for details, but the idea is that instead of always keeping the better of the two permutations, sometimes you keep the lesser of the two permutations, in the hope that it leads to a better overall outcome. This is done to defeat the tendency of hill-climbing to get stuck at a local maximum in the search space.
By the way, it's "permuted" rather than "permutated."

Related

Genetic algorithm - new generations getting worse

I have implemented a simple Genetic Algorithm to generate short story based on Aesop fables.
Here are the parameters I'm using:
Mutation: Single word swap mutation with tested rate with 0.01.
Crossover: Swap the story sentences at given point. rate - 0.7
Selection: Roulette wheel selection - https://stackoverflow.com/a/5315710/536474
Fitness function: 3 different function. highest score of each is 1.0. so total highest fitness score is 3.0.
Population size: Since I'm using 86 Aesop fables, I tested population size with 50.
Initial population: All 86 fable sentence orders are shuffled in order to make complete nonsense. And my goal is to generate something meaningful(at least at certain level) from these structure lost fables.
Stop Condition: 3000 generations.
And the results are below:
However, this still did not produce a favorable result. I was expecting the plot that goes up over the generations. Any ideas to why my GA performing worse result?
Update: As all of you suggested, I've employed elitism by 10% of current generation copied to next generation. Result still remains the same:
Probably I should use tournament selection.
All of the above responses are great and I'd look into them. I'll add my thoughts.
Mutation
Your mutation rate seems fine although with Genetic Algorithms mutation rate can cause a lot of issues if it's not right. I'd make sure you test a lot of other values to be sure.
With mutation I'd maybe use two types of mutation. One that replaces words with other from your dictionary, and one that swaps two words within a sentence. This would encourage diversifying the population as a whole, and shuffling words.
Crossover
I don't know exactly how you've implemented this but one-point crossover doesn't seem like it'll be that effective in this situation. I'd try to implement an n-point crossover, which will do a much better job of shuffling your sentences. Again, I'm not sure how it's implemented but just swapping may not be the best solution. For example, if a word is at the first point, is there ever any way for it to move to another position, or will it always be the first word if it's chosen by selection?
If word order is important for your chosen problem simple crossover may not be ideal.
Selection
Again, this seems fine but I'd make sure you test other options. In the past I've found rank based roulette selection to be a lot more successful.
Fitness
This is always the most important thing to consider in any genetic algorithm and with the complexity of problem you have I'd make doubly sure it works. Have you tested that it works with 'known' problems?
Population Size
Your value seems small but I have seen genetic algorithms work successfully with small populations. Again though, I'd experiment with much larger populations to see if your results are any better.
The most popular suggestion so far is to implement elitism and I'd definitely recommend it. It doesn't have to be much, even just the best couple of chromosome every generation (although as with everything else I'd try different values).
Another sometimes useful operator to implement is culling. Destroy a portion of your weakest chromosomes, or one that are similar to others (or both) and replace them with new chromosomes. This should help to stop your population going 'stale', which, from your graph looks like it might be happening. Mutation only does so much to diversify the population.
You may be losing the best combinations, you should keep the best of each generation without crossing(elite). Also, your function seems to be quite stable, try other types of mutations, that should improve.
Drop 5% to 10% of your population to be elite, so that you don't lose the best you have.
Make sure your selection process is well set up, if bad candidates are passing through very often it'll ruin your evolution.
You might also be stuck in a local optimum, you might need to introduce other stuff into your genome, otherwise you wont move far.
Moving sentences and words around will not probably get you very far, introducing new sentences or words might be interesting.
If you think of story as a point x,y and your evaluation function as f(x,y), and you're trying to find the max for f(x,y), but your mutation and cross-over are limited to x -> y, y ->y, it makes sense that you wont move far. Granted, in your problem there is a lot more variables, but without introducing something new, I don't think you can avoid locality.
As #GettnDer said, elitism might help a lot.
What I would suggest is to use different selection strategy. The roulette wheel selection has one big problem: imagine that the best indidivual's fitness is e.g. 90% of the sum of all fitnesses. Then the roulette wheel is not likely to select the other individuals (see e.g. here). The selction strategy I like the most is the tournament selection. It is much more robust to big differences in fitness values and the selection pressure can be controlled very easily.
Novelty Search
I would also give a try to Novelty Search. It's relatively new approach in evolutionary computation, where you don't do the selection based on the actual fitness but rather based on novelty which is supposed to be some metric of how an individual is different in its behaviour from the others (but you still compute the fitness to catch the good ones). Of special interest might be combinations of classical fitness-driven algorithms and novelty-driven ones, like the this one by J.-B. Mouret.
When working with genetic algorithms, it is a good practice to structure you chromosome in order to reflect the actual knowledge on the process under optimization.
In your case, since you intend to generate stories, which are made of sentences, it could improve your results if you transformed your chromosomes into structured phrases, line <adjectives>* <subject> <verb> <object>* <adverbs>* (huge simplification here).
Each word could then be assigned a class. For instance, Fox=subject , looks=verb , grapes=object and then your crossover operator would exchange elements from the same category between chromosomes. Besides, your mutation operator could only insert new elements of a proper category (for instance, an adjective before the subject) or replace a word for a random word in the same category.
This way you would minimize the number of nonsensical chromosomes (like Fox beautiful grape day sky) and improve the discourse generation power for your GA.
Besides, I agree with all previous comments: if you are using elitism and the best performance decreases, then you are implementing it wrong (notice that in a pathological situation it may remain constant for a long period of time).
I hope it helps.

Pseudocode for script to check transcription accuracy / edit distances

I need to write a script, probably in Ruby, that will take one block of text and compare a number of transcriptions of recordings of that text to the original to check for accuracy. If that's just completely confusing, I'll try explaining another way...
I have recordings of several different people reading a script that is a few sentences long. These recordings have all been transcribed back to text a number of times by other people. I need to take all of the transcriptions (hundreds) and compare them against the original script for accuracy.
I'm having trouble even conceptualising the pseudocode, and wondering if someone can point me in the right direction. Is there an established algorithm I should be considering? The Levenshtein distance has been suggested to me, but this seems like it wouldn't cope well with longer strings, considering differences in punctuation choices, whitespace, etc.--missing the first word would wreck the entire algorithm, even if every other word were perfect. I'm open to anything--thank you!
Edit:
Thanks for the tips, psyho. One of my biggest concerns, however, is a situation like this:
Original Text:
I would've taken that course if I'd known it was available!
Transcription
I would have taken that course if I'd known it was available!
Even with a word-wise comparison of tokens, this transcription will be marked as quite errant, even though it's almost perfect, and this is hardly an edge-case! "would've" and "would have" are commonly pronounced extremely similarly, especially in this part of the world. Is there a way to make the approach you suggest robust enough to deal with this? I've thought about running a word-wise comparison both forward and backward and building a sort of composite score, but this would fall apart with a transcription like this:
I would have taken that course if I had known it was available!
Any ideas?
Simple version:
Tokenize your input into words (convert a string containing words, punctuation, etc. into an array of lowercase words, without punctuation).
Use the Levenshtein distance (wordwise) to compare the original array with the transcription arrays.
Possible improvements:
You could introduce tokens for punctuation (or replace them all with a simple token like '.').
Levenshtein distance algorithm can be modified so that misspelling a character that with a character that is close on the keyboard generates a smaller distance. You could potentialy apply this, so that when comparing individual words, you would use Levenshtein distance (normalized, so that it's value ranges from 0 to 1, for example by dividing it by the length of the longer of the two words), and then use that value in the "outer" distance calculation.
It's hard to say what algorithm will work best with your data. My tip is: make sure you have some automated way of visualizing or testing your solution. This way you can quickly iterate and experiment with your solution and see how your changes affect the end result.
EDIT:
In response to your concerns:
The easiest way would be to start with normalizing the shorter forms (using gsub):
str.gsub("n't", ' not').gsub("'d", " had").gsub("'re", " are")
Note, that you can even expand "'s" to " is", even if it's not grammatically correct, because if John's means "John is", then you will get it right, and if it means "owned by John", then most likely both texts will contain the same form, so you will not further the distance by expanding both "incorrectly". The other case is when it should mean "John has", but then after "'s" there probably will be "got", so you can handle that easily as well.
You will probably also want to deal with numeric values (1st = first, etc.). Generally you can probably improve the result by doing some preprocessing. Don't worry if it's not always 100% correct, it should just be correct enough:)
Since you're ultimately trying to compare how different transcribers have dealt with the way the passage sounds, you might try comparing using a phonetic algorithm such as Metaphone.
After experimenting with the issues I noted in this question, I found that the Levenshtein Distance actually takes these problems into account. I don't fully understand how or why, but can see after experimentation that this is the case.

How does one go about reverse engineering an algorithm?

I'm wondering how does one go about reversing an algorithm such as one for storing logins or pin codes.
Lets say I have an amount of data where:
7262627 -> ? -> 8172
5353773 -> ? -> 1132
etc. This is just an example. Or say a hex string that is tansformed into another.
&h8712 -> &h1283 or something like that.
How do I go about starting to figure out what that algorithm is? Where does one start?
Would you start trying different shifts, xors and hope something stands out? I'm sure there's a better way as this seems like stabbing in the dark.
Is it even practically possible to reverse engineer this kind of algorithm?
Sorry if this is a stupid question. Thanks for your help / pointers.
There are a few things people try:
Get the source code, or disassemble an executable.
Guess, based on the hash functions other people use. For example, a hash consisting of 32 hex digits might well be one or more repetitions of MD5, and if you can get a single input/output pair then it is quite easy to confirm or refute this (although see "salt", below).
Statistically analyze a large number of pairs of inputs and outputs, looking for any kind of pattern or correlations, and relate those correlations to properties of known hash functions and/or possible operations that the designer of the system might have used. This is beyond the scope of a single technique, and into the realms of general cryptanalysis.
Ask the author. Secure systems don't usually rely on the secrecy of the hash algorithms they use (and don't usually stay secure long if they do). The examples you give are quite small, though, and secure hashing of passwords would always involve a salt, which yours apparently don't. So we might not be talking about the kind of system where the author is confident to do that.
In the case of a hash where the output is only 4 decimal digits, you can attack it simply by building a table of every possible 7 digit input, together with its hashed value. You can then reverse the table and you have your (one-to-many) de-hashing operation. You never need to know how the hash is actually calculated. How do you get the input/output pairs? Well, if an outsider can somehow specify a value to be hashed, and see the result, then you have what's called a "chosen plaintext", and an attack relying on that is a "chosen plaintext attack". So a 7 digit -> 4 digit hash would be very weak indeed if it was used in a way which allowed chosen plaintext attacks to generate a lot of input/output pairs. I realise that's just one example, but it's also just one example of a technique to reverse it.
Note that reverse engineering the hash, and actually reversing it, are two different things. You could figure out that I'm using SHA-256, but that wouldn't help you reverse it (i.e., given an output, work out the input value). Nobody knows how to fully reverse SHA-256, although of course there are always rainbow tables (see "salt", above) <conspiracy>At least nobody admits they do, so it's no use to you or me.</conspiracy>
Probably, you can't. Suppose the transformation function is known, something like
function hash(text):
return sha1("secret salt"+text)
But the "secret salt" is not known, and is cryptographically strong (a very large, random integer). You could never brute force the secret salt from even a very large number of plain-text, crypttext pairs.
In fact, if the precise hash function used was known to be one of two equally strong functions, you could never even get a good guess between which one was being used.
Stabbing in the dark will drive you to insanity. There are some algorithms that, given current understanding, you couldn't hope to deduce the inner workings of between now and the [predicted] end of the universe without knowing the exact details (potentially including private keys or internal state). Of course, some of these algorithms are the foundations of modern cryptography.
If you know in advance that there's a pattern to be discovered though, there are sometimes ways of approaching this. For instance, if the dataset contains several input values that differ by 1, compare the corresponding output values:
7262627 -> 8172
7262628 -> 819
7262629 -> 1732
...
7262631 -> 3558
Here it's fairly clear (given a few minutes and a calculator) that when the input increases by 1, the output increases by 913 modulo 8266 (i.e. a simple linear congruential generator).
Differential cryptanalysis is a relatively modern technique used to analyse the strength of cryptographic block ciphers, relying on a similar but more complex idea for where the cipher algorithm is known, but it's assumed the private key isn't. Input blocks differing from each other by a single bit are considered and the effect of that bit is traced through the cipher to deduce how likely each output bit is to "flip" as a result.
Other ways of approaching this kind of problem would be to look at the extremes (maximum, minimum values), distribution (leading to frequency analysis), direction (do the numbers always increase? decrease?) and (if this is allowed) consider the context in which the data sets were found. For instance, some types of PIN codes always contain a repeated digit to make them easier to remember (I'm not saying a PIN code can necessarily be deduced from anything else - just that a repeated digit is one less digit to worry about!).
Is it even practically possible to reverse engineer this kind of algorithm?
It is possible with a flawed algorithm and enough encrypted/unencrypted pairs, but a well designed algorithm can eliminate that possibility of doing it at all.

What is the empirically found best value for n in n-gram model?

I am implementing a variation of spell checker. After taking various routes (for improving the time efficiency) I am planning to try out a component which would involve use of n-gram model. So essentially I want to prune the list of likely candidates for further processing. Would you guys happen to know if using one value of n (say 2) will be better over other (say 3)?
According to this website, the average word length in English is 5.10 letters. I would assume that people are more likely to misspell longer words than shorter words, so I'd lean towards going around maybe 3-5 letters forward, if possible, as a gut feeling.
When you say n-grams, I'm going to assume you're talking about letters in a word, rather than words in a sentence (which probably is the most common usage). In this case, I would agree with Mark Rushakoff in that you could prune the candidates list to words including 3-5 characters more or less than the word you're controlling.
Another option would be to implement the Levenshtein algorithm to find the edit distance between two words. This can be done quite efficiently: Firstly, through only checking against your pruned list. Secondly, through ending the distance calculation of a word prematurely once the edit distance exceeds some sort of limit (e.g. 3-5).
As a side note, I disagree with Mark on that you should ignore short words, as they are less frequently misspelt. A large portion of the misspelt words will be short words (such as "and" - "nad", "the" - "teh", "you" - "yuo"), simply because they are much more frequent.
Hope this helps!
If you have sufficient text for training, 3 is a good start. On the other hand, such a model will be quite big and bloat your spell checker.
You could also compare different settings based on perplexity.

Algorithm for Comparing Words (Not Alphabetically)

I need to code a solution for a certain requirement, and I wanted to know if anyone is either familiar with an off-the-shelf library that can achieve it, or can direct me at the best practice. Description:
The user inputs a word that is supposed to be one of several fixed options (I hold the options in a list). I know the input must be in a member in the list, but since it is user input, he/she may have made a mistake. I'm looking for an algorithm that will tell me what is the most probable word the user meant. I don't have any context and I can’t force the user to choose from a list (i.e. he must be able to input the word freely and manually).
For example, say the list contains the words "water", “quarter”, "beer", “beet”, “hell”, “hello” and "aardvark".
The solution must account for different types of "normal" errors:
Speed typos (e.g. doubling characters, dropping characters etc)
Keyboard adjacent-character typos (e.g. "qater" for “water”)
Non-native English typos (e.g. "quater" for “quarter”)
And so on...
The obvious solution is to compare letter-by-letter and give "penalty weights" to each different letter, extra letter and missing letter. But this solution ignores thousands of "standard" errors I'm sure are listed somewhere. I'm sure there are heuristics out there that deal with all the cases, both specific and general, probably using a large database of standard mismatches (I’m open to data-heavy solutions).
I'm coding in Python but I consider this question language-agnostic.
Any recommendations/thoughts?
You want to read how google does this: http://norvig.com/spell-correct.html
Edit: Some people have mentioned algorithms that define a metric between a user given word and a candidate word (levenshtein, soundex). This is however not a complete solution to the problem, since one would also need a datastructure to efficiently perform a non-euclidean nearest neighbour search. This can be done e.g. with the Cover Tree: http://hunch.net/~jl/projects/cover_tree/cover_tree.html
A common solution is to calculate the Levenshtein distance between the input and your fixed texts. The Levenshtein distance of two strings is just the number of simple operations - insertions, deletions, and substitutions of a single character - required to turn one of the string into the other.
Have you considered algorithms that compare by phonetic sounds, such as soundex? It shouldn't be too hard to produce soundex representations of your list of words, store them, and then get a soundex of the user input and find the closest match there.
Look for the Bitap algorithm. It qualifies well for what you want to do, and even comes with a source code example in Wikipedia.
If your data set is really small, simply comparing the Levenshtein distance on all items independently ought to suffice. If it's larger, though, you'll need to use a BK-Tree or similar indexing system. The article I linked to describes how to find matches within a given Levenshtein distance, but it's fairly straightforward to adapt to do nearest-neighbor searches (and left as an exercise to the reader ;).
Though it may not solve the entire problem, you may want to consider using the soundex algorithm as part of the solution. A quick google search of "soundex" and "python" showed some python implementations of the algorithm.
Try searching for "Levenshtein distance" or "edit distance". It counts the number of edit operations (delete, insert, change letter) you need to transform one word into another. It's a common algorithm, but depending on the problem you might need something special with different weights for the different types of typos.

Resources