reverse deterministic shuffle -> derive key - algorithm

I'am looking for an algorithm with which it is possible to derive a key from an already happened shuffling-process.
Assume we've got the string "Hello" which was shuffled:
"hello" -> "loelh"
Now I would like to derive a key k from it which i could use to undo the shuffling. So if we use k as input parameter for a deterministic shuffling-algorithm like for example Fisher-Yates and shuffle "loelh" again, we would restore the initial string "hello".
What i do not mean is to simply use one and the same deterministic shuffling algorithm to shuffle and de-shuffle. That's because in my case the first string would not have been really shuffled in the classical sense. Actually there would be two sets of data (byte or bit-arrays) which are just given and we want to get from the first to the second one with just a key which has been derived before.
I hope it's clear what I want to achieve and I would appreciate all hints or proposed solutions.
Regards,
Merrit
UPDATE:
Another attemp:
basically, one could also call it deterministic transformation of a bunch of data e.g. a byte-array, but I will stick with the "hello"-string example.
Assume we've got a transformation-algorithm transform(data, "unknown seed") where data is "hello" and unknown seed is what we are looking for. The result of transform is "loelh". We are looking for this "unknown seed" which we could use to reverse the process. At the time of the "unknown seed"-generation, both, the input data AND the result are known of course.
Later on I want to use the "unknown seed" (which should be known already ;-) to get the original string again: so this transform("loelh", seed) should lead to "hello" again.
So you could also see it as a form of equation like data*["unknown value"]=resultdata and we are trying to find the unknown value (the operator * could be any kind of operation).

First of all, let's simplify the problem greatly. Instead of permuting "hello", let's assume that you are always permuting "abcde", as that will make it easier to understand.
A shuffle is the random generation of a permutation. How the shuffle generates the permutation is irrelevant; shuffles generate permutations, that's all we need to know.
Let's state a permutation as a string containing the numbers 1 through 5. Suppose the shuffle produces permutation "21453". That is, we take the first letter and put it in position 2: _a___. We take the second letter and put it in position 1, ba___. We take the 3rd letter and put it in position 5: ab__c. We take the fourth letter and put it in position 3, bad_c, and we take the fifth letter and put it in position 4, badec.
Now you wish to deduce a "key" which allows you to "unpermute" the permutation. Well, that's just another permutation, called the inverse permutation. To compute the inverse permutation of "21453" you do the following:
find "1". It's in the 2nd spot.
find "2". It's in the 1st spot.
find "3". It's in the 5th spot.
find "4". It's in the 3rd spot.
Find "5". It's in the 4th spot.
And now read down the second column; the inverse permutation of "21453" is "21534". We are unpermuting "badec". We put the first letter in position 2: _b___. We put the second letter in position 1: ab___. We put the third letter in position 4: ab_d_. We put the fourth letter in position 5: ab_de. And we put the fifth letter in position 3: abcde.

Shuffling is just creating a random permutation of a given sequence. The typical way to do that is something like the Fisher-Yates Shuffle that you pointed out. The problem is that the shuffle program generates multiple random numbers based on a seed, and unless you implement the random number generator there's no easy way to reverse the sequence of random numbers.
There is another way to do it. What if you could generate the nth permutation of a sequence directly? That is, given the string "Fast", you define the first few permutations as:
0 Fast
1 Fats
2 Fsat
3 Fsta
... etc. for all 24 permutations
You want a random permutation of those four characters. Select a random number from 0 to 23 and then call a function to generate that permutation.
If you know the key, you can call a different function, again passing that key, to have it reverse the permutation back to the original.
In the fourth article in his series on permutations, Eric Lippert showed how to generate the nth permutation without having to generate all of the permutations that come before it. He doesn't show how to reverse the process, but doing so shouldn't be difficult if you understand how the generator works. It's well worth the time to study the entire series of articles.
If you don't know what the key (i.e. the random number used) is, then deriving the sequence of swaps required to get to the original order is expensive.
Edit
Upon reflection, it just might be possible to derive the key if you're given the original sequence and the transformed sequence. Since you know how far each symbol has moved, you should be able to derive the key. Consider the possible permutations of two letters:
0. ab 1. ba
Now, assign the letter b the value of 0, and the letter a the value of 1. What permutation number is ba? Find a in the string, swap to the left until it gets to the proper position, and multiply the number of swaps by one.
That's too easy. Consider the next one:
0. abc 1. acb 2. bac
3. cab 4. bca 5. cba
a is now 2, b is 1, and c is 0. Given cab:
swap a left one space. 1x2 = 2. Result is `acb`
swap b left one space. 1x1 = 1. Result is `abc`
So cab is permutation #3.
This does assume that your permutation generator numbers the permutations in the same way. It's also not a terribly efficient way of doing things. Worst case will require n(n-1)/2 swaps. You can optimize the swaps by moving things in an array, but it's still an O(n^2) algorithm. Where n is the length of the sequence. Not terrible for 100 or maybe even 1,000 items. Pretty bad after that, though.

Related

Check if string includes part of Fibonacci Sequence

Which way should I follow to create an algorithm to find out whether fibonacci sequence exists in a given string ?
The string includes only digits with no whitespaces and there may be more than one sequence, I need to find all of them.
If as your comment says the first number must have less than 6 digits, you can simply search for all positions there one of the 25 fibonacci numbers (there are only 25 with less than 6 digits) and than try to expand this 1 number sequence in both directions.
After your update:
You can even speed things up when you are only looking for sequences of at least 3 numbers.
Prebuild all 25 3-number-Strings that start with one of the 25 first fibonnaci-numbers this should give much less matches than the search for the single fibonacci-numbers I suggested above.
Than search for them (like described above and try to expand the found 3-number-sequences).
here's how I would approach this.
The main algorithm could search for triplets then try to extend them to as long a sequence as possible.
This leaves us with the subproblem of finding triplets. So if you are scanning through a string to look for fibonacci numbers, one thing you can take advantage of is that the next number must have the same number of digits or one more digit.
e.g. if you have the string "987159725844" and are considering "[987]159725844" then the next thing you need to look at is "987[159]725844" and "987[1597]25844". Then the next part you would find is "[2584]4" or "[25844]".
Once you have the 3 numbers you can check if they form an arithmetic progression with C - B == B - A. If they do you can now check if they are from the fibonacci sequence by seeing if the ratio is roughly 1.6 and then running the fibonacci iteration backwards down to the initial conditions 1,1.
The overall algorithm would then work by scanning through looking for all triples starting with width 1, then width 2, width 3 up to 6.
I'd say you should first find all interesting Fibonacci items (which, having 6 or less digits, are no more than 30) and store them into an array.
Then, loop every position in your input string, and try to find upon there the longest possible Fibonacci number (that is, you must browse the array backwards).
If some Fib number is found, then you must bifurcate to a secondary algorithm, consisting of merely going through the array from current position to the end, trying to match every item in the following substring. When the matching ends, you must get back to the main algorithm to keep searching in the input string from the current position.
None of these two algorithms is recursive, nor too expensive.
update
Ok. If no tables are allowed, you could still use this approach replacing in the first loop the way to get the bext Fibo number: Instead of indexing, apply your formula.

Conditional Randomization

Imagine there is a list of elements as follow:
1a, 2a, 3a, 4a, 5b, 6b, 7b, 8b
Now we need to randomize it such that not more than 2 "a"s or 2 "b"s get next to each other. For instance the following list is not allowed because of the 2nd, third and fourth elements:
3a, 7b, 8b, 5b, 2a, 1a, 5b, 4a
How can we write write an efficient code without generating many random sequences and many triad comparisons?
Create two bins, one for the a's and one for the b's. Pick from a random bin and record the bin. Pick a second number from a random bin. If the bin is not the same as before just record the bin. If the bin is the same as before then force the next pick to be from the other bin. Carry on forward, only forcing a bin when you have two picks in succession from the same bin.
I'm going to assume that:
There are only two kinds of element, a and b, and
There aren't "too many" of either kind (say, less than 30) or that you're willing to use a bignum package.
The basic idea is to (conceptually) first construct a valid sequence of as and bs, and then randomly assign the actual elements to the as and bs in the sequence. In practice, you could do both of these steps in parallel; every time you add an a to the sequence, you select a random a element from the set of such elements not yet assigned, and similarly with b elements.
The (slightly) complicated part is constructing the valid sequence without bias, and that's what I'm going to focus on.
As is often the case, the key is to be able to count the number of possible sequences, in a way which leads to an enumeration. We don't actually enumerate the possibilities -- that would take really a long time for even moderately long sequences -- but we do need to know for every prefix how to enumerate the sequences starting with that prefix.
Rather than produce the sequence element by element, we'll produce it in chunks of one or two elements of the same kind. Since we don't allow more than two consecutive elements of the same kind, the final sequence must be a series of alternating chunks. In effect, at every point except the very beginning, the choice is whether to select one or two of the "other" kind. At the beginning, we must select one or two of either kind, so we must first choose the starting kind, after which all the kinds are fixed; we merely need a sequence of 1's and 2's -- representing one element or two elements of the same kind -- with the kind alternating at each step. The sequence of 1s and 2s is constrained by the fact that we know how many elements there are of each kind, which corresponds to the sum of the numbers in the even and odd positions of the {1,2}-sequence.
Now, let's define f(m,n) as the count of sequences whose even and odd sums are m and n. (Using CS rather than maths rules, we'll assume that the first position is 0 (even) but it actually makes absolutely no difference.) Suppose that we have 6 as and 4 bs. There are then f(6,4) sequences which start with an a, and f(4,6) sequences which start with a b, so that the total count of valid sequences is f(6,4)+f(4,6).
Now, suppose we need to compute f(m,n). Assuming m is large enough, we have exactly two options: choose one of the m elements of the even kind or choose two of the m elements of the even kind. After that, we will swap even and odd because the next choice applies to the other kind.
That rather directly leads to the recursion
f(m, n) = f(n, m-1) + f(n, m-2)
which we might think of as a kind of two-dimensional fibonacci recursion. (Recall that fib(m) = fib(m-1) + fib(m-2); the difference here is the second argument, and the fact that the argument order flip-flops at each recursion.
As with Fibonacci numbers, computing the values naively without memoization leads to exponential blow-up of recursive calls, and a more efficient strategy is to compute the entire table starting from f(0,0) (which has the value 1, obviously); in essence, a dynamic programming approach. We could also just do the recursive computation with memoization, which is slightly less efficient but possibly easier to read.
For now, let's just assume that we've arranged for the computation of f(m,n) to be suitably fast, either because we've prebuilt the entire array of possibilities up to the largest values of m and n we will need, or because we're using a memoizing recursive solution so that we only need to do the slow computation once for any given m,n. Now let's construct the random sequence.
Suppose there are na a-elements and nb b-elements. Since we don't know whether the random sequence will start with an a or a b, we need to first make that decision. We know there are f(na,nb) valid sequences which start a and f(nb,na) valid sequences starting with a b, so we start by generating a random non-negative integer less than f(na,nb) + f(nb,na). If the random is less than f(na,nb) then we'll start with a-elements; otherwise we'll start with b elements.
Having made that decision, we'll proceed as follows. We know what the next element kind is and how many elements remain of each kind, so we only need to know whether to select one or two elements of the correct kind. To make that choice, we generate a non-negative random integer less than f(m, n); if it is less than f(n, m-1) then we select one element; otherwise we select two elements. Then we swap the element sets, fix the counts, and continue until m and n are both 0.

Where to start: what set of N letters makes the most words?

I'm having trouble coming up with a non-brute force approach to solve this problem I've been wondering: what set of N letters can be used to make the most words from a given dictionary? Letters can be used any number of times.
For example, for N=3, we can have EST to give words like TEST and SEE, etc...
Searching online, I found some answers (such as listed above for EST), but no description of the approach.
My question is: what well-known problems are similar to this, or what principles should I use to tackle this problem?
NOTE: I know it's not necessarily true that if EST is the best for N=3, then ESTx is the best for N=4. That is to say, you can't just append a letter to the previous solution.
In case you're wondering, this question came to mind because I was wondering what set of 4 ingredients could make the most cocktails, and I started searching for that. Then I realized my question was specific, and so I figured this letter question is the same type of problem, and started searching for it as well.
For each word in dictionary, sort it letters and remove duplicates. Let it be the skeleton of the word. For each skeleton, count how many words contain it. Let it be its frequency. Ignore all skeletons whose size is higher than N.
Let a subskeleton be any possible removals of 1 or more letters from the skeleton, i.e. EST has subskeletons of E,S,T,ES,ET,ST. For each skeleton of size N, add the count of this skeleton and all its subskeletons. Select the skeleton with maximal sum.
You need O(2**N*D) operations, where D is size of the dictionary.
Correction: we need to take into account all skeletons of size up to N (not only of words), and the numbet of operations will be O(2**N*C(L,N)), where L is the number of letters (maybe 26 in english).
So I coded up a solution to this problem that uses a hash table to get things done. I had to deal with a few problems along the way too!
Let N be the size of the group of letters you are looking for that can make the most words. Let L be the length of the dictionary.
Convert each word in the dictionary into a set of letters: 'test' -> {'e','s','t'}
For each number 1 to N inclusive, create a cut list that contains the words you can make with exactly that many letters.
Make a hash table for each number 1 to N inclusive, then go through the corresponding cut list and use the set as a key, and increment by 1 for each member of the cut list.
This was the part that gave me trouble! Create a set out of your cut list (unique_cut_list) for N. This is essentially all the populated key-value pairs for the hash table for N.
For each set in unique_cut_list, generate all subsets, and check the corresponding hash table (the size of the subset) to see if there is a value. If there is, add that value to the hash table for N with the key of the original set.
Finally, go through the hash table and find the max value. The corresponding key is the group of letters you're after.
You go through the dictionary 1+2N times for steps 1-5, step 6 goes through a version of the dictionary and check (2^N)-1 subsets each time (ignore null set). That gives O(2NL + L*2^N) which should approach O(L*2^N). Not bad, since N will not be too big in most applications!

Word Prediction algorithm

I'm sure there is a post on this, but I couldn't find one asking this exact question. Consider the following:
We have a word dictionary available
We are fed many paragraphs of words, and I wish to be able to predict the next word in a sentence given this input.
Say we have a few sentences such as "Hello my name is Tom", "His name is jerry", "He goes where there is no water". We check a hash table if a word exists. If it does not, we assign it a unique id and put it in the hash table. This way, instead of storing a "chain" of words as a bunch of strings, we can just have a list of uniqueID's.
Above, we would have for instance (0, 1, 2, 3, 4), (5, 2, 3, 6), and (7, 8, 9, 10, 3, 11, 12). Note that 3 is "is" and we added new unique id's as we discovered new words. So say we are given a sentence "her name is", this would be (13, 2, 3). We want to know, given this context, what the next word should be. This is the algorithm I thought of, but I dont think its efficient:
We have a list of N chains (observed sentences) where a chain may be ex. 3,6,2,7,8.
Each chain is on average size M, where M is the average sentence length
We are given a new chain of size S, ex. 13, 2, 3, and we wish to know what is the most probable next word?
Algorithm:
First scan the entire list of chains for those who contain the full S input(13,2,3, in this example). Since we have to scan N chains, each of length M, and compare S letters at a time, its O(N*M*S).
If there are no chains in our scan which have the full S, next scan by removing the least significant word (ie. the first one, so remove 13). Now, scan for (2,3) as in 1 in worst case O(N*M*S) which is really S-1.
Continue scanning this way until we get results > 0 (if ever).
Tally the next words in all of the remaining chains we have gathered. We can use a hash table which counts every time we add, and keeps track of the most added word. O(N) worst case build, O(1) to find max word.
The max word found is the the most likely, so return it.
Each scan takes O(M*N*S) worst case. This is because there are N chains, each chain has M numbers, and we must check S numbers for overlaying a match. We scan S times worst case (13,2,3,then 2,3, then 3 for 3 scans = S). Thus, the total complexity is O(S^2 * M * N).
So if we have 100,000 chains and an average sentence length of 10 words, we're looking at 1,000,000*S^2 to get the optimal word. Clearly, N >> M, since sentence length does not scale with number of observed sentences in general, so M can be a constant. We can then reduce the complexity to O(S^2 * N). O(S^2 * M * N) may be more helpful for analysis though, since M can be a sizeable "constant".
This could be the complete wrong approach to take for this type of problem, but I wanted to share my thoughts instead of just blatantly asking for assitance. The reason im scanning the way I do is because I only want to scan as much as I have to. If nothing has the full S, just keep pruning S until some chains match. If they never match, we have no idea what to predict as the next word! Any suggestions on a less time/space complex solution? Thanks!
This is the problem of language modeling. For a baseline approach, The only thing you need is a hash table mapping fixed-length chains of words, say of length k, to the most probable following word.(*)
At training time, you break the input into (k+1)-grams using a sliding window. So if you encounter
The wrath sing, goddess, of Peleus' son, Achilles
you generate, for k=2,
START START the
START the wrath
the wrath sing
wrath sing goddess
goddess of peleus
of peleus son
peleus son achilles
This can be done in linear time. For each 3-gram, tally (in a hash table) how often the third word follows the first two.
Finally, loop through the hash table and for each key (2-gram) keep only the most commonly occurring third word. Linear time.
At prediction time, look only at the k (2) last words and predict the next word. This takes only constant time since it's just a hash table lookup.
If you're wondering why you should keep only short subchains instead of full chains, then look into the theory of Markov windows. If your model were to remember all the chains of words that it has seen in its input, then it would badly overfit its training data and only reproduce its input at prediction time. How badly depends on the training set (more data is better), but for k>4 you'd really need smoothing in your model.
(*) Or to a probability distribution, but this is not needed for your simple example use case.
Yeh Whye Teh also has some recent interesting work that addresses this problem. The "Sequence Memoizer" extends the traditional prediction-by-partial-matching scheme to take into account arbitrarily long histories.
Here is a link the original paper: http://www.stats.ox.ac.uk/~teh/research/compling/WooGasArc2011a.pdf
It is also worth reading some of the background work, which can be found in the paper "A Bayesian Interpretation of Interpolated Kneser-Ney"

string transposition algorithm

Suppose there is given two String:
String s1= "MARTHA"
String s2= "MARHTA"
here we exchange positions of T and H. I am interested to write code which counts how many changes are necessary to transform from one String to another String.
There are several edit distance algorithms, the given Wikipeida link has links to a few.
Assuming that the distance counts only swaps, here is an idea based on permutations, that runs in linear time.
The first step of the algorithm is ensuring that the two strings are really equivalent in their character contents. This can be done in linear time using a hash table (or a fixed array that covers all the alphabet). If they are not, then s2 can't be considered a permutation of s1, and the "swap count" is irrelevant.
The second step counts the minimum number of swaps required to transform s2 to s1. This can be done by inspecting the permutation p that corresponds to the transformation from s1 to s2. For example, if s1="abcde" and s2="badce", then p=(2,1,4,3,5), meaning that position 1 contains element #2, position 2 contains element #1, etc. This permutation can be broke up into permutation cycles in linear time. The cycles in the example are (2,1) (4,3) and (5). The minimum swap count is the total count of the swaps required per cycle. A cycle of length k requires k-1 swaps in order to "fix it". Therefore, The number of swaps is N-C, where N is the string length and C is the number of cycles. In our example, the result is 2 (swap 1,2 and then 3,4).
Now, there are two problems here, and I think I'm too tired to solve them right now :)
1) My solution assumes that no character is repeated, which is not always the case. Some adjustment is needed to calculate the swap count correctly.
2) My formula #MinSwaps=N-C needs a proof... I didn't find it in the web.
Your problem is not so easy, since before counting the swaps you need to ensure that every swap reduces the "distance" (in equality) between these two strings. Then actually you look for the count but you should look for the smallest count (or at least I suppose), otherwise there exists infinite ways to swap a string to obtain another one.
You should first check which charaters are already in place, then for every character that is not look if there is a couple that can be swapped so that the next distance between strings is reduced. Then iterate over until you finish the process.
If you don't want to effectively do it but just count the number of swaps use a bit array in which you have 1 for every well-placed character and 0 otherwise. You will finish when every bit is 1.

Resources