string transposition algorithm - algorithm

Suppose there is given two String:
String s1= "MARTHA"
String s2= "MARHTA"
here we exchange positions of T and H. I am interested to write code which counts how many changes are necessary to transform from one String to another String.

There are several edit distance algorithms, the given Wikipeida link has links to a few.

Assuming that the distance counts only swaps, here is an idea based on permutations, that runs in linear time.
The first step of the algorithm is ensuring that the two strings are really equivalent in their character contents. This can be done in linear time using a hash table (or a fixed array that covers all the alphabet). If they are not, then s2 can't be considered a permutation of s1, and the "swap count" is irrelevant.
The second step counts the minimum number of swaps required to transform s2 to s1. This can be done by inspecting the permutation p that corresponds to the transformation from s1 to s2. For example, if s1="abcde" and s2="badce", then p=(2,1,4,3,5), meaning that position 1 contains element #2, position 2 contains element #1, etc. This permutation can be broke up into permutation cycles in linear time. The cycles in the example are (2,1) (4,3) and (5). The minimum swap count is the total count of the swaps required per cycle. A cycle of length k requires k-1 swaps in order to "fix it". Therefore, The number of swaps is N-C, where N is the string length and C is the number of cycles. In our example, the result is 2 (swap 1,2 and then 3,4).
Now, there are two problems here, and I think I'm too tired to solve them right now :)
1) My solution assumes that no character is repeated, which is not always the case. Some adjustment is needed to calculate the swap count correctly.
2) My formula #MinSwaps=N-C needs a proof... I didn't find it in the web.

Your problem is not so easy, since before counting the swaps you need to ensure that every swap reduces the "distance" (in equality) between these two strings. Then actually you look for the count but you should look for the smallest count (or at least I suppose), otherwise there exists infinite ways to swap a string to obtain another one.
You should first check which charaters are already in place, then for every character that is not look if there is a couple that can be swapped so that the next distance between strings is reduced. Then iterate over until you finish the process.
If you don't want to effectively do it but just count the number of swaps use a bit array in which you have 1 for every well-placed character and 0 otherwise. You will finish when every bit is 1.

Related

Toggling bits pairs in an array to maximize its dot product with another array

Suppose two arrays are given A and B. A consists of integers and the second one consists of 0 and 1.
Now an operation is given - You can choose any adjacent bits in array B and you can toggle these two bits (for example - 00->11, 01->10, 10->01, 11->00) and you can perform this operation any number of times.
The output should be the sum of A[0]*B[0]+A[1]*B[1]+....+A[N-1]*B[N-1] such that the sum is maximum.
During the interview, my approach to this problem was to get the maximum number of 1's in array B in order to maximize the sum.
So to do that, I first calculated the total number of 1's in O(n) time in B. Let count = No. Of 1's=x.
Then I started traversing the array and toggle only if count becomes greater than x or based on the elements of array A (for example: Let B[i]=0 and B[i+1]=1 & A[i]=51 and A[i+1]=50
So I will toggle B[i] B[i+1] because A[i]>A[i+1])
But the interviewer was not quite satisfied with my approach and was asking me further to develop a less time complex algorithm.
Can anyone suggest a better approach with lesser time complexity?
You can create any B-vector with an even number of flipped bits just by repeatedly flipping the first bit that is in the wrong state.
So, pick all the positive numbers in A, and then drop the smallest one if you ended up with an a count that has a different oddness than the number of 1s in B. If you can't do that, because B has an odd number of 1s and A is all negative, then just pick the negative number closest to 0.
Then turn on all the bits corresponding to the numbers you chose, and turn off the other ones.

Check if string includes part of Fibonacci Sequence

Which way should I follow to create an algorithm to find out whether fibonacci sequence exists in a given string ?
The string includes only digits with no whitespaces and there may be more than one sequence, I need to find all of them.
If as your comment says the first number must have less than 6 digits, you can simply search for all positions there one of the 25 fibonacci numbers (there are only 25 with less than 6 digits) and than try to expand this 1 number sequence in both directions.
After your update:
You can even speed things up when you are only looking for sequences of at least 3 numbers.
Prebuild all 25 3-number-Strings that start with one of the 25 first fibonnaci-numbers this should give much less matches than the search for the single fibonacci-numbers I suggested above.
Than search for them (like described above and try to expand the found 3-number-sequences).
here's how I would approach this.
The main algorithm could search for triplets then try to extend them to as long a sequence as possible.
This leaves us with the subproblem of finding triplets. So if you are scanning through a string to look for fibonacci numbers, one thing you can take advantage of is that the next number must have the same number of digits or one more digit.
e.g. if you have the string "987159725844" and are considering "[987]159725844" then the next thing you need to look at is "987[159]725844" and "987[1597]25844". Then the next part you would find is "[2584]4" or "[25844]".
Once you have the 3 numbers you can check if they form an arithmetic progression with C - B == B - A. If they do you can now check if they are from the fibonacci sequence by seeing if the ratio is roughly 1.6 and then running the fibonacci iteration backwards down to the initial conditions 1,1.
The overall algorithm would then work by scanning through looking for all triples starting with width 1, then width 2, width 3 up to 6.
I'd say you should first find all interesting Fibonacci items (which, having 6 or less digits, are no more than 30) and store them into an array.
Then, loop every position in your input string, and try to find upon there the longest possible Fibonacci number (that is, you must browse the array backwards).
If some Fib number is found, then you must bifurcate to a secondary algorithm, consisting of merely going through the array from current position to the end, trying to match every item in the following substring. When the matching ends, you must get back to the main algorithm to keep searching in the input string from the current position.
None of these two algorithms is recursive, nor too expensive.
update
Ok. If no tables are allowed, you could still use this approach replacing in the first loop the way to get the bext Fibo number: Instead of indexing, apply your formula.

Conditional Randomization

Imagine there is a list of elements as follow:
1a, 2a, 3a, 4a, 5b, 6b, 7b, 8b
Now we need to randomize it such that not more than 2 "a"s or 2 "b"s get next to each other. For instance the following list is not allowed because of the 2nd, third and fourth elements:
3a, 7b, 8b, 5b, 2a, 1a, 5b, 4a
How can we write write an efficient code without generating many random sequences and many triad comparisons?
Create two bins, one for the a's and one for the b's. Pick from a random bin and record the bin. Pick a second number from a random bin. If the bin is not the same as before just record the bin. If the bin is the same as before then force the next pick to be from the other bin. Carry on forward, only forcing a bin when you have two picks in succession from the same bin.
I'm going to assume that:
There are only two kinds of element, a and b, and
There aren't "too many" of either kind (say, less than 30) or that you're willing to use a bignum package.
The basic idea is to (conceptually) first construct a valid sequence of as and bs, and then randomly assign the actual elements to the as and bs in the sequence. In practice, you could do both of these steps in parallel; every time you add an a to the sequence, you select a random a element from the set of such elements not yet assigned, and similarly with b elements.
The (slightly) complicated part is constructing the valid sequence without bias, and that's what I'm going to focus on.
As is often the case, the key is to be able to count the number of possible sequences, in a way which leads to an enumeration. We don't actually enumerate the possibilities -- that would take really a long time for even moderately long sequences -- but we do need to know for every prefix how to enumerate the sequences starting with that prefix.
Rather than produce the sequence element by element, we'll produce it in chunks of one or two elements of the same kind. Since we don't allow more than two consecutive elements of the same kind, the final sequence must be a series of alternating chunks. In effect, at every point except the very beginning, the choice is whether to select one or two of the "other" kind. At the beginning, we must select one or two of either kind, so we must first choose the starting kind, after which all the kinds are fixed; we merely need a sequence of 1's and 2's -- representing one element or two elements of the same kind -- with the kind alternating at each step. The sequence of 1s and 2s is constrained by the fact that we know how many elements there are of each kind, which corresponds to the sum of the numbers in the even and odd positions of the {1,2}-sequence.
Now, let's define f(m,n) as the count of sequences whose even and odd sums are m and n. (Using CS rather than maths rules, we'll assume that the first position is 0 (even) but it actually makes absolutely no difference.) Suppose that we have 6 as and 4 bs. There are then f(6,4) sequences which start with an a, and f(4,6) sequences which start with a b, so that the total count of valid sequences is f(6,4)+f(4,6).
Now, suppose we need to compute f(m,n). Assuming m is large enough, we have exactly two options: choose one of the m elements of the even kind or choose two of the m elements of the even kind. After that, we will swap even and odd because the next choice applies to the other kind.
That rather directly leads to the recursion
f(m, n) = f(n, m-1) + f(n, m-2)
which we might think of as a kind of two-dimensional fibonacci recursion. (Recall that fib(m) = fib(m-1) + fib(m-2); the difference here is the second argument, and the fact that the argument order flip-flops at each recursion.
As with Fibonacci numbers, computing the values naively without memoization leads to exponential blow-up of recursive calls, and a more efficient strategy is to compute the entire table starting from f(0,0) (which has the value 1, obviously); in essence, a dynamic programming approach. We could also just do the recursive computation with memoization, which is slightly less efficient but possibly easier to read.
For now, let's just assume that we've arranged for the computation of f(m,n) to be suitably fast, either because we've prebuilt the entire array of possibilities up to the largest values of m and n we will need, or because we're using a memoizing recursive solution so that we only need to do the slow computation once for any given m,n. Now let's construct the random sequence.
Suppose there are na a-elements and nb b-elements. Since we don't know whether the random sequence will start with an a or a b, we need to first make that decision. We know there are f(na,nb) valid sequences which start a and f(nb,na) valid sequences starting with a b, so we start by generating a random non-negative integer less than f(na,nb) + f(nb,na). If the random is less than f(na,nb) then we'll start with a-elements; otherwise we'll start with b elements.
Having made that decision, we'll proceed as follows. We know what the next element kind is and how many elements remain of each kind, so we only need to know whether to select one or two elements of the correct kind. To make that choice, we generate a non-negative random integer less than f(m, n); if it is less than f(n, m-1) then we select one element; otherwise we select two elements. Then we swap the element sets, fix the counts, and continue until m and n are both 0.

Weighted unordered string edit distance

I need an efficient way of calculating the minimum edit distance between two unordered collections of symbols. Like in the Levenshtein distance, which only works for sequences, I require insertions, deletions, and substitutions with different per-symbol costs. I'm also interested in recovering the edit script.
Since what I'm trying to accomplish is very similar to calculating string edit distance, I figured it might be called unordered string edit distance or maybe just set edit distance. However, Google doesn't turn up anything with those search terms, so I'm interested to learn if the problem is known by another name?
To clarify, the problem would be solved by
def unordered_edit_distance(target, source):
return min(edit_distance(target, source_perm)
for source_perm in permuations(source))
So for instance, the unordered_edit_distance('abc', 'cba') would be 0, whereas edit_distance('abc', 'cba') is 2. Unfortunately, the number of permutations grows large very quickly and is not practical even for moderately sized inputs.
EDIT Make it clearer that operations are associated with different costs.
Sort them (not necessary), then remove items which are same (and in equal numbers!) in both sets.
Then if the sets are equal in size, you need that numer of substitutions; if one is greater, then you also need some insertions or deletions. Anyway you need the number of operations equal the size of the greater set remaining after the first phase.
Although your observation is kind of correct, but you are actually make a simple problem more complex.
Since source can be any permutation of the original source, you first need check the difference in character level.
Have two map each map count the number of individual characters in your target and source string:
for example:
a: 2
c: 1
d: 100
Now compare two map, if you missing any character of course you need to insert it, and if you have extra character you delete it. Thats it.
Let's ignore substitutions for a moment.
Now it becomes a fairly trivial problem of determining the elements only in the first set (which would count as deletions) and those only in the second set (which would count as insertions). This can easily be done by either:
Sorting the sets and iterating through both at the same time, or
Inserting each element from the first set into a hash table, then removing each element from the second set from the hash table, with each element not found being an insertion and each element remaining in the hash table after we're done being a deletion
Now, to include substitutions, all that remains is finding the optimal pairing of inserted elements to deleted elements. This is actually the stable marriage problem:
The stable marriage problem (SMP) is the problem of finding a stable matching between two sets of elements given a set of preferences for each element. A matching is a mapping from the elements of one set to the elements of the other set. A matching is stable whenever it is not the case that both:
Some given element A of the first matched set prefers some given element B of the second matched set over the element to which A is already matched, and
B also prefers A over the element to which B is already matched
Which can be solved with the Gale-Shapley algorithm:
The Gale–Shapley algorithm involves a number of "rounds" (or "iterations"). In the first round, first a) each unengaged man proposes to the woman he prefers most, and then b) each woman replies "maybe" to her suitor she most prefers and "no" to all other suitors. She is then provisionally "engaged" to the suitor she most prefers so far, and that suitor is likewise provisionally engaged to her. In each subsequent round, first a) each unengaged man proposes to the most-preferred woman to whom he has not yet proposed (regardless of whether the woman is already engaged), and then b) each woman replies "maybe" to her suitor she most prefers (whether her existing provisional partner or someone else) and rejects the rest (again, perhaps including her current provisional partner). The provisional nature of engagements preserves the right of an already-engaged woman to "trade up" (and, in the process, to "jilt" her until-then partner).
We just need to get the cost correct. To pair an insertion and deletion, making it a substitution, we'll lose both the cost of the insertion and the deletion, and gain the cost of the substitution, so the net cost of the pairing would be substitutionCost - insertionCost - deletionCost.
Now the above algorithm guarantees that all insertion or deletions gets paired - we don't necessarily want this, but there's an easy fix - just create a bunch of "stay-as-is" elements (on both the insertion and deletion side) - any insertion or deletion paired with a "stay-as-is" element would have a cost of 0 and would result in it remaining an insertion or deletion and nothing would happen for two "stay-as-is" elements ending up paired.
Key observation: you are only concerned with how many 'a's, 'b's, ..., 'z's or other alphabet characters are in your strings, since you can reorder all the characters in each string.
So, the problem boils down to the following: having s['a'] characters 'a', s['b'] characters 'b', ..., s['z'] characters 'z', transform them into t['a'] characters 'a', t['b'] characters 'b', ..., t['z'] characters 'z'. If your alphabet is short, s[] and t[] can be arrays; generally, they are mappings from the alphabet to integers, like map <char, int> in C++, dict in Python, etc.
Now, for each character c, you know s[c] and t[c]. If s[c] > t[c], you must remove s[c] - t[c] characters c from the first unordered string (s). If s[c] < t[c], you must add t[c] - s[c] characters c to the second unordered string (t).
Take X, the sum of s[c] - t[c] for all c such that s[c] > t[c], and you will get the number of characters you have to remove from s in total. Take Y, the sum of t[c] - s[c] for all c such that s[c] < t[c], and you will get the number of characters you have to remove from t in total.
Now, let Z = min (X, Y). We can have Z substitutions, and what's left is X - Z insertions and Y - Z deletions. Thus the total number of operations is Z + (X - Z) + (Y - Z), or X + Y - min (X, Y).

Sort a given array based on parent array using only swap function

It is a coding interview question. We are given an array say random_arr and we need to sort it using only the swap function.
Also the number of swaps for each element in random_arr are limited. For this you are given an array parent_arr, containing number of swaps for each element of random_arr.
Constraints:
You should use swap function.
Every element may repeat minimum 5 times and maximum 26 times.
You cannot make elements of given array to 0.
You should not write helper functions.
Now I will explain how parent_arr is declared. If parent_arr is like:
parent_arr[] = {a,b,c,d,...,z} then
a can be swapped at most one time.
b can be swapped at most two times.
if parent_arr[] = {c,b,a,....,z} then
c can be swapped at most one time.
b can be swapped at most two times.
a can be swapped at most three times
My solution:
For each element in random_arr[] store that how many elements are below it, if it is sorted. Now select element having minimum swap count from parent_arr[] and check whether it exist in random_arr[]. If yes and it if has occurred more than one time then it will have more than one location where it can be placed. Now choose the position(rather element at that position, preciously) with maximum swap count and swap it. Now decrease the swap count for that element and sort the parent_arr[] and repeat the process.
But it is quite inefficient and its correctness can't be proved. Any ideas?
First, let's simplify your algorithm; then let's informally prove its correctness.
Modified algorithm
Observe that once you computed the number of elements below each number in the sorted sequence, you have enough information to determine for each group of equal elements x their places in the sorted array. For example, if c is repeated 7 times and has 21 elements ahead of it, then cs will occupy the range [21..27] (all indexes are zero-based; the range is inclusive of its ends).
Go through the parent_arr in the order of increasing number of swaps. For each element x, find the beginning of its target range rb; also note the end of its target range re. Now go through the elements of random_arr outside of the [rb..re] range. If you see x, swap it into the range. After swapping, increment rb. If you see that random_arr[rb] is equal to x, continue incrementing: these xs are already in the right spot, you wouldn't need to swap them.
Informal proof of correctness
Now lets prove the correctness of the above. Observe that once an element is swapped into its place, it is never moved again. When you reach an element x in the parent_arr, all elements with lower number of swaps are already processed. By construction of the algorithm this means that these elements are already in place. Suppose that x has k number of allowed swaps. When you swap it into its place, you move another element out.
This replaced element cannot be x, because the algorithm skips xs when looking for a destination in the target range [rb..re]. Moreover, the replaced element cannot be one of elements below x in the parent_arr, because all elements below x are in their places already, and therefore cannot move. This means that the swap count of the replaced element is necessarily k+1 or more. Since by the time that we finish processing x we have exhausted at most k swaps on any element (which is easy to prove by induction), any element that we swap out to make room for x will have at least one remaining swap that would allow us to swap it in place when we get to it in the order dictated by the parent_arr.

Resources