anagram string edit distance algorithm/code? - algorithm

There are two anagram strings S and P. There are two basic operations:
Swap two letters that are in neighborhood, e.g, swap "A" and "C" in BCCAB, cost is 1.
Swap the first letter and the last letter in the string, cost is 1.
Question: Design an efficient algorithm that minimize the cost to change S to P.
I tried a greedy algorithm, but I found counter examples and I think it is incorrect. I know famous DP problem edit distance, but I did not get the formula for this one.
Anyone can help? An idea and pseudo code would be great.

I wonder if http://en.wikipedia.org/wiki/A*_search_algorithm would count as efficient? For a heuristic, look for the smallest distance each character has to go, treating the string as a circle, and divide the sum of these distances by two. On the circle, each character needs to participate in enough swaps to move it, one step at a time, to its destination, and each swap affects only two characters, so this heuristic should be a lower bound to the number of swaps required.

Without the ends-swap the answer is simple: you have to get the first and last letter right, and there's no way to "save" by doing it later; hence for word ai where 0 <= i < n you'd "bubble" the correct a0 and an-1 in place, then repeat for the word ai where 1 <= i < n-1 until you're left with 0 or 1 letters.
With the ends-swap option, you're left with much harder problem, since there are two directions where each letter can arrive in the correct place. You'd basically have a bipartite graph between source and target word, and you'd want to find a matching that minimizes the sum of distances. Even that is not really an algorithm, since each swap moves two of the letters, not just one.
Bottom line is, you may have to do a search, but at least you can bound the search with the no-ends-swap distance.

Related

Algorithm for random sampling under multiple no-repeat conditions

I ran into the following issue:
So, I got an array of 100-1000 objects (size varies), e.g.something like
[{one:1,two:'A',three: 'a'}, {one:1,two:'A',three: 'b'}, {one:1,two:'A',three: 'c'}, {one:1,two:'A',three: 'd'},
{one:1,two:'B',three: 'a'},{one:2,two:'B',three: 'b'},{one:1,two:'B',three: ':c'}, {one:1,two:'B',three: 'd'},
{one:1,two:'C',three: 'a'},{one:1,two:'C',three: 'b'},{one:1,two:'C',three: ':c'}, {one:2,two:'C',three: 'd'},
{one:1,two:'C',three: 'a'},{one:1,two:'C',three: 'b'},{one:2,two:'C',three: ':c'}, {one:1,two:'C',three: 'd'},...]
The value for 'one' is pretty much arbitrary. 'two' and 'three' have to be balanced in a certain way: Basically, in the above, there is some n, such that n=4 times 'A'. 'B','C','D','a','b','c' and 'd' - and such an n exists in any variant of this problem. It is just not clear what the n is, and the combinations themselves can also vary (e.g. if we only had As and Bs, [{1,A,a},{1,A,a},{1,B,b},{1,B,b}] as well as [{1,A,a},{1,A,b},{1,B,a},{1,B,b}] would both be possible arrays with n=2).
What I am trying to do now, is randomise the original array with the condition that there cannot be repeats in close order for some keys, i.e. the value of 'two' and 'three' for an object at index i-1 cannot be the same as the value of same attribute for the object at index i (and that should be true for all or as many objects as possible), i.e. [{1,B,a},{1,A,a},{1,C,b}] would not be allowed, [{1,B,a},{1,C,b},{1,A,a}] would be allowed.
I tried some brute-force method (randomise all, then push wrong indexes to the back) that works rarely, but it mostly just loops infinitely over the whole array, because it never ends up without repeats. Not sure, if this is because it is generally mathematically impossible for some original arrays, or if it is just because my solution sucks.
By now, I've been looking for over a week, and I am not even sure how to approach this.
Would be great, if someone knew a solution for this problem, or at least a reason why it isn't possible. Any help is greatly appreciated!
First, let us dissect the problem.
Forget for now about one, separate two and three into two independent sequences (assuming they are indeed independent, and not tied to each other).
The underlying problem is then as follows.
Given is a collection of c1 As, c2 Bs, c3 Cs, and so on. Place them randomly in such a way that no two consecutive letters are the same.
The trivial approach is as follows.
Suppose we already placed some letters, and are left with d1 As, d2 Bs, d3 Cs, and so on.
What is the condition when it is impossible to place the remaining letters?
It is when the count for one of the letters, say dk, is greater than one plus the sum of all other counts, 1 + d1 + d2 + ... excluding dk.
Otherwise, we can place them as K . K . K . K ..., where K is the k-th letter, and dots correspond to any letter except the k-th.
We can proceed at least as long as dk is still the greatest of the remaining quantities of letters.
So, on each step, if there is a dk equal to 1 + d1 + d2 + ... excluding dk, we should place the k-th letter right now.
Otherwise, we can place any other letter and still be able to place all others.
If there is no immediate danger of not being able to continue, adjust the probabilities to your liking, for example, weigh placing k-th letter as dk (instead of uniform probabilities for all remaining letters).
This problem smells of NP complete and lots of hard combinatorial optimization problems.
Just to find a solution, I'd always place as the next element the remaining element that can be placed which as few possible remaining elements can be placed next to. In other words try to get the hardest elements out of the way first - if they run into a problem, then you're stuck. If that works, then you're golden. (There are data structures like a heap which can be used to find those fairly efficiently.)
Now armed with a "good enough" solver, I'd suggest picking the first element randomly until the solver can solve the rest. Repeat. If at any time you find it takes too many guesses, just go with what the solver did last time. That way all the way you know that there IS a solution, even though you are trying to do things randomly at every step.
Graph
As I understand it, one does not play a role in constraints, so I'll label {one:1,two:'A',three: 'a'} with Aa. Thinking of objects as vertices, place them on a graph. Place edges whenever two respective vertices can be beside each other. For [{1,A,a},{1,A,a},{1,B,b},{1,B,b}] it would be,
and for [{1,A,a},{1,A,b},{1,B,a},{1,B,b}],
The problem becomes: select a random Hamiltonian path, (if possible.) For the loop, it would be any path on the circuit [Aa, Bb, Aa, Bb] or the reverse. For the disconnected lines, it is not possible.
Possible algorithm
I think, to be uniformly random, we would have to enumerate all the possibilities and choose one at random. This is probably infeasible, even at 100 vertices.
A näive algorithm that relaxes the uniform criterion, I think, would be to select (a) random point that does not split the graph in two. Then select (b) random neighbour of (a) that does not split the graph in two. Remove (a) to the solution. (a) = (b). Keep going until the end or backtrack when there are no moves, (if possible.) There may be further heuristics that could cut down the branching factor.
Example
There are no vertices that would disconnect the graph, so choosing Ab uniformly at random.
The neighbours of Ab are {Ca, Bc, Ba, Cc} of which Ca is chosen randomly.
Ab splits the graph, so we must choose Bc.
The only choice left is which of Cc and Ba comes first. We might end up with: [Ab, Ca, Bc, Ab, Ba, Cc].

Frechet distance in O(n)

I have seen on a number of articles that the Fréchet algorithm complexity is O(n^2).
That the paths represent as an Q and P arrays, of n size each
What if I start from Q[0], P[0] and check all the possibilities and choose the minimal:
STP_i,j = min(|Q[i] - P[j+1]|, |Q[i+1] - P[j+1]|,|Q[i+1] - P[j]|)
And change the i and j accordingly.
So I can get the answer on O(n).
Am I wrong?
Consider the next example:
Take the dots marked with black as the beginning of the lines. In the first step, your algorithm would advance one point in both lines. However, the Fréchet distance in this case would be the distance between the first red point and the third blue point, but since your algorithm has already move away from the first point it will give you a larger value.

finding the count of cells in a given 2d array satisfying the given constraints

Given a 2-D array starting at (0,0) and proceeding to infinity in positive x and y axes. Given a number k>0 , find the number of cells reachable from (0,0) such that at every moment -> sum of digits of x+ sum of digits of y <=k . Moves can be up, down ,left or right. given x,y>=0 . Dfs gives answers but not sufficient for large values of k. anyone can help me with a better algorithm for this?
I think they asked you to calculate the number of cells (x,y) reachable with k>=x+y. If x=1 for example, then y can take any number between 0 and k-1 and the sum would be <=k. The total number of possibilities can be calculated by
sum(sum(1,y=0..k-x),x=0..k) = 1/2*k²+3/2*k+1
That should be able to do the trick for large k.
I am somewhat confused by the "digits" in your question. The digits make up the index like 3 times 9 makes 999. The sum of digits for the cell (999,888) would be 51. If you would allow the sum of digits to be 10^9 then you could potentially have 10^8 digits for an index, resulting something around 10^(10^8) entries, well beyond normal sizes for a table. I am therefore assuming my first interpretation. If that's not correct, then could you explain it a bit more?
EDIT:
okay, so my answer is not going to solve it. I'm afraid I don't see a nice formula or answer. I would approach it as a coloring/marking problem and mark all valid cells, then use some other technique to make sure all the parts are connected/to count them.
I have tried to come up with something but it's too messy. Basically I would try and mark large parts at once based on the index and k. If k=20, you can mark the cell range (0,0..299) at once (as any lower index will have a lower index sum) and continue to check the rest of the range. I start with 299 by fixing the 2 last digits to their maximum value and look for the max value for the first digit. Then continue that process for the remaining hundreds (300-999) and only fix the last digit to end up with 300..389 and 390..398. However, you can already see that it's a mess... (nevertheless i wanted to give it to you, you might get some better idea)
Another thing you can see immediately is that you problem is symmetric in index so any valid cell (x,y) tells you there's another valid cell (y,x). In a marking scheme / dfs/ bfs this can be exploited.

Revisit: 2D Array Sorted Along X and Y Axis

So, this is a common interview question. There's already a topic up, which I have read, but it's dead, and no answer was ever accepted. On top of that, my interests lie in a slightly more constrained form of the question, with a couple practical applications.
Given a two dimensional array such that:
Elements are unique.
Elements are sorted along the x-axis and the y-axis.
Neither sort predominates, so neither sort is a secondary sorting parameter.
As a result, the diagonal is also sorted.
All of the sorts can be thought of as moving in the same direction. That is to say that they are all ascending, or that they are all descending.
Technically, I think as long as you have a >/=/< comparator, any total ordering should work.
Elements are numeric types, with a single-cycle comparator.
Thus, memory operations are the dominating factor in a big-O analysis.
How do you find an element? Only worst case analysis matters.
Solutions I am aware of:
A variety of approaches that are:
O(nlog(n)), where you approach each row separately.
O(nlog(n)) with strong best and average performance.
One that is O(n+m):
Start in a non-extreme corner, which we will assume is the bottom right.
Let the target be J. Cur Pos is M.
If M is greater than J, move left.
If M is less than J, move up.
If you can do neither, you are done, and J is not present.
If M is equal to J, you are done.
Originally found elsewhere, most recently stolen from here.
And I believe I've seen one with a worst-case O(n+m) but a optimal case of nearly O(log(n)).
What I am curious about:
Right now, I have proved to my satisfaction that naive partitioning attack always devolves to nlog(n). Partitioning attacks in general appear to have a optimal worst-case of O(n+m), and most do not terminate early in cases of absence. I was also wondering, as a result, if an interpolation probe might not be better than a binary probe, and thus it occurred to me that one might think of this as a set intersection problem with a weak interaction between sets. My mind cast immediately towards Baeza-Yates intersection, but I haven't had time to draft an adaptation of that approach. However, given my suspicions that optimality of a O(N+M) worst case is provable, I thought I'd just go ahead and ask here, to see if anyone could bash together a counter-argument, or pull together a recurrence relation for interpolation search.
Here's a proof that it has to be at least Omega(min(n,m)). Let n >= m. Then consider the matrix which has all 0s at (i,j) where i+j < m, all 2s where i+j >= m, except for a single (i,j) with i+j = m which has a 1. This is a valid input matrix, and there are m possible placements for the 1. No query into the array (other than the actual location of the 1) can distinguish among those m possible placements. So you'll have to check all m locations in the worst case, and at least m/2 expected locations for any randomized algorithm.
One of your assumptions was that matrix elements have to be unique, and I didn't do that. It is easy to fix, however, because you just pick a big number X=n*m, replace all 0s with unique numbers less than X, all 2s with unique numbers greater than X, and 1 with X.
And because it is also Omega(lg n) (counting argument), it is Omega(m + lg n) where n>=m.
An optimal O(m+n) solution is to start at the top-left corner, that has minimal value. Move diagonally downwards to the right until you hit an element whose value >= value of the given element. If the element's value is equal to that of the given element, return found as true.
Otherwise, from here we can proceed in two ways.
Strategy 1:
Move up in the column and search for the given element until we reach the end. If found, return found as true
Move left in the row and search for the given element until we reach the end. If found, return found as true
return found as false
Strategy 2:
Let i denote the row index and j denote the column index of the diagonal element we have stopped at. (Here, we have i = j, BTW). Let k = 1.
Repeat the below steps until i-k >= 0
Search if a[i-k][j] is equal to the given element. if yes, return found as true.
Search if a[i][j-k] is equal to the given element. if yes, return found as true.
Increment k
1 2 4 5 6
2 3 5 7 8
4 6 8 9 10
5 8 9 10 11

string transposition algorithm

Suppose there is given two String:
String s1= "MARTHA"
String s2= "MARHTA"
here we exchange positions of T and H. I am interested to write code which counts how many changes are necessary to transform from one String to another String.
There are several edit distance algorithms, the given Wikipeida link has links to a few.
Assuming that the distance counts only swaps, here is an idea based on permutations, that runs in linear time.
The first step of the algorithm is ensuring that the two strings are really equivalent in their character contents. This can be done in linear time using a hash table (or a fixed array that covers all the alphabet). If they are not, then s2 can't be considered a permutation of s1, and the "swap count" is irrelevant.
The second step counts the minimum number of swaps required to transform s2 to s1. This can be done by inspecting the permutation p that corresponds to the transformation from s1 to s2. For example, if s1="abcde" and s2="badce", then p=(2,1,4,3,5), meaning that position 1 contains element #2, position 2 contains element #1, etc. This permutation can be broke up into permutation cycles in linear time. The cycles in the example are (2,1) (4,3) and (5). The minimum swap count is the total count of the swaps required per cycle. A cycle of length k requires k-1 swaps in order to "fix it". Therefore, The number of swaps is N-C, where N is the string length and C is the number of cycles. In our example, the result is 2 (swap 1,2 and then 3,4).
Now, there are two problems here, and I think I'm too tired to solve them right now :)
1) My solution assumes that no character is repeated, which is not always the case. Some adjustment is needed to calculate the swap count correctly.
2) My formula #MinSwaps=N-C needs a proof... I didn't find it in the web.
Your problem is not so easy, since before counting the swaps you need to ensure that every swap reduces the "distance" (in equality) between these two strings. Then actually you look for the count but you should look for the smallest count (or at least I suppose), otherwise there exists infinite ways to swap a string to obtain another one.
You should first check which charaters are already in place, then for every character that is not look if there is a couple that can be swapped so that the next distance between strings is reduced. Then iterate over until you finish the process.
If you don't want to effectively do it but just count the number of swaps use a bit array in which you have 1 for every well-placed character and 0 otherwise. You will finish when every bit is 1.

Resources