Diff algorithm with fuzzy difference metric

Diff algorithm with fuzzy difference metric - algorithm

I'm looking for an algorithm similar to largest common subsequence algorithms that has an alphabet letter similarity metric. What I mean is that known algorithms treat all letters of alphabet as completely different, my use case has letters of alphabet that are easier to edit into another letter, hence they should be treated as similar by diffing algorithm.
As usage example you may think about diffing algorithm working on lines of text where some lines are more similar to other lines.
The paper An O(ND) Difference Algorithm and Its Variations states on page 4: Consider adding a weight or cost to every edge. Give diagonal edges weight 0 and non-diagonal edges weight 1. I'd like to have an option to assign any weight from [0;1] interval.

The Largest Common Subsequence (LCS) problem is usually computed by dynamic programming methods and you can tweak existing methods to apply them to your use case.
In this example explaining how LCS works (from Wikipedia) https://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Example, you should think tweaking the algorithm such that:
instead of scoring :
score_j = socre_i + 1, for j = i +1 (that is to say, you add 1 when you find a new common item) when a new item is added to the LCS, you should score:
score_j = F(score_i, p(letter_i, letter_j)) no matter what.
p(letter_i, letter_j) is the probability to change from letter_i to letter_j (that is the weight [0, 1] you were talking about)
F is an aggreggation function, to go from score_i to score_j knowing that probability p.
For instance F can be defined as the operator +. It would then yield :
score_j = score_i + p(letter_i, letter_j))
or more precisely :
score_j = score_i + p(letter_i, letter_j)) x 1 (read the x 1 as of a character)
and that woud give you the maximum similarity (of characters) of the 2 subsequences, that you can find by backtracking at the end of the algorithm.
You can find your own function F to yield better results.

Related

Given a table of letter counts and a list of words, find N words using all the letters

I am looking for an efficient algorithm to do the following.
Given a table/dictionary of letter counts
counts = {'a': 3, 'b': 2, 'd': 2, 'e': 2, 'k':2, 'r': 2 }
and a list of M words
words = ['break, 'add', 'build' ... ]
write a function that finds N words using all the letters (and words from the list)
>>> find_words(counts, words, 3)
['break', 'break', 'add']
The naive approach of going over all the words N times is too slow (I think it's O(m^n)).

This problem has a strongly NP-complete feeling to it. I therefore would expect there to be no truly efficient solution.
I would therefore suggest that you try reducing it to a known problem and solve that instead. In particular you can convert this to an integer linear programming problem. Solvers for those are often surprisingly good. And per Python Mixed Integer Linear Programming there are a number that are available to you relatively easily.
The specific problem that you are trying to solve is this. You are trying to find a vector of m integers x_i such that 0 ≤ x_i, sum(x_i) ≤ n, and for each letter l, the sum of how many times the letter has been used does not exceed your count. The objective function that you would like to maximize is the sum of x_i * (1 + len(w_i)) where w_i is the i'th word.
This will always produce an answer. The answer that you get will represent a solution to your problem if and only if the objective function is n plus the count of letters available. If the objective function is less than that, then no solution exists.

Algorithm for minimum number of characters that need to be changed in String S

I came across a programming challenge few days back which is over now. The question said given a string S of lowercase English alphabets, find minimum count of characters that need to be changed in String S, so that it contains the given word W as a substring in S.
Also in next line , print the position of characters you need to change in ascending order. Since there can be multiple output, find the position in which first character to change is minimal.
I tried using LCS but could only get count of characters that need to be changed. How to find the position of the character?
I might be missing something, please help. Might be some other algorithm to solve it.

The obvious solution is to shift the reference word W over the input string S and count the differences. However, this will become inefficient for very long strings. So, how can we improve this?
The idea is to target the search at places in S where it is very likely that we have a good match with W. Finding these spots is the critical part. We cannot find them both efficiently and accurately without performing the naive algorithm. So, we use a heuristic H that gives us a lower bound on the number of changes that we have to perform. We calculate this lower bound for every position of S. Then, we start at the position of lowest H and check the actual difference in S and W at that position. If the next-higher H is higher than the current difference, we are already done. If it is not, we check the next position. The outline of the algorithm looks as follows:
input:
W of length LW
S of length LS
H := list of length LS - LW + 1 with tuples [index, mincost]
for i from 0 to LS - LW
H(i) = [i, calculate Heuristic for S[i .. i + LW]]
order H by mincost
actualcost = infinity
nextEntryInH = 0
while actualcost >= H[nextEntryInH].minCost && nextEntryInH < length(H)
calculate actual cost for S[H[nextEntryInH].index .. + LW]
update actualcost if we found a lesser cost or equal cost with an earlier difference
nextEntryInH++
Now, back to the heuristic. We need to find something that allows us to approximate the difference for a given position (and we need to guarantee that it is a lower bound), while at the same time being easy to calculate. Since our alphabet is limited, we can use a histogram of the letters to do this. So, let's assume the example from the comments: W = worldcup and the part of S that we are interested in is worstcap. The histograms for these two parts are (omitting letters that do not occur):
a c d l o p r s t u w
worldcup 0 1 1 1 1 1 1 0 0 1 1
worstcap 1 1 0 0 1 1 1 1 1 0 1
------------------------------
abs diff 1 0 1 1 0 0 0 1 1 1 0 (sum = 6)
We can see that half of the sum of absolute differences is a proper lower bound for the number of letters that we need to change (because every letter change decreases the sum by 2). In this case, the bound is even tight because the sum is equal to the actual cost. However, our heuristic does not consider the order of letters. But in the end, this is what makes it efficiently calculatable.
Ok, our heuristic is the sum of absolute differences for the histograms. Now, how can we calculate this efficiently? Luckily, we can calculate both the histograms and the sum incrementally. We start at position 0 and calculate the full histograms and the sum of absolute differences (note that the histogram of W will never change throughout the rest of the runtime). With this information, we can already set H(0).
To calculate the rest of H, we slide our window across S. When we slide our window by one letter to the right, we only need to update our histogram and sum slightly: There is exactly one new letter in our window (add to the histogram) and one letter leaves the window (remove from the histogram). For the two (or one) corresponding letters, calculate the resulting change for the sum of absolute differences and update it. Then, set H accordingly.
With this approach, we can calculate our heuristic in linear time for the entire string S. The heuristic gives us an indication where we should look for matches. Once we have it, we proceed with the remaining algorithm as outlined at the beginning of this answer (start the accurate cost calculation at places with low heuristic and continue until the actual cost exceeds the next-higher heuristic value).

LCS (= longest common subsequence) will not work because the common letters in W and S need to have matching positions.
Since you are only allowed to update and not remove/insert.
If you were allowed to remove/insert, the Levenshtein distance could be used:
https://en.wikipedia.org/wiki/Levenshtein_distance
In your case an obvious bruteforce solution is to match W with S at every position, with complexity O(N*M) (N size of S, M size of W)

Efficient multiselection algorithm

I have to implement an algorithm that solves the multi-selection problem.
The multiselection problem is:
Given a set S of n elements drawn from a linearly ordered set, and a set K = {k1, k2,...,kr} of positive integers between 1 and n, the multiselection problem is to select the ki-th smallest element for all values of i, 1 <= i <= r
I need to solve the average case on Θ(n log r)
I've found a paper that implements the solution I need, but it assumes that there are no repeated numbers on the set S. The problem is that I can't assume that and I don't know how to adapt the algorithm of that paper to support repeated numbers.
The paper is here: http://www.ccse.kfupm.edu.sa/~suwaiyel/publications/multiselection_parCom.pdf
and the algorithm is on the second page. Any tips are welcome!

For posterity: the algorithm to which Ivan refers is to sort K, then solve the problem recursively as follows. Use QuickSelect to find the ki-th smallest element x where i is ceil(r/2), then recurse on the smaller halves of K and S, and the larger halves of K and S, splitting K about i and S about x.
Finding algorithms that work in the presence of degeneracy (here, equal elements) is often not a high priority for authors of theoretical works, because it makes the presentation of the common case more difficult and doesn't often play a role in determining the computational complexity of the problem. This is essentially a one-dimensional problem, and the black box solution is easy; replace the i-th element of the input yi by (yi, i) and break ties in the comparisons using the second component.
In practice, we can do better. Instead of recursing on {y : y in S, y < x} and {y : y in S, y > x}, use a three-way partitioning algorithm about x (see, e.g., every sufficiently complete treatment of QuickSort), then divide the array S by index instead of value.

Algorithm for polygon with weight on vertices and operations on edges

I am thinking about the algorithm for the following problem (found on carrercup):
Given a polygon with N vertexes and N edges. There is an int number(could be negative) on every vertex and an operation in set(*,+) on every edge. Every time, we remove an edge E from the polygon, merge the two vertexes linked by the edge(V1,V2) to a new vertex with value: V1 op(E) V2. The last case would be two vertexes with two edges, the result is the bigger one.
Return the max result value can be gotten from a given polygon.
I think we can use just greedy approach. I.e. for polygon with k edges find a pair (p, q) which produces the maximum number when collapsing: (p ,q) = max ({i operation j : i, j - adjacent edges)
Then just call a recursion on polygons:
1. Let function CollapseMaxPair( P(k) ) - gets polygon with k edges and returns 'collapsed' polygon with k-1 edges
2. Then our recursion:
P = P(N);
Releat until two edges left
P = CollapseMaxPair( P )
maxvalue = max ( two remained values)
What do you think?

I have answered this question here: Google Interview : Find the maximum sum of a polygon and it was pointed out to me that that question is a duplicate of this one. Since no one has answered this question fully yet, I have decided to add this answer here as well.
As you have identified (tagged) correctly, this indeed is very similar to the matrix multiplication problem (in what order do I multiply matrixes in order to do it quickly).
This can be solved polynomially using a dynamic algorithm.
I'm going to instead solve a similar, more classic (and identical) problem, given a formula with numbers, addition and multiplications, what way of parenthesizing it gives the maximal value, for example
6+1 * 2 becomes (6+1)*2 which is more than 6+(1*2).
Let us denote our input a1 to an real numbers and o(1),...o(n-1) either * or +. Our approach will work as follows, we will observe the subproblem F(i,j) which represents the maximal formula (after parenthasizing) for a1,...aj. We will create a table of such subproblems and observe that F(1,n) is exactly the result we were looking for.
Define
F(i,j)
- If i>j return 0 //no sub-formula of negative length
- If i=j return ai // the maximal formula for one number is the number
- If i<j return the maximal value for all m between i (including) and j (not included) of:
F(i,m) (o(m)) F(m+1,j) //check all places for possible parenthasis insertion
This goes through all possible options. TProof of correctness is done by induction on the size n=j-i and is pretty trivial.
Lets go through runtime analysis:
If we do not save the values dynamically for smaller subproblems this runs pretty slow, however we can make this algorithm perform relatively fast in O(n^3)
We create a n*n table T in which the cell at index i,j contains F(i,j) filling F(i,i) and F(i,j) for j smaller than i is done in O(1) for each cell since we can calculate these values directly, then we go diagonally and fill F(i+1,i+1) (which we can do quickly since we already know all the previous values in the recursive formula), we repeat this n times for n diagonals (all the diagonals in the table really) and filling each cell takes (O(n)), since each cell has O(n) cells we fill each diagonals in O(n^2) meaning we fill all the table in O(n^3). After filling the table we obviously know F(1,n) which is the solution to your problem.
Now back to your problem
If you translate the polygon into n different formulas (one for starting at each vertex) and run the algorithm for formula values on it, you get exactly the value you want.

Here's a case where your greedy algorithm fails:
Imagine your polygon is a square with vertices A, B, C, D (top left, top right, bottom right, bottom left). This gives us edges (A,B), (A,D), (B,C), and (C, D).
Let the weights be A=-1, B=-1, C=-1, and D=1,000,000.
A (-1) ------ B (-1)
| |
| |
| |
| |
D(1000000) ---C (-1)
Clearly, the best strategy is to collapse (A,B), and then (B,C), so that you may end up with D by itself. Your algorithm, however, will start with either (A,D) or (D,C), which will not be optimal.
A greedy algorithm that combines the min sums has a similar weakness, so we need to think of something else.
I'm starting to see how we want to try to get all positive numbers together on one side and all negatives on the other.
If we think about the initial polygon entirely as a state, then we can imagine all the possible child states to be the subsequent graphs were an edge is collapsed. This creates a tree-like structure. A BFS or DFS would eventually give us an optimal solution, but at the cost of traversing the entire tree in the worst case, which is probably not as efficient as you'd like.
What you are looking for is a greedy best-first approach to search down this tree that is provably optimal. Perhaps you could create an A*-like search through it, although I'm not sure what your admissable heuristic would be.

I don't think the greedy algorithm works. Let the vertices be A = 0, B = 1, C = 2, and the edges be AB = a - 5b, BC = b + c, CA = -20. The greedy algorithm selects BC to evaluate first, value 3. Then AB, value, -15. However, there is a better sequence to use. Evaluate AB first, value -5. Then evaluate BC, value -3. I don't know of a better algorithm though.

What is the most efficient way to determine the Farey sequence of degree n?

I am going to implement a Farey fraction approximation for converting limited-precision user input into possibly-repeating rationals.
http://mathworld.wolfram.com/FareySequence.html
I can easily locate the closest Farey fraction in a sequence, and I can find Fn by recursively searching for mediant fractions by building the Stern-Brocot tree.
http://mathworld.wolfram.com/Stern-BrocotTree.html
However, the method I've come up with for finding the fractions in the sequence Fn seems very inefficient:
(pseudo)
For int i = 0 to fractions.count -2
{
if fractions[i].denominator + fractions[i+1].denominator < n
{
insert new fraction(
numerator = fractions[i].numerator + fractions[i+1].numerator
,denominator = fractions[i].denominator + fractions[i+1].denominator)
//note that fraction will reduce itself
addedAnElement = true
}
}
if addedAnElement
repeat
I will almost always be defining the sequence Fn where n = 10^m where m >1
So perhaps it might be best to build the sequence one time and cache it... but it still seems like there should be a better way to derive it.
EDIT:
This paper has a promising algorithm:
http://www.math.harvard.edu/~corina/publications/farey.pdf
I will try to implement.
The trouble is that their "most efficient" algorithm requires knowing the prior two elements. I know element one of any sequence is 1/n but finding the second element seems a challenge...
EDIT2:
I'm not sure how I overlooked this:
Given F0 = 1/n
If x > 2 then
F1 = 1/(n-1)
Therefore for all n > 2, the first two fractions will always be
1/n, 1/(n-1) and I can implement the solution from Patrascu.
So now, we the answer to this question should prove that this solution is or isn't optimal using benchmarks..

Why do you need the Farey series at all? Using continued fractions would give you the same approximation online without precalculating the series.

Neighboring fractions in Farey sequences are described in Sec. 3 of Neighboring Fractions in Farey Subsequences, http://arxiv.org/abs/0801.1981 .

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio