Algorithm explanation for common strings - algorithm

The Problem definition:
Given two strings a and b of equal length, what’s the longest string (S) that can be constructed such that S is a child to both a and b.
String x is said to be a child of string y if x can be formed by deleting 0 or more characters from y
Input format
Two strings a and b with a newline separating them
Constraints
All characters are upper-cased and lie between ascii values 65-90 The maximum length of the strings is 5000
Output format
Length of the string S
Sample Input #0
HARRY
SALLY
Sample Output #0
2
The longest possible subset of characters that is possible by deleting zero or more characters from HARRY and SALLY is AY, whose length is 2.
The solution:
public class Solution {
public static void main(String[] args) throws Exception {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
char[] a = in.readLine().toCharArray();
char[] b = in.readLine().toCharArray();
int[][] dp = new int[a.length + 1][b.length + 1];
dp[0][0] = 1;
for (int i = 0; i < a.length; i++)
for (int j = 0; j < b.length; j++)
if (a[i] == b[j])
dp[i + 1][j + 1] = dp[i][j] + 1;
else
dp[i + 1][j + 1] = Math.max(dp[i][j + 1], dp[i + 1][j]);
System.out.println(dp[a.length][b.length]);
}
}
Anyone has encountered this problem and solved using the solution like this? I solved it in a different way. Only found this solution is elegant, But can not make sense of it so far. Could anyone help explaining it little bit.

This algorithm uses Dynamic Programming. The key point in understanding dynamic programming is to understand the recursive step which in this case is within the if-else statement. My understanding about the matrix of size (a.length+1) * (b.length +1) is that for a given element in the matrix dp[i +1, j +1] it represents that if the we only compare string a[0:i] and b[0:j] what will be the child of both a[0:i] and b[0:j] that has most characters.
So to understand the recursive step, let's look at the example of "HARRY" and "SALLY", say if I am on the step of calculating dp[5][5], in this case, I will be looking at the last character 'Y':
A. if a[4] and b[4] are equal, in this case "Y" = "Y", then i know the optimal solution is: 1) Find out what is the child of "HARR" and "SALL" that has most characters (let's say n characters) and then 2) add 1 to n.
B. if a[4] and b[4] are not equal, then the optimal solution is either Child of "HARR" and "SALLY" or Child of "HARRY" and "SALL" which will translate to Max(dp[i+1][j] and dp[i][j+1]) in the code.

Related

What is wrong with the recursive algorithm developed for the below problem?

I have tried to solve an algorithmic problem. I have come up with a recursive algorithm to solve the same. This is the link to the problem:
https://codeforces.com/problemset/problem/1178/B
This problem is not from any contest that is currently going on.
I have coded my algorithm and had run it on a few test cases, it turns out that it is counting more than the correct amount. I went through my thought process again and again but could not find any mistake. I have written my algorithm (not the code, but just the recursive function I have thought of) below. Can I please know where had I gone wrong -- what was the mistake in my thought process?
Let my recursive function be called as count, it takes any of the below three forms as the algorithm proceeds.
count(i,'o',0) = count(i+1,'o',0) [+ count(i+1,'w',1) --> iff (i)th
element of the string is 'o']
count(i,'w',0) = count(i+1,'w',0) [+ count(i+2,'o',0) --> iff (i)th and (i+1)th elements are both equal to 'v']
count(i,'w',1) = count(i+1,'w',1) [+ 1 + count(i+2,'w',0) --> iff (i)th and (i+1)th elements are both equal to 'v']
Note: The recursive function calls present inside the [.] (square brackets) will be called iff the conditions mentioned after the arrows are satisfied.)
Explanation: The main idea behind the recursive function developed is to count the number of occurrences of the given sequence. The count function takes 3 arguments:
argument 1: The index of the string on which we are currently located.
argument 2: The pattern we are looking for (if this argument is 'o' it means that we are looking for the letter 'o' -- i.e. at which index it is there. If it is 'w' it means that we are looking for the pattern 'vv' -- i.e. we are looking for 2 consecutive indices where this pattern occurs.)
argument 3: This can be either 1 or 0. If it is 1 it means that we are looking for the 'vv' pattern, having already found the 'o' i.e. we are looking for the 'vv' pattern shown in bold: vvovv. If it is 0, it means that we are searching for the 'vv' pattern which will be the
beginning of the pattern vvovv (shown in bold.)
I will initiate the algorithm with count(0,'w',0) -- it means, we are at the 0th index of the string, we are looking for the pattern 'vv', and this 'vv' will be the prefix of the 'vvovv' pattern we wish to find.
So, the output of count(0,'w',0) should be my answer. Now comes the trouble, for the following input: "vvovooovovvovoovoovvvvovo" (say input1), my program (which is based on the above algorithm) gives the expected answer(= 50). But, when I just append "vv" to the above input to get a new input: "vvovooovovvovoovoovvvvovovv" (say input2) and run my algorithm again, I get 135 as the answer, while the correct answer is 75 (this is the answer the solution code returns). Why is this happening? Where had I made an error?
Also, one more doubt is if the output for the input1 is 50, then the output for the input2 should be at least twice right -- because all of the subsequences which were present in the input1, will be present in the input2 too and all of those subsequences can also form a new subsequence with the appended 'vv' -- this means we have at least 100 favourable subsequences right?
P.S. This is the link to the solution code https://codeforces.com/blog/entry/68534
This question doesn't need recursion or dynamic programming.
The basic idea is to count how many ws we have before and after each o.
If you have X vs, it means you have X - 1 ws.
Let's use vvvovvv as an example. We know that before and after the o we have 3 vs, which means 2 ws. To evaluate the answer, just multiply 2x2 = 4.
For each o we find, we just need to multiply the ws before and after it, sum it all and this is our answer.
We can find how many ws there are before and after each o in linear time.
#include <iostream>
using namespace std;
int convert_v_to_w(int v_count){
return max(0, v_count - 1);
}
int main(){
string s = "vvovooovovvovoovoovvvvovovvvov";
int n = s.size();
int wBefore[n];
int wAfter[n];
int v_count = 0, wb = 0, wa = 0;
//counting ws before each o
int i = 0;
while(i < n){
v_count = 0;
while(i < n && s[i] == 'v'){
v_count++;
i++;
}
wb += convert_v_to_w(v_count);
if(i < n && s[i] == 'o'){
wBefore[i] = wb;
}
i++;
}
//counting ws after each o
i = n - 1;
while(i >= 0){
v_count = 0;
while(i >= 0 && s[i] == 'v'){
v_count++;
i--;
}
wa += convert_v_to_w(v_count);
if(i >= 0 && s[i] == 'o'){
wAfter[i] = wa;
}
i--;
}
//evaluating answer by multiplying ws before and after each o
int ans = 0;
for(int i = 0; i < n; i++){
if(s[i] == 'o') ans += wBefore[i] * wAfter[i];
}
cout<<ans<<endl;
}
output: 100
complexity: O(n) time and space

How to find the following type of set with computation time less than O(n)?

Here 5 different sets are shown. S1 contains 1. Next set S2 is calculated from S1 considering the following logic:
Suppose Sn contains {a1,a2,a3,a4.....,an} and middle element of Sn is b.
Then the set Sn+1 contains elements {b,b+a1,b+a2,......,b+an}. Total (n+1) elements. If a set contains even number of elements then middle element is (n/2 +1) .
Now, if n is given as input then we have to display all the elements of set Sn.
Clearly it is possible to solve the problem in O(n) time.
we can compute all the middle element as (2^(n-1) - middle element of the previous set + 1) where s1 ={1} is base case. In this way O(n) time we will get the all middle elements till (n-1)th set. So, middle element of (n-1)th set is the first element of the nth set set. (middle element of (n-1)th set + middle element of (n-2)th set) is the middle second element of the nth set. In this way we will get all the elements of nth set.
So it needs O(n) time.
Here id the complete java code I have written:
public class SpecialSubset {
private static Scanner inp;
public static void main(String[] args) {
// TODO Auto-generated method stub
int N,fst,mid,con=0;
inp = new Scanner(System.in);
N=inp.nextInt();
int[] setarr=new int[N];
int[] midarr=new int[N];
fst=1;
mid=1;
midarr[0]=1;
for(int i=1;i<N;i++)
{
midarr[i]=(int) (Math.pow(2, i)-midarr[i-1]+1);
}
setarr[0]=midarr[N-2];
System.out.print(setarr[0]);
System.out.print(" ");
for(int i=1,j=N-3;i<N-1;i++,j--)
{
setarr[i]=setarr[i-1]+midarr[j];
System.out.print(setarr[i]);
System.out.print(" ");
}
setarr[N-1]=setarr[N-2]+1;
System.out.print(setarr[N-1]);
}
}
Here is the link of the Question:
https://www.hackerrank.com/contests/projecteuler/challenges/euler103
IS it possible to solve the problem with less than O(n) time?
#Paul Boddington has given an answer that relies on the sequence of first numbers of these sets being the Narayana-Zidek-Capell numbers and has checked it for some small-ish values. However, there was no proof of the conjecture given. This answer is in addition to the above, to make it complete. I'm no HTML/CSS/Markdown guru, so you'll have to excuse the bad positioning of subscripts (If anyone can improve those - be my guest.
Notation:
Let aij be the i-th number in the j-th set.
I'll also define bj as the first number of the j-2-th set. This is the sequence the proof is about. The -2 is to account for the first and second 1 in the Narayana-Zidek-Capell sequence.
Generating rules:
The problem statement didn't clarify what "center number" is for a even-length set (a list really, but whatever), but it seems they meant the "center right" in that case. I'll denote the rules numbers in bold when I use them below.
a11 = 1
a1n = aceil(n+1⁄2)n-1
ain = a1n + ai-1n-1
bn = a1n-2
Proof:
First step is to make a slightly more involved formula for ain by unwinding the recursion a bit more and substituting b:
ain = Σ a1n-j = Σ bn-j+2 for j in [0 ... i-1]
Next, we consider two cases for bn - one where n is odd, one where n is even.
Even case:
b2n+2 = a12n =
2 = aceil(2n+1⁄2)2n-1 = an+12n-1 =
3 = a12n-1 + an2n-2 =
2, 4 = b2n+1 + a12n-1 =
5 = 2 * b2n+1
Odd case:
b2n+1 = a12n-1 =
2 = aceil(2n⁄2)2n-2 = an2n-2 =
3 = a12n-2 + an-12n-3 =
4 = 2 * b2n + (an-12n-3 - a12n-2) =
2 = 2 * b2n + (an-12n-3 - an2n-3) =
5 = 2 * b2n - bn
These rules are the exact sequence definition, and provide a way to generate the nth set in linear time (as opposed to quadratic when generating each set in turn)
The smallest numbers in the sets appear to be the Narayana-Zidek-Capell numbers
1, 1, 2, 3, 6, 11, 22, ...
The other numbers are obtained from the first number by repeatedly adding these numbers in reverse.
For example,
S6 = {11, 17, 20, 22, 23, 24}
+6 +3 +2 +1 +1
Using a recurrence for the Narayana-Zidek-Capell sequence found in that link, I have managed to produce a solution for this problem that runs in O(n) time. Here is a solution in Java. It only works for n <= 32 due to int overflow, but it could be written using BigInteger to work for higher values.
static Set<Integer> set(int n) {
int[] a = new int[n + 2];
for (int i = 1; i < n + 2; i++) {
if (i <= 2)
a[i] = 1;
else if (i % 2 == 0)
a[i] = 2 * a[i - 1];
else
a[i] = 2 * a[i - 1] - a[i / 2];
}
Set<Integer> set = new HashSet<>();
int sum = 0;
for (int i = n + 1; i >= 2; i--) {
sum += a[i];
set.add(sum);
}
return set;
}
I'm not able to justify right now why this is the same as the set in the question, but I'm working on it. However I have checked for all n <= 32 that this algorithm gives the same set as the "obvious" algorithm, so I'm reasonably sure it's correct.

Longest common subsequence (LCS) brute force algorithm

I want to create a brute force algorithm to find the largest common subsequence between 2 strings, but I'm struggling to enumerate all possibilities in the form of a algorithm.
I don't want a dynamic programming answer as oddly enough I manage to figure this one out (You would think the brute force method would be easier). Please use pseudo code, as I prefer to understand it and write it up myself.
It's pretty much the same as DP minus the memoization part.
LCS(s1, s2, i, j):
if(i == -1 || j == -1)
return 0
if(s1[i] == s2[j])
return 1 + LCS(s1, s2, i-1, j-1)
return max(LCS(s1, s2, i-1, j), LCS(s1, s2, i, j-1))
The idea is if we have two strings s1 and s2 where s1 ends at i and s2 ends at j, then the LCS is:
if either string is empty, then the longest common subsequence is 0.
If the last character (index i) of string 1 is the same as the last one in string 2 (index j), then the answer is 1 plus the LCS of s1 and s2 ending at i-1 and j-1, respectively. Because it's obvious that those two indices contribute to the LCS, so it's optimal to count them.
If the last characters don't match, then we try to remove one of the characters. So we try finding LCS between s1 (ending at i-1) and s2 (ending at j) and the LCS between s1 (ending at i) and s2 (ending at j-1), then take the max of both.
The time complexity is obviously exponential.
I like #turingcomplete's answer but it's not really brute-force since it doesn't actually enumerate all candidate solutions. For example, if the strings are "ABCDE" and "XBCDY", the recursive approach won't test for "ABC" versus "XBC" because the test for "A" versus "X" will have already failed. It's kind of a matter of opinion whether you want to count that as a unique candidate though. In fact, you could argue that "ABC" versus "ABCDY" is a valid candidate as well, and just immediately fails due to length difference. You could add separate LA and LB to the code below to fully enumerate those candidates though.
for L = min(A.length, B.length) to 1
{
for iA = 0 to A.length - L - 1
{
for iB = 0 to B.length - L - 1
{
for i = 0 to L - 1
{
if(A[iA] != B[iB])
{
match failed;
}
}
if match didn't fail, then
A[iA..iA+L] and B[iB..iB+L] are a maximal common substring
}
}
}
no common substring
Replace s1 and s2 with your String
import java.util.*;
/* Online Java Compiler and Editor */
public class HelloWorld{
public static void main(String []args){
System.out.println("Hello, World!");
String s1 = "GXXAYB";
String s2 = "AGGTAB";
String ans = "",res ="";
int m = 0;
for(int k=0;k<s1.length();k++) {
m=0;
for(int i=k;i<s1.length();i++) {
for(int j=m;j<s2.length();j++) {
if(s1.charAt(i)==s2.charAt(j)) {
res = res+s2.charAt(j);
i=i+1;
}
}
}
if(res.length()>ans.length()) {
ans=res;
}
res ="";
}
System.out.println(ans);
}
}
Here is a Java method which stores/lists out all the subsequences of given string in an ArrayList.
Find all the subsequences of given 2 strings
Find common ones between them
Longest one among them is the answer
We already know that each character may either
1) appear
or
2) not appear in any subsequence.
So, we keep all the strings in the ArrayList untouched(case-2 in above).
Also, we add(to the ArrayList) strings which are results of
concatenation of already existing strings in ArrayList with the
next character of the string(case-1 above).
This covers(solves) both the above cases of our problem.
We do this until to all the letters in the string.
ArrayList<String> subseq(String s)
{
ArrayList<String> h = new ArrayList<String>();
h.add("");
int n = s.length();
int l;
for(int i=0;i<n;i++)
{
l = h.size();
for(int j=0;j<l;j++)
h.add( h.get(j) + s.charAt(i));
}
return h;
}

Google Combinatorial Optimization interview problem

I got asked this question on a interview for Google a couple of weeks ago, I didn't quite get the answer and I was wondering if anyone here could help me out.
You have an array with n elements. The elements are either 0 or 1.
You want to split the array into k contiguous subarrays. The size of each subarray can vary between ceil(n/2k) and floor(3n/2k). You can assume that k << n.
After you split the array into k subarrays. One element of each subarray will be randomly selected.
Devise an algorithm for maximizing the sum of the randomly selected elements from the k subarrays.
Basically means that we will want to split the array in such way such that the sum of all the expected values for the elements selected from each subarray is maximum.
You can assume that n is a power of 2.
Example:
Array: [0,0,1,1,0,0,1,1,0,1,1,0]
n = 12
k = 3
Size of subarrays can be: 2,3,4,5,6
Possible subarrays [0,0,1] [1,0,0,1] [1,0,1,1,0]
Expected Value of the sum of the elements randomly selected from the subarrays: 1/3 + 2/4 + 3/5 = 43/30 ~ 1.4333333
Optimal split: [0,0,1,1,0,0][1,1][0,1,1,0]
Expected value of optimal split: 1/3 + 1 + 1/2 = 11/6 ~ 1.83333333
I think we can solve this problem using dynamic programming.
Basically, we have:
f(i,j) is defined as the maximum sum of all expected values chosen from an array of size i and split into j subarrays. Therefore the solution should be f(n,k).
The recursive equation is:
f(i,j) = f(i-x,j-1) + sum(i-x+1,i)/x where (n/2k) <= x <= (3n/2k)
I don't know if this is still an open question or not, but it seems like the OP has managed to add enough clarifications that this should be straightforward to solve. At any rate, if I am understanding what you are saying this seems like a fair thing to ask in an interview environment for a software development position.
Here is the basic O(n^2 * k) solution, which should be adequate for small k (as the interviewer specified):
def best_val(arr, K):
n = len(arr)
psum = [ 0.0 ]
for x in arr:
psum.append(psum[-1] + x)
tab = [ -100000 for i in range(n) ]
tab.append(0)
for k in range(K):
for s in range(n - (k+1) * ceil(n/(2*K))):
terms = range(s + ceil(n/(2*K)), min(s + floor((3*n)/(2*K)) + 1, n+1))
tab[s] = max( [ (psum[t] - psum[s]) / (t - s) + tab[t] for t in terms ])
return tab[0]
I used the numpy ceil/floor functions but you basically get the idea. The only `tricks' in this version is that it does windowing to reduce the memory overhead to just O(n) instead of O(n * k), and that it precalculates the partial sums to make computing the expected value for a box a constant time operation (thus saving a factor of O(n) from the inner loop).
I don't know if anyone is still interested to see the solution for this problem. Just stumbled upon this question half an hour ago and thought of posting my solution(Java). The complexity for this is O(n*K^log10). The proof is a little convoluted so I would rather provide runtime numbers:
n k time(ms)
48 4 25
48 8 265
24 4 20
24 8 33
96 4 51
192 4 143
192 8 343919
The solution is the same old recursive one where given an array, choose the first partition of size ceil(n/2k) and find the best solution recursively for the rest with number of partitions = k -1, then take ceil(n/2k) + 1 and so on.
Code:
public class PartitionOptimization {
public static void main(String[] args) {
PartitionOptimization p = new PartitionOptimization();
int[] input = { 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0};
int splitNum = 3;
int lowerLim = (int) Math.ceil(input.length / (2.0 * splitNum));
int upperLim = (int) Math.floor((3.0 * input.length) / (2.0 * splitNum));
System.out.println(input.length + " " + lowerLim + " " + upperLim + " " +
splitNum);
Date currDate = new Date();
System.out.println(currDate);
System.out.println(p.getMaxPartExpt(input, lowerLim, upperLim,
splitNum, 0));
System.out.println(new Date().getTime() - currDate.getTime());
}
public double getMaxPartExpt(int[] input, int lowerLim, int upperLim,
int splitNum, int startIndex) {
if (splitNum <= 1 && startIndex<=(input.length -lowerLim+1)){
double expt = findExpectation(input, startIndex, input.length-1);
return expt;
}
if (!((input.length - startIndex) / lowerLim >= splitNum))
return -1;
double maxExpt = 0;
double curMax = 0;
int bestI=0;
for (int i = startIndex + lowerLim - 1; i < Math.min(startIndex
+ upperLim, input.length); i++) {
double curExpect = findExpectation(input, startIndex, i);
double splitExpect = getMaxPartExpt(input, lowerLim, upperLim,
splitNum - 1, i + 1);
if (splitExpect>=0 && (curExpect + splitExpect > maxExpt)){
bestI = i;
curMax = curExpect;
maxExpt = curExpect + splitExpect;
}
}
return maxExpt;
}
public double findExpectation(int[] input, int startIndex, int endIndex) {
double expectation = 0;
for (int i = startIndex; i <= endIndex; i++) {
expectation = expectation + input[i];
}
expectation = (expectation / (endIndex - startIndex + 1));
return expectation;
}
}
Not sure I understand, the algorithm is to split the array in groups, right? The maximum value the sum can have is the number of ones. So split the array in "n" groups of 1 element each and the addition will be the maximum value possible. But it must be something else and I did not understand the problem, that seems too silly.
I think this can be solved with dynamic programming. At each possible split location, get the maximum sum if you split at that location and if you don't split at that point. A recursive function and a table to store history might be useful.
sum_i = max{ NumOnesNewPart/NumZerosNewPart * sum(NewPart) + sum(A_i+1, A_end),
sum(A_0,A_i+1) + sum(A_i+1, A_end)
}
This might lead to something...
I think its a bad interview question, but it is also an easy problem to solve.
Every integer contributes to the expected value with weight 1/s where s is the size of the set where it has been placed. Therefore, if you guess the sizes of the sets in your partition, you just need to fill the sets with ones starting from the smallest set, and then fill the remaining largest set with zeroes.
You can easily see then that if you have a partition, filled as above, where the sizes of the sets are S_1, ..., S_k and you do a transformation where you remove one item from set S_i and move it to set S_i+1, you have the following cases:
Both S_i and S_i+1 were filled with ones; then the expected value does not change
Both them were filled with zeroes; then the expected value does not change
S_i contained both 1's and 0's and S_i+1 contains only zeroes; moving 0 to S_i+1 increases the expected value because the expected value of S_i increases
S_i contained 1's and S_i+1 contains both 1's and 0's; moving 1 to S_i+1 increases the expected value because the expected value of S_i+1 increases and S_i remains intact
In all these cases, you can shift an element from S_i to S_i+1, maintaining the filling rule of filling smallest sets with 1's, so that the expected value increases. This leads to the simple algorithm:
Create a partitioning where there is a maximal number of maximum-size arrays and maximal number of minimum-size arrays
Fill the arrays starting from smallest one with 1's
Fill the remaining slots with 0's
How about a recursive function:
int BestValue(Array A, int numSplits)
// Returns the best value that would be obtained by splitting
// into numSplits partitions.
This in turn uses a helper:
// The additional argument is an array of the valid split sizes which
// is the same for each call.
int BestValueHelper(Array A, int numSplits, Array splitSizes)
{
int result = 0;
for splitSize in splitSizes
int splitResult = ExpectedValue(A, 0, splitSize) +
BestValueHelper(A+splitSize, numSplits-1, splitSizes);
if splitResult > result
result = splitResult;
}
ExpectedValue(Array A, int l, int m) computes the expected value of a split of A that goes from l to m i.e. (A[l] + A[l+1] + ... A[m]) / (m-l+1).
BestValue calls BestValueHelper after computing the array of valid split sizes between ceil(n/2k) and floor(3n/2k).
I have omitted error handling and some end conditions but those should not be too difficult to add.
Let
a[] = given array of length n
from = inclusive index of array a
k = number of required splits
minSize = minimum size of a split
maxSize = maximum size of a split
d = maxSize - minSize
expectation(a, from, to) = average of all element of array a from "from" to "to"
Optimal(a[], from, k) = MAX[ for(j>=minSize-1 to <=maxSize-1) { expectation(a, from, from+j) + Optimal(a, j+1, k-1)} ]
Runtime (assuming memoization or dp) = O(n*k*d)

What string similarity algorithms are there?

I need to compare 2 strings and calculate their similarity, to filter down a list of the most similar strings.
e.g. searching for "dog" would return
dog
doggone
bog
fog
foggy
e.g. searching for "crack" would return
crack
wisecrack
rack
jack
quack
I have come across:
QuickSilver
LiquidMetal
What other string similarity algorithms are there?
The Levenshtein distance is the algorithm I would recommend. It calculates the minimum number of operations you must do to change 1 string into another. The fewer changes means the strings are more similar...
It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.
If the focus is on performance, I would implement an algorithm based on a trie structure
(works well to find words in a text, or to help correct a word, but in your case you can find quickly all words containing a given word or all but one letter, for instance).
Please follow first the wikipedia link above.Tries is the fastest words sorting method (n words, search s, O(n) to create the trie, O(1) to search s (or if you prefer, if a is the average length, O(an) for the trie and O(s) for the search)).
A fast and easy implementation (to be optimized) of your problem (similar words) consists of
Make the trie with the list of words, having all letters indexed front and back (see example below)
To search s, iterate from s[0] to find the word in the trie, then s[1] etc...
In the trie, if the number of letters found is len(s)-k the word is displayed, where k is the tolerance (1 letter missing, 2...).
The algorithm may be extended to the words in the list (see below)
Example, with the words car, vars.
Building the trie (big letter means a word end here, while another may continue). The > is post-index (go forward) and < is pre-index (go backward). In another example we may have to indicate also the starting letter, it is not presented here for clarity.
The < and > in C++ for instance would be Mystruct *previous,*next, meaning from a > c < r, you can go directly from a to c, and reversely, also from a to R.
1. c < a < R
2. a > c < R
3. > v < r < S
4. R > a > c
5. > v < S
6. v < a < r < S
7. S > r > a > v
Looking strictly for car the trie gives you access from 1., and you find car (you would have found also everything starting with car, but also anything with car inside - it is not in the example - but vicar for instance would have been found from c > i > v < a < R).
To search while allowing 1-letter wrong/missing tolerance, you iterate from each letter of s, and, count the number of consecutive - or by skipping 1 letter - letters you get from s in the trie.
looking for car,
c: searching the trie for c < a and c < r (missing letter in s). To accept a wrong letter in a word w, try to jump at each iteration the wrong letter to see if ar is behind, this is O(w). With two letters, O(w²) etc... but another level of index could be added to the trie to take into account the jump over letters - making the trie complex, and greedy regarding memory.
a, then r: same as above, but searching backwards as well
This is just to provide an idea about the principle - the example above may have some glitches (I'll check again tomorrow).
You could do this:
Foreach string in haystack Do
offset := -1;
matchedCharacters := 0;
Foreach char in needle Do
offset := PositionInString(string, char, offset+1);
If offset = -1 Then
Break;
End;
matchedCharacters := matchedCharacters + 1;
End;
If matchedCharacters > 0 Then
// (partial) match found
End;
End;
With matchedCharacters you can determine the “degree” of the match. If it is equal to the length of needle, all characters in needle are also in string. If you also store the offset of the first matched character, you can also sort the result by the “density” of the matched characters by subtracting the offset of the first matched character from the offset of the last matched character offset; the lower the difference, the more dense the match.
class Program {
static int ComputeLevenshteinDistance(string source, string target) {
if ((source == null) || (target == null)) return 0;
if ((source.Length == 0) || (target.Length == 0)) return 0;
if (source == target) return source.Length;
int sourceWordCount = source.Length;
int targetWordCount = target.Length;
int[,] distance = new int[sourceWordCount + 1, targetWordCount + 1];
// Step 2
for (int i = 0; i <= sourceWordCount; distance[i, 0] = i++);
for (int j = 0; j <= targetWordCount; distance[0, j] = j++);
for (int i = 1; i <= sourceWordCount; i++) {
for (int j = 1; j <= targetWordCount; j++) {
// Step 3
int cost = (target[j - 1] == source[i - 1]) ? 0 : 1;
// Step 4
distance[i, j] = Math.Min(Math.Min(distance[i - 1, j] + 1, distance[i, j - 1] + 1), distance[i - 1, j - 1] + cost);
}
}
return distance[sourceWordCount, targetWordCount];
}
static void Main(string[] args){
Console.WriteLine(ComputeLevenshteinDistance ("Stackoverflow","StuckOverflow"));
Console.ReadKey();
}
}

Resources