What string similarity algorithms are there? - algorithm

I need to compare 2 strings and calculate their similarity, to filter down a list of the most similar strings.
e.g. searching for "dog" would return
dog
doggone
bog
fog
foggy
e.g. searching for "crack" would return
crack
wisecrack
rack
jack
quack
I have come across:
QuickSilver
LiquidMetal
What other string similarity algorithms are there?

The Levenshtein distance is the algorithm I would recommend. It calculates the minimum number of operations you must do to change 1 string into another. The fewer changes means the strings are more similar...

It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.

If the focus is on performance, I would implement an algorithm based on a trie structure
(works well to find words in a text, or to help correct a word, but in your case you can find quickly all words containing a given word or all but one letter, for instance).
Please follow first the wikipedia link above.Tries is the fastest words sorting method (n words, search s, O(n) to create the trie, O(1) to search s (or if you prefer, if a is the average length, O(an) for the trie and O(s) for the search)).
A fast and easy implementation (to be optimized) of your problem (similar words) consists of
Make the trie with the list of words, having all letters indexed front and back (see example below)
To search s, iterate from s[0] to find the word in the trie, then s[1] etc...
In the trie, if the number of letters found is len(s)-k the word is displayed, where k is the tolerance (1 letter missing, 2...).
The algorithm may be extended to the words in the list (see below)
Example, with the words car, vars.
Building the trie (big letter means a word end here, while another may continue). The > is post-index (go forward) and < is pre-index (go backward). In another example we may have to indicate also the starting letter, it is not presented here for clarity.
The < and > in C++ for instance would be Mystruct *previous,*next, meaning from a > c < r, you can go directly from a to c, and reversely, also from a to R.
1. c < a < R
2. a > c < R
3. > v < r < S
4. R > a > c
5. > v < S
6. v < a < r < S
7. S > r > a > v
Looking strictly for car the trie gives you access from 1., and you find car (you would have found also everything starting with car, but also anything with car inside - it is not in the example - but vicar for instance would have been found from c > i > v < a < R).
To search while allowing 1-letter wrong/missing tolerance, you iterate from each letter of s, and, count the number of consecutive - or by skipping 1 letter - letters you get from s in the trie.
looking for car,
c: searching the trie for c < a and c < r (missing letter in s). To accept a wrong letter in a word w, try to jump at each iteration the wrong letter to see if ar is behind, this is O(w). With two letters, O(w²) etc... but another level of index could be added to the trie to take into account the jump over letters - making the trie complex, and greedy regarding memory.
a, then r: same as above, but searching backwards as well
This is just to provide an idea about the principle - the example above may have some glitches (I'll check again tomorrow).

You could do this:
Foreach string in haystack Do
offset := -1;
matchedCharacters := 0;
Foreach char in needle Do
offset := PositionInString(string, char, offset+1);
If offset = -1 Then
Break;
End;
matchedCharacters := matchedCharacters + 1;
End;
If matchedCharacters > 0 Then
// (partial) match found
End;
End;
With matchedCharacters you can determine the “degree” of the match. If it is equal to the length of needle, all characters in needle are also in string. If you also store the offset of the first matched character, you can also sort the result by the “density” of the matched characters by subtracting the offset of the first matched character from the offset of the last matched character offset; the lower the difference, the more dense the match.

class Program {
static int ComputeLevenshteinDistance(string source, string target) {
if ((source == null) || (target == null)) return 0;
if ((source.Length == 0) || (target.Length == 0)) return 0;
if (source == target) return source.Length;
int sourceWordCount = source.Length;
int targetWordCount = target.Length;
int[,] distance = new int[sourceWordCount + 1, targetWordCount + 1];
// Step 2
for (int i = 0; i <= sourceWordCount; distance[i, 0] = i++);
for (int j = 0; j <= targetWordCount; distance[0, j] = j++);
for (int i = 1; i <= sourceWordCount; i++) {
for (int j = 1; j <= targetWordCount; j++) {
// Step 3
int cost = (target[j - 1] == source[i - 1]) ? 0 : 1;
// Step 4
distance[i, j] = Math.Min(Math.Min(distance[i - 1, j] + 1, distance[i, j - 1] + 1), distance[i - 1, j - 1] + cost);
}
}
return distance[sourceWordCount, targetWordCount];
}
static void Main(string[] args){
Console.WriteLine(ComputeLevenshteinDistance ("Stackoverflow","StuckOverflow"));
Console.ReadKey();
}
}

Related

What is wrong with the recursive algorithm developed for the below problem?

I have tried to solve an algorithmic problem. I have come up with a recursive algorithm to solve the same. This is the link to the problem:
https://codeforces.com/problemset/problem/1178/B
This problem is not from any contest that is currently going on.
I have coded my algorithm and had run it on a few test cases, it turns out that it is counting more than the correct amount. I went through my thought process again and again but could not find any mistake. I have written my algorithm (not the code, but just the recursive function I have thought of) below. Can I please know where had I gone wrong -- what was the mistake in my thought process?
Let my recursive function be called as count, it takes any of the below three forms as the algorithm proceeds.
count(i,'o',0) = count(i+1,'o',0) [+ count(i+1,'w',1) --> iff (i)th
element of the string is 'o']
count(i,'w',0) = count(i+1,'w',0) [+ count(i+2,'o',0) --> iff (i)th and (i+1)th elements are both equal to 'v']
count(i,'w',1) = count(i+1,'w',1) [+ 1 + count(i+2,'w',0) --> iff (i)th and (i+1)th elements are both equal to 'v']
Note: The recursive function calls present inside the [.] (square brackets) will be called iff the conditions mentioned after the arrows are satisfied.)
Explanation: The main idea behind the recursive function developed is to count the number of occurrences of the given sequence. The count function takes 3 arguments:
argument 1: The index of the string on which we are currently located.
argument 2: The pattern we are looking for (if this argument is 'o' it means that we are looking for the letter 'o' -- i.e. at which index it is there. If it is 'w' it means that we are looking for the pattern 'vv' -- i.e. we are looking for 2 consecutive indices where this pattern occurs.)
argument 3: This can be either 1 or 0. If it is 1 it means that we are looking for the 'vv' pattern, having already found the 'o' i.e. we are looking for the 'vv' pattern shown in bold: vvovv. If it is 0, it means that we are searching for the 'vv' pattern which will be the
beginning of the pattern vvovv (shown in bold.)
I will initiate the algorithm with count(0,'w',0) -- it means, we are at the 0th index of the string, we are looking for the pattern 'vv', and this 'vv' will be the prefix of the 'vvovv' pattern we wish to find.
So, the output of count(0,'w',0) should be my answer. Now comes the trouble, for the following input: "vvovooovovvovoovoovvvvovo" (say input1), my program (which is based on the above algorithm) gives the expected answer(= 50). But, when I just append "vv" to the above input to get a new input: "vvovooovovvovoovoovvvvovovv" (say input2) and run my algorithm again, I get 135 as the answer, while the correct answer is 75 (this is the answer the solution code returns). Why is this happening? Where had I made an error?
Also, one more doubt is if the output for the input1 is 50, then the output for the input2 should be at least twice right -- because all of the subsequences which were present in the input1, will be present in the input2 too and all of those subsequences can also form a new subsequence with the appended 'vv' -- this means we have at least 100 favourable subsequences right?
P.S. This is the link to the solution code https://codeforces.com/blog/entry/68534
This question doesn't need recursion or dynamic programming.
The basic idea is to count how many ws we have before and after each o.
If you have X vs, it means you have X - 1 ws.
Let's use vvvovvv as an example. We know that before and after the o we have 3 vs, which means 2 ws. To evaluate the answer, just multiply 2x2 = 4.
For each o we find, we just need to multiply the ws before and after it, sum it all and this is our answer.
We can find how many ws there are before and after each o in linear time.
#include <iostream>
using namespace std;
int convert_v_to_w(int v_count){
return max(0, v_count - 1);
}
int main(){
string s = "vvovooovovvovoovoovvvvovovvvov";
int n = s.size();
int wBefore[n];
int wAfter[n];
int v_count = 0, wb = 0, wa = 0;
//counting ws before each o
int i = 0;
while(i < n){
v_count = 0;
while(i < n && s[i] == 'v'){
v_count++;
i++;
}
wb += convert_v_to_w(v_count);
if(i < n && s[i] == 'o'){
wBefore[i] = wb;
}
i++;
}
//counting ws after each o
i = n - 1;
while(i >= 0){
v_count = 0;
while(i >= 0 && s[i] == 'v'){
v_count++;
i--;
}
wa += convert_v_to_w(v_count);
if(i >= 0 && s[i] == 'o'){
wAfter[i] = wa;
}
i--;
}
//evaluating answer by multiplying ws before and after each o
int ans = 0;
for(int i = 0; i < n; i++){
if(s[i] == 'o') ans += wBefore[i] * wAfter[i];
}
cout<<ans<<endl;
}
output: 100
complexity: O(n) time and space

how to write iterative algorithm for generate all subsets of a set?

I wrote recursive backtracking algorithm for finding all subsets of a given set.
void backtracke(int* a, int k, int n)
{
if (k == n)
{
for(int i = 1; i <=k; ++i)
{
if (a[i] == true)
{
std::cout << i << " ";
}
}
std::cout << std::endl;
return;
}
bool c[2];
c[0] = false;
c[1] = true;
++k;
for(int i = 0; i < 2; ++i)
{
a[k] = c[i];
backtracke(a, k, n);
a[k] = INT_MAX;
}
}
now we have to write the same algorithm but in an iterative form, how to do it ?
You can use the binary counter approach. Any unique binary string of length n represents a unique subset of a set of n elements. If you start with 0 and end with 2^n-1, you cover all possible subsets. The counter can be easily implemented in an iterative manner.
The code in Java:
public static void printAllSubsets(int[] arr) {
byte[] counter = new byte[arr.length];
while (true) {
// Print combination
for (int i = 0; i < counter.length; i++) {
if (counter[i] != 0)
System.out.print(arr[i] + " ");
}
System.out.println();
// Increment counter
int i = 0;
while (i < counter.length && counter[i] == 1)
counter[i++] = 0;
if (i == counter.length)
break;
counter[i] = 1;
}
}
Note that in Java one can use BitSet, which makes the code really shorter, but I used a byte array to illustrate the process better.
There are a few ways to write an iterative algorithm for this problem. The most commonly suggested would be to:
Count (i.e. a simply for-loop) from 0 to 2numberOfElements - 1
If we look at the variable used above for counting in binary, the digit at each position could be thought of a flag indicating whether or not the element at the corresponding index in the set should be included in this subset. Simply loop over each bit (by taking the remainder by 2, then dividing by 2), including the corresponding elements in our output.
Example:
Input: {1,2,3,4,5}.
We'd start counting at 0, which is 00000 in binary, which means no flags are set, so no elements are included (this would obviously be skipped if you don't want the empty subset) - output {}.
Then 1 = 00001, indicating that only the last element would be included - output {5}.
Then 2 = 00010, indicating that only the second last element would be included - output {4}.
Then 3 = 00011, indicating that the last two elements would be included - output {4,5}.
And so on, all the way up to 31 = 11111, indicating that all the elements would be included - output {1,2,3,4,5}.
* Actually code-wise, it would be simpler to turn this on its head - output {1} for 00001, considering that the first remainder by 2 will then correspond to the flag of the 0th element, the second remainder, the 1st element, etc., but the above is simpler for illustrative purposes.
More generally, any recursive algorithm could be changed to an iterative one as follows:
Create a loop consisting of parts (think switch-statement), with each part consisting of the code between any two recursive calls in your function
Create a stack where each element contains each necessary local variable in the function, and an indication of which part we're busy with
The loop would pop elements from the stack, executing the appropriate section of code
Each recursive call would be replaced by first adding it's own state to the stack, and then the called state
Replace return with appropriate break statements
A little Python implementation of George's algorithm. Perhaps it will help someone.
def subsets(S):
l = len(S)
for x in range(2**l):
yield {s for i,s in enumerate(S) if ((x / 2**i) % 2) // 1 == 1}
Basically what you want is P(S) = S_0 U S_1 U ... U S_n where S_i is a set of all sets contained by taking i elements from S. In other words if S= {a, b, c} then S_0 = {{}}, S_1 = {{a},{b},{c}}, S_2 = {{a, b}, {a, c}, {b, c}} and S_3 = {a, b, c}.
The algorithm we have so far is
set P(set S) {
PS = {}
for i in [0..|S|]
PS = PS U Combination(S, i)
return PS
}
We know that |S_i| = nCi where |S| = n. So basically we know that we will be looping nCi times. You may use this information to optimize the algorithm later on. To generate combinations of size i the algorithm that I present is as follows:
Suppose S = {a, b, c} then you can map 0 to a, 1 to b and 2 to c. And perumtations to these are (if i=2) 0-0, 0-1, 0-2, 1-0, 1-1, 1-2, 2-0, 2-1, 2-2. To check if a sequence is a combination you check if the numbers are all unique and that if you permute the digits the sequence doesn't appear elsewhere, this will filter the above sequence to just 0-1, 0-2 and 1-2 which are later mapped back to {a,b},{a,c},{b,c}. How to generate the long sequence above you can follow this algorithm
set Combination(set S, integer l) {
CS = {}
for x in [0..2^l] {
n = {}
for i in [0..l] {
n = n U {floor(x / |S|^i) mod |S|} // get the i-th digit in x base |S|
}
CS = CS U {S[n]}
}
return filter(CS) // filtering described above
}

Add the least amount of characters to make a palindrome

The question:
Given any string, add the least amount of characters possible to make it a palindrome in linear time.
I'm only able to come up with a O(N2) solution.
Can someone help me with an O(N) solution?
Revert the string
Use a modified Knuth-Morris-Pratt to find the latest match (simplest modification would be to just append the original string to the reverted string and ignore matches after len(string).
Append the unmatched rest of the reverted string to the original.
1 and 3 are obviously linear and 2 is linear beacause Knuth-Morris-Pratt is.
If only appending is allowed
A Scala solution:
def isPalindrome(s: String) = s.view.reverse == s.view
def makePalindrome(s: String) =
s + s.take((0 to s.length).find(i => isPalindrome(s.substring(i))).get).reverse
If you're allowed to insert characters anywhere
Every palindrome can be viewed as a set of nested letter pairs.
a n n a b o b
| | | | | * |
| -- | | |
--------- -----
If the palindrome length n is even, we'll have n/2 pairs. If it is odd, we'll have n/2 full pairs and one single letter in the middle (let's call it a degenerated pair).
Let's represent them by pairs of string indexes - the left index counted from the left end of the string, and the right index counted from the right end of the string, both ends starting with index 0.
Now let's write pairs starting from the outer to the inner. So in our example:
anna: (0, 0) (1, 1)
bob: (0, 0) (1, 1)
In order to make any string a palindrome, we will go from both ends of the string one character at a time, and with every step, we'll eventually add a character to produce a correct pair of identical characters.
Example:
Assume the input word is "blob"
Pair (0, 0) is (b, b) ok, nothing to do, this pair is fine. Let's increase the counter.
Pair (1, 1) is (l, o). Doesn't match. So let's add "o" at position 1 from the left. Now our word became "bolob".
Pair (2, 2). We don't need to look even at the characters, because we're pointing at the same index in the string. Done.
Wait a moment, but we have a problem here: in point 2. we arbitrarily chose to add a character on the left. But we could as well add a character "l" on the right. That would produce "blolb", also a valid palindrome. So does it matter? Unfortunately it does because the choice in earlier steps may affect how many pairs we'll have to fix and therefore how many characters we'll have to add in the future steps.
Easy algorithm: search all the possiblities. That would give us a O(2^n) algorithm.
Better algorithm: use Dynamic Programming approach and prune the search space.
In order to keep things simpler, now we decouple inserting of new characters from just finding the right sequence of nested pairs (outer to inner) and fixing their alignment later. So for the word "blob" we have the following possibilities, both ending with a degenerated pair:
(0, 0) (1, 2)
(0, 0) (2, 1)
The more such pairs we find, the less characters we will have to add to fix the original string. Every full pair found gives us two characters we can reuse. Every degenerated pair gives us one character to reuse.
The main loop of the algorithm will iteratively evaluate pair sequences in such a way, that in step 1 all valid pair sequences of length 1 are found. The next step will evaluate sequences of length 2, the third sequences of length 3 etc. When at some step we find no possibilities, this means the previous step contains the solution with the highest number of pairs.
After each step, we will remove the pareto-suboptimal sequences. A sequence is suboptimal compared to another sequence of the same length, if its last pair is dominated by the last pair of the other sequence. E.g. sequence (0, 0)(1, 3) is worse than (0, 0)(1, 2). The latter gives us more room to find nested pairs and we're guaranteed to find at least all the pairs that we'd find for the former. However sequence (0, 0)(1, 2) is neither worse nor better than (0, 0)(2, 1). The one minor detail we have to beware of is that a sequence ending with a degenerated pair is always worse than a sequence ending with a full pair.
After bringing it all together:
def makePalindrome(str: String): String = {
/** Finds the pareto-minimum subset of a set of points (here pair of indices).
* Could be done in linear time, without sorting, but O(n log n) is not that bad ;) */
def paretoMin(points: Iterable[(Int, Int)]): List[(Int, Int)] = {
val sorted = points.toSeq.sortBy(identity)
(List.empty[(Int, Int)] /: sorted) { (result, e) =>
if (result.isEmpty || e._2 <= result.head._2)
e :: result
else
result
}
}
/** Find all pairs directly nested within a given pair.
* For performance reasons tries to not include suboptimal pairs (pairs nested in any of the pairs also in the result)
* although it wouldn't break anything as prune takes care of this. */
def pairs(left: Int, right: Int): Iterable[(Int, Int)] = {
val builder = List.newBuilder[(Int, Int)]
var rightMax = str.length
for (i <- left until (str.length - right)) {
rightMax = math.min(str.length - left, rightMax)
val subPairs =
for (j <- right until rightMax if str(i) == str(str.length - j - 1)) yield (i, j)
subPairs.headOption match {
case Some((a, b)) => rightMax = b; builder += ((a, b))
case None =>
}
}
builder.result()
}
/** Builds sequences of size n+1 from sequence of size n */
def extend(path: List[(Int, Int)]): Iterable[List[(Int, Int)]] =
for (p <- pairs(path.head._1 + 1, path.head._2 + 1)) yield p :: path
/** Whether full or degenerated. Full-pairs save us 2 characters, degenerated save us only 1. */
def isFullPair(pair: (Int, Int)) =
pair._1 + pair._2 < str.length - 1
/** Removes pareto-suboptimal sequences */
def prune(sequences: List[List[(Int, Int)]]): List[List[(Int, Int)]] = {
val allowedHeads = paretoMin(sequences.map(_.head)).toSet
val containsFullPair = allowedHeads.exists(isFullPair)
sequences.filter(s => allowedHeads.contains(s.head) && (isFullPair(s.head) || !containsFullPair))
}
/** Dynamic-Programming step */
#tailrec
def search(sequences: List[List[(Int, Int)]]): List[List[(Int, Int)]] = {
val nextStage = prune(sequences.flatMap(extend))
nextStage match {
case List() => sequences
case x => search(nextStage)
}
}
/** Converts a sequence of nested pairs to a palindrome */
def sequenceToString(sequence: List[(Int, Int)]): String = {
val lStr = str
val rStr = str.reverse
val half =
(for (List(start, end) <- sequence.reverse.sliding(2)) yield
lStr.substring(start._1 + 1, end._1) + rStr.substring(start._2 + 1, end._2) + lStr(end._1)).mkString
if (isFullPair(sequence.head))
half + half.reverse
else
half + half.reverse.substring(1)
}
sequenceToString(search(List(List((-1, -1)))).head)
}
Note: The code does not list all the palindromes, but gives only one example, and it is guaranteed it has the minimum length. There usually are more palindromes possible with the same minimum length (O(2^n) worst case, so you probably don't want to enumerate them all).
O(n) time solution.
Algorithm:
Need to find the longest palindrome within the given string that contains the last character. Then add all the character that are not part of the palindrome to the back of the string in reverse order.
Key point:
In this problem, the longest palindrome in the given string MUST contain the last character.
ex:
input: abacac
output: abacacaba
Here the longest palindrome in the input that contains the last letter is "cac". Therefore add all the letter before "cac" to the back in reverse order to make the entire string a palindrome.
written in c# with a few test cases commented out
static public void makePalindrome()
{
//string word = "aababaa";
//string word = "abacbaa";
//string word = "abcbd";
//string word = "abacac";
//string word = "aBxyxBxBxyxB";
//string word = "Malayal";
string word = "abccadac";
int j = word.Length - 1;
int mark = j;
bool found = false;
for (int i = 0; i < j; i++)
{
char cI = word[i];
char cJ = word[j];
if (cI == cJ)
{
found = true;
j--;
if(mark > i)
mark = i;
}
else
{
if (found)
{
found = false;
i--;
}
j = word.Length - 1;
mark = j;
}
}
for (int i = mark-1; i >=0; i--)
word += word[i];
Console.Write(word);
}
}
Note that this code will give you the solution for least amount of letter to APPEND TO THE BACK to make the string a palindrome. If you want to append to the front, just have a 2nd loop that goes the other way. This will make the algorithm O(n) + O(n) = O(n). If you want a way to insert letters anywhere in the string to make it a palindrome, then this code will not work for that case.
I believe #Chronical's answer is wrong, as it seems to be for best case scenario, not worst case which is used to compute big-O complexity. I welcome the proof, but the "solution" doesn't actually describe a valid answer.
KMP finds a matching substring in O(n * 2k) time, where n is the length of the input string, and k substring we're searching for, but does not in O(n) time tell you what the longest palindrome in the input string is.
To solve this problem, we need to find the longest palindrome at the end of the string. If this longest suffix palindrome is of length x, the minimum number of characters to add is n - x. E.g. the string aaba's longest suffix substring is aba of length 3, thus our answer is 1. The algorithm to find out if a string is a palindrome takes O(n) time, whether using KMP or the more efficient and simple algorithm (O(n/2)):
Take two pointers, one at the first character and one at the last character
Compare the characters at the pointers, if they're equal, move each pointer inward, otherwise return false
When the pointers point to the same index (odd string length), or have overlapped (even string length), return true
Using the simple algorithm, we start from the entire string and check if it's a palindrome. If it is, we return 0, and if not, we check the string string[1...end], string[2...end] until we have reached a single character and return n - 1. This results in a runtime of O(n^2).
Splitting up the KMP algorithm into
Build table
Search for longest suffix palindrome
Building the table takes O(n) time, and then each check of "are you a palindrome" for each substring from string[0...end], string[1...end], ..., string[end - 2...end] each takes O(n) time. k in this case is the same factor of n that the simple algorithm takes to check each substring, because it starts as k = n, then goes through k = n - 1, k = n - 2... just the same as the simple algorithm did.
TL; DR:
KMP can tell you if a string is a palindrome in O(n) time, but that supply an answer to the question, because you have to check if all substrings string[0...end], string[1...end], ..., string[end - 2...end] are palindromes, resulting in the same (but actually worse) runtime as a simple palindrome-check algorithm.
#include<iostream>
#include<string>
using std::cout;
using std::endl;
using std::cin;
int main() {
std::string word, left("");
cin >> word;
size_t start, end;
for (start = 0, end = word.length()-1; start < end; end--) {
if (word[start] != word[end]) {
left.append(word.begin()+end, 1 + word.begin()+end);
continue;
}
left.append(word.begin()+start, 1 + word.begin()+start), start++;
}
cout << left << ( start == end ? std::string(word.begin()+end, 1 + word.begin()+end) : "" )
<< std::string(left.rbegin(), left.rend()) << endl;
return 0;
}
Don't know if it appends the minimum number, but it produces palindromes
Explained:
We will start at both ends of the given string and iterate inwards towards the center.
At each iteration, we check if each letter is the same, i.e. word[start] == word[end]?.
If they are the same, we append a copy of the variable word[start] to another string called left which as it name suggests will serve as the left hand side of the new palindrome string when iteration is complete. Then we increment both variables (start)++ and (end)-- towards the center
In the case that they are not the same, we append a copy of of the variable word[end] to the same string left
And this is the basics of the algorithm until the loop is done.
When the loop is finished, one last check is done to make sure that if we got an odd length palindrome, we append the middle character to the middle of the new palindrome formed.
Note that if you decide to append the oppoosite characters to the string left, the opposite about everything in the code becomes true; i.e. which index is incremented at each iteration and which is incremented when a match is found, order of printing the palindrome, etc. I don't want to have to go through it again but you can try it and see.
The running complexity of this code should be O(N) assuming that append method of the std::string class runs in constant time.
If some wants to solve this in ruby, The solution can be very simple
str = 'xcbc' # Any string that you want.
arr1 = str.split('')
arr2 = arr1.reverse
count = 0
while(str != str.reverse)
count += 1
arr1.insert(count-1, arr2[count-1])
str = arr1.join('')
end
puts str
puts str.length - arr2.count
I am assuming that you cannot replace or remove any existing characters?
A good start would be reversing one of the strings and finding the longest-common-substring (LCS) between the reversed string and the other string. Since it sounds like this is a homework or interview question, I'll leave the rest up to you.
Here see this solution
This is better than O(N^2)
Problem is sub divided in to many other sub problems
ex:
original "tostotor"
reversed "rototsot"
Here 2nd position is 'o' so dividing in to two problems by breaking in to "t" and "ostot" from the original string
For 't':solution is 1
For 'ostot':solution is 2 because LCS is "tot" and characters need to be added are "os"
so total is 2+1 = 3
def shortPalin( S):
k=0
lis=len(S)
for i in range(len(S)/2):
if S[i]==S[lis-1-i]:
k=k+1
else :break
S=S[k:lis-k]
lis=len(S)
prev=0
w=len(S)
tot=0
for i in range(len(S)):
if i>=w:
break;
elif S[i]==S[lis-1-i]:
tot=tot+lcs(S[prev:i])
prev=i
w=lis-1-i
tot=tot+lcs(S[prev:i])
return tot
def lcs( S):
if (len(S)==1):
return 1
li=len(S)
X=[0 for x in xrange(len(S)+1)]
Y=[0 for l in xrange(len(S)+1)]
for i in range(len(S)-1,-1,-1):
for j in range(len(S)-1,-1,-1):
if S[i]==S[li-1-j]:
X[j]=1+Y[j+1]
else:
X[j]=max(Y[j],X[j+1])
Y=X
return li-X[0]
print shortPalin("tostotor")
Using Recursion
#include <iostream>
using namespace std;
int length( char str[])
{ int l=0;
for( int i=0; str[i]!='\0'; i++, l++);
return l;
}
int palin(char str[],int len)
{ static int cnt;
int s=0;
int e=len-1;
while(s<e){
if(str[s]!=str[e]) {
cnt++;
return palin(str+1,len-1);}
else{
s++;
e--;
}
}
return cnt;
}
int main() {
char str[100];
cin.getline(str,100);
int len = length(str);
cout<<palin(str,len);
}
Solution with O(n) time complexity
public static void main(String[] args) {
String givenStr = "abtb";
String palindromeStr = covertToPalindrome(givenStr);
System.out.println(palindromeStr);
}
private static String covertToPalindrome(String str) {
char[] strArray = str.toCharArray();
int low = 0;
int high = strArray.length - 1;
int subStrIndex = -1;
while (low < high) {
if (strArray[low] == strArray[high]) {
high--;
} else {
high = strArray.length - 1;
subStrIndex = low;
}
low++;
}
return str + (new StringBuilder(str.substring(0, subStrIndex+1))).reverse().toString();
}
// string to append to convert it to a palindrome
public static void main(String args[])
{
String s=input();
System.out.println(min_operations(s));
}
static String min_operations(String str)
{
int i=0;
int j=str.length()-1;
String ans="";
while(i<j)
{
if(str.charAt(i)!=str.charAt(j))
{
ans=ans+str.charAt(i);
}
if(str.charAt(i)==str.charAt(j))
{
j--;
}
i++;
}
StringBuffer sd=new StringBuffer(ans);
sd.reverse();
return (sd.toString());
}

Represent a word with an alphabet

This is an interview question:
Imagine an alphabet of words. Example:
a ==> 1
b ==> 2
c ==> 3
.
z ==> 26
ab ==> 27
ac ==> 28
.
az ==> 51
bc ==> 52
and so on.
Such that the sequence of characters need to be in ascending order only (ab is valid but ba is not). Given any word print its index if valid and 0 if not.
Input Output
ab 27
ba 0
aez 441
Note: Brute-force is not allowed. Here is the link to the question: http://www.careercup.com/question?id=21117662
I can understand that solution as:
The total words is 2^26 -1.
For a given word, the words with small size occurs first.
Let n be the length of the word,
Total number of words with size less than n is C(26, 1) + C(26, 2) + ...+ C(26, n -1)
Then calculate how many words with the same size prior to the given word
The sum of two numbers plusing one is the result
Reference: sites.google.com/site/spaceofjameschen/annnocements/printtheindexofawordwithlettersinascendingorder--microsoft
In the sample solution, I understood how the author calculated number of words with size less than word.size(). But, in the code, I am not too sure about how to find number of words of the same size as word.size() that occur before 'word'.
Precisely, this bit:
char desirableStart;
i = 0;
while( i < str.size()){
desirableStart = (i == 0) ? 'a' : str[i - 1] + 1;
for(int j = desirableStart; j < str[i]; ++j){
index += NChooseK('z' - j, str.size() - i - 1); // Choose str.size() - i - 1 in the available charset
}
i ++;
}
Can someone help me understand this bit? Thanks.
First of all (you probably got this part, but for completeness sake), the NChooseK function calculates the binomial coefficient, i.e. the number of ways to choose k elements from a set of n elements. This function is referred to as C(n, k) in your comments, so I will use the same notation.
Since the letters are sorted and not repeating, this is exactly the number of ways one can create the n-letter words described in the problem, so this is why the first part of the function is getting you at the right position:
int index = 0;
int i = 1;
while(i < str.size()) {
// choose *i* letters out of 26
index += NChooseK(26, i);
i++;
}
For example, if your input was aez, this would get the index of the word yz, which is the last possible 2-letter combination: C(26, 1) + C(26, 2) = 351.
At this point, you have the initial index of your n-letter word, and need to see how many combinations of n-letter words you need to skip to get to the end of the word. To do this, you have to check each individual letter and count all possible combinations of letters starting with one letter after the previous one (the desirableStart variable in your code), and ending with the letter being examined.
For example, for aez you would proceed as following:
Get the last 2-letter word index (yz).
Increase index by one (this is actually done at the end of your code, but it makes more sense to do it here to keep the correct positions): now we are at index of abc.
First letter is a, no need to increase. You are still at abc.
Second letter is e, count combinations for 2nd letter from b to e. This will get you to aef (note that f is the first valid 3rd character in this example, and desirableStart takes care of that).
Third letter is z, count combinations for 3rd letter, from f to z. This will get you to aez.
That's what the last part of your code does:
// get to str.size() initial index (moved this line up)
index ++;
i = 0;
while( i < str.size()) {
// if previous letter was `e`, we need to start with `f`
desirableStart = (i == 0) ? 'a' : str[i - 1] + 1;
// add all combinations until you get to the current letter
for (int j = desirableStart; j < str[i]; ++j) {
char validLettersRemaining = 'z' - j;
int numberOfLettersToChoose = str.size() - i - 1;
index += NChooseK(validLettersRemaining, numberOfLettersToChoose);
}
i++;
}
return index;
there is no difference between the computation of the number of words of the same size and the counterpart for shorter words.
you may be led astray by the indexing of arrays in c which starts at 0. thus though i < str.size() might suggest otherwise, the last iteration of this loop actually counts words of the same size as that of the word whose index is computed.

Algorithm explanation for common strings

The Problem definition:
Given two strings a and b of equal length, what’s the longest string (S) that can be constructed such that S is a child to both a and b.
String x is said to be a child of string y if x can be formed by deleting 0 or more characters from y
Input format
Two strings a and b with a newline separating them
Constraints
All characters are upper-cased and lie between ascii values 65-90 The maximum length of the strings is 5000
Output format
Length of the string S
Sample Input #0
HARRY
SALLY
Sample Output #0
2
The longest possible subset of characters that is possible by deleting zero or more characters from HARRY and SALLY is AY, whose length is 2.
The solution:
public class Solution {
public static void main(String[] args) throws Exception {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
char[] a = in.readLine().toCharArray();
char[] b = in.readLine().toCharArray();
int[][] dp = new int[a.length + 1][b.length + 1];
dp[0][0] = 1;
for (int i = 0; i < a.length; i++)
for (int j = 0; j < b.length; j++)
if (a[i] == b[j])
dp[i + 1][j + 1] = dp[i][j] + 1;
else
dp[i + 1][j + 1] = Math.max(dp[i][j + 1], dp[i + 1][j]);
System.out.println(dp[a.length][b.length]);
}
}
Anyone has encountered this problem and solved using the solution like this? I solved it in a different way. Only found this solution is elegant, But can not make sense of it so far. Could anyone help explaining it little bit.
This algorithm uses Dynamic Programming. The key point in understanding dynamic programming is to understand the recursive step which in this case is within the if-else statement. My understanding about the matrix of size (a.length+1) * (b.length +1) is that for a given element in the matrix dp[i +1, j +1] it represents that if the we only compare string a[0:i] and b[0:j] what will be the child of both a[0:i] and b[0:j] that has most characters.
So to understand the recursive step, let's look at the example of "HARRY" and "SALLY", say if I am on the step of calculating dp[5][5], in this case, I will be looking at the last character 'Y':
A. if a[4] and b[4] are equal, in this case "Y" = "Y", then i know the optimal solution is: 1) Find out what is the child of "HARR" and "SALL" that has most characters (let's say n characters) and then 2) add 1 to n.
B. if a[4] and b[4] are not equal, then the optimal solution is either Child of "HARR" and "SALLY" or Child of "HARRY" and "SALL" which will translate to Max(dp[i+1][j] and dp[i][j+1]) in the code.

Resources