Longest Common Substring

Longest Common Substring - algorithm

We have two strings a and b respectively. The length of a is greater than or equal to b. We have to find out the longest common substring. If there are multiple answers then we have to output the substring which comes earlier in b (earlier as in whose starting index comes first).
Note: The length of a and b can be up to 106.
I tried to find the longest common substring using suffix array (sorting the suffixes using quicksort). For the case when there is more than one answer, I tried pushing all the common substrings in a stack which are equal to the length of the longest common substring.
I wanted to know is there any faster way to do so?

Build a suffix tree of a string a$b, that is, a concatenated with some character like $ not occurring in both strings, then concatenated with b. A (compressed) suffix tree can be built in O(|a|+|b|) time and memory, and have O(|a|+|b|) nodes.
Now, for each node, we know its depth (the length of the string obtained by starting from the root and traversing the tree down to that node). We also can keep track of two boolean quantities: whether this node was visited during the build phase corresponding to a, and whether it was visited during the build phase corresponding to b (for example, we might as well build the two trees separately and then merge them using pre-order traversal). Now, the task boils down to finding the deepest vertex which was visited during both phases, which can be done by a single pre-order traversal. The case of multiple answers should be easy to handle.
This Wikipedia page contains another (brief) overview of the technique.

This is longest substring,what you are looking for is it with repetition or without .
please go through this it might be helpful.
http://www.programcreek.com/2013/02/leetcode-longest-substring-without-repeating-characters-java/

import java.util.Scanner;
public class JavaApplication8 {
public static int find(String s1,String s2){
int n = s1.length();
int m = s2.length();
int ans = 0;
int[] a = new int[m];
int b[] = new int[m];
for(int i = 0;i<n;i++){
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
ans = Math.max(ans, a[j]);
}
}
int[] c = a;
a = b;
b = c;
}
return ans;
}
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
String s1 = sc.next();
String s2 = sc.next();
System.out.println(find(s1,s2));
}
}
Time Complexity O(N)
Space Complexity O(N)

package main
import (
"fmt"
"strings"
)
func main(){
fmt.Println(lcs("CLCL","LCLC"))
}
func lcs(s1,s2 string)(max int,str string){
str1 := strings.Split(s1,"")
str2 := strings.Split(s2,"")
fmt.Println(str1,str2)
str = ""
mnMatrix := [4][4]int{}
for i:=0;i<len(str1);i++{
for j:=0;j<len(str2);j++{
if str1[i]==str2[j]{
if i==0 || j==0 {
mnMatrix[i][j] = 1
max = 1
//str = str1[i]
}else{
mnMatrix[i][j] = mnMatrix[i-1][j-1]+1
max = mnMatrix[i][j]
str = ""
for k:=max;k>=1;k--{
str = str + str2[k]
//fmt.Println(str)
}
}
}else{
mnMatrix[i][j] = 0
}
}
}
fmt.Println(mnMatrix)
return max, str
}
enter code here

Related

Word ladder complexity analysis

I'd like to make sure that I am doing the time complexity analysis correctly. There seems to be many different analysis.
Just in case people don't know the problem this is problem description.
Given two words (beginWord and endWord), and a dictionary's word list, find the length of shortest transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time.
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
For example,
Given:
beginWord = "hit"
endWord = "cog"
wordList = ["hot","dot","dog","lot","log","cog"]
As one shortest transformation is "hit" -> "hot" -> "dot" -> "dog" -> "cog",
return its length 5.
And this is simple BFS algorithm.
static int ladderLength(String beginWord, String endWord, List<String> wordList) {
int level = 1;
Deque<String> queue = new LinkedList<>();
queue.add(beginWord);
queue.add(null);
Set<String> visited = new HashSet<>();
// worst case we can add all dictionary thus N (len(dict)) computation
while (!queue.isEmpty()) {
String word = queue.removeFirst();
if (word != null) {
if (word.equals(endWord)) {
return level;
}
// m * 26 * log N
for (int i = 0; i < word.length(); i++) {
char[] chars = word.toCharArray();
for (char c = 'a'; c <= 'z'; c++) {
chars[i] = c;
String newStr = new String(chars);
if (!visited.contains(newStr) && wordList.contains(newStr)) {
queue.add(newStr);
visited.add(newStr);
}
}
}
} else {
level++;
if (!queue.isEmpty()) {
queue.add(null);
}
}
}
return 0;
}
wordList (dictionary) contains N elements and length of beginWord is m
In worst case, the queue would have all the element in the word list, thus, the outer while loop would run for o(N).
For each word (length m), it tries 26 charaters (a to z) thus inner nested for loop is o(26*m), and inside inner for loop, it does wordList.contains assume it's o(logN).
So overall it's o(N*m*26*logN) => o(N*mlogN)
Is this correct?

The List<T> type does not automatically sort its elements, but instead "faithfully" keeps all elements in the order they were added. So wordList.contains is in fact O(n). However for a HashSet such as visited, this operation is O(1) (amortized), so consider switching to that.

Algorithm to list unique permutations of string with duplicate letters

For example, string "AAABBB" will have permutations:
"ABAABB",
"BBAABA",
"ABABAB",
etc
What's a good algorithm for generating the permutations? (And what's its time complexity?)

For a multiset, you can solve recursively by position (JavaScript code):
function f(multiset,counters,result){
if (counters.every(x => x === 0)){
console.log(result);
return;
}
for (var i=0; i<counters.length; i++){
if (counters[i] > 0){
_counters = counters.slice();
_counters[i]--;
f(multiset,_counters,result + multiset[i]);
}
}
}
f(['A','B'],[3,3],'');

This is not full answer, just an idea.
If your strings has fixed number of only two letters I'll go with binary tree and good recursion function.
Each node is object that contains name with prefix of parent name and suffix A or B furthermore it have numbers of A and B letters in the name.
Node constructor gets name of parent and number of A and B from parent so it needs only to add 1 to number of A or B and one letter to name.
It doesn't construct next node if there is more than three A (in case of A node) or B respectively, or their sum is equal to the length of starting string.
Now you can collect leafs of 2 trees (their names) and have all permutations that you need.
Scala or some functional language (with object-like features) would be perfect for implementing this algorithm. Hope this helps or just sparks some ideas.

Since you actually want to generate the permutations instead of just counting them, the best complexity you can hope for is O(size_of_output).
Here's a good solution in java that meets that bound and runs very quickly, while consuming negligible space. It first sorts the letters to find the lexographically smallest permutation, and then generates all permutations in lexographic order.
It's known as the Pandita algorithm: https://en.wikipedia.org/wiki/Permutation#Generation_in_lexicographic_order
import java.util.Arrays;
import java.util.function.Consumer;
public class UniquePermutations
{
static void generateUniquePermutations(String s, Consumer<String> consumer)
{
char[] array = s.toCharArray();
Arrays.sort(array);
for (;;)
{
consumer.accept(String.valueOf(array));
int changePos=array.length-2;
while (changePos>=0 && array[changePos]>=array[changePos+1])
--changePos;
if (changePos<0)
break; //all done
int swapPos=changePos+1;
while(swapPos+1 < array.length && array[swapPos+1]>array[changePos])
++swapPos;
char t = array[changePos];
array[changePos] = array[swapPos];
array[swapPos] = t;
for (int i=changePos+1, j = array.length-1; i < j; ++i,--j)
{
t = array[i];
array[i] = array[j];
array[j] = t;
}
}
}
public static void main (String[] args) throws java.lang.Exception
{
StringBuilder line = new StringBuilder();
generateUniquePermutations("banana", s->{
if (line.length() > 0)
{
if (line.length() + s.length() >= 75)
{
System.out.println(line.toString());
line.setLength(0);
}
else
line.append(" ");
}
line.append(s);
});
System.out.println(line);
}
}
Here is the output:
aaabnn aaanbn aaannb aabann aabnan aabnna aanabn aananb aanban aanbna
aannab aannba abaann abanan abanna abnaan abnana abnnaa anaabn anaanb
anaban anabna ananab ananba anbaan anbana anbnaa annaab annaba annbaa
baaann baanan baanna banaan banana bannaa bnaaan bnaana bnanaa bnnaaa
naaabn naaanb naaban naabna naanab naanba nabaan nabana nabnaa nanaab
nanaba nanbaa nbaaan nbaana nbanaa nbnaaa nnaaab nnaaba nnabaa nnbaaa

Trying to understand the space complexity of this algorithm

I see a lot of articles online explaining Time complexity but haven't found anything good that explains space complexity well. I was trying to solve the following interview question
You have two numbers represented by a linked list, where each node
contains a single digit. The digits are stored in reverse order, such
that the Ts digit is at the head of the list. Write a function that
adds the two numbers and returns the sum as a linked list.
EXAMPLE
Input: (7-> 1 -> 6) + (5 -> 9 -> 2).That is, 617 + 295.
Output: 2 -> 1 -> 9.That is, 912.
My solution for it is the following:
private Node addLists(Node head1, Node head2) {
Node summationHead = null;
Node summationIterator = null;
int num1 = extractNumber(head1);
int num2 = extractNumber(head2);
int sum = num1 + num2;
StringBuilder strValue = new StringBuilder();
strValue.append(sum);
String value = strValue.reverse().toString();
char[] valueArray = value.toCharArray();
for (char charValue : valueArray) {
Node node = createNode(Character.getNumericValue(charValue));
if (summationHead == null) {
summationHead = node;
summationIterator = summationHead;
} else {
summationIterator.next = node;
summationIterator = node;
}
}
return summationHead;
}
private Node createNode(int value) {
Node node = new Node(value);
node.element = value;
node.next = null;
return node;
}
private int extractNumber(Node head) {
Node iterator = head;
StringBuilder strNum = new StringBuilder();
while (iterator != null) {
int value = iterator.element;
strNum.append(value);
iterator = iterator.next;
}
String reversedString = strNum.reverse().toString();
return Integer.parseInt(reversedString);
}
Can someone please deduce the space complexity for this? Thanks.

The space complexity means "how does the amount of space required to run this algorithm change asymptotically as the inputs get larger"?
So you have two lists of length N and M. The resultant list will have length max(N,M), possibly +1 if there's a carry. But that +1 is a constant, and we don't consider it part of the Big-O as the larger of N or M will dominate.
Also note this algo is pretty straightforward. There's no intermediate calculation requiring larger-than-linear space.
The space complexity is max(N,M).

Longest common prefix for n string

Given n string of max length m. How can we find the longest common prefix shared by at least two strings among them?
Example: ['flower', 'flow', 'hello', 'fleet']
Answer: fl
I was thinking of building a Trie for all the string and then checking the deepest node (satisfies longest) that branches out to two/more substrings (satisfies commonality). This takes O(n*m) time and space. Is there a better way to do this

Why to use trie(which takes O(mn) time and O(mn) space, just use the basic brute force way. first loop, find the shortest string as minStr, which takes o(n) time, second loop, compare one by one with this minStr, and keep an variable which indicates the rightmost index of minStr, this loop takes O(mn) where m is the shortest length of all strings. The code is like below,
public String longestCommonPrefix(String[] strs) {
if(strs.length==0) return "";
String minStr=strs[0];
for(int i=1;i<strs.length;i++){
if(strs[i].length()<minStr.length())
minStr=strs[i];
}
int end=minStr.length();
for(int i=0;i<strs.length;i++){
int j;
for( j=0;j<end;j++){
if(minStr.charAt(j)!=strs[i].charAt(j))
break;
}
if(j<end)
end=j;
}
return minStr.substring(0,end);
}

there is an O(|S|*n) solution to this problem, using a trie. [n is the number of strings, S is the longest string]
(1) put all strings in a trie
(2) do a DFS in the trie, until you find the first vertex with more than 1 "edge".
(3) the path from the root to the node you found at (2) is the longest common prefix.
There is no possible faster solution then it [in terms of big O notation], at the worst case, all your strings are identical - and you need to read all of them to know it.

I would sort them, which you can do in n lg n time. Then any strings with common prefixes will be right next to eachother. In fact you should be able to keep a pointer of which index you're currently looking at and work your way down for a pretty speedy computation.

As a completely different answer from my other answer...
You can, with one pass, bucket every string based on its first letter.
With another pass you can sort each bucket based on its second later. (This is known as radix sort, which is O(n*m), and O(n) with each pass.) This gives you a baseline prefix of 2.
You can safely remove from your dataset any elements that do not have a prefix of 2.
You can continue the radix sort, removing elements without a shared prefix of p, as p approaches m.
This will give you the same O(n*m) time that the trie approach does, but will always be faster than the trie since the trie must look at every character in every string (as it enters the structure), while this approach is only guaranteed to look at 2 characters per string, at which point it culls much of the dataset.
The worst case is still that every string is identical, which is why it shares the same big O notation, but will be faster in all cases as is guaranteed to use less comparisons since on any "non-worst-case" there are characters that never need to be visited.

public String longestCommonPrefix(String[] strs) {
if (strs == null || strs.length == 0)
return "";
char[] c_list = strs[0].toCharArray();
int len = c_list.length;
int j = 0;
for (int i = 1; i < strs.length; i++) {
for (j = 0; j < len && j < strs[i].length(); j++)
if (c_list[j] != strs[i].charAt(j))
break;
len = j;
}
return new String(c_list).substring(0, len);
}

It happens that the bucket sort (radix sort) described by corsiKa can be extended such that all strings are eventually placed alone in a bucket, and at that point, the LCP for such a lonely string is known. Further, the shustring of each string is also known; it is one longer than is the LCP. The bucket sort is defacto the construction of a suffix array but, only partially so. Those comparisons that are not performed (as described by corsiKa) indeed represent those portions of the suffix strings that are not added to the suffix array. Finally, this method allows for determination of not just the LCP and shustrings, but also one may easily find those subsequences that are not present within the string.

Since the world is obviously begging for an answer in Swift, here's mine ;)
func longestCommonPrefix(strings:[String]) -> String {
var commonPrefix = ""
var indices = strings.map { $0.startIndex}
outerLoop:
while true {
var toMatch: Character = "_"
for (whichString, f) in strings.enumerate() {
let cursor = indices[whichString]
if cursor == f.endIndex { break outerLoop }
indices[whichString] = cursor.successor()
if whichString == 0 { toMatch = f[cursor] }
if toMatch != f[cursor] { break outerLoop }
}
commonPrefix.append(toMatch)
}
return commonPrefix
}
Swift 3 Update:
func longestCommonPrefix(strings:[String]) -> String {
var commonPrefix = ""
var indices = strings.map { $0.startIndex}
outerLoop:
while true {
var toMatch: Character = "_"
for (whichString, f) in strings.enumerated() {
let cursor = indices[whichString]
if cursor == f.endIndex { break outerLoop }
indices[whichString] = f.characters.index(after: cursor)
if whichString == 0 { toMatch = f[cursor] }
if toMatch != f[cursor] { break outerLoop }
}
commonPrefix.append(toMatch)
}
return commonPrefix
}
What's interesting to note:
this runs in O^2, or O(n x m) where n is the number of strings and m
is the length of the shortest one.
this uses the String.Index data type and thus deals with Grapheme Clusters which the Character type represents.
And given the function I needed to write in the first place:
/// Takes an array of Strings representing file system objects absolute
/// paths and turn it into a new array with the minimum number of common
/// ancestors, possibly pushing the root of the tree as many level downwards
/// as necessary
///
/// In other words, we compute the longest common prefix and remove it
func reify(fullPaths:[String]) -> [String] {
let lcp = longestCommonPrefix(fullPaths)
return fullPaths.map {
return $0[lcp.endIndex ..< $0.endIndex]
}
}
here is a minimal unit test:
func testReifySimple() {
let samplePaths:[String] = [
"/root/some/file"
, "/root/some/other/file"
, "/root/another/file"
, "/root/direct.file"
]
let expectedPaths:[String] = [
"some/file"
, "some/other/file"
, "another/file"
, "direct.file"
]
let reified = PathUtilities().reify(samplePaths)
for (index, expected) in expectedPaths.enumerate(){
XCTAssert(expected == reified[index], "failed match, \(expected) != \(reified[index])")
}
}

Perhaps a more intuitive solution. Channel the already found prefix out of earlier iteration as input string to the remaining or next string input. [[[w1, w2], w3], w4]... so on], where [] is supposedly the LCP of two strings.
public String findPrefixBetweenTwo(String A, String B){
String ans = "";
for (int i = 0, j = 0; i < A.length() && j < B.length(); i++, j++){
if (A.charAt(i) != B.charAt(j)){
return i > 0 ? A.substring(0, i) : "";
}
}
// Either of the string is prefix of another one OR they are same.
return (A.length() > B.length()) ? B.substring(0, B.length()) : A.substring(0, A.length());
}
public String longestCommonPrefix(ArrayList<String> A) {
if (A.size() == 1) return A.get(0);
String prefix = A.get(0);
for (int i = 1; i < A.size(); i++){
prefix = findPrefixBetweenTwo(prefix, A.get(i)); // chain the earlier prefix
}
return prefix;
}

How can I compute the number of characters required to turn a string into a palindrome?

I recently found a contest problem that asks you to compute the minimum number of characters that must be inserted (anywhere) in a string to turn it into a palindrome.
For example, given the string: "abcbd" we can turn it into a palindrome by inserting just two characters: one after "a" and another after "d": "adbcbda".
This seems to be a generalization of a similar problem that asks for the same thing, except characters can only be added at the end - this has a pretty simple solution in O(N) using hash tables.
I have been trying to modify the Levenshtein distance algorithm to solve this problem, but haven't been successful. Any help on how to solve this (it doesn't necessarily have to be efficient, I'm just interested in any DP solution) would be appreciated.

Note: This is just a curiosity. Dav proposed an algorithm which can be modified to DP algorithm to run in O(n^2) time and O(n^2) space easily (and perhaps O(n) with better bookkeeping).
Of course, this 'naive' algorithm might actually come in handy if you decide to change the allowed operations.
Here is a 'naive'ish algorithm, which can probably be made faster with clever bookkeeping.
Given a string, we guess the middle of the resulting palindrome and then try to compute the number of inserts required to make the string a palindrome around that middle.
If the string is of length n, there are 2n+1 possible middles (Each character, between two characters, just before and just after the string).
Suppose we consider a middle which gives us two strings L and R (one to left and one to right).
If we are using inserts, I believe the Longest Common Subsequence algorithm (which is a DP algorithm) can now be used the create a 'super' string which contains both L and reverse of R, see Shortest common supersequence.
Pick the middle which gives you the smallest number inserts.
This is O(n^3) I believe. (Note: I haven't tried proving that it is true).

My C# solution looks for repeated characters in a string and uses them to reduce the number of insertions. In a word like program, I use the 'r' characters as a boundary. Inside of the 'r's, I make that a palindrome (recursively). Outside of the 'r's, I mirror the characters on the left and the right.
Some inputs have more than one shortest output: output can be toutptuot or outuputuo. My solution only selects one of the possibilities.
Some example runs:
radar -> radar, 0 insertions
esystem -> metsystem, 2 insertions
message -> megassagem, 3 insertions
stackexchange -> stegnahckexekchangets, 8 insertions
First I need to check if an input is already a palindrome:
public static bool IsPalindrome(string str)
{
for (int left = 0, right = str.Length - 1; left < right; left++, right--)
{
if (str[left] != str[right])
return false;
}
return true;
}
Then I need to find any repeated characters in the input. There may be more than one. The word message has two most-repeated characters ('e' and 's'):
private static bool TryFindMostRepeatedChar(string str, out List<char> chs)
{
chs = new List<char>();
int maxCount = 1;
var dict = new Dictionary<char, int>();
foreach (var item in str)
{
int temp;
if (dict.TryGetValue(item, out temp))
{
dict[item] = temp + 1;
maxCount = temp + 1;
}
else
dict.Add(item, 1);
}
foreach (var item in dict)
{
if (item.Value == maxCount)
chs.Add(item.Key);
}
return maxCount > 1;
}
My algorithm is here:
public static string MakePalindrome(string str)
{
List<char> repeatedList;
if (string.IsNullOrWhiteSpace(str) || IsPalindrome(str))
{
return str;
}
//If an input has repeated characters,
// use them to reduce the number of insertions
else if (TryFindMostRepeatedChar(str, out repeatedList))
{
string shortestResult = null;
foreach (var ch in repeatedList) //"program" -> { 'r' }
{
//find boundaries
int iLeft = str.IndexOf(ch); // "program" -> 1
int iRight = str.LastIndexOf(ch); // "program" -> 4
//make a palindrome of the inside chars
string inside = str.Substring(iLeft + 1, iRight - iLeft - 1); // "program" -> "og"
string insidePal = MakePalindrome(inside); // "og" -> "ogo"
string right = str.Substring(iRight + 1); // "program" -> "am"
string rightRev = Reverse(right); // "program" -> "ma"
string left = str.Substring(0, iLeft); // "program" -> "p"
string leftRev = Reverse(left); // "p" -> "p"
//Shave off extra chars in rightRev and leftRev
// When input = "message", this loop converts "meegassageem" to "megassagem",
// ("ee" to "e"), as long as the extra 'e' is an inserted char
while (left.Length > 0 && rightRev.Length > 0 &&
left[left.Length - 1] == rightRev[0])
{
rightRev = rightRev.Substring(1);
leftRev = leftRev.Substring(1);
}
//piece together the result
string result = left + rightRev + ch + insidePal + ch + right + leftRev;
//find the shortest result for inputs that have multiple repeated characters
if (shortestResult == null || result.Length < shortestResult.Length)
shortestResult = result;
}
return shortestResult;
}
else
{
//For inputs that have no repeated characters,
// just mirror the characters using the last character as the pivot.
for (int i = str.Length - 2; i >= 0; i--)
{
str += str[i];
}
return str;
}
}
Note that you need a Reverse function:
public static string Reverse(string str)
{
string result = "";
for (int i = str.Length - 1; i >= 0; i--)
{
result += str[i];
}
return result;
}

C# Recursive solution adding to the end of the string:
There are 2 base cases. When length is 1 or 2. Recursive case: If the extremes are equal, then
make palindrome the inner string without the extremes and return that with the extremes.
If the extremes are not equal, then add the first character to the end and make palindrome the
inner string including the previous last character. return that.
public static string ConvertToPalindrome(string str) // By only adding characters at the end
{
if (str.Length == 1) return str; // base case 1
if (str.Length == 2 && str[0] == str[1]) return str; // base case 2
else
{
if (str[0] == str[str.Length - 1]) // keep the extremes and call
return str[0] + ConvertToPalindrome(str.Substring(1, str.Length - 2)) + str[str.Length - 1];
else //Add the first character at the end and call
return str[0] + ConvertToPalindrome(str.Substring(1, str.Length - 1)) + str[0];
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Longest Common Substring - algorithm

This is longest substring,what you are looking for is it with repetition or without . please go through this it might be helpful. http://www.programcreek.com/2013/02/leetcode-longest-substring-without-repeating-characters-java/

Related

Word ladder complexity analysis

Algorithm to list unique permutations of string with duplicate letters

Trying to understand the space complexity of this algorithm

Longest common prefix for n string

How can I compute the number of characters required to turn a string into a palindrome?

Categories

Resources