efficient way to find matches against two strings - algorithm

I need to find all equal substrings against two strings. I've tried to use suffix tree to lookup substrings and it works fast, but too memory consuming (inappropriate for my task).
Any other ideas?

Aho-corasick is a great implementation for matching any number of strings with minimal performance issues. Did you try that?

You could do sliding window, though that's less memory, but more time consuming.
The smallest substring is one character (actually, the empty word is one, but let's leave that aside).
Take character 1 of string 1 and save the positions of that character in string 2 in some sort of data structure, like a map or an array.
Then you take the next one, (character 2 of string 1) and do the same thing.
Once you've reached the end of string 1, you start over but this time you take every two characters of string 1 and alway advance by one character checking for all positions in string 2.
You do this as long as the substring you're cheking is equal in length to string 1, meaning you compare string 1 and 2 as a whole.
Keep in mind: when string 2 is longer than string 1, you need to advance the whole string 1 once every character on string 2, since string 1 might be a substring of string 2.
If string 1 is larger than string 2, you can stop cheking, once your substring is longer that string 2, all other substrings will have been checked by then. Ideally, you'd end up having a map, (which in its simplest form is a two dimensional array), that holds the positions of each substring of string 1 in string 2.

Why do you say that suffix tree is too memory consuming? If implemented properly, it consumes only O(n) memory.

Related

Trie data structure

Given N strings. Each string contains only lowercase letters from a−j (both inclusive). The set of N strings is said to be GOOD SET if no string is prefix of another string else, it is BAD SET.
For example, aab, abcde, aabcd is BAD SET because aab is prefix of aabcd.
Print GOOD SET if it satisfies the problem requirement.
Else, print BAD SET and the first string for which the condition fails.
Input Format:
First line contains N, the number of strings in the set.
Then next N lines follow, where ith line contains ith string.
Constraints:
1 ≤ N ≤ 105
1 ≤ Length of the string ≤60
Output Format:
Output GOOD SET if the set is valid.
Else, output BAD SET followed by the first string for which the condition fails.
Can anyone suggest on this?
Construct a trie, inserting each string to the trie one by one, recording the pointers to nodes representing each string inserted in trie. Once done, scan the node pointers for all strings. If any of those ends at internal node, then it is a BAD SET. otherwise, (all string ends on distinct leaves), it is a GOOD SET. Time and space complexity are both linear to the total length of all strings.

Convert string to perfect number

Given a string, we need to find the largest square which can be obtained by replace its characters by digits (leading zeros are not allowed) where same characters always map to the same digits and different characters always map to different digits. If no solution, return -1.
Consider the string "ab" If we replace character a with 8 and b with 1, we get 81, which is a square.
How to find it for given string ? It is given that string length can be at max 11.
Please help me find a suitable and efficient way
Sorry can't comment, not enough reputation for it so I'll answer here.
#mat7 about what you said in your question comments, no you don't have to do it for every letter from a to z. You only have to do it for the letters present in your string (so at max 12 letters, not 26).
The first thing I would even check is how much different letter you have, if it's 11 or 12 different letters you can directly return -1 since you can't have different letters having the same number.
Now, supposing the input string being "fdsadrtas", you take a new array with only each different letter => "fdsadrt"
And with this array you try all possibilities (exclude the obvious mismatching options, if you set 'f' to 4 and 'd' to 5, 's' can only be 12367890 (and f can never be 0)), this way you will exclude lots of possibilities, having as worst case 10! instead of 12^10. (actually 9*9! with the test of the first one never beeing 0 but it's close enough)
EDIT 2 : +1 samgak nice idea !
The last digit can only be 0,1,4,5,6,9 so the worst number of tests drop even to 9*6*8!
10! is by far small enough to be brute tested, keep the higher square value you found and you are done.
EDIT :
Actually It would work (in a finite reasonable amount of time) but it is the wrong approach now that I have thought about it.
You will use less time in looking all the squares numbers that could be a solution for your string, using the exemple I gave above it's a string of length 9, and checking each square who is length 9 if he could be successfully mapped into the string.
For a string of length 12 (the worst case) you will have to check the square values of 316'228 to 999'999, who is way less than the >2 millions check of the previous proposition. The other proposition might become faster if you start accepting long strings but with only 12 you are faster this way.

Make palindrome from given word

I have given word like abca. I want to know how many letters do I need to add to make it palindrome.
In this case its 1, because if I add b, I get abcba.
First, let's consider an inefficient recursive solution:
Suppose the string is of the form aSb, where a and b are letters and S is a substring.
If a==b, then f(aSb) = f(S).
If a!=b, then you need to add a letter: either add an a at the end, or add a b in the front. We need to try both and see which is better. So in this case, f(aSb) = 1 + min(f(aS), f(Sb)).
This can be implemented with a recursive function which will take exponential time to run.
To improve performance, note that this function will only be called with substrings of the original string. There are only O(n^2) such substrings. So by memoizing the results of this function, we reduce the time taken to O(n^2), at the cost of O(n^2) space.
The basic algorithm would look like this:
Iterate over the half the string and check if a character exists at the appropriate position at the other end (i.e., if you have abca then the first character is an a and the string also ends with a).
If they match, then proceed to the next character.
If they don't match, then note that a character needs to be added.
Note that you can only move backwords from the end when the characters match. For example, if the string is abcdeffeda then the outer characters match. We then need to consider bcdeffed. The outer characters don't match so a b needs to be added. But we don't want to continue with cdeffe (i.e., removing/ignoring both outer characters), we simply remove b and continue with looking at cdeffed. Similarly for c and this means our algorithm returns 2 string modifications and not more.

minimal cyclic sub string in a bigger cyclic string

I am trying to find an algorithm that culd return the length of the shortest cyclic sub string in a larger cyclic string.
A cyclic string would be defined as a concatenation of tow or more identicle strings, e.g. "abababab", or "aaaa"...
Now in a given for example a string T = "abbcabbcabbcabbc" there is a cycle of the pattern "abbc" but the shortest cyclic sub string would be "bb".
If you're just looking for a substring that appears more than once:
Build a Suffix tree from the string.
While creating the suffix tree, you can count re-occurrences of every substring and save it on the number of occurrences on the node.
Then just do a BFS search on the tree (which will give you a layered search, from shorter to longer strings) and find the first substring which is longer than 1 that occurred more than once.
Total complexity: O(n) where n is the length of the string
Edit:
The paths from the root to the leaves
have a one-to-one relationship with
the suffixes of S
You can implement the tree that each node contains one letter, that will give you better granularity and allow you to see all the substrings by length.
Here's a suffix tree of banana where every node contains one letter, you can see that you have all the substrings there.
If you'll look at the applications section of the suffix tree, you'll see that it is used for exactly this kind of tasks - finding stuff about substrings.
Look at the image from the root, you can see ALL the substrings start from the root (BFS list):
b
a
n
ba
an
na
ban
ana
nan
bana
anan
nana
banan
anana
banana
Let me call "abbc" the generator in your example - i.e. the string that you repeat in order to get the bigger string.
The very first observation is that the smaller string should be made by repeating some substring twice.
It's clear that the smallest string should be smaller than the generator repeated twice (2*generator), because 2*generator is cyclic.
Now note that you only need to consider the string obtained by taking the generator 3 times, when searching for smaller cyclic string. Indeed, if the smallest is not there, but it is in the 4*generator, then it must span at least two generators, but then it wouldn't be the smallest.
So now lets assume the bigger string is 3*generator (or 2*generator).
Also it's clear that if the generator has only different digits, then the answer is 2*generator. If not then you just need to find all pairs of identical characters in the bigger string say at position i and j and check whether the string starting a i, which is 2*(j-i) long is cyclic. If you try them in order of increasing j-i, then you can stop after the first success.

Find the prefix substring which gives best compression

Problem:
Given a list of strings, find the substring which, if subtracted from the beginning of all strings where it matches and replaced by an escape byte, gives the shortest total length.
Example:
"foo", "fool", "bar"
The result is: "foo" as the base string with the strings "\0", "\0l", "bar" and a total length of 9 bytes. "\0" is the escape byte. The sum of the length of the original strings is 10, so in this case we only saved one byte.
A naive algorithm would look like:
for string in list
for i = 1, i < length of string
calculate total length based on prefix of string[0..i]
if better than last best, save it
return the best prefix
That will give us the answer, but it's something like O((n*m)^2), which is too expensive.
Use a forest of prefix trees (trie)...
f_2 b_1
/ |
o_2 a_1
| |
o_2 r_1
|
l_1
then, we can find the best result, and guarantee it, by maximizing (depth * frequency) which will be replaced with your escape character. You can optimize the search by doing a branch and bound depth first search for the maximum.
On the complexity: O(C), as mentioned in comment, for building it, and for finding the optimal, it depends. If you order the first elements frequency (O(A) --where A is the size of the languages alphabet), then you'll be able to cut out more branches, and have a good chance of getting sub-linear time.
I think this is clear, I am not going to write it up --what is this a homework assignment? ;)
I would try starting by sorting the list. Then you simply go from string to string comparing the first character to the next string's first char. Once you have a match you would look at the next char. You would need to devise a way to track the best result so far.
Well, first step would be to sort the list. Then one pass through the list, comparing each element with the previous, keeping track of the longest 2-character, 3-character, 4-character etc runs. Then figure is the 20 3-character prefixes better than the 15 4-character prefixes.

Resources