string algorithm homework/interview like question - algorithm

Lets say we have to strings, A and B. The task is to insert any needed letters in the string B in order to end up with the string A.
For example:
A - This is just a simple test
B - is t a sim te
So if we look at the string A like this:
--is -- ---t a sim--- te--
or:
---- is ---t a sim--- te--
it is clear that we can build string A from the string B, and the output should be in the above written format (both answers are correct).
Can you think of an algorithm that will solve this in the reasonable time? It is quite easy to come up with brute force solution, but I need something a bit more sophisticated than that.

You could take Levenshtein distance algorithm as a base and extend it to also remember the characters that were added/deleted/substituted. That would run in linear time.

You can just find first occurrence of characters of B in A, just start finding occurrence after last index found in A, for example in your case:
A - This is just a simple test
B - is t a sim te
i: 3rd place in A,
s: 4th place in A,
' ': 5th place in A,
t: 12th place in A, (because last index was 4)
' ': ...
a: ....
That's O(|A|+|B|) and because |A| > |B| it's O(|A|).
After finding indices it's easy to convert B to A, just by adding characters of A between 2 indices to B.
Edit: Also if there is no match this algorithm, works fine (there will be some characters in B which are not in A, from last index).

I would think you could determine if its possible with a nice reg expression. as there may or may not be 1 or more characters between any of the ones suggested in B.

I actually think brute force will be the quickest but you have to mutate string A to do it fastests in one pass... just iterate over A change each letter to "-" that doesn't match the letter you are looking for unless it is a space type of deal... that will be constant time and require no extra storage... basically your time will be N. Of course you have to deal with the tokens from "B" so once you find "i" you need to make sure the next letter is "s" so that you find "is" but is still should be quick just don't go back to the I position if the next letter isn't s... that should point you in the correct direction without giving you the solution to your homework...

I think you could do it in linear time and space, like this (pseudocode, completely untested):
String f( String a, String b ) {
String result;
// Iterate over 'a' and 'b' in parallel. If both have the same
// character, add it to the result. Otherwise, if 'a' has a space add a space to the result, otherwise add a dash.
int idxB = 0;
for (int idxA = 0; idxA < a.length(); ++idxA ) {
if (a[idxA] == b[idxB]) {
result.append(a[idxA]);
++idxB;
} else if (a[idxA] == ' ') {
result.append(' ');
} else {
result.append('-');
}
}
// Tack on dashes to the result until it's as long as 'a'.
while (result.length() < a.length()) {
result.append('-');
}
return result;
}

Related

number of letters to be deleted from a string so that it is divisible by another string

I am doing this problem https://www.spoj.com/problems/DIVSTR/
We are given two strings S and T.
S is divisible by string T if there is some non-negative integer k, which satisfies the equation S=k*T
What is the minimum number of characters which should be removed from S, so that S is divisible by T?
The main idea was to match T with S using a pointer and count the number of instances of T occurring in S when the count is done, bring the pointer to the start of T and if there's a mismatch, compare T's first letter with S's present letter.
This code is working totally fine with test cases they provided and custom test cases I gave, but it could not get through hidden test cases.
this is the code
def no_of_letters(string1,string2):
# print(len(string1),len(string2))
count = 0
pointer = 0
if len(string1)<len(string2):
return len(string1)
if (len(string1)==len(string2)) and (string1!=string2):
return len(string1)
for j in range(len(string1)):
if (string1[j]==string2[pointer]) and pointer<(len(string2)-1):
pointer+=1
elif (string1[j]==string2[pointer]) and pointer == (len(string2)-1):
count+=1
pointer=0
elif (string1[j]!=string2[pointer]):
if string1[j]==string2[0]:
pointer=1
else:
pointer = 0
return len(string1)-len(string2)*count
One place where I think there should be confusion is when same letters can be parts of two counts, but it should not be a problem, because our answer doesn't need to take overlapping into account.
for example, S = 'akaka' T= 'aka' will give the output 2, irrespective of considering first 'aka',ka as count or second ak,'aka'.
I believe that the solution is much more straightforward that you make it. You're simply trying to find how many times the characters of T appear, in order, in S. Everything else is the characters you remove. For instance, given RobertBaron's example of S="akbaabka" and T="aka", you would write your routine to locate the characters a, k, a, in that order, from the start of S:
akbaabka
ak a^
# with some pointer, ptr, now at position 4, marked with a caret above
With that done, you can now recur on the remainder of the string:
find_chars(S[ptr:], T)
With each call, you look for T in S; if you find it, count 1 repetition and recur on the remainder of S; if not, return 0 (base case). As you crawl back up your recursion stack, accumulate all the 1 counts, and there is your value of k.
The quantity of chars to remove is len(s) - k*len(T).
Can you take it from there?

Reconstructing a string of words using a dictionary into an English sentence

I am completely stumped. The question is: given you have a string like "thisisasentence" and a function isWord() that returns true if it is an English word, I would get stuck on "this is a sent"
How can I recursively return and keep track of where I am each time?
You need backtracking, which is easily achievable using recursion. Key observation is that you do not need to keep track of where you are past the moment when you are ready to return a solution.
You have a valid "split" when one of the following is true:
The string w is empty (base case), or
You can split non-empty w into substrings p and s, such that p+s=w, p is a word, and s can be split into a sentence (recursive call).
An implementation can return a list of words when successful split is found, or null when it cannot be found. Base case will always return an empty list; recursive case will, upon finding a p, s split that results in non-null return for s, construct a list with p prefixed to the list returned from the recursive call.
The recursive case will have a loop in it, trying all possible prefixes of w. To speed things up a bit, the loop could terminate upon reaching the prefix that is equal in length to the longest word in the dictionary. For example, if the longest word has 12 characters, you know that trying prefixes 13 characters or longer will not result in a match, so you could cut enumeration short.
Just adding to the answer above.
According to my experience, many people understand recursion better when they see a «linearized» version of a recursive algorithm, which means «implemented as a loop over a stack». Linearization is applicable to any recursive task.
Assuming that isWord() has two parameters (1st: string to test; 2nd: its length) and returns a boolean-compatible value, a C implementation of backtracking is as follows:
void doSmth(char *phrase, int *words, int total) {
int i;
for (i = 0; i < total; ++i)
printf("%.*s ", words[i + 1] - words[i], phrase + words[i]);
printf("\n");
}
void parse(char *phrase) {
int current, length, *words;
if (phrase) {
words = (int*)calloc((length = strlen(phrase)) + 2, sizeof(int));
current = 1;
while (current) {
for (++words[current]; words[current] <= length; ++words[current])
if (isWord(phrase + words[current - 1],
words[current] - words[current - 1])) {
words[current + 1] = words[current];
current++;
}
if (words[--current] == length)
doSmth(phrase, words, current); /** parse successful! **/
}
free(words);
}
}
As can be seen, for each word, a pair of stack values are used, the first of which being an offset to the current word`s first character, whereas the second is a potential offset of a character exactly after the current word`s last one (thus being the next word`s first character). The second value of the current word (the one whose pair is at the top of our «stack») is iterated through all characters left in the phrase.
When a word is accepted, a new second value (equalling the current, to only look at positions after it) is pushed to the stack, making the former second the first in a new pair. If the current word (the one just found) completes the phrase, something useful is performed; see doSmth().
If there are no more acceptable words in the remaining part of our phrase, the current word is considered unsuitable, and its second value is discarded from the stack, effectively repeating a search for words at a previous starting location, while the ending location is now farther than the word previously accepted there.

Given a dictionary and a list of letters find all valid words that can be built with the letters

The brute force way can solve the problem in O(n!), basically calculating all the permutations and checking the results in a dictionary. I am looking for ways to improve the complexity. I can think of building a tree out of the dictionary but still checking all letters permutations is O(n!). Are there better ways to solve this problem?
Letters can have duplicates.
The api for the function looks like this:
List<String> findValidWords(Dict dict, char letters[])
Assume that letters only contains letters from a to z.
Use an integer array to count the number of occurrence of a character in letters.
For each word in the dictionary, check if there is a specific character in the word that appears more than allowed, if not, add this word into result.
List<String> findValidWords(List<String> dict, char letters[]){
int []avail = new int[26];
for(char c : letters){
int index = c - 'a';
avail[index]++;
}
List<String> result = new ArrayList();
for(String word: dict){
int []count = new int[26];
boolean ok = true;
for(char c : word.toCharArray()){
int index = c - 'a';
count[index]++;
if(count[index] > avail[index]){
ok = false;
break;
}
}
if(ok){
result.add(word);
}
}
return result;
}
So we can see that the time complexity is O(m*k) with m is number of word in the dictionary and k is the maximum total of characters in a word
You can sort each word in your dictionary so that the letters appear in the same order as they do in the alphabet, and then build a trie out of your sorted words. (where each node contains a list of all words that can be made out of the letters). (linear time in total letter length of dictionary) Then, given a set of query letters, sort the letters the same way and proceed through the trie using depth first search in all possible directions that use a subset of your letters from left to right. Any time you reach a node in the trie that contains words, output those words. Each path you explore can be charged to at least one word in the dictionary, so the worst case complexity to find all nodes that contain words you can make is O(kn) where n is the number of words in the dictionary and k is the maximum number of letters in a word. However for somewhat restricted sets of query letters, the running time should be much faster per query.
Here is the algorithm that will find all words that can be formed from a set of letters in O(1). We will represent words with their spectra and store them in a prefix tree (aka trie).
General Description
The spectrum of a word W is an array S of size N, such that S(i) is the number of occurrences (aka frequency) of an A(i) letter in the word W, where A(i) is the i-th letter of a chosen alphabet and N is its size.
For example, in the English alphabet, A(0) is A, A(1) is B, ... , A(25) is Z. A spectrum of the word aha is <2,0,0,0,0,0,0,1,0,...,0>.
We will store the dictionary in a prefix trie, using spectrum as a key. The first token of a key is the frequency of letter A, the second is the frequency of letter B and so on. (From here and below we will use the English alphabet as an example).
Once formed, our dictionary will be a tree with the height 26 and width that varies with each level, depending on a popularity of the letter. Basically, each layer will have a number of subtrees that is equal to the maximum word frequency of this letter in the provided dictionary.
Since our task is not only to decide whether we can build a word from the provided set of characters but also to find these words (a search problem), then we need to attach the words to their spectra (as spectral transformation is not invertible, consider spectra of words read and dear). We will attach a word to the end of each path that represents its spectrum.
To find whether we can build a word from a provided set we will build a spectrum of the set, and find all paths in the prefix trie with the frequencies bounded by the corresponding frequencies of the set's spectrum. (Note, we are not forcing to use all letters from the set, so if a word uses fewer letters, then we can build it. Basically, our requirement is that for all letters in the word the frequency of a letter should be less than or equal than a frequency of the same letter in the provided set).
The complexity of the search procedure doesn't depend on the length of the dictionary or the length of the provided set. On average, it is equal to 26 times the average frequency of a letter. Given the English alphabet, it is a quite small constant factor. For other alphabets, it might not be the case.
Reference implementation
I will provide a reference implementation of an algorithm in OCaml.
The dictionary data type is recursive:
type t = {
dict : t Int.Map.t;
data : string list;
}
(Note: it is not the best representation, probably it is better to represent it is a sum type, e.g., type t = Dict of t Int.Map.t | Data of string list, but I found it easier to implement it with the above representation).
We can generalize the algorithm by a spectrum function, either using a functor, or by just storing the spectrum function in the dictionary, but for the simplicity, we will just hardcode the English alphabet in the ASCII representation,
let spectrum word =
let index c = Char.(to_int (uppercase c) - to_int 'A') in
let letters = Char.(to_int 'Z' - to_int 'A' + 1) in
Array.init letters ~f:(fun i ->
String.count word ~f:(fun c -> index c = i))
Next, we will define the add_word function of type dict -> string -> dict, that will add a new path to our dictionary, by decomposing a word to its spectrum, and adding each constituent. Each addition will require exactly 26 iterations, not including the spectrum computation. Note, the implementation is purely functional, and doesn't use any imperative features. Every time the function add_word returns a new data structure.
let add_word dict word =
let count = spectrum word in
let rec add {dict; data} i =
if i < Array.length count then {
data;
dict = Map.update dict count.(i) ~f:(function
| None -> add empty (i+1)
| Some sub -> add sub (i+1))
} else {empty with data = word :: data} in
add dict 0
We are using the following definition of the empty value in the add function:
let empty = {dict = Int.Map.empty; data=[]}
Now let's define the is_buildable function of type dict -> string -> bool that will decide whether the given set of characters can be used to build any word in the dictionary. Although we can express it via the search, by checking the size of the found set, we would still prefer to have a specialized implementation, as it is more efficient and easier to understand. The definition of the function follows closely the general description provided above. Basically, for every character in the alphabet, we check whether there is an entry in the dictionary with the frequency that is less or equal than the frequency in the building set. If we checked all letters, then we proved, that we can build at least one word with the given set.
let is_buildable dict set =
let count = spectrum set in
let rec find {dict} i =
i >= Array.length count ||
Sequence.range 0 count.(i) ~stop:`inclusive |>
Sequence.exists ~f:(fun cnt -> match Map.find dict cnt with
| None -> false
| Some dict -> find dict (i+1)) in
find dict 0
Now, let's actually find the set of all words, that are buildable from the provided set:
let build dict set =
let count = spectrum set in
let rec find {dict; data} i =
if i < Array.length count then
Sequence.range 0 count.(i) ~stop:`inclusive |>
Sequence.concat_map ~f:(fun cnt -> match Map.find dict cnt with
| None -> Sequence.empty
| Some dict -> find dict (i+1))
else Sequence.of_list data in
find dict 0
We will basically follow the structure of the is_buildable function, except that instead of proving that such a frequency exists for each letter, we will collect all the proofs by reaching the end of the path and grabbing the set of word attached to it.
Testing and example
For the sake of completeness, we will test it by creating a small program, that will read a dictionary, with each word on a separate line, and interact with a user, by asking for a set and printing the resultion set of words, that can be built from it.
module Test = struct
let run () =
let dict =
In_channel.(with_file Sys.argv.(1)
~f:(fold_lines ~init:empty ~f:add_word)) in
let prompt () =
printf "Enter characters and hit enter (or Ctrl-D to stop): %!" in
prompt ();
In_channel.iter_lines stdin ~f:(fun set ->
build dict set |> Sequence.iter ~f:print_endline;
prompt ())
end
Here comes and example of interaction, that uses /usr/share/dict/american-english dictionary available on my machine (Ubunty Trusty).
./scrabble.native /usr/share/dict/american-english
Enter characters and hit enter (or Ctrl-D to stop): read
r
R
e
E
re
Re
Er
d
D
Rd
Dr
Ed
red
Red
a
A
Ra
Ar
era
ear
are
Rae
ad
read
dear
dare
Dare
Enter characters and hit enter (or Ctrl-D to stop):
(Yep, the dictionary contains words, that like r and d that are probably not true English words. In fact, for each letter the dictionary has a word, so, we can basically build a word from each non-empty set of alphabet letters).
The full implementation along with the building instructions can be found on Gist
A better way to do this is to loop through all the words in the dictionary and see if the word can be built with the letters in the array.
"Sign" the letters available by sorting them in order; that's O(m log m), where m is the number of letters.
"Sign" each word in the dictionary by sorting the letters of the word in order; that's O(k log k), where k is the length of the word.
Compare the letter signature to each word signature; that's O(min(m, k) * n), where n is the number of words in the dictionary. Output any word that matches.
Assuming an English word list of approximately a quarter-million words, and no more than about half a dozen, that should be nearly instantaneous.
I was recently asked the same question in BankBazaar interview. I was given the option to (he said that in a very subtle manner) pre-process the dictionary in any way I want.
My first thought was to arrange the dictionary in a trie or ternary search tree, and make all the words from the letters given. In any optimization way, that would take n! + n-1! + n-2! n-3! + ..... + n word checks(n being the number of letters) in worst case, which was not acceptable.
The other way could be to check all the dictionary words whether they can be made from the given letters. This again in any optimized way would take noOfDictionaryWords(m) * average size of dictionary words(k) at worst case, which was again not acceptable.
Now I have n! + n-1! + n-2! + .... + N words, which I have to check in the dictionary, and I don't want to check them all, so what are the situations that I have to check only a subset of them, and how to group them.
If I have to check only combination and not permutation, the result gets to 2^n.
so I have to pre-process the dictionary words in such a way that if I pass a combination, all the anagrams would be printed.
A ds something like this : http://1.bp.blogspot.com/-9Usl9unQJpY/Vg6IIO3gpsI/AAAAAAAAAbM/oTuhRDWelhQ/s1600/hashmapArrayForthElement.png
A hashvalue made by the letters(irrespective of its positions and permutation), pointing to list containing all the words made by those letters, then we only need to check that hashvalue.
I gave the answer to make the hash value by assigning a prime value to all the alphabets and while calculating the hash value of a word, multiply all the assigned values. This will create a problem of having really big hash values given that 26th prime is 101, and many null values in the map taking space. We could optimize it a bit by rather than starting lexicographically with a = 2, b = 3, c = 5, d = 7.... z = 101, we search for the most used alphabets and assign them small values, like vowels, and 's', 't' etc.
The interviewer accepted it, but was not expecting the answer, so there is definitely another answer, for better or worse but there is.
Swift 4
func findValidWords(in dictionary: [String], with letters: [Character]) -> [String] {
var validWords = [String]()
for word in dictionary {
var temp = word
for char in letters {
temp = temp.filter({ $0 != char })
if temp.isEmpty {
validWords.append(word)
break
}
}
}
return validWords
}
print(findValidWords(in: ["ape", "apples", "orange", "elapse", "lap", "soap", "bar", "sole"], with: ["a","p","l","e","s","o"]))
Output => ["ape", "apples", "elapse", "lap", "soap", "sole"]
My English is not good so try to understand.
My approach is using bit/bitwise to increase speed. Still bruteforce, though.
FIRST STEP
We only consider distinct character in each word and mark its existence. English has 26 characters, so we need 26 bits. Integer is 32 bits. That's enough.
Now encode each words in dictionary to an integer number.
abcdddffg -> 123444667 -> 123467 (only distinct characters) -> 1111011 (bits) -> 123 (decimal number)
So 2,000,000 words will be converted into 2,000,000 integer numbers.
Now let say you have this set of letters: a,b,c,d,e
abcde -> 12345 -> 1111100 (bits)
Do AND operation and we have:
1111100 (abcde)
&
1111011 (abcdddffg, no e)
=
1111000 (result) => result != abcdddffg => word cannot be created
Other example with a,b,c,d,e,f,g,h:
11111111 (abcdefgh)
&
11110110 (abcdddffg, no e and h)
=
11110110 (result) => result == abcdddffg => word can be created
SECOND STEP
While converting word to number, store the letter count also. If we found a match in first step, we continue to check if the number of letters is enough too.
Depend on the requirement, you might not need this second step.
COMPLEXITY
O(n) to convert word to number and store letters count. Only need to do this once.
O(n) for each search query.
Following is more efficient way :-
1.Use count sort to count all letters appearing in the a word in dictionary.
2.Do count sort on the collection of letter that you are given.
3.Compare if the counts are same then the word can be made.
4. Do this for all words in dictionary.
This will be inefficient for multiple such queries so you can do following :-
1. make a tupple for each word using count sort.
2. put the tupple in a Tree or hashmap with count entries.
3. When query is given do count sort and lookup tupple in hashmap
.
Time complexity :-
The above method gives O(1) time complexity for a query and O(N) time complexity for hash table construction where N is no of words in dictionary
(cf. anagram search, e.g. using primes looks cleaner for a signature based approach - collect for all non-equivalent "substrings of letters"])
Given the incentive, I'd (pre)order Dict by (set of characters that make up each word, increasing length) and loop over the subsets from letters checking validity of each word until too long.
Alternatively, finding the set of words from dict out of chars from letters can be considered a multi-dimensional range query: with "eeaspl" specifying letters, valid words have zero to two "e", one or none of a, s, p, l, and no other characters at all - bounds on word length (no longer than letters, lower bound to taste) blend in nicely.
Then again, data structures like k-d-trees do well with few, selective dimensions.
(Would-be comment: you do not mention alphabet cardinality, whether "valid" depends on capitalisation or diacritics, "complexity" includes programmer effort or preprocessing of dict - the latter may be difficult to amortise if dict is immutable.)
Swift 3
func findValidWords(wordsList: [String] , string: String) -> [String]{
let charCountsDictInTextPassed = getCharactersCountIn(string: string)
var wordsArrayResult: [String] = []
for word in wordsList {
var canBeProduced = true
let currentWordCharsCount = getCharactersCountIn(string: word)
for (char, count) in currentWordCharsCount {
if let charCountInTextPassed = charCountsDictInTextPassed[char], charCountInTextPassed >= count {
continue
}else{
canBeProduced = false
break
}
}// end for
if canBeProduced {
wordsArrayResult.append(word)
}//end if
}//end for
return wordsArrayResult
}
// Get the count of each character in the string
func getCharactersCountIn(string: String) -> [String: Int]{
var charDictCount:[String: Int] = [:]
for char in string.characters {
if let count = charDictCount[String(char)] {
charDictCount[String(char)] = count + 1
}else{
charDictCount[String(char)] = 1
}
}//end for
return charDictCount
}
If letters can be repeated, that means that a word can be infinitely long. You would obviously cap this at the length of the longest word in the dictionary, but there are still too many words to check. Like nmore suggested, you'd rather iterate over the dictionary to do this.
List<String> findAllValidWords(Set<String> dict, char[] letters) {
List<String> result = new LinkedList<>();
Set<Character> charSet = new HashSet<>();
for (char letter : letters) {
charSet.add(letter);
}
for (String word : dict) {
if (isPossible(word, charSet)) {
result.add(word);
}
}
return result;
}
boolean isPossible(String word, Set<Character> charSet) {
// A word is possible if all its letters are contained in the given letter set
for (int i = 0; i < word.length(); i++) {
if (!charSet.contains(word.charAt(i))) {
return false;
}
}
return true;
}

If a word is made up of two valid words

Given a dictionary find out if given word can be made by two words in dictionary. For eg. given "newspaper" you have to find if it can be made by two words. (news and paper in this case). Only thing i can think of is starting from beginning and checking if current string is a word. In this case checking n, ne, new, news..... check for the remaining part if current string is a valid word.
Also how do you generalize it for k(means if a word is made up of k words) ? Any thoughts?
Starting your split at the center may yield results faster. For example, for newspaper, you would first try splitting at 'news paper' or 'newsp aper'. As you can see, for this example, you would find your result on the first or second try. If you do not find a result, just search outwards. See the example for 'crossbow' below:
cros sbow
cro ssbow
cross bow
For the case with two words, the problem can be solved by just considering all possible ways of splitting the word into two, then checking each half to see if it's a valid word. If the input string has length n, then there are only O(n) different ways of splitting the string. If you store the strings in a structure supporting fast lookup (say, a trie, or hash table).
The more interesting case is when you have k > 2 words to split the word into. For this, we can use a really elegant recursive formulation:
A word can be split into k words if it can be split into a word followed by a word splittable into k - 1 words.
The recursive base case would be that a word can be split into zero words only if it's the empty string, which is trivially true.
To use this recursive insight, we'll modify the original algorithm by considering all possible splits of the word into two parts. Once we have that split, we can check if the first part of the split is a word and if the second part of the split can be broken apart into k - 1 words. As an optimization, we don't recurse on all possible splits, but rather just on those where we know the first word is valid. Here's some sample code written in Java:
public static boolean isSplittable(String word, int k, Set<String> dictionary) {
/* Base case: If the string is empty, we can only split into k words and vice-
* versa.
*/
if (word.isEmpty() || k == 0)
return word.isEmpty() && k == 0;
/* Generate all possible non-empty splits of the word into two parts, recursing on
* problems where the first word is known to be valid.
*
* This loop is structured so that we always try pulling off at least one letter
* from the input string so that we don't try splitting the word into k pieces
* of which some are empty.
*/
for (int i = 1; i <= word.length(); ++i) {
String first = word.substring(0, i), last = word.substring(i);
if (dictionary.contains(first) &&
isSplittable(last, k - 1, dictionary)
return true;
}
/* If we're here, then no possible split works in this case and we should signal
* that no solution exists.
*/
return false;
}
}
This code, in the worst case, runs in time O(nk) because it tries to generate all possible partitions of the string into k different pieces. Of course, it's unlikely to hit this worst-case behavior because most possible splits won't end up forming any words.
I'd first loop through the dictionary using a strpos(-like) function to check if it occurs at all. Then try if you can find a match with the results.
So it would do something like this:
Loop through the dictionary strpos-ing every word in the dictionary and saving results into an array, let's say it gives me the results 'new', 'paper', and 'news'.
Check if new+paper==newspaper, check if new+news==newspaper, etc, untill you get to paper+news==newspaper which returns.
Not sure if it is a good method though, but it seems more efficient than checking a word letter for letter (more iterations) and you didn't explain how you'd check when the second word started.
Don't know what you mean by 'how do you generalize it for k'.

String Tiling Algorithm

I'm looking for an efficient algorithm to do string tiling. Basically, you are given a list of strings, say BCD, CDE, ABC, A, and the resulting tiled string should be ABCDE, because BCD aligns with CDE yielding BCDE, which is then aligned with ABC yielding the final ABCDE.
Currently, I'm using a slightly naïve algorithm, that works as follows. Starting with a random pair of strings, say BCD and CDE, I use the following (in Java):
public static String tile(String first, String second) {
for (int i = 0; i < first.length() || i < second.length(); i++) {
// "right" tile (e.g., "BCD" and "CDE")
String firstTile = first.substring(i);
// "left" tile (e.g., "CDE" and "BCD")
String secondTile = second.substring(i);
if (second.contains(firstTile)) {
return first.substring(0, i) + second;
} else if (first.contains(secondTile)) {
return second.substring(0, i) + first;
}
}
return EMPTY;
}
System.out.println(tile("CDE", "ABCDEF")); // ABCDEF
System.out.println(tile("BCD", "CDE")); // BCDE
System.out.println(tile("CDE", "ABC")); // ABCDE
System.out.println(tile("ABC", tile("BCX", "XYZ"))); // ABCXYZ
Although this works, it's not very efficient, as it iterates over the same characters over and over again.
So, does anybody know a better (more efficient) algorithm to do this ? This problem is similar to a DNA sequence alignment problem, so any advice from someone in this field (and others, of course) are very much welcome. Also note that I'm not looking for an alignment, but a tiling, because I require a full overlap of one of the strings over the other.
I'm currently looking for an adaptation of the Rabin-Karp algorithm, in order to improve the asymptotic complexity of the algorithm, but I'd like to hear some advice before delving any further into this matter.
Thanks in advance.
For situations where there is ambiguity -- e.g., {ABC, CBA} which could result in ABCBA or CBABC --, any tiling can be returned. However, this situation seldom occurs, because I'm tiling words, e.g. {This is, is me} => {This is me}, which are manipulated so that the aforementioned algorithm works.
Similar question: Efficient Algorithm for String Concatenation with Overlap
Order the strings by the first character, then length (smallest to largest), and then apply the adaptation to KMP found in this question about concatenating overlapping strings.
I think this should work for the tiling of two strings, and be more efficient than your current implementation using substring and contains. Conceptually I loop across the characters in the 'left' string and compare them to a character in the 'right' string. If the two characters match, I move to the next character in the right string. Depending on which string the end is first reached of, and if the last compared characters match or not, one of the possible tiling cases is identified.
I haven't thought of anything to improve the time complexity of tiling more than two strings. As a small note for multiple strings, this algorithm below is easily extended to checking the tiling of a single 'left' string with multiple 'right' strings at once, which might prevent extra looping over the strings a bit if you're trying to find out whether to do ("ABC", "BCX", "XYZ") or ("ABC", "XYZ", BCX") by just trying all the possibilities. A bit.
string Tile(string a, string b)
{
// Try both orderings of a and b,
// since TileLeftToRight is not commutative.
string ab = TileLeftToRight(a, b);
if (ab != "")
return ab;
return TileLeftToRight(b, a);
// Alternatively you could return whichever
// of the two results is longest, for cases
// like ("ABC" "BCABC").
}
string TileLeftToRight(string left, string right)
{
int i = 0;
int j = 0;
while (true)
{
if (left[i] != right[j])
{
i++;
if (i >= left.Length)
return "";
}
else
{
i++;
j++;
if (i >= left.Length)
return left + right.Substring(j);
if (j >= right.Length)
return left;
}
}
}
If Open Source code is acceptable, then you should check the genome benchmarks in Stanford's STAMP benchmark suite: it does pretty much exactly what you're looking for. Starting with a bunch of strings ("genes"), it looks for the shortest string that incorporates all the genes. So for example if you have ATGC and GCAA, it'll find ATGCAA. There's nothing about the algorithm that limits it to a 4-character alphabet, so this should be able to help you.
The first thing to ask is if you want to find the tilling of {CDB, CDA}? There is no single tilling.
Interesting problem. You need some kind of backtracking. For example if you have:
ABC, BCD, DBC
Combining DBC with BCD results in:
ABC, DBCD
Which is not solvable. But combining ABC with BCD results in:
ABCD, DBC
Which can be combined to:
ABCDBC.

Resources