Reconstructing a string of words using a dictionary into an English sentence - algorithm

I am completely stumped. The question is: given you have a string like "thisisasentence" and a function isWord() that returns true if it is an English word, I would get stuck on "this is a sent"
How can I recursively return and keep track of where I am each time?

You need backtracking, which is easily achievable using recursion. Key observation is that you do not need to keep track of where you are past the moment when you are ready to return a solution.
You have a valid "split" when one of the following is true:
The string w is empty (base case), or
You can split non-empty w into substrings p and s, such that p+s=w, p is a word, and s can be split into a sentence (recursive call).
An implementation can return a list of words when successful split is found, or null when it cannot be found. Base case will always return an empty list; recursive case will, upon finding a p, s split that results in non-null return for s, construct a list with p prefixed to the list returned from the recursive call.
The recursive case will have a loop in it, trying all possible prefixes of w. To speed things up a bit, the loop could terminate upon reaching the prefix that is equal in length to the longest word in the dictionary. For example, if the longest word has 12 characters, you know that trying prefixes 13 characters or longer will not result in a match, so you could cut enumeration short.

Just adding to the answer above.
According to my experience, many people understand recursion better when they see a «linearized» version of a recursive algorithm, which means «implemented as a loop over a stack». Linearization is applicable to any recursive task.
Assuming that isWord() has two parameters (1st: string to test; 2nd: its length) and returns a boolean-compatible value, a C implementation of backtracking is as follows:
void doSmth(char *phrase, int *words, int total) {
int i;
for (i = 0; i < total; ++i)
printf("%.*s ", words[i + 1] - words[i], phrase + words[i]);
printf("\n");
}
void parse(char *phrase) {
int current, length, *words;
if (phrase) {
words = (int*)calloc((length = strlen(phrase)) + 2, sizeof(int));
current = 1;
while (current) {
for (++words[current]; words[current] <= length; ++words[current])
if (isWord(phrase + words[current - 1],
words[current] - words[current - 1])) {
words[current + 1] = words[current];
current++;
}
if (words[--current] == length)
doSmth(phrase, words, current); /** parse successful! **/
}
free(words);
}
}
As can be seen, for each word, a pair of stack values are used, the first of which being an offset to the current word`s first character, whereas the second is a potential offset of a character exactly after the current word`s last one (thus being the next word`s first character). The second value of the current word (the one whose pair is at the top of our «stack») is iterated through all characters left in the phrase.
When a word is accepted, a new second value (equalling the current, to only look at positions after it) is pushed to the stack, making the former second the first in a new pair. If the current word (the one just found) completes the phrase, something useful is performed; see doSmth().
If there are no more acceptable words in the remaining part of our phrase, the current word is considered unsuitable, and its second value is discarded from the stack, effectively repeating a search for words at a previous starting location, while the ending location is now farther than the word previously accepted there.

Related

number of letters to be deleted from a string so that it is divisible by another string

I am doing this problem https://www.spoj.com/problems/DIVSTR/
We are given two strings S and T.
S is divisible by string T if there is some non-negative integer k, which satisfies the equation S=k*T
What is the minimum number of characters which should be removed from S, so that S is divisible by T?
The main idea was to match T with S using a pointer and count the number of instances of T occurring in S when the count is done, bring the pointer to the start of T and if there's a mismatch, compare T's first letter with S's present letter.
This code is working totally fine with test cases they provided and custom test cases I gave, but it could not get through hidden test cases.
this is the code
def no_of_letters(string1,string2):
# print(len(string1),len(string2))
count = 0
pointer = 0
if len(string1)<len(string2):
return len(string1)
if (len(string1)==len(string2)) and (string1!=string2):
return len(string1)
for j in range(len(string1)):
if (string1[j]==string2[pointer]) and pointer<(len(string2)-1):
pointer+=1
elif (string1[j]==string2[pointer]) and pointer == (len(string2)-1):
count+=1
pointer=0
elif (string1[j]!=string2[pointer]):
if string1[j]==string2[0]:
pointer=1
else:
pointer = 0
return len(string1)-len(string2)*count
One place where I think there should be confusion is when same letters can be parts of two counts, but it should not be a problem, because our answer doesn't need to take overlapping into account.
for example, S = 'akaka' T= 'aka' will give the output 2, irrespective of considering first 'aka',ka as count or second ak,'aka'.
I believe that the solution is much more straightforward that you make it. You're simply trying to find how many times the characters of T appear, in order, in S. Everything else is the characters you remove. For instance, given RobertBaron's example of S="akbaabka" and T="aka", you would write your routine to locate the characters a, k, a, in that order, from the start of S:
akbaabka
ak a^
# with some pointer, ptr, now at position 4, marked with a caret above
With that done, you can now recur on the remainder of the string:
find_chars(S[ptr:], T)
With each call, you look for T in S; if you find it, count 1 repetition and recur on the remainder of S; if not, return 0 (base case). As you crawl back up your recursion stack, accumulate all the 1 counts, and there is your value of k.
The quantity of chars to remove is len(s) - k*len(T).
Can you take it from there?

Valid Permutations of a String

This question was asked to me in a recent amazon technical interview. It goes as follows:-
Given a string ex: "where am i" and a dictionary of valid words, you have to list all valid distinct permutations of the string. A valid string comprises of words which exists in the dictionary. For ex: "we are him","whim aree" are valid strings considering the words(whim, aree) are part of the dictionary. Also the condition is that a mere rearrangement of words is not a valid string, i.e "i am where" is not a valid combination.
The task is to find all possible such strings in the optimum way.
As you have said, space doesn't count, so input can be just viewed as a list of chars. The output is the permutation of words, so an obvious way to do it is find all valid words then permutate them.
Now problem becomes to divide a list of chars into subsets which each forms a word, which you can find some answers here and following is my version to solve this sub-problem.
If the dictionary is not large, we can iterate dictionary to
find min_len/max_len of words, to estimate how many words we may have, i.e. how deep we recur
convert word into map to accelerate search;
filter the words which have impossible char (i.e. the char our input doesn't have) out;
if this word is subset of our input, we can find word recursively.
The following is pseudocode:
int maxDepth = input.length / min_len;
void findWord(List<Map<Character, Integer>> filteredDict, Map<Character, Integer> input, List<String> subsets, int level) {
if (level < maxDepth) {
for (Map<Character, Integer> word : filteredDict) {
if (subset(input, word)) {
subsets.add(word);
findWord(filteredDict, removeSubset(input, word), subsets, level + 1);
}
}
}
}
And then you can permutate words in a recursive functions easily.
Technically speaking, this solution can be O(n**d) -- where n is dictionary size and d is max depth. But if the input is not large and complex, we can still solve it in feasible time.

Find the lexicographically largest unique string

I need an algorithm to find the largest unique (no duplicate characters) substring from a string by removing character (no rearranging).
String A is greater than String B if it satisfies these two conditions.
1. Has more characters than String B
Or
2. Is lexicographically greater than String B if equal length
For example, if the input string is dedede, then the possible unique combinations are de, ed, d, and e.
Of these combinations, the largest one is therefore ed since it has more characters than d and e and is lexicographically greater than de.
The algorithm must more efficient than generating all possible unique strings and sorting them to find the largest one.
Note: this is not a homework assignment.
How about this
string getLargest(string s)
{
int largerest_char_pos=0;
string result="";
if(s.length() == 1) return s;
for(int i=0;i<s.length();)
{
p=i;
for(int j=i+1;j<s.length();j++)
{
if(s[largerest_char_pos]< s[j]) largerest_char_pos =j;
}
res+=s[largerest_char_pos];
i=largerest_char_pos+1;
}
return result;
}
This is code snipet just gives you the lexicigraphically larger string. If you dont want duplicates you can just keep track of already added characters .
Let me state the rules for ordering in a way that I think is more clear.
String A is greater than string B if
- A is longer than B
OR
- A and B are the same length and A is lexicographically greater than B
If my restatement of the rules is correct then I believe I have a solution that runs in O(n^2) time and O(n) space. My solution is a greedy algorithm based on the observation that there are as many characters in the longest valid subsequence as there are unique characters in the input string. I wrote this in Go, and hopefully the comments are sufficient enough to describe the algorithm.
func findIt(str string) string {
// exc keeps track of characters that we cannot use because they have
// already been used in an earlier part of the subsequence
exc := make(map[byte]bool)
// ret is where we will store the characters of the final solution as we
// find them
var ret []byte
for len(str) > 0 {
// inc keeps track of unique characters as we scan from right to left so
// that we don't take a character until we know that we can still make the
// longest possible subsequence.
inc := make(map[byte]bool, len(str))
fmt.Printf("-%s\n", str)
// best is the largest character we have found that can also get us the
// longest possible subsequence.
var best byte
// best_pos is the lowest index that we were able to find best at, we
// always want the lowest index so that we keep as many options open to us
// later if we take this character.
best_pos := -1
// Scan through the input string from right to left
for i := len(str) - 1; i >= 0; i-- {
// Ignore characters we've already used
if _, ok := exc[str[i]]; ok { continue }
if _, ok := inc[str[i]]; !ok {
// If we haven't seen this character already then it means that we can
// make a longer subsequence by including it, so it must be our best
// option so far
inc[str[i]] = true
best = str[i]
best_pos = i
} else {
// If we've already seen this character it might still be our best
// option if it is a lexicographically larger or equal to our current
// best. If it is equal we want it because it is at a lower index,
// which keeps more options open in the future.
if str[i] >= best {
best = str[i]
best_pos = i
}
}
}
if best_pos == -1 {
// If we didn't find any valid characters on this pass then we are done
break
} else {
// include our best character in our solution, and exclude it for
// consideration in any future passes.
ret = append(ret, best)
exc[best] = true
// run the same algorithm again on the substring that is to the right of
// best_pos
str = str[best_pos+1:]
}
}
return string(ret)
}
I am fairly certain you can do this in O(n) time, but I wasn't sure of my solution so I posted this one instead.

Algorithm for finding first repeated substring of length k

There is a homework I should do and I need help. I should write a program to find the first substring of length k that is repeated in the string at least twice.
For example in the string "banana" there are two repeated substrings of length 2: "an" , "na". In this case, the answer is "an" because it appeared sooner than "na"
Note that the simple O(n^2) algorithm is not useful since there is time limit on execution time of program so I guess it should be in linear time.
There is a hint too: Use Hash table.
I don't want the code. I just want you to give me a clue because I have no idea how to do this using a hash table. Should I use a specific data structure too?
Iterate over the character indexes of the string (0, 1, 2, ...) up to and including the index of the second-from-last character (i.e. up to strlen(str) - 2). For each iteration, do the following...
Extract the 2-char substring starting at the character index.
Check whether your hashtable contains the 2-char substring. If it does, you've got your answer.
Insert each 2-char substring into the hashtable.
This is easily modifiable to cope with substrings of length k.
Combine Will A's algorithm with a rolling hash to get a linear-time algorithm.
You can use linked hash map.
public static String findRepeated(String s , int k){
Map<String,Integer> map = new LinkedHashMap<String,Integer>();
for(int i = 0 ; i < s.length() - k ; i ++){
String temp = s.substring(i,i +k);
if(!map.containsKey(temp)){
map.put(temp, 1);
}
else{
map.put(temp, map.get(temp) + 1);
}
}
for(Map.Entry<String,Integer> entry : map.entrySet()){
if(entry.getValue() > 1){
return entry.getKey();
}
}
return "no such value";
}

If a word is made up of two valid words

Given a dictionary find out if given word can be made by two words in dictionary. For eg. given "newspaper" you have to find if it can be made by two words. (news and paper in this case). Only thing i can think of is starting from beginning and checking if current string is a word. In this case checking n, ne, new, news..... check for the remaining part if current string is a valid word.
Also how do you generalize it for k(means if a word is made up of k words) ? Any thoughts?
Starting your split at the center may yield results faster. For example, for newspaper, you would first try splitting at 'news paper' or 'newsp aper'. As you can see, for this example, you would find your result on the first or second try. If you do not find a result, just search outwards. See the example for 'crossbow' below:
cros sbow
cro ssbow
cross bow
For the case with two words, the problem can be solved by just considering all possible ways of splitting the word into two, then checking each half to see if it's a valid word. If the input string has length n, then there are only O(n) different ways of splitting the string. If you store the strings in a structure supporting fast lookup (say, a trie, or hash table).
The more interesting case is when you have k > 2 words to split the word into. For this, we can use a really elegant recursive formulation:
A word can be split into k words if it can be split into a word followed by a word splittable into k - 1 words.
The recursive base case would be that a word can be split into zero words only if it's the empty string, which is trivially true.
To use this recursive insight, we'll modify the original algorithm by considering all possible splits of the word into two parts. Once we have that split, we can check if the first part of the split is a word and if the second part of the split can be broken apart into k - 1 words. As an optimization, we don't recurse on all possible splits, but rather just on those where we know the first word is valid. Here's some sample code written in Java:
public static boolean isSplittable(String word, int k, Set<String> dictionary) {
/* Base case: If the string is empty, we can only split into k words and vice-
* versa.
*/
if (word.isEmpty() || k == 0)
return word.isEmpty() && k == 0;
/* Generate all possible non-empty splits of the word into two parts, recursing on
* problems where the first word is known to be valid.
*
* This loop is structured so that we always try pulling off at least one letter
* from the input string so that we don't try splitting the word into k pieces
* of which some are empty.
*/
for (int i = 1; i <= word.length(); ++i) {
String first = word.substring(0, i), last = word.substring(i);
if (dictionary.contains(first) &&
isSplittable(last, k - 1, dictionary)
return true;
}
/* If we're here, then no possible split works in this case and we should signal
* that no solution exists.
*/
return false;
}
}
This code, in the worst case, runs in time O(nk) because it tries to generate all possible partitions of the string into k different pieces. Of course, it's unlikely to hit this worst-case behavior because most possible splits won't end up forming any words.
I'd first loop through the dictionary using a strpos(-like) function to check if it occurs at all. Then try if you can find a match with the results.
So it would do something like this:
Loop through the dictionary strpos-ing every word in the dictionary and saving results into an array, let's say it gives me the results 'new', 'paper', and 'news'.
Check if new+paper==newspaper, check if new+news==newspaper, etc, untill you get to paper+news==newspaper which returns.
Not sure if it is a good method though, but it seems more efficient than checking a word letter for letter (more iterations) and you didn't explain how you'd check when the second word started.
Don't know what you mean by 'how do you generalize it for k'.

Resources