More efficient way to find phrases in a string? - algorithm

I have a list that contains 100,000+ words/phrases sorted by length
let list = [“string with spaces”, “another string”, “test”, ...]
I need to find the longest element in the list above that is inside a given sentence. This is my initial solution
for item in list {
if sentence == item
|| sentence.startsWith(item + “ “)
|| sentence.contains(“ “ + item + “ “)
|| sentence.endsWith(“ “ + item) {
...
break
}
}
This issue I am running into is that this is too slow for my application. Is there a different approach I could take to make this faster?

You could build an Aho-Corasick searcher from the list and then run this on the sentence. According to https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm "The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa). "

I would break the given sentence up into a list of words and then compute all possible contiguous sublists (i.e. phrases). Given a sentence of n words, there are n * (n + 1) / 2 possible phrases that can be found inside it.
If you now substitute your list of search phrases ([“string with spaces”, “another string”, “test”, ...]) for an (amortized) constant time lookup data structure like a hashset, you can walk over the list of phrases you computed in the previous step and check whether each one is in the set in ~ constant time.
The overall time complexity of this algorithm scales quadratically in the size of the sentence, and is roughly independent of the size of the set of search terms.

The solution I decided to use was a Trie https://en.wikipedia.org/wiki/Trie. Each node in the trie is a word, and all I do is tokenize the input sentence (by word) and traverse the trie.
This improved performance from ~140 seconds to ~5 seconds

Related

An algorithm reading N words and printing all anagrams

The original problem is here:
Design a O(N log N) algorithm to read in a list of words and print out all anagrams. For example, the strings "comedian" and "demoniac" are anagrams of each other. Assume there are N words and each word contains at most 20 letters. Designing a O(N^2) algorithms should not be too difficult, but getting it down to O(N log N) requires some cleverness.
I am confused since the length of a word does not depend on N. It is a constant number 20. So I thought we can multiply the running time for one word by N. Hence the result will be O(N). However, it seems I miss something.
If they insist on hitting that O(n log n) algorithm (because I think you can do better), a way is as follows:
Iterate over the array and sort each word individually
Keep a one to one map of each sorted word its original (in the form of a tuple for example)
Sort that newly created list by the sorted words.
Iterate over the sorted list, extract the words with equal sorted counterparts (they are adjacent now) and print them.
Example:
array = ["abc", "gba", "bca"]
Sorting each word individually and keeping the original word gives:
new_array = [("abc", "abc"), ("abg", "gba"), ("abc", "bca")]
Sorting the whole array by the first element gives:
new_array = [("abc", "abc"), ("abc", "bca"), ("abg", "gba")]
Now we iterate over the above array and extract words with equal first elements in the tuple, which gives
[("abc", "abc"), ("abc", "bca")] => ["abc", "bca"]
Time complexity analysis:
Looping over the original array is linear n.
sorting each individual word in constant because they will never exceed 20 characters. It's just 20 * log 20 = 100 roughly.
sorting the whole array is linearithmic in n.
the resulting time complexity becomes O(n * log n) where n is the length of the input array.
An algorithm
We are going to find an "identifier" for every class of anagrams. An identifier should be something that:
is unique to this class: no two classes have the same identifier;
can be computed when we're given a single word of the class: given two different words of the same class, we should compute the same identifier.
Once we've done that, all we have to do is group together the words that have the same identifier. There are several different ways of grouping words that have the same identifier; the main two ways are:
sorting the list of words, using the identifier as a comparison key;
using a "map" data structure, for instance a hash table or a binary tree.
Can you think of a good identifier?
An identifier I can think of is the list of letters of the words, in alphabetical order. For instance:
comedian --> acdeimno
dog --> dgo
god --> dgo
hello --> ehllo
hole --> ehlo
demoniac --> acdeimno
Implementation in python
words = 'comedian dog god hello hole demoniac'.split()
d = {}
for word in words:
d.setdefault(''.join(sorted(word)), []).append(word)
print(list(d.values()))
[['comedian', 'demoniac'], ['dog', 'god'], ['hello'], ['hole']]
The explanation
The most important thing here is that for each word, we computed ''.join(sorted(word)). That's the identifier I mentioned earlier. In fact, I didn't write the earlier example by hand; I printed it with python using the following code:
for word in words:
print(word, ' --> ', ''.join(sorted(word)))
comedian --> acdeimno
dog --> dgo
god --> dgo
hello --> ehllo
hole --> ehlo
demoniac --> acdeimno
So what is this? For each class of anagrams, we've made up a unique word to represent that class. "comedian" and "demoniac" both belong to the same class, represented by "acdeimno".
Once we've managed to do that, all that is left is to group the words which have the same representative. There are a few different ways to do that. In the python code, I have used a python dict, a dictionary, which is effectively a hashtable mapping the representative to the list of corresponding words.
Another way, if you don't know about map data structures, is to sort the list, which takes O(N log N) operations, using the representative as the comparison key:
print( sorted(words, key=lambda word:''.join(sorted(word))) )
['comedian', 'demoniac', 'dog', 'god', 'hello', 'hole']
Now, all words that belong to the same class of synonyms are adjacent. All that's left for you to do is iterate through this sorted list, and group elements which have the same key. This is only O(N). So the longest part of the algorithm was sorting the list.
You can do it in o(n) by using a hash table (or a dict in python)
Then you add of for i in 1..sqrt(n) at each step
This is the best way to make a n.sqrt(n) where a o(n) algorithm exists.
Let d=dict() a python dictionary
Iterate over the array:
for each word w, let s=word sorted by increasing letter value
if s already in d, add w to d[s]
if not d[s]=[w]
for i in 1.. sqrt(n) : do nothing // needed to slow from o(n) to o(n.sqrt())`
Print anagrams
foreach (k,l) in d
if len(l)>1 print l

What is the Time and Space complexity of following solution?

Problem statement:
Given a non-empty string s and a dictionary wordDict containing a list of non-empty words, add spaces in s to construct a sentence where each word is a valid dictionary word. Return all such possible sentences.
Note:
The same word in the dictionary may be reused multiple times in the segmentation.
You may assume the dictionary does not contain duplicate words.
Sample test case:
Input:
s = "catsanddog"
wordDict = ["cat", "cats", "and", "sand", "dog"]
Output:
[
"cats and dog",
"cat sand dog"
]
My Solution:
class Solution {
unordered_set<string> words;
unordered_map<string, vector<string> > memo;
public:
vector<string> getAllSentences(string s) {
if(s.size()==0){
return {""};
}
if(memo.count(s)) {
return memo[s];
}
string curWord = ""; vector<string> result;
for(int i = 0; i < s.size(); i++ ) {
curWord+=s[i];
if(words.count(curWord)) {
auto sentences = getAllSentences(s.substr(i+1));
for(string s : sentences) {
string sentence = curWord + ((int)s.size()>0? ((" ") + s) : "");
result.push_back(sentence);
}
}
}
return memo[s] = result;
}
vector<string> wordBreak(string s, vector<string>& wordDict) {
for(auto word : wordDict) {
words.insert(word);
}
return getAllSentences(s);
}
};
I am not sure about the time and space complexity. I think it should be 2^n where n is the length of given string s. Can anyone please help me to prove time and space complexity?
I have also some following questions:
If I don't use memo in the getAllSentences function what will be the
time complexity in this case?
Is there any better solution than this?
Let's try to go through the algorithm step by step but for specific wordDict to simplify the things.
So let wordDict be all the characters from a to z,
wordDict = ["a",..., "z"]
In this case if(words.count(curWord)) would be true every time when i = 0 and false otherwise.
Also, let's skip using memo cache (we'll add it later).
In the case above, we just got though string s recursively until we reach the end without any additional memory except result vector which gives the following:
time complexity is O(n!)
space complexity is O(1) - just 1 solution exists
where n - lenght of s
Now let's examine how using memo cache changes the situation in our case. Cache would contain n items - size of our string s which changes space complexity to O(n). Our time is the same since every there will be no hits by using memo cache.
This is the basis for us to move forward.
Now let's try to find how the things are changed if wordDict contains all the pairs of letters (and length of s is 2*something, so we could reach the end).
So, wordDict = ['aa','ab',...,'zz']
In this case we move forward with for 2 letters instead of 1 and everything else is the same, which gives us the following complexity withoug using memo cache:
time complexity is O((n/2)!)
space complexity is O(1) - just 1 solution exists
Memo cache would contain (n/2) items, giving a complexity of O(n) which also changes space complexity to O(n) but all the checks there are of different length.
Let's now imagine that wordDict contains both dictionaries we mentioned before ('a'...'z','aa'...'zz').
In this case we have the following complexity without using memo cache
time complexity is O((n)!) as we need to check the case for i=0 and i=1 which roughly doubles the number of checks we need to do for each step but on the other size it reduces the number of checks we have to do later since we move forward by 2 letters instead of one (this is the trickiest part for me).
Space complexity is ~O(2^n) since every additional char doubles the number of results.
Now let's think of the memo cache we have. It would be usefull for every 3 letters, because for example '...ab c...' gives the same as '...a bc...', so it reduces the number of calculations by 2 at every step, so our complexity would be the following
time complexity is roughly O((n/2)!) and we need O(2*n)=O(n) memory to store the memo. Let's also remember that in n/2 expression 2 reflects the cache effectiveness.
space complexity is O(2^n) - 2 here is a charateristic of the wordDict we've constructed
These were 3 cases for us to understand how the complexity is changing depending of the curcumstances. Now let's try to generalize it to the generic case:
time complexity is O((n/(l*e))!) where l = min length of words in wordDict, e - cache effectiveness (I would assume it 1 in general case but there might bt situations where it's different as we saw in the case above
space complexity is O(a^n) where a is a similarity of words in our wordDict, could be very very roughly estimated as P(h/l)=(h/l)! where h is max word length in a dictionary and l is min word length as (for example, if wordDict contains all combinations of up 3 letters, this gives us 3! combinations for every 6 letters)
This is how I see your approach and it's complexity.
As for improving the solution itself, I don't see any simple way to improve it. There might be an alternative way to divide the string in 3 parts and then processing each part separately but it would definitely work if we could get rid of searching the results and just count the number of results without displaying them.
I hope it helps.

Valid Permutations of a String

This question was asked to me in a recent amazon technical interview. It goes as follows:-
Given a string ex: "where am i" and a dictionary of valid words, you have to list all valid distinct permutations of the string. A valid string comprises of words which exists in the dictionary. For ex: "we are him","whim aree" are valid strings considering the words(whim, aree) are part of the dictionary. Also the condition is that a mere rearrangement of words is not a valid string, i.e "i am where" is not a valid combination.
The task is to find all possible such strings in the optimum way.
As you have said, space doesn't count, so input can be just viewed as a list of chars. The output is the permutation of words, so an obvious way to do it is find all valid words then permutate them.
Now problem becomes to divide a list of chars into subsets which each forms a word, which you can find some answers here and following is my version to solve this sub-problem.
If the dictionary is not large, we can iterate dictionary to
find min_len/max_len of words, to estimate how many words we may have, i.e. how deep we recur
convert word into map to accelerate search;
filter the words which have impossible char (i.e. the char our input doesn't have) out;
if this word is subset of our input, we can find word recursively.
The following is pseudocode:
int maxDepth = input.length / min_len;
void findWord(List<Map<Character, Integer>> filteredDict, Map<Character, Integer> input, List<String> subsets, int level) {
if (level < maxDepth) {
for (Map<Character, Integer> word : filteredDict) {
if (subset(input, word)) {
subsets.add(word);
findWord(filteredDict, removeSubset(input, word), subsets, level + 1);
}
}
}
}
And then you can permutate words in a recursive functions easily.
Technically speaking, this solution can be O(n**d) -- where n is dictionary size and d is max depth. But if the input is not large and complex, we can still solve it in feasible time.

Splitting a sentence to minimize sentence lengths

I have come across the following problem statement:
You have a sentence written entirely in a single row. You would like to split it into several rows by replacing some of the spaces
with "new row" indicators. Your goal is to minimize the width of the
longest row in the resulting text ("new row" indicators do not count
towards the width of a row). You may replace at most K spaces.
You will be given a sentence and a K. Split the sentence using the
procedure described above and return the width of the longest row.
I am a little lost with where to start. To me, it seems I need to try to figure out every possible sentence length that satisfies the criteria of splitting the single sentence up into K lines.
I can see a couple of edge cases:
There are <= K words in the sentence, therefore return the longest word.
The sentence length is 0, return 0
If neither of those criteria are true, then we have to determine all possible combinations of splitting the sentence and the return the minimum of all those options. This is the part I don't know how to do (and is obviously the heart of the problem).
You can solve it by inverting the problem. Let's say I fix the length of the longest split to L. Can you compute the minimum number of breaks you need to satisfy it?
Yes, you just break before the first word that would go over L and count them up (O(N)).
So now that we have that we just have to find a minimum L that would require less or equal K breaks. You can do a binary search in the length of the input. Final complexity O(NlogN).
First Answer
What you want to achieve is Minimum Raggedness. If you just want the algorithm, it is here as a PDF. If the research paper's link goes bad, please search for the famous paper named Breaking Paragraphs into Lines by Knuth.
However if you want to get your hands over some implementations of the same, in the question Balanced word wrap (Minimum raggedness) in PHP on SO, people have actually given implementation not only in PHP but in C, C++ and bash as well.
Second Answer
Though this is not exactly a correct approach, it is quick and dirty if you are looking for something like that. This method will not return correct answer for every case. It is for those people for whom time to ship their product is more important.
Idea
You already know the length of your input string. Let's call it L;
When putting in K breaks, the best scenario would be to be able to break the string to parts of exactly L / (K + 1) size;
So break your string at that word which makes the resulting sentence part's length least far from L / (K + 1);
My recursive solution, which can be improved through memoization or dynamic programming.
def split(self,sentence, K):
if not sentence: return 0
if ' ' not in sentence or K == 0: return len(sentence)
spaces = [i for i, s in enumerate(sentence) if s == ' ']
res = 100000
for space in spaces:
res = min(res, max(space, self.split(sentence[space+1:], K-1)))
return res

If a word is made up of two valid words

Given a dictionary find out if given word can be made by two words in dictionary. For eg. given "newspaper" you have to find if it can be made by two words. (news and paper in this case). Only thing i can think of is starting from beginning and checking if current string is a word. In this case checking n, ne, new, news..... check for the remaining part if current string is a valid word.
Also how do you generalize it for k(means if a word is made up of k words) ? Any thoughts?
Starting your split at the center may yield results faster. For example, for newspaper, you would first try splitting at 'news paper' or 'newsp aper'. As you can see, for this example, you would find your result on the first or second try. If you do not find a result, just search outwards. See the example for 'crossbow' below:
cros sbow
cro ssbow
cross bow
For the case with two words, the problem can be solved by just considering all possible ways of splitting the word into two, then checking each half to see if it's a valid word. If the input string has length n, then there are only O(n) different ways of splitting the string. If you store the strings in a structure supporting fast lookup (say, a trie, or hash table).
The more interesting case is when you have k > 2 words to split the word into. For this, we can use a really elegant recursive formulation:
A word can be split into k words if it can be split into a word followed by a word splittable into k - 1 words.
The recursive base case would be that a word can be split into zero words only if it's the empty string, which is trivially true.
To use this recursive insight, we'll modify the original algorithm by considering all possible splits of the word into two parts. Once we have that split, we can check if the first part of the split is a word and if the second part of the split can be broken apart into k - 1 words. As an optimization, we don't recurse on all possible splits, but rather just on those where we know the first word is valid. Here's some sample code written in Java:
public static boolean isSplittable(String word, int k, Set<String> dictionary) {
/* Base case: If the string is empty, we can only split into k words and vice-
* versa.
*/
if (word.isEmpty() || k == 0)
return word.isEmpty() && k == 0;
/* Generate all possible non-empty splits of the word into two parts, recursing on
* problems where the first word is known to be valid.
*
* This loop is structured so that we always try pulling off at least one letter
* from the input string so that we don't try splitting the word into k pieces
* of which some are empty.
*/
for (int i = 1; i <= word.length(); ++i) {
String first = word.substring(0, i), last = word.substring(i);
if (dictionary.contains(first) &&
isSplittable(last, k - 1, dictionary)
return true;
}
/* If we're here, then no possible split works in this case and we should signal
* that no solution exists.
*/
return false;
}
}
This code, in the worst case, runs in time O(nk) because it tries to generate all possible partitions of the string into k different pieces. Of course, it's unlikely to hit this worst-case behavior because most possible splits won't end up forming any words.
I'd first loop through the dictionary using a strpos(-like) function to check if it occurs at all. Then try if you can find a match with the results.
So it would do something like this:
Loop through the dictionary strpos-ing every word in the dictionary and saving results into an array, let's say it gives me the results 'new', 'paper', and 'news'.
Check if new+paper==newspaper, check if new+news==newspaper, etc, untill you get to paper+news==newspaper which returns.
Not sure if it is a good method though, but it seems more efficient than checking a word letter for letter (more iterations) and you didn't explain how you'd check when the second word started.
Don't know what you mean by 'how do you generalize it for k'.

Resources