If a word is made up of two valid words - algorithm

Given a dictionary find out if given word can be made by two words in dictionary. For eg. given "newspaper" you have to find if it can be made by two words. (news and paper in this case). Only thing i can think of is starting from beginning and checking if current string is a word. In this case checking n, ne, new, news..... check for the remaining part if current string is a valid word.
Also how do you generalize it for k(means if a word is made up of k words) ? Any thoughts?

Starting your split at the center may yield results faster. For example, for newspaper, you would first try splitting at 'news paper' or 'newsp aper'. As you can see, for this example, you would find your result on the first or second try. If you do not find a result, just search outwards. See the example for 'crossbow' below:
cros sbow
cro ssbow
cross bow

For the case with two words, the problem can be solved by just considering all possible ways of splitting the word into two, then checking each half to see if it's a valid word. If the input string has length n, then there are only O(n) different ways of splitting the string. If you store the strings in a structure supporting fast lookup (say, a trie, or hash table).
The more interesting case is when you have k > 2 words to split the word into. For this, we can use a really elegant recursive formulation:
A word can be split into k words if it can be split into a word followed by a word splittable into k - 1 words.
The recursive base case would be that a word can be split into zero words only if it's the empty string, which is trivially true.
To use this recursive insight, we'll modify the original algorithm by considering all possible splits of the word into two parts. Once we have that split, we can check if the first part of the split is a word and if the second part of the split can be broken apart into k - 1 words. As an optimization, we don't recurse on all possible splits, but rather just on those where we know the first word is valid. Here's some sample code written in Java:
public static boolean isSplittable(String word, int k, Set<String> dictionary) {
/* Base case: If the string is empty, we can only split into k words and vice-
* versa.
*/
if (word.isEmpty() || k == 0)
return word.isEmpty() && k == 0;
/* Generate all possible non-empty splits of the word into two parts, recursing on
* problems where the first word is known to be valid.
*
* This loop is structured so that we always try pulling off at least one letter
* from the input string so that we don't try splitting the word into k pieces
* of which some are empty.
*/
for (int i = 1; i <= word.length(); ++i) {
String first = word.substring(0, i), last = word.substring(i);
if (dictionary.contains(first) &&
isSplittable(last, k - 1, dictionary)
return true;
}
/* If we're here, then no possible split works in this case and we should signal
* that no solution exists.
*/
return false;
}
}
This code, in the worst case, runs in time O(nk) because it tries to generate all possible partitions of the string into k different pieces. Of course, it's unlikely to hit this worst-case behavior because most possible splits won't end up forming any words.

I'd first loop through the dictionary using a strpos(-like) function to check if it occurs at all. Then try if you can find a match with the results.
So it would do something like this:
Loop through the dictionary strpos-ing every word in the dictionary and saving results into an array, let's say it gives me the results 'new', 'paper', and 'news'.
Check if new+paper==newspaper, check if new+news==newspaper, etc, untill you get to paper+news==newspaper which returns.
Not sure if it is a good method though, but it seems more efficient than checking a word letter for letter (more iterations) and you didn't explain how you'd check when the second word started.
Don't know what you mean by 'how do you generalize it for k'.

Related

Valid Permutations of a String

This question was asked to me in a recent amazon technical interview. It goes as follows:-
Given a string ex: "where am i" and a dictionary of valid words, you have to list all valid distinct permutations of the string. A valid string comprises of words which exists in the dictionary. For ex: "we are him","whim aree" are valid strings considering the words(whim, aree) are part of the dictionary. Also the condition is that a mere rearrangement of words is not a valid string, i.e "i am where" is not a valid combination.
The task is to find all possible such strings in the optimum way.
As you have said, space doesn't count, so input can be just viewed as a list of chars. The output is the permutation of words, so an obvious way to do it is find all valid words then permutate them.
Now problem becomes to divide a list of chars into subsets which each forms a word, which you can find some answers here and following is my version to solve this sub-problem.
If the dictionary is not large, we can iterate dictionary to
find min_len/max_len of words, to estimate how many words we may have, i.e. how deep we recur
convert word into map to accelerate search;
filter the words which have impossible char (i.e. the char our input doesn't have) out;
if this word is subset of our input, we can find word recursively.
The following is pseudocode:
int maxDepth = input.length / min_len;
void findWord(List<Map<Character, Integer>> filteredDict, Map<Character, Integer> input, List<String> subsets, int level) {
if (level < maxDepth) {
for (Map<Character, Integer> word : filteredDict) {
if (subset(input, word)) {
subsets.add(word);
findWord(filteredDict, removeSubset(input, word), subsets, level + 1);
}
}
}
}
And then you can permutate words in a recursive functions easily.
Technically speaking, this solution can be O(n**d) -- where n is dictionary size and d is max depth. But if the input is not large and complex, we can still solve it in feasible time.

Reconstructing a string of words using a dictionary into an English sentence

I am completely stumped. The question is: given you have a string like "thisisasentence" and a function isWord() that returns true if it is an English word, I would get stuck on "this is a sent"
How can I recursively return and keep track of where I am each time?
You need backtracking, which is easily achievable using recursion. Key observation is that you do not need to keep track of where you are past the moment when you are ready to return a solution.
You have a valid "split" when one of the following is true:
The string w is empty (base case), or
You can split non-empty w into substrings p and s, such that p+s=w, p is a word, and s can be split into a sentence (recursive call).
An implementation can return a list of words when successful split is found, or null when it cannot be found. Base case will always return an empty list; recursive case will, upon finding a p, s split that results in non-null return for s, construct a list with p prefixed to the list returned from the recursive call.
The recursive case will have a loop in it, trying all possible prefixes of w. To speed things up a bit, the loop could terminate upon reaching the prefix that is equal in length to the longest word in the dictionary. For example, if the longest word has 12 characters, you know that trying prefixes 13 characters or longer will not result in a match, so you could cut enumeration short.
Just adding to the answer above.
According to my experience, many people understand recursion better when they see a «linearized» version of a recursive algorithm, which means «implemented as a loop over a stack». Linearization is applicable to any recursive task.
Assuming that isWord() has two parameters (1st: string to test; 2nd: its length) and returns a boolean-compatible value, a C implementation of backtracking is as follows:
void doSmth(char *phrase, int *words, int total) {
int i;
for (i = 0; i < total; ++i)
printf("%.*s ", words[i + 1] - words[i], phrase + words[i]);
printf("\n");
}
void parse(char *phrase) {
int current, length, *words;
if (phrase) {
words = (int*)calloc((length = strlen(phrase)) + 2, sizeof(int));
current = 1;
while (current) {
for (++words[current]; words[current] <= length; ++words[current])
if (isWord(phrase + words[current - 1],
words[current] - words[current - 1])) {
words[current + 1] = words[current];
current++;
}
if (words[--current] == length)
doSmth(phrase, words, current); /** parse successful! **/
}
free(words);
}
}
As can be seen, for each word, a pair of stack values are used, the first of which being an offset to the current word`s first character, whereas the second is a potential offset of a character exactly after the current word`s last one (thus being the next word`s first character). The second value of the current word (the one whose pair is at the top of our «stack») is iterated through all characters left in the phrase.
When a word is accepted, a new second value (equalling the current, to only look at positions after it) is pushed to the stack, making the former second the first in a new pair. If the current word (the one just found) completes the phrase, something useful is performed; see doSmth().
If there are no more acceptable words in the remaining part of our phrase, the current word is considered unsuitable, and its second value is discarded from the stack, effectively repeating a search for words at a previous starting location, while the ending location is now farther than the word previously accepted there.

Splitting a sentence to minimize sentence lengths

I have come across the following problem statement:
You have a sentence written entirely in a single row. You would like to split it into several rows by replacing some of the spaces
with "new row" indicators. Your goal is to minimize the width of the
longest row in the resulting text ("new row" indicators do not count
towards the width of a row). You may replace at most K spaces.
You will be given a sentence and a K. Split the sentence using the
procedure described above and return the width of the longest row.
I am a little lost with where to start. To me, it seems I need to try to figure out every possible sentence length that satisfies the criteria of splitting the single sentence up into K lines.
I can see a couple of edge cases:
There are <= K words in the sentence, therefore return the longest word.
The sentence length is 0, return 0
If neither of those criteria are true, then we have to determine all possible combinations of splitting the sentence and the return the minimum of all those options. This is the part I don't know how to do (and is obviously the heart of the problem).
You can solve it by inverting the problem. Let's say I fix the length of the longest split to L. Can you compute the minimum number of breaks you need to satisfy it?
Yes, you just break before the first word that would go over L and count them up (O(N)).
So now that we have that we just have to find a minimum L that would require less or equal K breaks. You can do a binary search in the length of the input. Final complexity O(NlogN).
First Answer
What you want to achieve is Minimum Raggedness. If you just want the algorithm, it is here as a PDF. If the research paper's link goes bad, please search for the famous paper named Breaking Paragraphs into Lines by Knuth.
However if you want to get your hands over some implementations of the same, in the question Balanced word wrap (Minimum raggedness) in PHP on SO, people have actually given implementation not only in PHP but in C, C++ and bash as well.
Second Answer
Though this is not exactly a correct approach, it is quick and dirty if you are looking for something like that. This method will not return correct answer for every case. It is for those people for whom time to ship their product is more important.
Idea
You already know the length of your input string. Let's call it L;
When putting in K breaks, the best scenario would be to be able to break the string to parts of exactly L / (K + 1) size;
So break your string at that word which makes the resulting sentence part's length least far from L / (K + 1);
My recursive solution, which can be improved through memoization or dynamic programming.
def split(self,sentence, K):
if not sentence: return 0
if ' ' not in sentence or K == 0: return len(sentence)
spaces = [i for i, s in enumerate(sentence) if s == ' ']
res = 100000
for space in spaces:
res = min(res, max(space, self.split(sentence[space+1:], K-1)))
return res

clustering words based on their char set

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example
{tea, eat, abba, aabb, hello}
will be clustered into
{{tea, eat}, {abba, aabb}, {hello}}.
abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.
To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.
Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.
Is there a more efficient way to solve the problem?
For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better)
See my answer here.
Since you are going to sort words, I assume all characters ascii values are in the range 0-255. Then you can do a Counting Sort over the words.
The counting sort is going to take the same amount of time as the size of the input word. Reconstruction of the string obtained from counting sort will take O(wordlen). You cannot make this step less than O(wordLen) because you will have to iterate the string at least once ie O(wordLen). There is no predefined order. You cannot make any assumptions about the word without iterating though all the characters in that word. Traditional sorting implementations(ie comparison based ones) will give you O(n * lg n). But non comparison ones give you O(n).
Iterate over all the words of the list and sort them using our counting sort. Keep a map of
sorted words to the list of known words they map. Addition of elements to a list takes constant time. So overall the complexity of the algorithm is O(n * avgWordLength).
Here is a sample implementation
import java.util.ArrayList;
public class ClusterGen {
static String sortWord(String w) {
int freq[] = new int[256];
for (char c : w.toCharArray()) {
freq[c]++;
}
StringBuilder sortedWord = new StringBuilder();
//It is at most O(n)
for (int i = 0; i < freq.length; ++i) {
for (int j = 0; j < freq[i]; ++j) {
sortedWord.append((char)i);
}
}
return sortedWord.toString();
}
static Map<String, List<String>> cluster(List<String> words) {
Map<String, List<String>> allClusters = new HashMap<String, List<String>>();
for (String word : words) {
String sortedWord = sortWord(word);
List<String> cluster = allClusters.get(sortedWord);
if (cluster == null) {
cluster = new ArrayList<String>();
}
cluster.add(word);
allClusters.put(sortedWord, cluster);
}
return allClusters;
}
public static void main(String[] args) {
System.out.println(cluster(Arrays.asList("tea", "eat", "abba", "aabb", "hello")));
System.out.println(cluster(Arrays.asList("moon", "bat", "meal", "tab", "male")));
}
}
Returns
{aabb=[abba, aabb], ehllo=[hello], aet=[tea, eat]}
{abt=[bat, tab], aelm=[meal, male], mnoo=[moon]}
Using an alphabet of x characters and a maximum word length of y, you can create hashes of (x + y) bits such that every anagram has a unique hash. A value of 1 for a bit means there is another of the current letter, a value of 0 means to move on to the next letter. Here's an example showing how this works:
Let's say we have a 7 letter alphabet(abcdefg) and a maximum word length of 4. Every word hash will be 11 bits. Let's hash the word "fade": 10001010100
The first bit is 1, indicating there is an a present. The second bit indicates that there are no more a's. The third bit indicates that there are no more b's, and so on. Another way to think about this is the number of ones in a row represents the number of that letter, and the total zeroes before that string of ones represents which letter it is.
Here is the hash for "dada": 11000110000
It's worth noting that because there is a one-to-one correspondence between possible hashes and possible anagrams, this is the smallest possible hash guaranteed to give unique hashes for any input, which eliminates the need to check everything in your buckets when you are done hashing.
I'm well aware that using large alphabets and long words will result in a large hash size. This solution is geared towards guaranteeing unique hashes in order to avoid comparing strings. If you can design an algorithm to compute this hash in constant time(given you know the values of x and y) then you'll be able to solve the entire grouping problem in O(n).
I would do this in two steps, first sort all your words according to their length and work on each subset separately(this is to avoid lots of overlaps later.)
The next step is harder and there are many ways to do it. One of the simplest would be to assign every letter a number(a = 1, b = 2, etc. for example) and add up all the values for each word, thereby assigning each word to an integer. Then you can sort the words according to this integer value which drastically cuts the number you have to compare.
Depending on your data set you may still have a lot of overlaps("bad" and "cac" would generate the same integer hash) so you may want to set a threshold where if you have too many words in one bucket you repeat the previous step with another hash(just assigning different numbers to the letters) Unless someone has looked at your code and designed a wordlist to mess you up, this should cut the overlaps to almost none.
Keep in mind that this approach will be efficient when you are expecting small numbers of words to be in the same char bag. If your data is a lot of long words that only go into a couple char bags, the number of comparisons you would do in the final step would be astronomical, and in this case you would be better off using an approach like the one you described - one that has no possible overlaps.
One thing I've done that's similar to this, but allows for collisions, is to sort the letters, then get rid of duplicates. So in your example, you'd have buckets for "aet", "ab", and "ehlo".
Now, as I say, this allows for collisions. So "rod" and "door" both end up in the same bucket, which may not be what you want. However, the collisions will be a small set that is easily and quickly searched.
So once you have the string for a bucket, you'll notice you can convert it into a 32-bit integer (at least for ASCII). Each letter in the string becomes a bit in a 32-bit integer. So "a" is the first bit, "b" is the second bit, etc. All (English) words make a bucket with a 26-bit identifier. You can then do very fast integer compares to find the bucket a new words goes into, or find the bucket an existing word is in.
Count the frequency of characters in each of the strings then build a hash table based on the frequency table. so for an example, for string aczda and aacdz we get 20110000000000000000000001. Using hash table we can partition all these strings in buckets in O(N).
26-bit integer as a hash function
If your alphabet isn't too large, for instance, just lower case English letters, you can define this particular hash function for each word: a 26 bit integer where each bit represents whether that English letter exists in the word. Note that two words with the same char set will have the same hash.
Then just add them to a hash table. It will automatically be clustered by hash collisions.
It will take O(max length of the word) to calculate a hash, and insertion into a hash table is constant time. So the overall complexity is O(max length of a word * number of words)

Palindrome detection efficiency

I got curious by Jon Limjap's interview mishap and started to look for efficient ways to do palindrome detection. I checked the palindrome golf answers and it seems to me that in the answers are two algorithms only, reversing the string and checking from tail and head.
def palindrome_short(s):
length = len(s)
for i in xrange(0,length/2):
if s[i] != s[(length-1)-i]: return False
return True
def palindrome_reverse(s):
return s == s[::-1]
I think neither of these methods are used in the detection of exact palindromes in huge DNA sequences. I looked around a bit and didn't find any free article about what an ultra efficient way for this might be.
A good way might be parallelizing the first version in a divide-and-conquer approach, assigning a pair of char arrays 1..n and length-1-n..length-1 to each thread or processor.
What would be a better way?
Do you know any?
Given only one palindrome, you will have to do it in O(N), yes. You can get more efficiency with multi-processors by splitting the string as you said.
Now say you want to do exact DNA matching. These strings are thousands of characters long, and they are very repetitive. This gives us the opportunity to optimize.
Say you split a 1000-char long string into 5 pairs of 100,100. The code will look like this:
isPal(w[0:100],w[-100:]) and isPal(w[101:200], w[-200:-100]) ...
etc... The first time you do these matches, you will have to process them. However, you can add all results you've done into a hashtable mapping pairs to booleans:
isPal = {("ATTAGC", "CGATTA"): True, ("ATTGCA", "CAGTAA"): False}
etc... this will take way too much memory, though. For pairs of 100,100, the hash map will have 2*4^100 elements. Say that you only store two 32-bit hashes of strings as the key, you will need something like 10^55 megabytes, which is ridiculous.
Maybe if you use smaller strings, the problem can be tractable. Then you'll have a huge hashmap, but at least palindrome for let's say 10x10 pairs will take O(1), so checking if a 1000 string is a palindrome will take 100 lookups instead of 500 compares. It's still O(N), though...
Another variant of your second function. We need no check equals of the right parts of normal and reverse strings.
def palindrome_reverse(s):
l = len(s) / 2
return s[:l] == s[l::-1]
Obviously, you're not going to be able to get better than O(n) asymptotic efficiency, since each character must be examined at least once. You can get better multiplicative constants, though.
For a single thread, you can get a speedup using assembly. You can also do better by examining data in chunks larger than a byte at a time, but this may be tricky due to alignment considerations. You'll do even better to use SIMD, if you can examine chunks as large as 16 bytes at a time.
If you wanted to parallelize it, you could divide the string into N pieces, and have processor i compare the segment [i*n/2, (i+1)*N/2) with the segment [L-(i+1)*N/2, L-i*N/2).
There isn't, unless you do a fuzzy match. Which is what they probably do in DNA (I've done EST searching in DNA with smith-waterman, but that is obviously much harder then matching for a palindrome or reverse-complement in a sequence).
They are both in O(N) so I don't think there is any particular efficiency problem with any of these solutions. Maybe I am not creative enough but I can't see how would it be possible to compare N elements in less than N steps, so something like O(log N) is definitely not possible IMHO.
Pararellism might help, but it still wouldn't change the big-Oh rank of the algorithm since it is equivalent to running it on a faster machine.
Comparing from the center is always much more efficient since you can bail out early on a miss but it alwo allows you to do faster max palindrome search, regardless of whether you are looking for the maximal radius or all non-overlapping palindromes.
The only real paralellization is if you have multiple independent strings to process. Splitting into chunks will waste a lot of work for every miss and there's always much more misses than hits.
On top of what others said, I'd also add a few pre-check criteria for really large inputs :
quick check whether tail-character matches
head character
if NOT, just early exit by returning Boolean-False
if (input-length < 4) {
# The quick check just now already confirmed it's palindrome
return Boolean-True
} else if (200 < input-length) {
# adjust this parameter to your preferences
#
# e.g. make it 20 for longer than 8000 etc
# or make it scale to input size,
# either logarithmically, or a fixed ratio like 2.5%
#
reverse last ( N = 4 ) characters/bytes of the input
if that **DOES NOT** match first N chars/bytes {
return boolean-false # early exit
# no point to reverse rest of it
# when head and tail don't even match
} else {
if N was substantial
trim out the head and tail of the input
you've confirmed; avoid duplicated work
remember to also update the variable(s)
you've elected to track the input size
}
[optional 1 : if that substring of N characters you've
just checked happened to all contain the
same character, let's call it C,
then locate the index position, P, for the first
character that isn't C
if P == input-size
then you've already proven
the entire string is a nonstop repeat
of one single character, which, by def,
must be a palindrome
then just return Boolean-True
but the P is more than the half-way point,
you've also proven it cannot possibly be a
palindrome, because second half contains a
component that doesn't exist in first half,
then just return Boolean-False ]
[optional 2 : for extremely long inputs,
like over 200,000 chars,
take the N chars right at the middle of it,
and see if the reversed one matches
if that fails, then do early exit and save time ]
}
if all pre-checks passed,
then simply do it BAU style :
reverse second-half of it,
and see if it's same as first half
With Python, short code can be faster since it puts the load into the faster internals of the VM (And there is the whole cache and other such things)
def ispalin(x):
return all(x[a]==x[-a-1] for a in xrange(len(x)>>1))
You can use a hashtable to put the character and have a counter variable whose value increases everytime you find an element not in table/map. If u check and find element thats already in table decrease the count.
For odd lettered string the counter should be back to 1 and for even it should hit 0.I hope this approach is right.
See below the snippet.
s->refers to string
eg: String s="abbcaddc";
Hashtable<Character,Integer> textMap= new Hashtable<Character,Integer>();
char charA[]= s.toCharArray();
for(int i=0;i<charA.length;i++)
{
if(!textMap.containsKey(charA[i]))
{
textMap.put(charA[i], ++count);
}
else
{
textMap.put(charA[i],--count);
}
if(length%2 !=0)
{
if(count == 1)
System.out.println("(odd case:PALINDROME)");
else
System.out.println("(odd case:not palindrome)");
}
else if(length%2==0)
{
if(count ==0)
System.out.println("(even case:palindrome)");
else
System.out.println("(even case :not palindrome)");
}
public class Palindrome{
private static boolean isPalindrome(String s){
if(s == null)
return false; //unitialized String ? return false
if(s.isEmpty()) //Empty Strings is a Palindrome
return true;
//we want check characters on opposite sides of the string
//and stop in the middle <divide and conquer>
int left = 0; //left-most char
int right = s.length() - 1; //right-most char
while(left < right){ //this elegantly handles odd characters string
if(s.charAt(left) != s.charAt(right)) //left char must equal
return false; //right else its not a palindrome
left++; //converge left to right
right--;//converge right to left
}
return true; // return true if the while doesn't exit
}
}
though we are doing n/2 calculations its still O(n)
this can done also using threads, but calculations get messy, best to avoid it. this doesn't test for special characters and is case sensitive. I have code that does it, but this code can be modified to handle that easily.

Resources