Splitting a sentence to minimize sentence lengths

Splitting a sentence to minimize sentence lengths - algorithm

I have come across the following problem statement:
You have a sentence written entirely in a single row. You would like to split it into several rows by replacing some of the spaces
with "new row" indicators. Your goal is to minimize the width of the
longest row in the resulting text ("new row" indicators do not count
towards the width of a row). You may replace at most K spaces.
You will be given a sentence and a K. Split the sentence using the
procedure described above and return the width of the longest row.
I am a little lost with where to start. To me, it seems I need to try to figure out every possible sentence length that satisfies the criteria of splitting the single sentence up into K lines.
I can see a couple of edge cases:
There are <= K words in the sentence, therefore return the longest word.
The sentence length is 0, return 0
If neither of those criteria are true, then we have to determine all possible combinations of splitting the sentence and the return the minimum of all those options. This is the part I don't know how to do (and is obviously the heart of the problem).

You can solve it by inverting the problem. Let's say I fix the length of the longest split to L. Can you compute the minimum number of breaks you need to satisfy it?
Yes, you just break before the first word that would go over L and count them up (O(N)).
So now that we have that we just have to find a minimum L that would require less or equal K breaks. You can do a binary search in the length of the input. Final complexity O(NlogN).

First Answer
What you want to achieve is Minimum Raggedness. If you just want the algorithm, it is here as a PDF. If the research paper's link goes bad, please search for the famous paper named Breaking Paragraphs into Lines by Knuth.
However if you want to get your hands over some implementations of the same, in the question Balanced word wrap (Minimum raggedness) in PHP on SO, people have actually given implementation not only in PHP but in C, C++ and bash as well.
Second Answer
Though this is not exactly a correct approach, it is quick and dirty if you are looking for something like that. This method will not return correct answer for every case. It is for those people for whom time to ship their product is more important.
Idea
You already know the length of your input string. Let's call it L;
When putting in K breaks, the best scenario would be to be able to break the string to parts of exactly L / (K + 1) size;
So break your string at that word which makes the resulting sentence part's length least far from L / (K + 1);

My recursive solution, which can be improved through memoization or dynamic programming.
def split(self,sentence, K):
if not sentence: return 0
if ' ' not in sentence or K == 0: return len(sentence)
spaces = [i for i, s in enumerate(sentence) if s == ' ']
res = 100000
for space in spaces:
res = min(res, max(space, self.split(sentence[space+1:], K-1)))
return res

Related

Deriving the subsolution table for LCS from the brute force solution

I'm working on the LCS problem using dynamic programming. I'm having trouble deriving the DP solution myself without looking at the solution.
I currently reason that given two strings, P and Q:
We can enumerate through all subsequences of P, which is of size 2^n.
We can also enumerate through all subsequences of Q, which is of size 2^m.
So, if we want to check for shared subsequences, the run time would be O(2^n * 2^m) or O(2^(n+m)).
I don't understand how we can go from this brute force solution to the dynamic programming solution. What's the logic for deriving the subsolution table?
I just don't understand how we can jump straight to the subsolution table for DP from this point. What's the logic for doing that?
I understand that we need to identify overlapping subsolutions. But I can't find a good explanation for identifying this then going onto the subsolution table.
Let me know if this question makes sense.

Here's the basic idea that creates the magic for this algorithm.
Consider 2 strings S1 and S2,
S1 = c1,c2,c3,........cm, length = m
and
S2 = b1,b2,b3,........bn, length = n
Say you have a function LCS(arg1,arg2), where
arg1 = S1,m , string S1 of length of m
and
arg2 = S2,n , string S2 of length of n
and LCS(arg1,arg2) will give us the length of longest common subsequence for the 2 arguments.
Now suppose that the last character of both strings is same.
bn = cm
And suppose no other two charcters is same. This means that:
LCS(arg1,arg2) = 1(last character) + 0(remaining strings)
Now if you have understood the above equation, its clear that if instead of no other two characters matching we do have something matching(in the remaining strings) then:
LCS(arg1,arg2) = 1(last character) + LCS(arg1 - cm,arg2 - bn)(Remaining strings)
But if last two chars do not match, then definitely we have to consider the second last 2 characters in each string and thats why we have the following when the last 2 characters does not match:
LCS(arg1,arg2) = max(LCS(arg1 - cm,arg2) , LCS(arg1 ,arg2 - bn))

Longest common subsequence Algo

In Longest Common Subsequence (LCS) Problem, why do we match last chars for the string. For ex
Consider the input strings “AGGTAB” and “AXTXAYB”. Last characters match for the strings. So length of LCS can be written as:
L(“AGGTAB”, “AXTXAYB”) = 1 + L(“AGGTA”, “AXTXAY”)
Wouldn't the algo still produce the optimal search if we match first chars for string. For example
Consider the input strings “AGGTAB” and “AXTXAYB”. First characters match for the strings. So length of LCS can be written as:
L(“AGGTAB”, “AXTXAYB”) = 1 + L(“GGTAB”, “XTXAYB”)
LCS problem : Longest Common Subsequence Problem

Yes, this is the same thing.
Computing the LCS of two reversed sequences is the same as reversing the LCS of two sequences before reversal. In other words,
REVERSE(LCS(A,B)) = LCS(REVERSE(A), REVERSE(B))
Assuming that the LCS reduces from the end, the operation on the right would go from the opposite end, but achieve the same result.
That's why you can work with prefixes in the same way that they work with suffixes in the explanation: you would get the same kind of recursive reduction in the process.
Moreover, you can do reductions on both ends if you wish. However, this would complicate the algorithm a lot without giving you any speed up in return.

Well it turns out that you can directly make use of length variables (say M,N) in recursion provided by user if we perform LCS from last. On the other hand you will have to create extra variables if you do it from start index. That's the reason former method is considered as standard otherwise there is no complexity difference and everything is same.
LCS (M, N)
{
if(M==0 || N==0)
return 0;
elseif (a[M]!=b[N])
return max(LCS(M,N-1), LCS(M-1,N));
else
return 1 + LCS(M-1,N-1);
}

Yes you could do that it will not change the time complexity. Starting from last is just a matter of convention.

Efficient way to verify if a string is a rotated palindrome?

A rotated palindrome is like "1234321", "3432112".
The naive method will be cut the string into different pieces and concate them back and see if the string is a palindrome.
That would take O(n^2) since there are n cuts and for each cut we need O(n) to check if the string is a palindrome.
I'm wondering if there's a better solution than this.
I guess so, please advice.
Thanks!

According to this wikipedia article it is possible for each string S of length n in time O(n) compute an array A of the same size, such that:
A[i]==1 iff the prefix of S of length i is a palindrome.
http://en.wikipedia.org/wiki/Longest_palindromic_substring
The algorithm should be possible to find in:
Manacher, Glenn (1975), "A new linear-time "on-line" algorithm for
finding the smallest initial palindrome of a string"
In other words we can check which prefixes of the string are palindromes in linear time. We will use this result to solve the proposed problem.
Each (non-rotating) palindrome S has the following form S = psxs^Rp^R.
Where "x" is the center of the palindrome (either empty string or one letter string),
"p" and "s" are (possibly empty) strings and "s^R" means "s" string reversed.
Each rotating palindrome created from this string has one of the two following forms (for some p):
sxs^Rp^Rp
p^Rpsxs^R
This is true, because you can choose if to cut some substring before or after the middle of the palindrome and then paste it on the other end.
As one can see the substrings "p^Rp" and "sxs^R" are both palindromes, one of then of even length and the other on odd length iff S is of odd length.
We can use the algorithm mentioned in the wikipedia link to create two arrays A and B. The array A is created by checking which prefixes are palindromes and B for suffixes. Then we search for a value i such that A[i]==B[i]==1 such that either prefix or suffix has even length. We will find such index iff the proposed string is a rotated palindrome and the even part is the "p^Rp" substring, so we can easily recover the original palindrome by moving half of this string to the other end of the string.
One remark to the solution by rks, this solution doesn't work, as for a string S = 1121 it will create string 11211121 which has palindrome of length longer or equal than the length of S, but it is not a rotated palindrome. If we change the solution such that it checks whether there exist a palindrome of length equal to length of S, it would work, but i don't see any direct solution how to change the algorithm searching for longest substring in such a way that it would search for substring of fixed length (len(S)).
(i didn't write this as a comment under the solution as i'm new to Stackoverflow and don't have enough reputation to do so)
Second remark -- I'm sorry not to include the algorithm of Manacher, if someone has link to either the idea of the algorithm or some implementation please include it in the comments.

Concatenate the string to itself, then do the classical palindrome research in the new string. If you find a palindrome whose length is longer or equal to the length of your original string, you know your string is a rotated palindrome.
For your example, you would do your research in 34321123432112 , finding 21123432112, which is longer than your initial string, so it's a rotated palindrome.
EDIT: as Richard Stefanec noted, my algorithm fails on 1121, he proposed that we change the >= test on the length by =.
EDIT2: it should be noted than finding a palindrome of a given size isn't obviously easy. Read the discussion under Richard Stefanec post for more information.

#Given a string, check if it is a rotation of a palindrome.
#For example your function should return true for “aab” as it is a rotation of “aba”.
string1 = input("Enter the first string")
def check_palindrome(string1):
#string1_list = [word1 for word1 in string1]
#print(string1_list)
string1_rotated = string1[1::1] + string1[0]
print(string1_rotated)
string1_rotated_palindrome = string1_rotated[-1::-1]
print(string1_rotated_palindrome)
if string1_rotated == string1_rotated_palindrome:
return True
else:
return False
isPalindrome = check_palindrome(string1)
if(isPalindrome):
print("Rotated string is palindrome as well")
else:
print("Rotated string is not palindrome")

I would like to propose one simple solution, using only conventional algorithms. It will not solve any harder problem, but it should be sufficient for your task. It is somewhat similar to the other two proposed solutions, but none of them seems to be concise enough for me to read carefully.
First step: concatenate the string to itself (abvc - > abvcabvc) as in all other proposed solutions.
Second step: do Rabin-Karp precalculation (which uses rolling hash) on the newly obtained string and its reversed.
Third step: Let the string be with length n. For each index iin 0...n-1 check if the substring of the doubled string [i, i + n - 1] is palindrome in constant time, using the Rabin-Karp precalculations (basically the obtained value in for the substring in the forward and the reversed direction should be equal).
Conclusion: if third step found any palindrome - then the string is rotated palindrome. If not - then it is not.
PS: Rabin Karp uses hashes, and collisions are possible even for non-coinciding strings. Thus it is a good idea to make verifying brute force check for equality if such is induced by the hash checks. Still if the hash functions used in the Rabin Karp are good, the amortized speed of the solution should remain O(n).

you can add the same pattern to the end of the original pattern. For example the pattern is 1234321, then you can add the same pattern to the end 12343211234321. After to do this, you can use the KMP or other substring match algotithms to find the string you want. if match, return ture.

If a word is made up of two valid words

Given a dictionary find out if given word can be made by two words in dictionary. For eg. given "newspaper" you have to find if it can be made by two words. (news and paper in this case). Only thing i can think of is starting from beginning and checking if current string is a word. In this case checking n, ne, new, news..... check for the remaining part if current string is a valid word.
Also how do you generalize it for k(means if a word is made up of k words) ? Any thoughts?

Starting your split at the center may yield results faster. For example, for newspaper, you would first try splitting at 'news paper' or 'newsp aper'. As you can see, for this example, you would find your result on the first or second try. If you do not find a result, just search outwards. See the example for 'crossbow' below:
cros sbow
cro ssbow
cross bow

For the case with two words, the problem can be solved by just considering all possible ways of splitting the word into two, then checking each half to see if it's a valid word. If the input string has length n, then there are only O(n) different ways of splitting the string. If you store the strings in a structure supporting fast lookup (say, a trie, or hash table).
The more interesting case is when you have k > 2 words to split the word into. For this, we can use a really elegant recursive formulation:
A word can be split into k words if it can be split into a word followed by a word splittable into k - 1 words.
The recursive base case would be that a word can be split into zero words only if it's the empty string, which is trivially true.
To use this recursive insight, we'll modify the original algorithm by considering all possible splits of the word into two parts. Once we have that split, we can check if the first part of the split is a word and if the second part of the split can be broken apart into k - 1 words. As an optimization, we don't recurse on all possible splits, but rather just on those where we know the first word is valid. Here's some sample code written in Java:
public static boolean isSplittable(String word, int k, Set<String> dictionary) {
/* Base case: If the string is empty, we can only split into k words and vice-
* versa.
*/
if (word.isEmpty() || k == 0)
return word.isEmpty() && k == 0;
/* Generate all possible non-empty splits of the word into two parts, recursing on
* problems where the first word is known to be valid.
*
* This loop is structured so that we always try pulling off at least one letter
* from the input string so that we don't try splitting the word into k pieces
* of which some are empty.
*/
for (int i = 1; i <= word.length(); ++i) {
String first = word.substring(0, i), last = word.substring(i);
if (dictionary.contains(first) &&
isSplittable(last, k - 1, dictionary)
return true;
}
/* If we're here, then no possible split works in this case and we should signal
* that no solution exists.
*/
return false;
}
}
This code, in the worst case, runs in time O(nk) because it tries to generate all possible partitions of the string into k different pieces. Of course, it's unlikely to hit this worst-case behavior because most possible splits won't end up forming any words.

I'd first loop through the dictionary using a strpos(-like) function to check if it occurs at all. Then try if you can find a match with the results.
So it would do something like this:
Loop through the dictionary strpos-ing every word in the dictionary and saving results into an array, let's say it gives me the results 'new', 'paper', and 'news'.
Check if new+paper==newspaper, check if new+news==newspaper, etc, untill you get to paper+news==newspaper which returns.
Not sure if it is a good method though, but it seems more efficient than checking a word letter for letter (more iterations) and you didn't explain how you'd check when the second word started.
Don't know what you mean by 'how do you generalize it for k'.

Fastest way to find minimal Hamming distance to any substring?

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,
minHamming(ABCDEFGHIJ, CDEFGG) == 1.
minHamming(ABCDEFGHIJ, BCDGHI) == 3.
Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.
Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).

Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.
Here is a freely available PDF of a paper by Donald Benson describing how it works:
Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)
Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.
Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.
UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.

Modified Boyer-Moore
I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:
current_dist = 0
while pattern_pos >= 0:
if pattern[pattern_pos] != text[text_pos]:
if first_mismatch == -1:
first_mismatch = pattern_pos
tp = text_pos
current_dist += 1
if current_dist == smallest_dist:
break
pattern_pos -= 1
text_pos -= 1
smallest_dist = min(current_dist, smallest_dist)
# if the distance is 0, we've had a match and can quit
if current_dist == 0:
return 0
else: # shift
pattern_pos = first_mismatch
text_pos = tp
...
If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.
The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.
Preprocessing on the Text
Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.
import sys
from collections import defaultdict
import bisect
def char_positions(t):
pos = defaultdict(list)
for idx, c in enumerate(t):
pos[c].append(idx)
return dict(pos)
This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.
The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:
def min_hamming(text, pattern):
best = len(pattern)
pos = char_positions(text)
i = find_next_pos(pattern, pos, 0)
while i < len(text) - len(pattern):
dist = 0
for c in range(len(pattern)):
if text[i+c] != pattern[c]:
dist += 1
if dist == best:
break
c += 1
else:
if dist == 0:
return 0
best = min(dist, best)
i = find_next_pos(pattern, pos, i + 1)
return best
The actual improvement is in find_next_pos:
def find_next_pos(pattern, pos, i):
smallest = sys.maxint
for idx, c in enumerate(pattern):
if c in pos:
x = bisect.bisect_left(pos[c], i + idx)
if x < len(pos[c]):
smallest = min(smallest, pos[c][x] - idx)
return smallest
For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.
find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.
Discussion
Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.
DNA
If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.

You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.
Luckily, this is easily parallelized.
One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Splitting a sentence to minimize sentence lengths - algorithm

Related

Deriving the subsolution table for LCS from the brute force solution

Longest common subsequence Algo

Efficient way to verify if a string is a rotated palindrome?

If a word is made up of two valid words

Fastest way to find minimal Hamming distance to any substring?

Categories

Resources