Time Complexity of Text Justification with Dynamic Programming - algorithm

I've been working on a dynamic programming problem involving the justification of text. I believe that I have found a working solution, but I am confused regarding this algorithm's runtime.
The research I have done thus far has described dynamic programming solutions to this problem as O(N^2) with N as the length of the text which is being justified. To me, this feels incorrect: I can see that O(N) calls must be made because there are N suffixes to check, however, for any given prefix we will never consider placing the newline (or 'split_point') beyond the maximum line length L. Therefore, for any given piece of text, there are at most L positions to place the split point (this assumes the worst case: that each word is exactly one character long). Because of this realization, isn't this algorithm more accurately described as O(LN)?
#memoize
def justify(text, line_length):
# If the text is less than the line length, do not split
if len(' '.join(text)) < line_length:
return [], math.pow(line_length - len(' '.join(text)), 3)
best_cost, best_splits = sys.maxsize, []
# Iterate over text and consider putting split between each word
for split_point in range(1, len(text)):
length = len(' '.join(text[:split_point]))
# This split exceeded maximum line length: all future split points unacceptable
if length > line_length:
break
# Recursively compute the best split points of text after this point
future_splits, future_cost = justify(text[split_point:], line_length)
cost = math.pow(line_length - length, 3) + future_cost
if cost < best_cost:
best_cost = cost
best_splits = [split_point] + [split_point + n for n in future_splits]
return best_splits, best_cost
Thanks in advance for your help,
Ethan

First of all your implementation is going to be far, far from the theoretical efficiency that you want. You are memoizing a string of length N in your call, which means that looking for a cached copy of your data is potentially O(N). Now start making multiple cached calls and you've blown your complexity budget.
This is fixable by moving the text outside of the function call and just passing around the index of the starting position and the length L. You are also doing a join inside of your loop that is a O(L) operation. With some care you can make that a O(1) operation instead.
With that done, you would be doing O(N*L) operations. For exactly the reasons you thought.

Related

Creating a Non-greedy LZW algorithm

Basically, I'm doing an IB Extended Essay for Computer Science, and was thinking of using a non-greedy implementation of the LZW algorithm. I found the following links:
https://pdfs.semanticscholar.org/4e86/59917a0cbc2ac033aced4a48948943c42246.pdf
http://theory.stanford.edu/~matias/papers/wae98.pdf
And have been operating under the assumption that the algorithm described in paper 1 and the LZW-FP in paper 2 are essentially the same. Either way, tracing the pseudocode in paper 1 has been a painful experience that has yielded nothing, and in the words of my teacher "is incredibly difficult to understand." If anyone can figure out how to trace it, or happens to have studied the algorithm before and knows how it works, that'd be a great help.
Note: I refer to what you call "paper 1" as Horspool 1995 and "paper 2" as Matias et al 1998. I only looked at the LZW algorithm in Horspool 1995, so if you were referring to the LZSS algorithm this won't help you much.
My understanding is that Horspool's algorithm is what the authors of Matias et al 1998 call "LZW-FPA", which is different from what they call "LZW-FP"; the difference has to do with the way the algorithm decides which substrings to add to the dictionary. Since "LZW-FP" adds exactly the same substrings to the dictionary as LZW would add, LZW-FP cannot produce a longer compressed sequence for any string. LZW-FPA (and Horspool's algorithm) add the successor string of the greedy match at each output cycle. That's not the same substring (because the greedy match doesn't start at the same point as it would in LZW) and therefore it is theoretically possible that it will produce a longer compressed sequence than LZW.
Horspool's algorithm is actually quite simple, but it suffers from the fact that there are several silly errors in the provided pseudo-code. Implementing the algorithm is a good way of detecting and fixing these errors; I put an annotated version of the pseudocode below.
LZW-like algorithms decompose the input into a sequence of blocks. The compressor maintains a dictionary of available blocks (with associated codewords). Initially, the dictionary contains all single-character strings. It then steps through the input, at each point finding the longest prefix at that point which is in its dictionary. Having found that block, it outputs its codeword, and adds to the dictionary the block with the next input character appended. (Since the block found was the longest prefix in the dictionary, the block plus the next character cannot be in the dictionary.) It then advances over the block, and continues at the next input point (which is just before the last character of the block it just added to the dictionary).
Horspool's modification also finds the longest prefix at each point, and also adds that prefix extended by one character into the dictionary. But it does not immediately output that block. Instead, it considers prefixes of the greedy match, and for each one works out what the next greedy match would be. That gives it a candidate extent of two blocks; it chooses the extent with the best advance. In order to avoid using up too much time in this search, the algorithm is parameterised by the number of prefixes it will test, on the assumption that much shorter prefixes are unlikely to yield longer extents. (And Horspool provides some evidence for this heuristic, although you might want to verify that with your own experimentation.)
In Horspool's pseudocode, α is what I call the "candidate match" -- that is, the greedy match found at the previous step -- and βj is the greedy successor match for the input point after the jth prefix of α. (Counting from the end, so β0 is precisely the greedy successor match of α, with the result that setting K to 0 will yield the LZW algorithm. I think Horspool mentions this fact somewhere.) L is just the length of α. The algorithm will end up using some prefix of α, possibly (usually) all of it.
Here's Horspool's pseudocode from Figure 2 with my annotations:
initialize dictionary D with all strings of length 1;
set α = the string in D that matches the first
symbol of the input;
set L = length(α);
while more than L symbols of input remain do
begin
// The new string α++head(β0) must be added to D here, rather
// than where Horspool adds it. Otherwise, it is not available for the
// search for a successor match. Of course, head(β0) is not meaningful here
// because β0 doesn't exist yet, but it's just the symbol following α in
// the input.
for j := 0 to max(L-1,K) do
// The above should be min(L - 1, K), not max.
// (Otherwise, K would be almost irrelevant.)
find βj, the longest string in D that matches
the input starting L-j symbols ahead;
add the new string α++head(β0) to D;
// See above; the new string must be added before the search
set j = value of j in range 0 to max(L-1,K)
such that L - j + length(βj) is a maximum;
// Again, min rather than max
output the index in D of the string prefix(α,j);
// Here Horspool forgets that j is the number of characters removed
// from the end of α, not the number of characters in the desired prefix.
// So j should be replaced with L - j
advance j symbols through the input;
// Again, the advance should be L - j, not j
set α = βj;
set L = length(α);
end;
output the index in D of string α;

Calculating the hash of any substring in logarithmic time

Question came up in relation to this article:
https://threads-iiith.quora.com/String-Hashing-for-competitive-programming
The author presents this algorithm for hashing a string:
where S is our string, Si is the character at index i, and p is a prime number we've chosen.
He then presents the problem of determining whether a substring of a given string is a palindrome and claims it can be done in logarithmic time through hashing.
He makes the point we can calculate from the beginning of our whole string to the right edge of our substring:
and observes that if we calculate the hash from the beginning to the left edge of our substring (F(L-1)), the difference between this and our hash to our right edge is basically the hash of our substring:
This is all fine, and I think I follow it so far. But he then immediately makes the claim that this allows us to calculate our hash (and thus determine if our substring is a palindrome by comparing this hash with the one generated by moving through our substring in reverse order) in logarithmic time.
I feel like I'm probably missing something obvious but how does this allow us to calculate the hash in logarithmic time?
You already know that you can calculate the difference in constant time. Let me restate the difference (I'll leave the modulo away for clarity):
diff = ∑_{i=L to R} S_i ∗ p^i
Note that this is not the hash of the substring because the powers of p are offset by a constant. Instead, this is (as stated in the article)
diff = Hash(S[L,R])∗p^L
To derive the hash of the substring, you have to multiply the difference with p^-L. Assuming that you already know p^-1 (this can be done in a preprocessing step), you need to calculate (p^-1)^L. With the square-and-multiply method, this takes O(log L) operations, which is probably what the author refers to.
This may become more efficient if your queries are sorted by L. In this case, you could calculate p^-L incrementally.

Complexity of binary search on a string

I have an sorted array of strings: eg: ["bar", "foo", "top", "zebra"] and I want to search if an input word is present in an array or not.
eg:
search (String[] str, String word) {
// binary search implemented + string comaparison.
}
Now binary search will account for complexity which is O(logn), where n is the length of an array. So for so good.
But, at some point we need to do a string compare, which can be done in linear time.
Now the input array can contain of words of different sizes. So when I
am calculating final complexity will the final answer be O(m*logn)
where m is the size of word we want to search in the array, which in our case
is "zebra" the word we want to search?
Yes, your thinking as well your proposed solution, both are correct. You need to consider the length of the longest String too in the overall complexity of String searching.
A trivial String compare is an O(m) operation, where m is the length of the larger of the two strings.
But, we can improve a lot, given that the array is sorted. As user "doynax" suggests,
Complexity can be improved by keeping track of how many characters got matched during
the string comparisons, and store the present count for the lower and
upper bounds during the search. Since the array is sorted we know that
the prefix of the middle entry to be tested next must match up to at
least the minimum of the two depths, and therefore we can skip
comparing that prefix. In effect we're always either making progress
or stopping the incremental comparisons immediately on a mismatch, and
thereby never needing to keep going over old ground.
So, overall m number of character comparisons would have to be done till the end of the string, if found OR else not even that much(if fails at early stage).
So, the overall complexity would be O(m + log n).
I was under the impression that what original poster said was correct by saying time complexity is O(m*logn).
If you use the suggested enhancement to improve the time complexity (to get O(m + logn)) by tracking previously matched letters I believe the below inputs would break it.
arr = [“abc”, “def”, “ghi”, “nlj”, “pfypfy”, “xyz”]
target = “nljpfy”
I expect this would incorrectly match on “pfypfy”. Perhaps one of the original posters can weigh in on this. Definitely curious to better understand what was proposed. It sounds like matched number of letters are skipped in next comparison.

Where can I use a technique from Majority Vote algorithm

As seen in the answers to Linear time majority algorithm?, it is possible to compute the majority of an array of elements in linear time and log(n) space.
It was shown that everyone who sees this algorithm believes that it is a cool technique. But does the idea generalize to new algorithms?
It seems the hidden power of this algorithm is in keeping a counter that plays a complex role -- such as "(count of majority element so far) - (count of second majority so far)". Are there other algorithms based on the same idea?
Umh, let's first start to understand why the algorithm works, in order to "isolate" the ideas there.
The point of the algorithm is that if you have a majority element, then you can match each occurrence of it with an "another" element, and then you have some more "spare".
So, we just have a counter which counts the number of "spare" occurrences of our guest answer.
If it reaches 0, then it isn't a majority element for the subsequence starting from when we have "elected" the "current" element as the guest major element to the "current" position.
Also, since our "guest" element matches every other element occurrence in the considered subsequence, there are no major elements in the considered subsequence.
Now, since:
our algorithm gives a correct answer only if there is a major element, and
if there is a major element, then it'll still be if we ignore the "current" subsequence when the counter goes to zero
it is obvious to see by contradiction that, if a major element exists, then we have a suffix of the whole sequence when the counter never gets to zero.
Now: what's the idea that can be exploited in new, O(1) size O(n) time algorithms?
To me, you can apply this technique whenever you have to compute a property P on a sequence of elements which:
can be exteded from seq[n, m] to seq[n, m+1] in O(1) time if Q(seq[n, m+1]) doesn't hold
P(seq[n, m]) can be computed in O(1) time and space from P(seq[n, j]) and P(seq[j, m]) if Q(seq[n, j]) holds
In our case, P is the "spare" occurrences of our "elected" major element and Q is "P is zero".
If you see things in that way, longest common subsequence exploits the same idea (dunno about its "coolness factor" ;))
Jaydev Misra and David Gries have a paper called Finding Repeated Elements (ACM page) which generalizes it to an element repeating more than n/k times (k=2 is the majority problem).
Of course, this is probably very similar to the original problem, and you are probably looking for 'different' algorithms.
Here is an example which is possibly different.
Give an algorithm which will detect if a string of parentheses ( '(' and ')') is well formed.
I believe the standard solution is to maintain a counter.
Side note:
As to answers which claim cannot be constant space etc, ask them for the model of computation. In the WORD RAM model for instance, you assume the integers/array indices etc are O(1).
A lot of folks incorrectly mix and match models. For instance, they will happily have the input array of n integers be O(n), have an array index be O(1) space, but a counter they consider Omega(log n) etc, which is nonsense. If they want to consider the size in bits, then the input itself is Omega(n log n) etc.
For people who want to understand what does this algorithm do and why does it works: look at my detailed answer.
Here I will describe a natural extension of this algorithm (or a generalization). So in a standard majority voting algorithm you have to find an element which appears at least n/2 times in the stream, where n is the size of the stream. You can do this in O(n) time (with a tiny constant and in O(log(n)) space, worse case and highly unlikely.
The generalized algorithm allows you to find k most frequent items, where each time appeared at least n/(k+1) times in the original stream. Note that if k=1, you end up with your original problem.
Solution to this problem is really similar to the original one, except instead of one counter and one possible element, you maintain k counters and k possible elements. Now the logic goes in a similar way. You iterate through the array and if the element is in the possible elements, you increase it's counter, if one of the counters is zero - substitute the element of this counter with new element. Otherwise just decrease the values.
As with original majority voting algorithm, you need to have a guarantee that you have these k majority elements, otherwise you have to do another pass over the array to verify that your previously found possible elements are correct. Here is my python attempt (have not done a thorough testing).
from collections import defaultdict
def majority_element_general(arr, k=1):
counter, i = defaultdict(int), 0
while len(counter) < k and i < len(arr):
counter[arr[i]] += 1
i += 1
for i in arr[i:]:
if i in counter:
counter[i] += 1
elif len(counter) < k:
counter[i] = 1
else:
fields_to_remove = []
for el in counter:
if counter[el] > 1:
counter[el] -= 1
else:
fields_to_remove.append(el)
for el in fields_to_remove:
del counter[el]
potential_elements = counter.keys()
# might want to check that they are really frequent.
return potential_elements

Algorithm to find lenth of longest sequence of blanks in a given string

Looking for an algorithm to find the length of longest sequence of blanks in a given string examining as few characters as possible?
Hint : Your program should become faster as the length of the sequence of blanks increases.
I know the solution which is O(n).. But looking for more optimal solution
You won't be able to find a solution which is a smaller complexity than O(n) because you need to pass through every character in the worst case with an input string that has at most 0 or 1 consecutive whitespace, or is completely whitespace.
You can do some optimizations though, but it'll still be considered O(n).
For example:
Let M be the current longest match so far as you go through your list. Also assume you can access input elements in O(1), for example you have an array as input.
When you see a non-whitespace you can skip M elements if the current + M is non whitespace. Surely no whitespace longer than M can be contained inside.
And when you see a whitepsace character, if current + M-1 is not whitespace you know you don't have the longest runs o you can skip in that case as well.
But in the worst case (when all characters are blank) you have to examine every character. So it can't be better than O(n) in complexity.
Rationale: assume the whole string is blank, you haven't examined N characters and your algorithms outputs n. Then if any non-examined character is not blank, your answer would be wrong. So for this particular input you have to examine the whole string.
There's no way to make it faster than O(N) in the worst case. However, here are a few optimizations, assuming 0-based indexing.
If you already have a complete sequence of L blanks (by complete I mean a sequence that is not a subsequence of a larger sequence), and L is at least as large as half the size of your string, you can stop.
If you have a complete sequence of L blanks, once you hit a space at position i check if the character at position i + L is also a space. If it is, continue scanning from position i forwards as you might find a larger sequence - however, if you encounter a non-space until position i + L, then you can skip directly to i + L + 1. If it isn't a space, there's no way you can build a larger sequence starting at i, so scan forwards starting from i + L + 1.
If you have a complete sequence of blanks of length L, and you are at position i and you have k positions left to examine, and k <= L, you can stop your search, as obviously there's no way you'll be able to find anything better anymore.
To prove that you can't make it faster than O(N), consider a string that contains no spaces. You will have to access each character once, so it's O(N). Same with a string that contains nothing but spaces.
The obvious idea: you can jump by K+1 places (where K is the current longest space sequence) and scan back if you found a space.
This way you have something about (n + n/M)/2 = n(M+1)/2M positions checked.
Edit:
Another idea would be to apply a kind of binary search. This is like follows: for a given k you make a procedure that checks whether there is a sequence of spaces with length >= k. This can be achieved in O(n/k) steps. Then, you try to find the maximal k with binary search.
Edit:
During the consequent searches, you can utilize the knowledge that the sequence of some length k already exist, and start skipping at k from the very beginning.
What ever you do, the worst case will always be o(n) - if those blanks are on the last part of the string... (or the last "checked" part of the string).

Resources