Other Ways of Verifying Balanced Parenthesis? - algorithm

A Classical example of how stacks are quite important is in the problem of verifying whether a string of parenthesis is balanced or not. You start with an empty stack and you keep pushing/popping elements in the stack, at the end, you check if your stack is empty, and if so return that the string is indeed balanced.
However, I am looking for other less efficient approaches to solve this problem. I want to show my students the usefulness of the stack data structure by first coming up with an exponential/non linear algorithm that solves the problem, then introduce the stack solution. Is anyone familiar with other methods other than the stack based approach?

find the last opening-parenthesizes, and look whether it closes, and whether there is no other type of parenthesis after it.
If it does, repeat the process until the string is empty.
If the string is not empty in the end of the process, or you find a different kind of parenthesis - it means it is not balanced.
example:
([[{}]])
the last opening is {, so look for }, after you find it- delete it from the string and continue with:
([[]])
etc.
if the string looks like that:
([[{]}])
so after you find the last open ({) - you see there is parenthesis from a different kind (]) before the closing parenthesis - so it is not balanced.
worst case complexity: O(n^2)

I assume that, for pedagogical purposes, it would be best to show a simple algorithm that they might actually have come up with themselves? If so, then I think a very intuitive algorithm is to just remove occurrences of () until there aren't any more to remove:
boolean isBalancedParens(String s) {
while (s.contains("()")) {
s = s.replace("()", "");
}
return s.isEmpty();
}
Under reasonable assumptions about the performance of the various methods called, this takes worst-case O(n2) time and O(n) extra space.

This problem raises a number of interesting questions in algorithm analysis which are possibly at too high a level for your class, but were fun to think about. I sketch the worst-case and expected runtimes for all the algorithms, which are somewhere between log-linear and quadratic.
The only exponential time algorithm I could think of was the equivalent of Bogosort: generate all possible balanced strings until you find one which matches. That seemed to weird even for a class exercise. Even weirder would be the modified Bogocheck, which only generates all ()-balanced strings and uses some cleverness to figure out which actual parenthesis to use in the comparison. (If you're interested, I could expand on this possibility.)
In most of the algorithms presented here, I use a procedure called "scan maintaining paren depth". This procedure examines characters one at a time in the order specified (forwards or backwards) maintaining a total count of observed open parentheses (of all types) less observed close parentheses (again, of all types). When scanning backwards, the meaning of "open" and "close" are reversed. If the count ever becomes negative, the string is not balanced and the entire procedure can immediately return failure.
Here are two algorithms which use constant space, both of which are worst-case quadratic in string length.
Algorithm 1: Find matching paren
Scan left-to-right. For each close encountered, scan backwards starting with the close maintaining paren depth. When the paren depth reaches zero, compare the character which caused the depth to reach 0 with the close which started the backwards scan; if they don't match, immediately fail. Also fail if the backwards scan hits the beginning of the string without the paren depth reaching zero.
If the end of the string is reached without failure being detected, the string is balanced.
Algorithm 2: Depthwise scan
Set depth to 1.
LOOP: Scan left-to-right from the first character, maintaining paren depth. If an open is encountered and the paren depth is incremented to depth, remember the open. If the paren depth is depth and a close is encountered, check to see if it matches the remembered open; if it does not, fail immediately.
If the end of the string is reached before any open is remembered, report success. If the end of the string is reached and the last remembered open was never matched by a close, report failure. Otherwise, increment depth and repeat the LOOP.
Both of the above have worst case (quadratic) performance on a completely nested string ((…()…)). However, the average time complexity is trickier to compute.
Each loop in Algorithm 2 takes precisely &Theta(N) time. If the total paren depth of the string is not 0 or there is any point in the string where the cumulative paren depth is negative, then failure will be reported in the first scan, taking linear time. That accounts for the vast majority of strings if the inputs are randomly selected from among all strings containing parenthesis characters. Of the strings which are not trivially rejected -- that is, the strings which would match if all opens were replaced with ( and all closes with ), including strings which are correctly balanced -- the expected number of scans is the expected maximum parenthesis depth of the string, which is Θ(log N) (proving this is an interesting exercise, but I think it's not too difficult), so the total expected time is Θ(N log N).
Algorithm 1 is rather more difficult to analyse in the general case, but for completely random strings it seems safe to guess that the first mismatch will be found in expected linear time. I don't have a proof for this, though. If the string is actually balanced, success will be reported at the termination of the scan, and the work performed is the sum of the span lengths of each pair of balanced parentheses. I believe this is approximately Θ(N log N), but I'd like to do some analysis before committing to this fact.
Here is an algorithm which is guaranteed to be O(N log N) for any input, but which requires Θ(N) additional space:
Algorithm 3: Sort matching pairs
Create an auxiliary vector of length N, whose ith element is the 2-tuple consisting of the cumulative paren depth of the character at position i, and the index i itself. The paren depth of an open is defined as the paren depth just before the open is counted, and the paren depth of a close is the paren depth just after the close is counted; the consequence is that matching open and close have the same paren depth.
Now sort the auxiliary vector in ascending order using lexicographic comparison of the tuples. Any O(N log N) sorting algorithm can be used; note that a stable sort is not necessary because all the tuples are distinct. [Note 1].
Finally iterate over the sorted vector, selecting two elements at a time. Reject the string if the two elements do not have the same depth, or are not a matching pair of open and close (using the index in the tuple to look up the character in the original string).
If the entire sorted vector can be scanned without failure, then the string was balanced.
Finally, a regex-based solution, because everyone loves regexes. :) This algorithm destroys the input string (unless a copy is made), but requires only constant additional storage.
Algorithm 4: Regex to the rescue!
Do the following search and replace until the search fails to find anything: (I wrote it for sed using Posix BREs, but in case that's too obscure, the pattern consists precisely of an alternation of each possible matched open-close pair.)
s/()\|[]\|{}//g
When the above loop terminates, if the string is not empty then it was not originally balanced; if it is empty, it was.
Note the g, which means that the search-and-replace is performed across the entire string on each pass. Each pass will take time proportional to the remaining length of the string at the beginning of the pass, but for simplicity we can say that the cost of a pass is O(N). The number of passes performed is the maximum paren depth of the string, which is Θ(N) in the worst case, but has an expected value of Θ(log N). So in the worst case, the execution time is Θ(N2) but the expected time is Θ(N log N).
Notes
An O(N) stable counting sort on the paren depth is possible. In that case, the total algorithm would be O(N) instead of O(N log N), but that wasn't what you wanted, right? You could also use a stable sort just on the paren depth, in which case you could replace the second element of the tuple with the character itself. That would still be O(N log N), if the sort was O(N log N).

If your students are already familiar with recursion, here's a simple idea: look at the first parenthesis, find all matching closing parentheses, and for each of these pairs, recurse with the substring inside them and the substring after them; e.g.:
input: "{(){[]}()}[]"
option 1: ^ ^
recurse with: "(){[]" and "()}[]"
"{(){[]}()}[]"
option 2: ^ ^
recurse with: "(){[]}()" and "[]"
If the input is an empty string, return true. If the input starts with a closing parenthesis, or if the input does not contain a closing parenthesis matching the first parenthesis, return false.
function balanced(input) {
var opening = "{([", closing = "})]";
if (input.length == 0)
return true;
var type = opening.indexOf(input.charAt(0));
if (type == -1)
return false;
for (var pos = 1; pos < input.length; pos++) { // forward search
if (closing.indexOf(input.charAt(pos)) == type) {
var inside = input.slice(1, pos);
var after = input.slice(pos + 1);
if (balanced(inside) && balanced(after))
return true;
}
}
return false;
}
document.write(balanced("{(({[][]}()[{}])({[[]]}()[{}]))}"));
Using forward search is better for concatenations of short balanced substrings; using backward search is better for deeply nested strings. But the worst case for both is O(n2).

Related

Time Complexity of Text Justification with Dynamic Programming

I've been working on a dynamic programming problem involving the justification of text. I believe that I have found a working solution, but I am confused regarding this algorithm's runtime.
The research I have done thus far has described dynamic programming solutions to this problem as O(N^2) with N as the length of the text which is being justified. To me, this feels incorrect: I can see that O(N) calls must be made because there are N suffixes to check, however, for any given prefix we will never consider placing the newline (or 'split_point') beyond the maximum line length L. Therefore, for any given piece of text, there are at most L positions to place the split point (this assumes the worst case: that each word is exactly one character long). Because of this realization, isn't this algorithm more accurately described as O(LN)?
#memoize
def justify(text, line_length):
# If the text is less than the line length, do not split
if len(' '.join(text)) < line_length:
return [], math.pow(line_length - len(' '.join(text)), 3)
best_cost, best_splits = sys.maxsize, []
# Iterate over text and consider putting split between each word
for split_point in range(1, len(text)):
length = len(' '.join(text[:split_point]))
# This split exceeded maximum line length: all future split points unacceptable
if length > line_length:
break
# Recursively compute the best split points of text after this point
future_splits, future_cost = justify(text[split_point:], line_length)
cost = math.pow(line_length - length, 3) + future_cost
if cost < best_cost:
best_cost = cost
best_splits = [split_point] + [split_point + n for n in future_splits]
return best_splits, best_cost
Thanks in advance for your help,
Ethan
First of all your implementation is going to be far, far from the theoretical efficiency that you want. You are memoizing a string of length N in your call, which means that looking for a cached copy of your data is potentially O(N). Now start making multiple cached calls and you've blown your complexity budget.
This is fixable by moving the text outside of the function call and just passing around the index of the starting position and the length L. You are also doing a join inside of your loop that is a O(L) operation. With some care you can make that a O(1) operation instead.
With that done, you would be doing O(N*L) operations. For exactly the reasons you thought.

Complexity of binary search on a string

I have an sorted array of strings: eg: ["bar", "foo", "top", "zebra"] and I want to search if an input word is present in an array or not.
eg:
search (String[] str, String word) {
// binary search implemented + string comaparison.
}
Now binary search will account for complexity which is O(logn), where n is the length of an array. So for so good.
But, at some point we need to do a string compare, which can be done in linear time.
Now the input array can contain of words of different sizes. So when I
am calculating final complexity will the final answer be O(m*logn)
where m is the size of word we want to search in the array, which in our case
is "zebra" the word we want to search?
Yes, your thinking as well your proposed solution, both are correct. You need to consider the length of the longest String too in the overall complexity of String searching.
A trivial String compare is an O(m) operation, where m is the length of the larger of the two strings.
But, we can improve a lot, given that the array is sorted. As user "doynax" suggests,
Complexity can be improved by keeping track of how many characters got matched during
the string comparisons, and store the present count for the lower and
upper bounds during the search. Since the array is sorted we know that
the prefix of the middle entry to be tested next must match up to at
least the minimum of the two depths, and therefore we can skip
comparing that prefix. In effect we're always either making progress
or stopping the incremental comparisons immediately on a mismatch, and
thereby never needing to keep going over old ground.
So, overall m number of character comparisons would have to be done till the end of the string, if found OR else not even that much(if fails at early stage).
So, the overall complexity would be O(m + log n).
I was under the impression that what original poster said was correct by saying time complexity is O(m*logn).
If you use the suggested enhancement to improve the time complexity (to get O(m + logn)) by tracking previously matched letters I believe the below inputs would break it.
arr = [“abc”, “def”, “ghi”, “nlj”, “pfypfy”, “xyz”]
target = “nljpfy”
I expect this would incorrectly match on “pfypfy”. Perhaps one of the original posters can weigh in on this. Definitely curious to better understand what was proposed. It sounds like matched number of letters are skipped in next comparison.

Balanced Parenthesis Order number

Suppose if you consider the case of length-six strings, the order would be: “()()()”, “()(())”, “(())()”, “(()())”, “((()))”.
In the above example, if we see that the strings in which the first opening parenthesis is closed the earliest come first, and if that is the same for two strings, the rule is recursively applied to the next opening parenthesis.
If particular balanced parenthesis sequence is given how to find the order number? Suppose ()(())--> Output is 2....In O(n) where n is the length of balanced parenthesis i.e 3 in above case...The input can be around 100000 balanced parenthesis
First let g(n,k) be the number of length 2n + k strings there are with n pairs of balanced parentheses, which close k more parentheses. Can we calculate g(n,k)?
Let's try recursion. For that we first need a base case. It is clear that if there are no balanced parentheses, then we can only have one possibility - only closing parentheses. So g(0,k) = 1. There is our base case.
Next the recursive case. The first character is either an opening parenthesis, or a closing parenthesis. If it is an opening parenthesis, then there are g(n-1,k+1) ways to finish. If it is a closing parenthesis, then there are g(n,k-1) ways to finish. But we can't have a negative number of open
g(0,k) = 1
g(n,-1) = 0
g(n,k) = g(n-1, k+1)
This lets us calculate g but is not efficient - we are effectively going to list every possible string in the recursive calls. However there is a trick, memoize the results. Meaning that every time you call g(n, k) see if you've ever called it before, and if you have just return that answer. Otherwise you calculate the answer, cache it, and then return it. (For more on this trick, and alternate strategies, look up dynamic programming.)
OK, so now we can generate counts of something related, but how can we use this to get your answer?
Well note the following. Suppose that partway through your string you find an open parenthesis where there logically could be a close parenthesis instead. Suppose that at that point there are n pairs of parentheses needed and k open parentheses. Then there were g(n, k-1) possible strings that are the same as yours until then, then have a close parenthesis there (so they come before yours) and do whatever afterwards. So summing g(n, k-1) over all of the close parentheses gives you the number of strings before yours. Adding one to that gives you your position.
I got the answer from the Ruskey thesis. This algorithm specified about the Ranking & unranking of binary trees.
http://webhome.cs.uvic.ca/~ruskey/Publications/Thesis/ThesisPage16.png

Tokenize valid words from a long string

Suppose you have a dictionary that contains valid words.
Given an input string with all spaces removed, determine whether the string is composed of valid words or not.
You can assume the dictionary is a hashtable that provides O(1) lookup.
Some examples:
helloworld-> hello world (valid)
isitniceinhere-> is it nice in here (valid)
zxyy-> invalid
If a string has multiple possible parsings, just return true is sufficient.
The string can be very long. Hence think an algorithm that is both space & time efficient.
I think the set of all strings that occur as the concatenation of valid words (words taken from a finite dictionary) form a regular language over the alphabet of characters. You can then build a finite automaton that accepts exactly the strings you want; computation time is O(n).
For instance, let the dictionary consist of the words {bat, bag}. Then we construct the following automaton: states are denoted by 0, 1, 2. Edges: (0,1,b), (1,2,a), (2,0,t), (2,0,g); where the triple (x,y,z) means an edge leading from x to y on input z. The only accepting state is 0. In each step, on reading the next input sign, you have to calculate the set of states that are reachable on that input. Given that the number of states in the automaton is constant, this is of complexity O(n). As for space complexity, I think you can do with O(number of words) with the hint for construction above.
For an other example, with the words {bag, bat, bun, but} the automaton would look like this:
Supposing that the automaton has already been built (the time to do this has something to do with the length and number of words :-) we now argue that the time to decide whether a string is accepted by the automaton is O(n) where n is the length of the input string.
More formally, our algorithm is as follows:
Let S be a set of states, initially containing the starting state.
Read the next input character, let us denote it by a.
For each element s in S, determine the state that we move into from s on reading a; that is, the state r such that with the notation above (s,r,a) is an edge. Let us denote the set of these states by R. That is, R = {r | s in S, (s,r,a) is an edge}.
(If R is empty, the string is not accepted and the algorithm halts.)
If there are no more input symbols, check whether any of the accepting states is in R. (In our case, there is only one accepting state, the starting state.) If so, the string is accepted, if not, the string is not accepted.
Otherwise, take S := R and go to 2.
Now, there are as many executions of this cycle as there are input symbols. The only thing we have to examine is that steps 3 and 5 take constant time. Given that the size of S and R is not greater than the number of states in the automaton, which is constant and that we can store edges in a way such that lookup time is constant, this follows. (Note that we of course lose multiple 'parsings', but that was not a requirement either.)
I think this is actually called the membership problem for regular languages, but I couldn't find a proper online reference.
I'd go for a recursive algorithm with implicit backtracking. Function signature: f: input -> result, with input being the string, result either true or false depending if the entire string can be tokenized correctly.
Works like this:
If input is the empty string, return true.
Look at the length-one prefix of input (i.e., the first character). If it is in the dictionary, run f on the suffix of input. If that returns true, return true as well.
If the length-one prefix from the previous step is not in the dictionary, or the invocation of f in the previous step returned false, make the prefix longer by one and repeat at step 2. If the prefix cannot be made any longer (already at the end of the string), return false.
Rinse and repeat.
For dictionaries with low to moderate amount of ambiguous prefixes, this should fetch a pretty good running time in practice (O(n) in the average case, I'd say), though in theory, pathological cases with O(2^n) complexity can probably be constructed. However, I doubt we can do any better since we need backtracking anyways, so the "instinctive" O(n) approach using a conventional pre-computed lexer is out of the question. ...I think.
EDIT: the estimate for the average-case complexity is likely incorrect, see my comment.
Space complexity would be only stack space, so O(n) even in the worst-case.

Algorithm to find lenth of longest sequence of blanks in a given string

Looking for an algorithm to find the length of longest sequence of blanks in a given string examining as few characters as possible?
Hint : Your program should become faster as the length of the sequence of blanks increases.
I know the solution which is O(n).. But looking for more optimal solution
You won't be able to find a solution which is a smaller complexity than O(n) because you need to pass through every character in the worst case with an input string that has at most 0 or 1 consecutive whitespace, or is completely whitespace.
You can do some optimizations though, but it'll still be considered O(n).
For example:
Let M be the current longest match so far as you go through your list. Also assume you can access input elements in O(1), for example you have an array as input.
When you see a non-whitespace you can skip M elements if the current + M is non whitespace. Surely no whitespace longer than M can be contained inside.
And when you see a whitepsace character, if current + M-1 is not whitespace you know you don't have the longest runs o you can skip in that case as well.
But in the worst case (when all characters are blank) you have to examine every character. So it can't be better than O(n) in complexity.
Rationale: assume the whole string is blank, you haven't examined N characters and your algorithms outputs n. Then if any non-examined character is not blank, your answer would be wrong. So for this particular input you have to examine the whole string.
There's no way to make it faster than O(N) in the worst case. However, here are a few optimizations, assuming 0-based indexing.
If you already have a complete sequence of L blanks (by complete I mean a sequence that is not a subsequence of a larger sequence), and L is at least as large as half the size of your string, you can stop.
If you have a complete sequence of L blanks, once you hit a space at position i check if the character at position i + L is also a space. If it is, continue scanning from position i forwards as you might find a larger sequence - however, if you encounter a non-space until position i + L, then you can skip directly to i + L + 1. If it isn't a space, there's no way you can build a larger sequence starting at i, so scan forwards starting from i + L + 1.
If you have a complete sequence of blanks of length L, and you are at position i and you have k positions left to examine, and k <= L, you can stop your search, as obviously there's no way you'll be able to find anything better anymore.
To prove that you can't make it faster than O(N), consider a string that contains no spaces. You will have to access each character once, so it's O(N). Same with a string that contains nothing but spaces.
The obvious idea: you can jump by K+1 places (where K is the current longest space sequence) and scan back if you found a space.
This way you have something about (n + n/M)/2 = n(M+1)/2M positions checked.
Edit:
Another idea would be to apply a kind of binary search. This is like follows: for a given k you make a procedure that checks whether there is a sequence of spaces with length >= k. This can be achieved in O(n/k) steps. Then, you try to find the maximal k with binary search.
Edit:
During the consequent searches, you can utilize the knowledge that the sequence of some length k already exist, and start skipping at k from the very beginning.
What ever you do, the worst case will always be o(n) - if those blanks are on the last part of the string... (or the last "checked" part of the string).

Resources