Complexity of binary search on a string - algorithm

I have an sorted array of strings: eg: ["bar", "foo", "top", "zebra"] and I want to search if an input word is present in an array or not.
eg:
search (String[] str, String word) {
// binary search implemented + string comaparison.
}
Now binary search will account for complexity which is O(logn), where n is the length of an array. So for so good.
But, at some point we need to do a string compare, which can be done in linear time.
Now the input array can contain of words of different sizes. So when I
am calculating final complexity will the final answer be O(m*logn)
where m is the size of word we want to search in the array, which in our case
is "zebra" the word we want to search?

Yes, your thinking as well your proposed solution, both are correct. You need to consider the length of the longest String too in the overall complexity of String searching.
A trivial String compare is an O(m) operation, where m is the length of the larger of the two strings.
But, we can improve a lot, given that the array is sorted. As user "doynax" suggests,
Complexity can be improved by keeping track of how many characters got matched during
the string comparisons, and store the present count for the lower and
upper bounds during the search. Since the array is sorted we know that
the prefix of the middle entry to be tested next must match up to at
least the minimum of the two depths, and therefore we can skip
comparing that prefix. In effect we're always either making progress
or stopping the incremental comparisons immediately on a mismatch, and
thereby never needing to keep going over old ground.
So, overall m number of character comparisons would have to be done till the end of the string, if found OR else not even that much(if fails at early stage).
So, the overall complexity would be O(m + log n).

I was under the impression that what original poster said was correct by saying time complexity is O(m*logn).
If you use the suggested enhancement to improve the time complexity (to get O(m + logn)) by tracking previously matched letters I believe the below inputs would break it.
arr = [“abc”, “def”, “ghi”, “nlj”, “pfypfy”, “xyz”]
target = “nljpfy”
I expect this would incorrectly match on “pfypfy”. Perhaps one of the original posters can weigh in on this. Definitely curious to better understand what was proposed. It sounds like matched number of letters are skipped in next comparison.

Related

Other Ways of Verifying Balanced Parenthesis?

A Classical example of how stacks are quite important is in the problem of verifying whether a string of parenthesis is balanced or not. You start with an empty stack and you keep pushing/popping elements in the stack, at the end, you check if your stack is empty, and if so return that the string is indeed balanced.
However, I am looking for other less efficient approaches to solve this problem. I want to show my students the usefulness of the stack data structure by first coming up with an exponential/non linear algorithm that solves the problem, then introduce the stack solution. Is anyone familiar with other methods other than the stack based approach?
find the last opening-parenthesizes, and look whether it closes, and whether there is no other type of parenthesis after it.
If it does, repeat the process until the string is empty.
If the string is not empty in the end of the process, or you find a different kind of parenthesis - it means it is not balanced.
example:
([[{}]])
the last opening is {, so look for }, after you find it- delete it from the string and continue with:
([[]])
etc.
if the string looks like that:
([[{]}])
so after you find the last open ({) - you see there is parenthesis from a different kind (]) before the closing parenthesis - so it is not balanced.
worst case complexity: O(n^2)
I assume that, for pedagogical purposes, it would be best to show a simple algorithm that they might actually have come up with themselves? If so, then I think a very intuitive algorithm is to just remove occurrences of () until there aren't any more to remove:
boolean isBalancedParens(String s) {
while (s.contains("()")) {
s = s.replace("()", "");
}
return s.isEmpty();
}
Under reasonable assumptions about the performance of the various methods called, this takes worst-case O(n2) time and O(n) extra space.
This problem raises a number of interesting questions in algorithm analysis which are possibly at too high a level for your class, but were fun to think about. I sketch the worst-case and expected runtimes for all the algorithms, which are somewhere between log-linear and quadratic.
The only exponential time algorithm I could think of was the equivalent of Bogosort: generate all possible balanced strings until you find one which matches. That seemed to weird even for a class exercise. Even weirder would be the modified Bogocheck, which only generates all ()-balanced strings and uses some cleverness to figure out which actual parenthesis to use in the comparison. (If you're interested, I could expand on this possibility.)
In most of the algorithms presented here, I use a procedure called "scan maintaining paren depth". This procedure examines characters one at a time in the order specified (forwards or backwards) maintaining a total count of observed open parentheses (of all types) less observed close parentheses (again, of all types). When scanning backwards, the meaning of "open" and "close" are reversed. If the count ever becomes negative, the string is not balanced and the entire procedure can immediately return failure.
Here are two algorithms which use constant space, both of which are worst-case quadratic in string length.
Algorithm 1: Find matching paren
Scan left-to-right. For each close encountered, scan backwards starting with the close maintaining paren depth. When the paren depth reaches zero, compare the character which caused the depth to reach 0 with the close which started the backwards scan; if they don't match, immediately fail. Also fail if the backwards scan hits the beginning of the string without the paren depth reaching zero.
If the end of the string is reached without failure being detected, the string is balanced.
Algorithm 2: Depthwise scan
Set depth to 1.
LOOP: Scan left-to-right from the first character, maintaining paren depth. If an open is encountered and the paren depth is incremented to depth, remember the open. If the paren depth is depth and a close is encountered, check to see if it matches the remembered open; if it does not, fail immediately.
If the end of the string is reached before any open is remembered, report success. If the end of the string is reached and the last remembered open was never matched by a close, report failure. Otherwise, increment depth and repeat the LOOP.
Both of the above have worst case (quadratic) performance on a completely nested string ((…()…)). However, the average time complexity is trickier to compute.
Each loop in Algorithm 2 takes precisely &Theta(N) time. If the total paren depth of the string is not 0 or there is any point in the string where the cumulative paren depth is negative, then failure will be reported in the first scan, taking linear time. That accounts for the vast majority of strings if the inputs are randomly selected from among all strings containing parenthesis characters. Of the strings which are not trivially rejected -- that is, the strings which would match if all opens were replaced with ( and all closes with ), including strings which are correctly balanced -- the expected number of scans is the expected maximum parenthesis depth of the string, which is Θ(log N) (proving this is an interesting exercise, but I think it's not too difficult), so the total expected time is Θ(N log N).
Algorithm 1 is rather more difficult to analyse in the general case, but for completely random strings it seems safe to guess that the first mismatch will be found in expected linear time. I don't have a proof for this, though. If the string is actually balanced, success will be reported at the termination of the scan, and the work performed is the sum of the span lengths of each pair of balanced parentheses. I believe this is approximately Θ(N log N), but I'd like to do some analysis before committing to this fact.
Here is an algorithm which is guaranteed to be O(N log N) for any input, but which requires Θ(N) additional space:
Algorithm 3: Sort matching pairs
Create an auxiliary vector of length N, whose ith element is the 2-tuple consisting of the cumulative paren depth of the character at position i, and the index i itself. The paren depth of an open is defined as the paren depth just before the open is counted, and the paren depth of a close is the paren depth just after the close is counted; the consequence is that matching open and close have the same paren depth.
Now sort the auxiliary vector in ascending order using lexicographic comparison of the tuples. Any O(N log N) sorting algorithm can be used; note that a stable sort is not necessary because all the tuples are distinct. [Note 1].
Finally iterate over the sorted vector, selecting two elements at a time. Reject the string if the two elements do not have the same depth, or are not a matching pair of open and close (using the index in the tuple to look up the character in the original string).
If the entire sorted vector can be scanned without failure, then the string was balanced.
Finally, a regex-based solution, because everyone loves regexes. :) This algorithm destroys the input string (unless a copy is made), but requires only constant additional storage.
Algorithm 4: Regex to the rescue!
Do the following search and replace until the search fails to find anything: (I wrote it for sed using Posix BREs, but in case that's too obscure, the pattern consists precisely of an alternation of each possible matched open-close pair.)
s/()\|[]\|{}//g
When the above loop terminates, if the string is not empty then it was not originally balanced; if it is empty, it was.
Note the g, which means that the search-and-replace is performed across the entire string on each pass. Each pass will take time proportional to the remaining length of the string at the beginning of the pass, but for simplicity we can say that the cost of a pass is O(N). The number of passes performed is the maximum paren depth of the string, which is Θ(N) in the worst case, but has an expected value of Θ(log N). So in the worst case, the execution time is Θ(N2) but the expected time is Θ(N log N).
Notes
An O(N) stable counting sort on the paren depth is possible. In that case, the total algorithm would be O(N) instead of O(N log N), but that wasn't what you wanted, right? You could also use a stable sort just on the paren depth, in which case you could replace the second element of the tuple with the character itself. That would still be O(N log N), if the sort was O(N log N).
If your students are already familiar with recursion, here's a simple idea: look at the first parenthesis, find all matching closing parentheses, and for each of these pairs, recurse with the substring inside them and the substring after them; e.g.:
input: "{(){[]}()}[]"
option 1: ^ ^
recurse with: "(){[]" and "()}[]"
"{(){[]}()}[]"
option 2: ^ ^
recurse with: "(){[]}()" and "[]"
If the input is an empty string, return true. If the input starts with a closing parenthesis, or if the input does not contain a closing parenthesis matching the first parenthesis, return false.
function balanced(input) {
var opening = "{([", closing = "})]";
if (input.length == 0)
return true;
var type = opening.indexOf(input.charAt(0));
if (type == -1)
return false;
for (var pos = 1; pos < input.length; pos++) { // forward search
if (closing.indexOf(input.charAt(pos)) == type) {
var inside = input.slice(1, pos);
var after = input.slice(pos + 1);
if (balanced(inside) && balanced(after))
return true;
}
}
return false;
}
document.write(balanced("{(({[][]}()[{}])({[[]]}()[{}]))}"));
Using forward search is better for concatenations of short balanced substrings; using backward search is better for deeply nested strings. But the worst case for both is O(n2).

Calculating the hash of any substring in logarithmic time

Question came up in relation to this article:
https://threads-iiith.quora.com/String-Hashing-for-competitive-programming
The author presents this algorithm for hashing a string:
where S is our string, Si is the character at index i, and p is a prime number we've chosen.
He then presents the problem of determining whether a substring of a given string is a palindrome and claims it can be done in logarithmic time through hashing.
He makes the point we can calculate from the beginning of our whole string to the right edge of our substring:
and observes that if we calculate the hash from the beginning to the left edge of our substring (F(L-1)), the difference between this and our hash to our right edge is basically the hash of our substring:
This is all fine, and I think I follow it so far. But he then immediately makes the claim that this allows us to calculate our hash (and thus determine if our substring is a palindrome by comparing this hash with the one generated by moving through our substring in reverse order) in logarithmic time.
I feel like I'm probably missing something obvious but how does this allow us to calculate the hash in logarithmic time?
You already know that you can calculate the difference in constant time. Let me restate the difference (I'll leave the modulo away for clarity):
diff = ∑_{i=L to R} S_i ∗ p^i
Note that this is not the hash of the substring because the powers of p are offset by a constant. Instead, this is (as stated in the article)
diff = Hash(S[L,R])∗p^L
To derive the hash of the substring, you have to multiply the difference with p^-L. Assuming that you already know p^-1 (this can be done in a preprocessing step), you need to calculate (p^-1)^L. With the square-and-multiply method, this takes O(log L) operations, which is probably what the author refers to.
This may become more efficient if your queries are sorted by L. In this case, you could calculate p^-L incrementally.

Discovering Consecutive Repetitive Patterns in a String

I am trying to search for the maximal number of substring repetitions inside a string, here are some few examples:
"AQMQMB" => QM (2x)
"AQMPQMB" => <nothing>
"AACABABCABCABCP" => A (2x), AB (2x), ABC (3x)
As you can see I am searching for consecutive substrings only and this seems to be a problem because all compression algorithms (at least that I am aware of) don't care about the consecutivity (LZ*), or too simple to handle consecutive patterns instead of single data items (RLE). I think using suffix tree-related algorithms is also not useful due to the same problem.
I think there are some bio-informatics algorithms that can do this, does anyone have an idea about such algorithm?
Edit
In the second example there might be multiple possibilities of consecutive patterns (thanks to Eugen Rieck for the notice, read comments below), however in my use case any of these possibilities is actually acceptable.
This is what I used for a similar problem:
<?php
$input="AACABABCABCABCP";
//Prepare index array (A..Z) - adapt to your character range
$idx=array();
for ($i="A"; strlen($i)==1; $i++) $idx[$i]=array();
//Prepare hits array
$hits=array();
//Loop
$len=strlen($input);
for ($i=0;$i<$len;$i++) {
//Current character
$current=$input[$i];
//Cycle past occurrences of character
foreach ($idx[$current] as $offset) {
//Check if substring from past occurrence to now matches oncoming
$matchlen=$i-$offset;
$match=substr($input,$offset,$matchlen);
if ($match==substr($input,$i,$matchlen)) {
//match found - store it
if (isset($hits[$match])) $hits[$match][]=$i;
else $hits[$match]=array($offset,$i);
}
}
//Store current character in index
$idx[$current][]=$i;
}
print_r($hits);
?>
I suspect it to be O(N*N/M) time with N being string length and M being the width of the character range.
It outputs what I think are the correct answers for your example.
Edit:
This algo hast the advantage of keeping valid scores while running, so it is usable for streams, asl long as you can look-ahaead via some buffering. It pays for this with efficiency.
Edit 2:
If one were to allow a maximum length for repetition detection, this will decrease space and time usage: Expelling too "early" past occurrences via something like if ($matchlen>MAX_MATCH_LEN) ... limits index size and string comparison length
Suffix tree related algorithms are useful here.
One is described in Algorithms on Strings, Trees and Sequences by Dan Gusfield (Chapter 9.6). It uses a combination of divide-and-conquer approach and suffix trees and has time complexity O(N log N + Z) where Z is the number of substring repetitions.
The same book describes simpler O(N2) algorithm for this problem, also using suffix trees.

Fastest algorithm to find a string in an array of strings?

This question is merely about algorithm.
In pseudo code is like this:
A = Array of strings; //let's say count(A) = N
S = String to find; //let's say length(S) = M
for (Index=0; Index<count(A); Index++)
if (A[Index]==S) {
print "First occurrence at index\x20"+Index;
break;
}
This for loop requires string comparison N times (or byte comparison N*M times, O(N*M)). This is bad when array A has lots of items, or when string S is too long.
Any better method to find out the first occurrence? Some algorithm at O(K*logK) is OK, but preferable at O(K) or best at O(logK), where K is either N or M.
I don't mind adding in some other structures or doing some data processing before the comparison loop.
You could convert the whole array of strings to a finite state machine, where the transitions are the characters of the strings and put the smallest index of the strings that produced a state into the state. This takes a lot of time, and may be considered indexing.
Put the strings into a hash based set, and test to see if a given string is contained in the set should give you more or less constant performance once the set is built.
You can first sort the array of strings, which will take O(m*nlogn) time. And after A is sorted, you can do a binary search instead of linear search, which could reduce the total running time to O(m*logn).
The advantage of this method is that it's quite easy to implement. For example, in Java you can do this with just 2 lines of codes:
Arrays.sort(A);
int index = Arrays.binarySearch(A, "S");
You could use a Self-balancing binary search tree. Most implementations have O(log(n)) to insert, and O(log(n)) to search.
If your set is not very big, and you have a good hash functions for your values, the hash based set is a better solution, because in that case you will have O(1) to insert and O(1) to search. But if your hash function is bad or your set is too big, it will be O(n) to insert and O(n) to search.
The best way to search as fast as possible, is to have the array sorted
As you describe, there seems to be no possible information a priori which would allow for some heuristics or constraints in the search
Sort the array first (Quicksort for example, O(NlogN)),
and do binary search next O(log(N))

Using Rabin-Karp to search for multiple patterns in a string

According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the patterns are of the same length, but I still don't get how we can preserve O(n) complexity when searching for patterns with differing length simultaneously. Can someone please shed some light on this?
Edit (December 2011):
The wikipedia article has since been updated and no longer claims to match multiple patterns of differing length in O(n).
I'm not sure if this is the correct answer, but anyway:
While constructing the hash value, we can check for a match in the set of string hashes. Aka, the current hash value. The hash function/code is usually implemented as a loop and inside that loop we can insert our quick look up.
Of course, we must pick m to have the maximum string length from the set of strings.
Update: From Wikipedia,
[...]
for i from 1 to n-m+1
if hs ∈ hsubs
if s[i..i+m-1] = a substring with hash hs
return i
hs := hash(s[i+1..i+m]) // <---- calculating current hash
[...]
We calculate current hash in m steps. On each step there is a temporary hash value that we can look up ( O(1) complexity ) in the set of hashes. All hashes will have the same size, ie 32 bit.
Update 2: an amortized (average) O(n) time complexity ?
Above I said that m must have the maximum string length. It turns out that we can exploit the opposite.
With hashing for shifting substring search and a fixed m size we can achieve O(n) complexity.
If we have variable length strings we can set m to the minimum string length. Additionally, in the set of hashes we don't associate a hash with the whole string but with the first m-characters of it.
Now, while searching the text we check if the current hash is in the hash set and we examine the associated strings for a match.
This technique will increase the false alarms but on average it has O(n) time complexity.
It's because the hash values of the substrings are related mathematically. Computing the hash H(S,j) (the hash of the characters starting from the jth position of string S) takes O(m) time on a string of length m. But once you have that, computing H(S, j+1) can be done in constant time, because H(S, j+1) can be expressed as a function of H(S, j).
O(m) + O(1) => O(m), i.e. linear time.
Here's a link where this is described in more detail (see e.g. the section "What makes Rabin-Karp fast?")

Resources