Question came up in relation to this article:
The author presents this algorithm for hashing a string:
where S is our string, Si is the character at index i, and p is a prime number we've chosen.
He then presents the problem of determining whether a substring of a given string is a palindrome and claims it can be done in logarithmic time through hashing.
He makes the point we can calculate from the beginning of our whole string to the right edge of our substring:
and observes that if we calculate the hash from the beginning to the left edge of our substring (F(L-1)), the difference between this and our hash to our right edge is basically the hash of our substring:
This is all fine, and I think I follow it so far. But he then immediately makes the claim that this allows us to calculate our hash (and thus determine if our substring is a palindrome by comparing this hash with the one generated by moving through our substring in reverse order) in logarithmic time.
I feel like I'm probably missing something obvious but how does this allow us to calculate the hash in logarithmic time?

You already know that you can calculate the difference in constant time. Let me restate the difference (I'll leave the modulo away for clarity):
diff = ∑_{i=L to R} S_i ∗ p^i
Note that this is not the hash of the substring because the powers of p are offset by a constant. Instead, this is (as stated in the article)
diff = Hash(S[L,R])∗p^L
To derive the hash of the substring, you have to multiply the difference with p^-L. Assuming that you already know p^-1 (this can be done in a preprocessing step), you need to calculate (p^-1)^L. With the square-and-multiply method, this takes O(log L) operations, which is probably what the author refers to.
This may become more efficient if your queries are sorted by L. In this case, you could calculate p^-L incrementally.


Repeated DNA sequence

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.
Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.
and k=10
This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.
This is actually a research problem.
Let's come to terms with some facts
Input = N, Input length = |N|
You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be
Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.
SO what alternatives do you have?
Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA
Heads-up: This is not a general solution, but a good trick that you can use when k is not large.
The trick is to encrypt the sequence into an integer by bit manipulation.
If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.
For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.
In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.
Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.
By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.
Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.
You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).
After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).
The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.
What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

Complexity of binary search on a string

I have an sorted array of strings: eg: ["bar", "foo", "top", "zebra"] and I want to search if an input word is present in an array or not.
search (String[] str, String word) {
// binary search implemented + string comaparison.
Now binary search will account for complexity which is O(logn), where n is the length of an array. So for so good.
But, at some point we need to do a string compare, which can be done in linear time.
Now the input array can contain of words of different sizes. So when I
am calculating final complexity will the final answer be O(m*logn)
where m is the size of word we want to search in the array, which in our case
is "zebra" the word we want to search?
Yes, your thinking as well your proposed solution, both are correct. You need to consider the length of the longest String too in the overall complexity of String searching.
A trivial String compare is an O(m) operation, where m is the length of the larger of the two strings.
But, we can improve a lot, given that the array is sorted. As user "doynax" suggests,
Complexity can be improved by keeping track of how many characters got matched during
the string comparisons, and store the present count for the lower and
upper bounds during the search. Since the array is sorted we know that
the prefix of the middle entry to be tested next must match up to at
least the minimum of the two depths, and therefore we can skip
comparing that prefix. In effect we're always either making progress
or stopping the incremental comparisons immediately on a mismatch, and
thereby never needing to keep going over old ground.
So, overall m number of character comparisons would have to be done till the end of the string, if found OR else not even that much(if fails at early stage).
So, the overall complexity would be O(m + log n).
I was under the impression that what original poster said was correct by saying time complexity is O(m*logn).
If you use the suggested enhancement to improve the time complexity (to get O(m + logn)) by tracking previously matched letters I believe the below inputs would break it.
arr = [“abc”, “def”, “ghi”, “nlj”, “pfypfy”, “xyz”]
target = “nljpfy”
I expect this would incorrectly match on “pfypfy”. Perhaps one of the original posters can weigh in on this. Definitely curious to better understand what was proposed. It sounds like matched number of letters are skipped in next comparison.

searching through a vast collection of potential solutions

I have a quite difficult problem (perhaps even a NP-hard problem ^^) with looking for a solution in a massive collection of results. Perhaps there is an algorithm for it.
Below exercise is artificial but is a perfect example to illustrate my issue.
There is a big array with integers. Lets say it has 100.000 elements.
int numbers[] = {-123,32,4,-234564,23,5,....}
I want to check in a relatively quick way if a sum on any 2 numbers from this array is equal to 0. In other words, if the array has "-123" I want to find is there also a "123" number.
The easiest solution would be brute force - check everything with everything. That gives 100.000 x 100.000 a big number ;-) Obviously brute force method can by optimised. Order numbers and check negatives against positive only. My question is - is there something better then optimised brute force to find a solution?
First, sort the array by magnitude of the value.
Then, if the data contains a pair which satisfies the conditions you're after, it contains such a pair adjacent in the array. So just sweep through looking for adjacent pairs whose sum is 0.
Overall time complexity is O(n log n) for the sort, could be O(n) if you use "cheating" sorts not based solely on comparisons. Clearly it can't be done in less than linear time, because in the worst case you can't do it without looking at all the elements. I think n log n is probably optimal in the decision tree model of computing, but only because it "feels a bit like" the element uniqueness problem.
Alternative approach:
Add the elements one at a time to a hash-based or tree-based container. Before adding each element, check whether its negative is present. If so, stop.
This is likely to be faster in the case where there are lots of suitable pairs, because you save the cost of sorting the whole data. That said, you could write a modified sort that exits early by checking for adjacent pairs as soon as any subset of the data is in its final order, but that's effort.
Brute force would be an O(n^2) solution. You can certainly do better.
Off the top of my head, first sort it. Heap sort will have a complexity of O(nlogn).
Now, for the first element, say a, you know you need to find an element b, such that a+b = 0. This can be found using binary search (since your array is now sorted). Binary search has a complexity of O(logn).
This gives you an overall solution of O(nlogn) complexity.
The example you provided can be brute-force solved in O(n^2) time.
You can start ordering the numbers (O(n·logn)) from smaller to bigger. If you place one pointer at the beginning (the "most negative number") and other at the end (the "most positive"), you can check if there is such pair of numbers in an additional O(n) steps by following the next procedure:
If the numbers at both pointers have the same module, you have the solution
If not, move the pointer of the number with bigger module towards "zero" (this is, increase if it is the pointer on the negative side, decrease if it is the positive-side one)
Repeat until finding a solution, or the pointers cross.
Total complexity is O(n·logn)+O(n) = O(n·logn).
Sort your array using Quicksort. After this happened, use two indexes, let's call them positive and negative.
positive <- 0
negative <- size - 1
while ((array[positive] > 0) and (array(negative < 0) and (positive >= 0) and (negative < size)) do
delta <- array[positive] + array[negative]
if (delta = 0) then
return true
else if (delta < 0) then
negative <- negative + 1
positive <- positive - 1
end if
end while
return (array[positive] * array[negative] = 0)
You didn't say what should the algorithm do if 0 is part of the array, I've supposed that in this case true should be returned.

Tokenize valid words from a long string

Suppose you have a dictionary that contains valid words.
Given an input string with all spaces removed, determine whether the string is composed of valid words or not.
You can assume the dictionary is a hashtable that provides O(1) lookup.
Some examples:
helloworld-> hello world (valid)
isitniceinhere-> is it nice in here (valid)
zxyy-> invalid
If a string has multiple possible parsings, just return true is sufficient.
The string can be very long. Hence think an algorithm that is both space & time efficient.
I think the set of all strings that occur as the concatenation of valid words (words taken from a finite dictionary) form a regular language over the alphabet of characters. You can then build a finite automaton that accepts exactly the strings you want; computation time is O(n).
For instance, let the dictionary consist of the words {bat, bag}. Then we construct the following automaton: states are denoted by 0, 1, 2. Edges: (0,1,b), (1,2,a), (2,0,t), (2,0,g); where the triple (x,y,z) means an edge leading from x to y on input z. The only accepting state is 0. In each step, on reading the next input sign, you have to calculate the set of states that are reachable on that input. Given that the number of states in the automaton is constant, this is of complexity O(n). As for space complexity, I think you can do with O(number of words) with the hint for construction above.
For an other example, with the words {bag, bat, bun, but} the automaton would look like this:
Supposing that the automaton has already been built (the time to do this has something to do with the length and number of words :-) we now argue that the time to decide whether a string is accepted by the automaton is O(n) where n is the length of the input string.
More formally, our algorithm is as follows:
Let S be a set of states, initially containing the starting state.
Read the next input character, let us denote it by a.
For each element s in S, determine the state that we move into from s on reading a; that is, the state r such that with the notation above (s,r,a) is an edge. Let us denote the set of these states by R. That is, R = {r | s in S, (s,r,a) is an edge}.
(If R is empty, the string is not accepted and the algorithm halts.)
If there are no more input symbols, check whether any of the accepting states is in R. (In our case, there is only one accepting state, the starting state.) If so, the string is accepted, if not, the string is not accepted.
Otherwise, take S := R and go to 2.
Now, there are as many executions of this cycle as there are input symbols. The only thing we have to examine is that steps 3 and 5 take constant time. Given that the size of S and R is not greater than the number of states in the automaton, which is constant and that we can store edges in a way such that lookup time is constant, this follows. (Note that we of course lose multiple 'parsings', but that was not a requirement either.)
I think this is actually called the membership problem for regular languages, but I couldn't find a proper online reference.
I'd go for a recursive algorithm with implicit backtracking. Function signature: f: input -> result, with input being the string, result either true or false depending if the entire string can be tokenized correctly.
Works like this:
If input is the empty string, return true.
Look at the length-one prefix of input (i.e., the first character). If it is in the dictionary, run f on the suffix of input. If that returns true, return true as well.
If the length-one prefix from the previous step is not in the dictionary, or the invocation of f in the previous step returned false, make the prefix longer by one and repeat at step 2. If the prefix cannot be made any longer (already at the end of the string), return false.
Rinse and repeat.
For dictionaries with low to moderate amount of ambiguous prefixes, this should fetch a pretty good running time in practice (O(n) in the average case, I'd say), though in theory, pathological cases with O(2^n) complexity can probably be constructed. However, I doubt we can do any better since we need backtracking anyways, so the "instinctive" O(n) approach using a conventional pre-computed lexer is out of the question. ...I think.
EDIT: the estimate for the average-case complexity is likely incorrect, see my comment.
Space complexity would be only stack space, so O(n) even in the worst-case.

Using Rabin-Karp to search for multiple patterns in a string

According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the patterns are of the same length, but I still don't get how we can preserve O(n) complexity when searching for patterns with differing length simultaneously. Can someone please shed some light on this?
Edit (December 2011):
The wikipedia article has since been updated and no longer claims to match multiple patterns of differing length in O(n).
I'm not sure if this is the correct answer, but anyway:
While constructing the hash value, we can check for a match in the set of string hashes. Aka, the current hash value. The hash function/code is usually implemented as a loop and inside that loop we can insert our quick look up.
Of course, we must pick m to have the maximum string length from the set of strings.
Update: From Wikipedia,
for i from 1 to n-m+1
if hs ∈ hsubs
if s[i..i+m-1] = a substring with hash hs
return i
hs := hash(s[i+1..i+m]) // <---- calculating current hash
We calculate current hash in m steps. On each step there is a temporary hash value that we can look up ( O(1) complexity ) in the set of hashes. All hashes will have the same size, ie 32 bit.
Update 2: an amortized (average) O(n) time complexity ?
Above I said that m must have the maximum string length. It turns out that we can exploit the opposite.
With hashing for shifting substring search and a fixed m size we can achieve O(n) complexity.
If we have variable length strings we can set m to the minimum string length. Additionally, in the set of hashes we don't associate a hash with the whole string but with the first m-characters of it.
Now, while searching the text we check if the current hash is in the hash set and we examine the associated strings for a match.
This technique will increase the false alarms but on average it has O(n) time complexity.
It's because the hash values of the substrings are related mathematically. Computing the hash H(S,j) (the hash of the characters starting from the jth position of string S) takes O(m) time on a string of length m. But once you have that, computing H(S, j+1) can be done in constant time, because H(S, j+1) can be expressed as a function of H(S, j).
O(m) + O(1) => O(m), i.e. linear time.
Here's a link where this is described in more detail (see e.g. the section "What makes Rabin-Karp fast?")
