Repeated DNA sequence - algorithm

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.
Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.
Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT
and k=10
Answer: {AAAAACCCCC, CCCCCAAAAA}
This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.

This is actually a research problem.
Let's come to terms with some facts
Input = N, Input length = |N|
You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be
Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.
SO what alternatives do you have?
Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA

Heads-up: This is not a general solution, but a good trick that you can use when k is not large.
The trick is to encrypt the sequence into an integer by bit manipulation.
If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.
For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.
In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.
Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.
By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.
Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.

You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).
After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).
The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.

What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

Related

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

How good is hash function that is linear combination of values?

I was reading text about hashing , I found out that naive hash code of char string can be implemented as polynomial hash function
h(S0,S1,S2,...SN-1) = S0*A^N-1 + S1*A^N-2 + S2*A^N-3 ..... SN-1*A^0. Where Si is character at index i and A is some integer.
But cannot we straightaway sum as
h(S0,S1,S2,...SN-1) = S0*(N)+S1*(N-1)+S2*(N-2) ...... SN-1*1.
I see this function also as good since two values 2*S0+S1 != 2*S1+S0 (which are reverse) are not hashed to same values. But nowhere i find this type of hash function
Suppose we work with strings of 30 characters. That's not long, but it's not so short that problems with the hash should arise purely because the strings are too short.
The sum of the weights is 465 (1+2+...+30), with printable ASCII characters that makes the maximum hash 58590, attained by "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~". There are a lot more possible printable ASCII strings of 30 characters than that (9530 ≈ 2E59), but they all hash into the range of 0 to 58590. Naturally you cannot actually have that many strings at the same time, but you could have a lot more than 58590, and that would guarantee collisions just based on counting (it is very likely to happen much sooner of course).
The maximum hash grows only slowly, you'd need strings of 34 million characters before the entire range of a 32bit integer is used.
The other way, multiplying by powers of A (this can be evaluated with Horner's scheme so no powers needs to be calculated explicitly, it still only costs an addition and a multiplication per character, though the naive way is not the fastest way to compute that hash), does not have this problem. The powers of A quickly get big (and start wrapping, which is fine as long as A is odd), so strings with 30 characters stand a good chance to cover the entire range of whatever integer type you're using.
The problem with a linear hash function is that it's much easier to generate collisions.
Consider a string with 3 chars: S0, S1, S2.
The proposed hash code would be 3 * S0 + 2 * S1 + S2.
Every time we decrease char S2 by two (e.g. e --> c), and increase char S1 by one (e.g. m --> n), we obtain the same hash code.
Even only the fact that it is possible to describe an operation preserving hash so easily would be an alarm (because some algorithm might process the string exactly in that manner). As a more extreme case consider just summing the characters. In this situation all the anagrams of the original string would generate the same hash code (thus this hash would be useless in an application processing anagrams).

Upper bound on 4 digit sequences in pi

If this is not the right SE site for this question, please let me know.
A friend shared this interview question he received over the phone, which I have tried to solve myself. I will paraphrase:
The value of pi up to n digits as a string is given.
How can I find all duplicate 4 digit sequences in this string?
This part seems fairly straight forward. Add 4 character sequences to a hash table, incrementing one character at a time. Check if the current 4 character sequence already exists before insertion into the hash table. If so, then you have found a duplicate. Store this somewhere, and repeat the process. I was told this was more or less correct.
The issue I have is on the second question:
What is the upper bound?
n = 10,000,000 was an example.
My algorithm background is admittedly very rusty. My first thought is that the upper bound must be related to n somehow, but I was told it is not.
How do I calculate this?
EDIT:
I would also be open to a solution that disregards the restraint that the upper bound is not related to n. Either is acceptable.
There are only 10,000 possible sequences of four digits (0000 to 9999), so at some point you will have found that every sequence has been duplicated, and there's no need to process further digits.
If you assume that pi is a perfectly uniform random number generator, then each new digit that's processes results in a new sequence, and after about 20,000 digits, you will have found duplicates for all 10,000 sequences. Given that pi is not perfect, you may need significantly more digits before you duplicate all sequences, but 100,000 would be a reasonable guess at the upper bound.
Also, since there are only 10,000 possibilities, you don't really need a hash table. You can simply use an array of 10000 counters, (int count[10000]), and increment the count for each sequence you find.
The upper bound of your solution is the size of the hash table that you can fit into memory.
An alternate technique is to generate all the sequences and sort them. Then the duplicates will be adjacent and easy to detect. You can generally fit more into a linear data structure than you can a hash table, and if you still exhaust memory you can sort to/from disk.
Edit: unless "upper bound" means the O(n) of the algorithm, which should be easy to figure out.

First pair of numbers adding to a specific value in a stream

There are a stream of integers coming through. The problem is to find the first pair of numbers from the stream that adds to a specific value (say, k).
With static arrays, one can use either of the below approaches:
Approach (1): Sort the array, use two pointers to beginning and end of array and compare.
Approach (2): Use hashing, i.e. if A[i]+A[j]=k, then A[j]=k-A[i]. Search for A[j] in the hash table.
But neither of these approaches scale well for streams. Any thoughts on efficiently solving this?
I believe that there is no way to do this that doesn't use at least O(n) memory, where n is the number of elements that appear before the first pair that sums to k. I'm assuming that we are using a RAM machine, but not a machine that permits awful bitwise hackery (in other words, we can't do anything fancy with bit packing.)
The proof sketch is as follows. Suppose that we don't store all of the n elements that appear before the first pair that sums to k. Then when we see the nth element, which sums with some previous value to get k, there is a chance that we will have discarded the previous element that it pairs with and thus won't know that the sum of k has been reached. More formally, suppose that an adversary could watch what values we were storing in memory as we looked at the first n - 1 elements and noted that we didn't store some element x. Then the adversary could set the next element of the stream to be k - x and we would incorrectly report that the sum had not yet been reached, since we wouldn't remember seeing x.
Given that we need to store all the elements we've seen, without knowing more about the numbers in the stream, a very good approach would be to use a hash table that contains all of the elements we've seen so far. Given a good hash table, this would take expected O(n) memory and O(n) time to complete.
I am not sure whether there is a more clever strategy for solving this problem if you make stronger assumptions about the sorts of numbers in the stream, but I am fairly confident that this is asymptotically ideal in terms of time and space.
Hope this helps!

Best data structure to store lots one bit data

I want to store lots of data so that
they can be accessed by an index,
each data is just yes and no (so probably one bit is enough for each)
I am looking for the data structure which has the highest performance and occupy least space.
probably storing data in a flat memory, one bit per data is not a good choice on the other hand using different type of tree structures still use lots of memory (e.g. pointers in each node are required to make these tree even though each node has just one bit of data).
Does anyone have any Idea?
What's wrong with using a single block of memory and either storing 1 bit per byte (easy indexing, but wastes 7 bits per byte) or packing the data (slightly trickier indexing, but more memory efficient) ?
Well in Java the BitSet might be a good choice http://download.oracle.com/javase/6/docs/api/java/util/BitSet.html
If I understand your question correctly you should store them in an unsigned integer where you assign each value to a bit of the integer (flag).
Say you represent 3 values and they can be on or off. Then you assign the first to 1, the second to 2 and the third to 4. Your unsigned int can then be 0,1,2,3,4,5,6 or 7 depending on which values are on or off and you check the values using bitwise comparison.
Depends on the language and how you define 'index'. If you mean that the index operator must work, then your language will need to be able to overload the index operator. If you don't mind using an index macro or function, you can access the nth element by dividing the given index by the number of bits in your type (say 8 for char, 32 for uint32_t and variants), then return the result of arr[n / n_bits] & (1 << (n % n_bits))
Have a look at a Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
It performs very well and is space-efficient. But make sure you read the fine print below ;-): Quote from the above wiki page.
An empty Bloom filter is a bit array
of m bits, all set to 0. There must
also be k different hash functions
defined, each of which maps or hashes
some set element to one of the m array
positions with a uniform random
distribution. To add an element, feed
it to each of the k hash functions to
get k array positions. Set the bits at
all these positions to 1. To query for
an element (test whether it is in the
set), feed it to each of the k hash
functions to get k array positions. If
any of the bits at these positions are
0, the element is not in the set – if
it were, then all the bits would have
been set to 1 when it was inserted. If
all are 1, then either the element is
in the set, or the bits have been set
to 1 during the insertion of other
elements. The requirement of designing
k different independent hash functions
can be prohibitive for large k. For a
good hash function with a wide output,
there should be little if any
correlation between different
bit-fields of such a hash, so this
type of hash can be used to generate
multiple "different" hash functions by
slicing its output into multiple bit
fields. Alternatively, one can pass k
different initial values (such as 0,
1, ..., k − 1) to a hash function that
takes an initial value; or add (or
append) these values to the key. For
larger m and/or k, independence among
the hash functions can be relaxed with
negligible increase in false positive
rate (Dillinger & Manolios (2004a),
Kirsch & Mitzenmacher (2006)).
Specifically, Dillinger & Manolios
(2004b) show the effectiveness of
using enhanced double hashing or
triple hashing, variants of double
hashing, to derive the k indices using
simple arithmetic on two or three
indices computed with independent hash
functions. Removing an element from
this simple Bloom filter is
impossible. The element maps to k
bits, and although setting any one of
these k bits to zero suffices to
remove it, this has the side effect of
removing any other elements that map
onto that bit, and we have no way of
determining whether any such elements
have been added. Such removal would
introduce a possibility for false
negatives, which are not allowed.
One-time removal of an element from a
Bloom filter can be simulated by
having a second Bloom filter that
contains items that have been removed.
However, false positives in the second
filter become false negatives in the
composite filter, which are not
permitted. In this approach re-adding
a previously removed item is not
possible, as one would have to remove
it from the "removed" filter. However,
it is often the case that all the keys
are available but are expensive to
enumerate (for example, requiring many
disk reads). When the false positive
rate gets too high, the filter can be
regenerated; this should be a
relatively rare event.

Resources