constructing binary sequences with unique n-bit - algorithm

A question that was asked during a job interview (which I pretty much failed) and
sadly, something I still cannot figure out.
Let's assume that you're given some positive integer, n.
Assume that you construct a sequence consisting of only 1 and 0, and
you want to construct a sequence of length 2^n + n-1 such that
every sequence of length n consisting of adjacent numbers is unique.
for instance
00110 (00, 01, 11, 10) for n=2
How would one construct such a sequence?
I think one should start with 0000..0 (n zeroes) and
do something about it.
If there is a constructive way of doing it, maybe
I could extend that method to constructing
a sequence consisting of only 0, 1, ..., k-1, and having
length k^n + n-1 such that
every sequence of length n consisting of adjacent numbers is unique
(or maybe not..)
(sorry, my sequence for n=3 is wrong, so I deleted it.
also, i've never heard of De Bruijin's sequence. I know it now!
thanks for all the answers and comments).

This strikes me as a very ambitious interview question; if you don't know the answer, you're unlikely to get it in a few minutes.
As mentioned in comments, this is really just the derivation of a de Bruijn sequence, only unwrapped. You can read the Wikipedia article linked above for more information, but the algorithms it proposes, while efficient, are not exactly easy to derive. There is a much simpler (but rather more storage-intensive) algorithm which I think is folkloric; at least, I don't know of a name attached to it. It's at least simple to describe:
Start with n 0s
As long as possible:
If you can add a 1 without repeating a previously-seen n-sequence, do so.
If not but you can add a 0 without repeating a previously-seen n-sequence, do so.
Otherwise, done.
This requires you to either search the entire string on each iteration, requiring exponential time, or maintain a boolean array of all seen sequences (coded as binary numbers, presumably), requiring exponential space. The "concatenate all Lyndon words in lexicographical order" solution is much more efficient, but leaves open the question of generating all Lyndon words in lexicographical order.

Related

Repeated DNA sequence

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.
Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.
Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT
and k=10
Answer: {AAAAACCCCC, CCCCCAAAAA}
This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.
This is actually a research problem.
Let's come to terms with some facts
Input = N, Input length = |N|
You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be
Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.
SO what alternatives do you have?
Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA
Heads-up: This is not a general solution, but a good trick that you can use when k is not large.
The trick is to encrypt the sequence into an integer by bit manipulation.
If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.
For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.
In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.
Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.
By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.
Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.
You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).
After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).
The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.
What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

Dynamic algorithm to multiply elements in a sequence two at a time and find the total

I am trying to find a dynamic approach to multiply each element in a linear sequence to the following element, and do the same with the pair of elements, etc. and find the sum of all of the products. Note that any two elements cannot be multiplied. It must be the first with the second, the third with the fourth, and so on. All I know about the linear sequence is that there are an even amount of elements.
I assume I have to store the numbers being multiplied, and their product each time, then check some other "multipliable" pair of elements to see if the product has already been calculated (perhaps they possess opposite signs compared to the current pair).
However, by my understanding of a linear sequence, the values must be increasing or decreasing by the same amount each time. But since there are an even amount of numbers, I don't believe it is possible to have two "multipliable" pairs be the same (with potentially opposite signs), due to the issue shown in the following example:
Sequence: { -2, -1, 0, 1, 2, 3 }
Pairs: -2*-1, 0*1, 2*3
Clearly, since there are an even amount of pairs, the only case in which the same multiplication may occur more than once is if the elements are increasing/decreasing by 0 each time.
I fail to see how this is a dynamic programming question, and if anyone could clarify, it would be greatly appreciated!
A quick google for define linear sequence gave
A number pattern which increases (or decreases) by the same amount each time is called a linear sequence. The amount it increases or decreases by is known as the common difference.
In your case the common difference is 1. And you are not considering any other case.
The same multiplication may occur in the following sequence
Sequence = {-3, -1, 1, 3}
Pairs = -3 * -1 , 1 * 3
with a common difference of 2.
However this is not necessarily to be solved by dynamic programming. You can just iterate over the numbers and store the multiplication of two numbers in a set(as a set contains unique numbers) and then find the sum.
Probably not what you are looking for, but I've found a closed solution for the problem.
Suppose we observe the first two numbers. Note the first number by a, the difference between the numbers d. We then count for a total of 2n numbers in the whole sequence. Then the sum you defined is:
sum = na^2 + n(2n-1)ad + (4n^2 - 3n - 1)nd^2/3
That aside, I also failed to see how this is a dynamic problem, or at least this seems to be a problem where dynamic programming approach really doesn't do much. It is not likely that the sequence will go from negative to positive at all, and even then the chance that you will see repeated entries decreases the bigger your difference between two numbers is. Furthermore, multiplication is so fast the overhead from fetching them from a data structure might be more expensive. (mul instruction is probably faster than lw).

Upper bound on 4 digit sequences in pi

If this is not the right SE site for this question, please let me know.
A friend shared this interview question he received over the phone, which I have tried to solve myself. I will paraphrase:
The value of pi up to n digits as a string is given.
How can I find all duplicate 4 digit sequences in this string?
This part seems fairly straight forward. Add 4 character sequences to a hash table, incrementing one character at a time. Check if the current 4 character sequence already exists before insertion into the hash table. If so, then you have found a duplicate. Store this somewhere, and repeat the process. I was told this was more or less correct.
The issue I have is on the second question:
What is the upper bound?
n = 10,000,000 was an example.
My algorithm background is admittedly very rusty. My first thought is that the upper bound must be related to n somehow, but I was told it is not.
How do I calculate this?
EDIT:
I would also be open to a solution that disregards the restraint that the upper bound is not related to n. Either is acceptable.
There are only 10,000 possible sequences of four digits (0000 to 9999), so at some point you will have found that every sequence has been duplicated, and there's no need to process further digits.
If you assume that pi is a perfectly uniform random number generator, then each new digit that's processes results in a new sequence, and after about 20,000 digits, you will have found duplicates for all 10,000 sequences. Given that pi is not perfect, you may need significantly more digits before you duplicate all sequences, but 100,000 would be a reasonable guess at the upper bound.
Also, since there are only 10,000 possibilities, you don't really need a hash table. You can simply use an array of 10000 counters, (int count[10000]), and increment the count for each sequence you find.
The upper bound of your solution is the size of the hash table that you can fit into memory.
An alternate technique is to generate all the sequences and sort them. Then the duplicates will be adjacent and easy to detect. You can generally fit more into a linear data structure than you can a hash table, and if you still exhaust memory you can sort to/from disk.
Edit: unless "upper bound" means the O(n) of the algorithm, which should be easy to figure out.

Linear algorithm on binary strings

I'm going through some old midterms to study. (None of the solutions are given)
I've come across this problem which I'm stuck on
Let n = 2ℓ − 1 for some positive integer ℓ. Suppose someone claims to hold an array A[1.. n] of
distinct ℓ-bit strings; thus, exactly one ℓ-bit string does not appear in A. Suppose further that the
only way we can access A is by calling the function FetchBit(i, j), which returns the jth bit of the string A[i] in O(1) time.
Describe an algorithm to find the missing string in A using only O(n) calls to FetchBit.
The only thing I can think of is go through each string, convert it to base 10, sort them all and then see which value is missing. But that's certainly not O(n)
Proof it's not homework... http://web.engr.illinois.edu/~jeffe/teaching/algorithms/hwex/f12/midterm1.pdf
You can do it in 2n operations.
First, look at the first bit of every number. Obviously, you will get 2ℓ-1 zeros and 2ℓ-1-1 ones ore vice versa (because only one number is missing). If there is 2ℓ-1-1 ones then you know that the first bit of the missing number is one, otherwise it is zero.
Now you know the first bit of a missing number. Let's look at all numbers which have the same first bit (there are 2ℓ-1-1 of them) and repeat the same procedure with their second bit. This way you will determine the second bit of the missing number, and so on.
The total number of FetchBit calls will be 2ℓ-1 + 2ℓ-1-1 + ... + 21-1 <= 2ℓ+1 <= 2n+2 = O(n).

Word Prediction algorithm

I'm sure there is a post on this, but I couldn't find one asking this exact question. Consider the following:
We have a word dictionary available
We are fed many paragraphs of words, and I wish to be able to predict the next word in a sentence given this input.
Say we have a few sentences such as "Hello my name is Tom", "His name is jerry", "He goes where there is no water". We check a hash table if a word exists. If it does not, we assign it a unique id and put it in the hash table. This way, instead of storing a "chain" of words as a bunch of strings, we can just have a list of uniqueID's.
Above, we would have for instance (0, 1, 2, 3, 4), (5, 2, 3, 6), and (7, 8, 9, 10, 3, 11, 12). Note that 3 is "is" and we added new unique id's as we discovered new words. So say we are given a sentence "her name is", this would be (13, 2, 3). We want to know, given this context, what the next word should be. This is the algorithm I thought of, but I dont think its efficient:
We have a list of N chains (observed sentences) where a chain may be ex. 3,6,2,7,8.
Each chain is on average size M, where M is the average sentence length
We are given a new chain of size S, ex. 13, 2, 3, and we wish to know what is the most probable next word?
Algorithm:
First scan the entire list of chains for those who contain the full S input(13,2,3, in this example). Since we have to scan N chains, each of length M, and compare S letters at a time, its O(N*M*S).
If there are no chains in our scan which have the full S, next scan by removing the least significant word (ie. the first one, so remove 13). Now, scan for (2,3) as in 1 in worst case O(N*M*S) which is really S-1.
Continue scanning this way until we get results > 0 (if ever).
Tally the next words in all of the remaining chains we have gathered. We can use a hash table which counts every time we add, and keeps track of the most added word. O(N) worst case build, O(1) to find max word.
The max word found is the the most likely, so return it.
Each scan takes O(M*N*S) worst case. This is because there are N chains, each chain has M numbers, and we must check S numbers for overlaying a match. We scan S times worst case (13,2,3,then 2,3, then 3 for 3 scans = S). Thus, the total complexity is O(S^2 * M * N).
So if we have 100,000 chains and an average sentence length of 10 words, we're looking at 1,000,000*S^2 to get the optimal word. Clearly, N >> M, since sentence length does not scale with number of observed sentences in general, so M can be a constant. We can then reduce the complexity to O(S^2 * N). O(S^2 * M * N) may be more helpful for analysis though, since M can be a sizeable "constant".
This could be the complete wrong approach to take for this type of problem, but I wanted to share my thoughts instead of just blatantly asking for assitance. The reason im scanning the way I do is because I only want to scan as much as I have to. If nothing has the full S, just keep pruning S until some chains match. If they never match, we have no idea what to predict as the next word! Any suggestions on a less time/space complex solution? Thanks!
This is the problem of language modeling. For a baseline approach, The only thing you need is a hash table mapping fixed-length chains of words, say of length k, to the most probable following word.(*)
At training time, you break the input into (k+1)-grams using a sliding window. So if you encounter
The wrath sing, goddess, of Peleus' son, Achilles
you generate, for k=2,
START START the
START the wrath
the wrath sing
wrath sing goddess
goddess of peleus
of peleus son
peleus son achilles
This can be done in linear time. For each 3-gram, tally (in a hash table) how often the third word follows the first two.
Finally, loop through the hash table and for each key (2-gram) keep only the most commonly occurring third word. Linear time.
At prediction time, look only at the k (2) last words and predict the next word. This takes only constant time since it's just a hash table lookup.
If you're wondering why you should keep only short subchains instead of full chains, then look into the theory of Markov windows. If your model were to remember all the chains of words that it has seen in its input, then it would badly overfit its training data and only reproduce its input at prediction time. How badly depends on the training set (more data is better), but for k>4 you'd really need smoothing in your model.
(*) Or to a probability distribution, but this is not needed for your simple example use case.
Yeh Whye Teh also has some recent interesting work that addresses this problem. The "Sequence Memoizer" extends the traditional prediction-by-partial-matching scheme to take into account arbitrarily long histories.
Here is a link the original paper: http://www.stats.ox.ac.uk/~teh/research/compling/WooGasArc2011a.pdf
It is also worth reading some of the background work, which can be found in the paper "A Bayesian Interpretation of Interpolated Kneser-Ney"

Resources