Upper bound on 4 digit sequences in pi - algorithm

If this is not the right SE site for this question, please let me know.
A friend shared this interview question he received over the phone, which I have tried to solve myself. I will paraphrase:
The value of pi up to n digits as a string is given.
How can I find all duplicate 4 digit sequences in this string?
This part seems fairly straight forward. Add 4 character sequences to a hash table, incrementing one character at a time. Check if the current 4 character sequence already exists before insertion into the hash table. If so, then you have found a duplicate. Store this somewhere, and repeat the process. I was told this was more or less correct.
The issue I have is on the second question:
What is the upper bound?
n = 10,000,000 was an example.
My algorithm background is admittedly very rusty. My first thought is that the upper bound must be related to n somehow, but I was told it is not.
How do I calculate this?
EDIT:
I would also be open to a solution that disregards the restraint that the upper bound is not related to n. Either is acceptable.

There are only 10,000 possible sequences of four digits (0000 to 9999), so at some point you will have found that every sequence has been duplicated, and there's no need to process further digits.
If you assume that pi is a perfectly uniform random number generator, then each new digit that's processes results in a new sequence, and after about 20,000 digits, you will have found duplicates for all 10,000 sequences. Given that pi is not perfect, you may need significantly more digits before you duplicate all sequences, but 100,000 would be a reasonable guess at the upper bound.
Also, since there are only 10,000 possibilities, you don't really need a hash table. You can simply use an array of 10000 counters, (int count[10000]), and increment the count for each sequence you find.

The upper bound of your solution is the size of the hash table that you can fit into memory.
An alternate technique is to generate all the sequences and sort them. Then the duplicates will be adjacent and easy to detect. You can generally fit more into a linear data structure than you can a hash table, and if you still exhaust memory you can sort to/from disk.
Edit: unless "upper bound" means the O(n) of the algorithm, which should be easy to figure out.

Related

How to find the winner(s) of Brazil's Mega Sena lottery most efficiently?

Mega Sena is Brazil's most famous lottery. The number set ranges from 1 to 60 and a single bet can contain each from 6 to 15 numbers selected (the more numbers selected the more expensive that bet is). Prizes are given, proportionally, for those who can correctly guess 4, 5 or 6 numbers, the latter being the jackpot.
Sometimes the total number of bets reach the hundred of millions. Drawings are televisioned live so, as a player, you can know almost instantaneously if you won or not. However, the lottery takes several hours to announce the number of winners and which cities they are from.
Given the rules above, how would you implement the computational logic that finds the winners? What data structures would you choose to store the data? Which sorting algorithms, if any, would you select? What's the Big O notation of the best solution you can think of?
I would allocate a 64-bit integer to represent which numbers have been chosen.
The first step would be expand all the 7-15 number tickets to tickets having only 6 bits, which could be done offline before the drawing. I don't know if the 15-number ticket contains all the (15 choose 6) = 5005 6-element tickets, or do they use another system. But since it's done offline, the complexity is delegated elsewhere.
There's even an algorithm (or bithack) called lexicographically next permutation, which is able to generate all those n choose k bit patterns efficiently, if it needs to be done in real time.
Mask all those tickets with the bit pattern of the winning row, and compute the number of bits left. This should be extremely efficient taking an order of one second for a billion tickets in a modern computer that has the popcount instruction.
The other issue is the validation, integrity and confidentiality of the data, when associated to the ticket holders. I would guess that this is the real issue and is probably addressed by implementing the whole thing by a database query.
You can hold the selection of each person within a single 64-bit word with each bit representing a selected number. The entire dataset could fit in memory in one long integer array.
If the array was sorted in the same order as your database, e.g., by ticket ID, then you could retrieve an associated record simply from knowing that the position in the array would be the same as the rownum of your query.
If you performed a bitwise AND operation of each value with the 64-bit representation of the winning numbers, you could count the bits and store the offsets of any matching 4, 5, or 6 bits into their own respective lists.
The whole operation would be pretty obviously linear O(n).

Repeated DNA sequence

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.
Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.
Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT
and k=10
Answer: {AAAAACCCCC, CCCCCAAAAA}
This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.
This is actually a research problem.
Let's come to terms with some facts
Input = N, Input length = |N|
You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be
Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.
SO what alternatives do you have?
Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA
Heads-up: This is not a general solution, but a good trick that you can use when k is not large.
The trick is to encrypt the sequence into an integer by bit manipulation.
If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.
For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.
In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.
Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.
By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.
Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.
You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).
After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).
The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.
What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

constructing binary sequences with unique n-bit

A question that was asked during a job interview (which I pretty much failed) and
sadly, something I still cannot figure out.
Let's assume that you're given some positive integer, n.
Assume that you construct a sequence consisting of only 1 and 0, and
you want to construct a sequence of length 2^n + n-1 such that
every sequence of length n consisting of adjacent numbers is unique.
for instance
00110 (00, 01, 11, 10) for n=2
How would one construct such a sequence?
I think one should start with 0000..0 (n zeroes) and
do something about it.
If there is a constructive way of doing it, maybe
I could extend that method to constructing
a sequence consisting of only 0, 1, ..., k-1, and having
length k^n + n-1 such that
every sequence of length n consisting of adjacent numbers is unique
(or maybe not..)
(sorry, my sequence for n=3 is wrong, so I deleted it.
also, i've never heard of De Bruijin's sequence. I know it now!
thanks for all the answers and comments).
This strikes me as a very ambitious interview question; if you don't know the answer, you're unlikely to get it in a few minutes.
As mentioned in comments, this is really just the derivation of a de Bruijn sequence, only unwrapped. You can read the Wikipedia article linked above for more information, but the algorithms it proposes, while efficient, are not exactly easy to derive. There is a much simpler (but rather more storage-intensive) algorithm which I think is folkloric; at least, I don't know of a name attached to it. It's at least simple to describe:
Start with n 0s
As long as possible:
If you can add a 1 without repeating a previously-seen n-sequence, do so.
If not but you can add a 0 without repeating a previously-seen n-sequence, do so.
Otherwise, done.
This requires you to either search the entire string on each iteration, requiring exponential time, or maintain a boolean array of all seen sequences (coded as binary numbers, presumably), requiring exponential space. The "concatenate all Lyndon words in lexicographical order" solution is much more efficient, but leaves open the question of generating all Lyndon words in lexicographical order.

First pair of numbers adding to a specific value in a stream

There are a stream of integers coming through. The problem is to find the first pair of numbers from the stream that adds to a specific value (say, k).
With static arrays, one can use either of the below approaches:
Approach (1): Sort the array, use two pointers to beginning and end of array and compare.
Approach (2): Use hashing, i.e. if A[i]+A[j]=k, then A[j]=k-A[i]. Search for A[j] in the hash table.
But neither of these approaches scale well for streams. Any thoughts on efficiently solving this?
I believe that there is no way to do this that doesn't use at least O(n) memory, where n is the number of elements that appear before the first pair that sums to k. I'm assuming that we are using a RAM machine, but not a machine that permits awful bitwise hackery (in other words, we can't do anything fancy with bit packing.)
The proof sketch is as follows. Suppose that we don't store all of the n elements that appear before the first pair that sums to k. Then when we see the nth element, which sums with some previous value to get k, there is a chance that we will have discarded the previous element that it pairs with and thus won't know that the sum of k has been reached. More formally, suppose that an adversary could watch what values we were storing in memory as we looked at the first n - 1 elements and noted that we didn't store some element x. Then the adversary could set the next element of the stream to be k - x and we would incorrectly report that the sum had not yet been reached, since we wouldn't remember seeing x.
Given that we need to store all the elements we've seen, without knowing more about the numbers in the stream, a very good approach would be to use a hash table that contains all of the elements we've seen so far. Given a good hash table, this would take expected O(n) memory and O(n) time to complete.
I am not sure whether there is a more clever strategy for solving this problem if you make stronger assumptions about the sorts of numbers in the stream, but I am fairly confident that this is asymptotically ideal in terms of time and space.
Hope this helps!

Finding the repeated element

In an array with integers between 1 and 1,000,000 or say some very larger value ,if a single value is occurring twice twice. How do you determine which one?
I think we can use a bitmap to mark the elements , and then traverse allover again to find out the repeated element . But , i think it is a process with high complexity.Is there any better way ?
This sounds like homework or an interview question ... so rather than giving away the answer, here's a hint.
What calculations can you do on a range of integers whose answer you can determine ahead of time?
Once you realize the answer to this, you should be able to figure it out .... if you still can't figure it out ... (and it's not homework) I'll post the solution :)
EDIT: Ok. So here's the elegant solution ... if the list contains ALL of the integers within the range.
We know that all of the values between 1 and N must exist in the list. Using Guass' formula we can quickly compute the expected value of a range of integers:
Sum(1..N) = 1/2 * (1 + N) * Count(1..N).
Since we know the expected sum, all we have to do is loop through all the values and sum their values. The different between this sum and the expected sum is the duplicate value.
EDIT: As other's have commented, the question doesn't state that the range contains all of the integers ... in this case, you have to decide whether you want to optimize for memory or time.
If you want to perform the operation using O(1) storage, you can perform an in-place sort of the list. As you're sorting you have to check adjacent elements. Once you see a duplicate, you know you can stop. Optimal sorting is an O(n log n) operation on average - which establishes an upper bound for find the duplicate in this manner.
If you want to optimize for speed, you can use an additional O(n) storage. Using a HashSet (or similar structure), insert values from your list until you determine you are inserting a duplicate into the HashSet. Inserting n items into a HashSet is an O(n) operation on average, which establishes that as an upper bound for this method.
you may try to use bits as hashmap:
1 at position k means that number k occured before
0 at position k means that number k did not occured before
pseudocode:
0. assume that your array is A
1. initialize bitarray(there is nice class in c# for this) of 1000000 length filled with zeros
2. for each num in A:
if bitarray[num]
return num
else
bitarray[num] = 1
end
The time complexity of the bitmap solution is O(n) and it doesn't seem like you could do better than that. However it will take up a lot of memory for a generic list of numbers. Sorting the numbers is an obvious way to detect duplicates and doesn't require extra space if you don't mind the current order changing.
Assuming the array is of length n < N (i.e. not ALL integers are present -- in this case LBushkin's trick is the answer to this homework problem), there is no way to solve this problem using less than O(n) memory using an algorithm that just takes a single pass through the array. This is by reduction to the set disjointness problem.
Suppose I made the problem easier, and I promised you that the duplicate elements were in the array such that the first one was in the first n/2 elements, and the second one was in the last n/2 elements. Now we can think of playing a game in which two people each hold a string of n/2 elements, and want to know how many messages they have to send to be sure that none of their elements are the same. Since the first player could simulate the run of any algorithm that takes a pass through the array, and send the contents of its memory to the second player, a lower bound on the number of messages they need to send implies a lower bound on the memory requirements of any algorithm.
But its easy to see in this simple game that they need to send n/2 messages to be sure that they don't hold any of the same elements, which yields the lower bound.
Edit: This generalizes to show that for algorithms that make k passes through the array and use memory m, that m*k = Omega(n). And it is easy to see that you can in fact trade off memory for time in this way.
Of course, if you are willing to use algorithms that don't simply take passes through the array, you can do better as suggested already: sort the array, then take 1 pass through. This takes time O(nlogn) and space O(1). But note curiously that this proves that any sorting algorithm that just makes passes through the array must take time Omega(n^2)! Sorting algorithms that break the n^2 bound must make random accesses.

Resources