Getting i'th prefix without computing others - algorithm

I read this post which is quite close to the problem I'm having, but couldn't generalize it.
I'm trying to solve the Traveling Sales Person by searching all paths using multiple CPU's.
What I need, is a way to encode a path prefix to integer and distribute it to each CPU so it would know what paths it's supposed to scan.
For example, if the number of cities is 10 one possible 3-prefix (suppose prefix length is fixed and known) is 4-10-3 (there are 10*9*8 prefixes), so the CPU which would receive it, would search all paths that begin with 4-10-3.
Since the number of cities is quite large, I can't compute n! thus I can't use the post above.

The standard representation of a permutation as a number uses Lehmer codes represented in the factorial number system. The idea is that every permutation of n elements can be mapped to a sequence of n numbers, the first of which is in the range 0 to (n - 1), the second of which is in the range 0 to (n - 2), etc. This sequence of numbers can then be represented as a single integer in the factorial number system.
I believe that it should be possible to adapt this trick to work with prefixes of permutations rather than entire permutations. Suppose that you have n elements and want to choose a permutation of k of them. To do this, start off by computing the Lehmer code for the partial permutation. Instead of getting a sequence of n numbers, you'll get back a sequence of k numbers. For example, given the partial permutation c a d drawn from a b c d e f g, your Lehmer code would be found as follows:
c is the second (zero-indexed) element of a b c d e f g
a is the zeroth (zero-indexed) element of a b d e f g
d is the first (zero-indexed) element of b d e f g
So the Lehmer code would be (2, 0, 1).
Once you have this Lehmer code, you can try to encode it as a single integer. To do this, you can use a modified factorial number system encoding. Specifically, you can try doing the following. If you have n elements and want a permutation of k of them, then there will be a total of (n - k + 1) possible choices for the very last element. There are a total of (n - k + 2) possible choices for the second-to-last element, (n - k + 3) possible choices for the third-to-last element, etc. Consequently, you could take your Lehmer code and do the following:
Keep the final digit unchanged.
Multiply the second-to-last element by (n - k + 1).
Multiply the third-to-last element by (n - k + 1)(n - k + 2)
4 ...
Multiply the first element by (n - k + 1)(n - k + 2)...(n - 1)
Sum up these values.
This produces a unique integer code for the permutation.
For example, our Lehmer code was (2, 0, 1), n = 7, and k = 3. Therefore, we'd compute
1 + 0 × (7 - 3 + 1) + 2 × (7 - 3 + 2)(7 - 3 + 3)
= 1 + 2 × (5 × 6)
= 5 + 2 × 30
= 61
To invert this process, you can take the integer and run it backwards through this procedure to recover the partial Lehmer code. To do this, start off by taking the number and dividing by (n - k + 1)(n - k + 2)...(n - 1) to get back the very first digit of the Lehmer code. Then, mod the number by (n - k + 1)(n - k + 2)...(n - 1) to drop off the first digit. Then, divide the number by (n - k + 1)(n - k + 2)...(n - 2) to get back the second digit of the Lehmer code, then mod by (n - k + 1)(n - k + 2)...(n - 2) to drop off the second digit. Repeat this until all the digits of the Lehmer code have been reconstructed.
For example, given prefix 61, n = 7, and k = 3, we would start off by dividing 61 by 7 × 6 = 30. This gives 2, remainder 1. Thus the first digit of the Lehmer code is 2. Modding by 30, we get back the number 1. Next, we divide by 6. This gives 0, remainder 1. Thus the second digit is 0. Finally, we read off the remaining number, which gives the last digit of the Lehmer code, 1. We have recovered our Lehmer code (2, 0, 1), from which we can easily reconstruct the permutation.
Hope this helps!

The easiest way here is to map the prefix without treating it as part of a permutation. Don't map the prefix to [0,10*9*8-1], but rather to [0,10*10*10-1], so the prefix 0,4,5 will be mapped to the number 45, and the prefix 4,1,9 will be mapped to the number 419 (assuming there are 10 cities over-all, of course).

Related

How to get the intuition behind the solution?

I was solving the below problem from USACO training. I found this really fast solution for which, I am finding it unable to absorb fully.
Problem: Consider an ordered set S of strings of N (1 <= N <= 31) bits. Bits, of course, are either 0 or 1.
This set of strings is interesting because it is ordered and contains all possible strings of length N that have L (1 <= L <= N) or fewer bits that are `1'.
Your task is to read a number I (1 <= I <= sizeof(S)) from the input and print the Ith element of the ordered set for N bits with no more than L bits that are `1'.
sample input: 5 3 19
output: 10110
The two solutions I could think of:
Firstly the brute force solution which goes through all possible combinations of bits, selects and stores the strings whose count of '1's are less than equal to 'L' and returning the Ith string.
Secondly, we can find all the permutations of '1's from 5 positions with range of count(0 to L), sort the strings in increasing order and returning the Ith string.
The best Solution:
The OP who posted the solution has used combination instead of permutation. According to him, the total number of string possible is 5C0 + 5C1 + 5C2 + 5C3.
So at every position i of the string, we decide whether to include the ith bit in our output or not, based on the total number of ways we have to build the rest of the string. Below is a dry run of the entire approach for the above input.
N = 5, L = 3, I = 19
00000
at i = 0, for the rem string, we have 4C0 + 4C1 + 4C2 + 4C3 = 15
It says that, there are 15 other numbers possible with the last 4 positions. as 15 is less than 19, our first bit has to be set.
N = 5, L = 2, I = 4
10000
at i = 1, we have 3C0 + 3C1 + 3C2 (as we have used 1 from L) = 7
as 7 is greater than 4, we cannot set this bit.
N = 5, L = 2, I = 4
10000
at i = 2 we have 2C0 + 2C2 = 2
as 2 <= I(4), we take this bit in our output.
N = 5, L = 1, I = 2
10100
at i = 3, we have 1C0 + 1C1 = 2
as 2 <= I(2) we can take this bit in our output.
as L == 0, we stop and 10110 is our answer. I was amazed to find this solution. However, I am finding it difficult to get the intuition behind this solution.
How does this solution sort-of zero in directly to the Ith number in the set?
Why does the order of the bits not matter in the combinations of set bits?
Suppose we have precomputed the number of strings of length n with k or fewer bits set. Call that S(n, k).
Now suppose we want the i'th string (in lexicographic order) of length N with L or fewer bits set.
All the strings with the most significant bit zero come before those with the most significant bit 1. There's S(N-1, L) strings with the most significant bit zero, and S(N-1, L-1) strings with the most significant bit 1. So if we want the i'th string, if i<=S(N-1, L), then it must have the top bit zero and the remainder must be the i'th string of length N-1 with at most L bits set, and otherwise it must have the top bit one, and the remainder must be the (i-S(N-1, L))'th string of length N-1 with at most L-1 bits set.
All that remains to code is to precompute S(n, k), and to handle the base cases.
You can figure out a combinatorial solution to S(n, k) as your friend did, but it's more practical to use a recurrence relation: S(n, k) = S(n-1, k) + S(n-1, k-1), and S(0, k) = S(n, 0) = 1.
Here's code that does all that, and as an example prints out all 8-bit numbers with 3 or fewer bits set, in lexicographic order. If i is out of range, then it raises an IndexError exception, although in your question you assume i is always in range, so perhaps that's not necessary.
S = [[1] * 32 for _ in range(32)]
for n in range(1, 32):
for k in range(1, 32):
S[n][k] = S[n-1][k] + S[n-1][k-1]
def ith_string(n, k, i):
if n == 0:
if i != 1:
raise IndexError
return ''
elif i <= S[n-1][k]:
return "0" + ith_string(n-1, k, i)
elif k == 0:
raise IndexError
else:
return "1" + ith_string(n-1, k-1, i - S[n-1][k])
print([ith_string(8, 3, i) for i in range(1, 94)])

Finding the amount of combination of three numbers in a sequence which fulfills a specific requirement

The question is, given a number D and a sequence of numbers with amount N, find the amount of the combinations of three numbers that have a highest difference value within it that does not exceed the value D. For example:
D = 3, N = 4
Sequence of numbers: 1 2 3 4
Possible combinations: 1 2 3 (3-1 = 2 <= D), 1 2 4 (4 - 1 = 3 <= D), 1 3 4, 2 3 4.
Output: 4
What I've done: link
Well my concept is: iterate through the whole sequence of numbers and find the smallest number that exceeds the D value when subtracted to the current compared number. Then, find the combinations between those two numbers with the currently compared number being a fixed value (which means combination of n [numbers between the two numbers] taken 2). If even the biggest number in the sequence subtracted with the currently compared number does not exceed D, then use a combination of the whole elements taken 3.
N can be as big as 10^5 with the smallest being 1 and D can be as big as 10^9 with the smallest being 1 too.
Problem with my algorithm: overflow occurs when I do a combination of the 1st element and 10^5th element. How can I fix this? Is there a way to calculate that large amount of combination without actually doing the factorials?
EDIT:
Overflow occurs when worst case happens: currently compared number is still in index 0 while all other numbers, when subtracted with the currently compared number, is still smaller than D. For example, the value of number at index 0 is 1, the value of number at index 10^5 is 10^5 + 1 and D is 10^9. Then, my algorithm will attempt to calculate the factorial of 10^5 - 0 which then overflows. The factorial will be used to calculate the combination of 10^5 taken 3.
When you seek for items in value range D in sorted list, and get index difference M, then you should calculate C(M,3).
But for such combination number you don't need to use huge factorials:
C(M,3) = M! / (6 * (M-3)!) = M * (M-1) * (M-2) / 6
To diminish intermediate results even more:
A = (M - 1) * (M - 2) / 2
A = (A * M) / 3
You didn't add the C++ tag to your question, so let me write the answer in Python 3 (it should be easy to translate it to C++):
N = int(input("N = "))
D = int(input("D = "))
v = [int(input("v[{}] = ".format(i))) for i in range (0, N)]
count = 0
i, j = 0, 1
while j + 1 < N:
j += 1
while v[j] - v[i] > D:
i += 1
d = j - i
if d >= 2:
count += (d - 1) * d // 2 # // is the integer division
print(count)
The idea is to move up the upper index of the triples j, while dragging the lower index i at the greatest distance j-i=d where v[j]-v[i]<=D. For each i-j pair, there are 1+2+3+...+d-1 possible triples keeping j fixed, i.e., (d-1)*d/2.

Number of different binary sequences of length n generated using exactly k flip operations

Consider a binary sequence b of length N. Initially, all the bits are set to 0. We define a flip operation with 2 arguments, flip(L,R), such that:
All bits with indices between L and R are "flipped", meaning a bit with value 1 becomes a bit with value 0 and vice-versa. More exactly, for all i in range [L,R]: b[i] = !b[i].
Nothing happens to bits outside the specified range.
You are asked to determine the number of possible different sequences that can be obtained using exactly K flip operations modulo an arbitrary given number, let's call it MOD.
More specifically, each test contains on the first line a number T, the number of queries to be given. Then there are T queries, each one being of the form N, K, MOD with the meaning from above.
1 ≤ N, K ≤ 300 000
T ≤ 250
2 ≤ MOD ≤ 1 000 000 007
Sum of all N-s in a test is ≤ 600 000
time limit: 2 seconds
memory limit: 65536 kbytes
Example :
Input :
1
2 1 1000
Output :
3
Explanation :
There is a single query. The initial sequence is 00. We can do the following operations :
flip(1,1) ⇒ 10
flip(2,2) ⇒ 01
flip(1,2) ⇒ 11
So there are 3 possible sequences that can be generated using exactly 1 flip.
Some quick observations that I've made, although I'm not sure they are totally correct :
If K is big enough, that is if we have a big enough number of flips at our disposal, we should be able to obtain 2n sequences.
If K=1, then the result we're looking for is N(N+1)/2. It's also C(n,1)+C(n,2), where C is the binomial coefficient.
Currently trying a brute force approach to see if I can spot a rule of some kind. I think this is a sum of some binomial coefficients, but I'm not sure.
I've also come across a somewhat simpler variant of this problem, where the flip operation only flips a single specified bit. In that case, the result is
C(n,k)+C(n,k-2)+C(n,k-4)+...+C(n,(1 or 0)). Of course, there's the special case where k > n, but it's not a huge difference. Anyway, it's pretty easy to understand why that happens.I guess it's worth noting.
Here are a few ideas:
We may assume that no flip operation occurs twice (otherwise, we can assume that it did not happen). It does affect the number of operations, but I'll talk about it later.
We may assume that no two segments intersect. Indeed, if L1 < L2 < R1 < R2, we can just do the (L1, L2 - 1) and (R1 + 1, R2) flips instead. The case when one segment is inside the other is handled similarly.
We may also assume that no two segments touch each other. Otherwise, we can glue them together and reduce the number of operations.
These observations give the following formula for the number of different sequences one can obtain by flipping exactly k segments without "redundant" flips: C(n + 1, 2 * k) (we choose 2 * k ends of segments. They are always different. The left end is exclusive).
If we had perform no more than K flips, the answer would be sum for k = 0...K of C(n + 1, 2 * k)
Intuitively, it seems that its possible to transform the sequence of no more than K flips into a sequence of exactly K flips (for instance, we can flip the same segment two more times and add 2 operations. We can also split a segment of more than two elements into two segments and add one operation).
By running the brute force search (I know that it's not a real proof, but looks correct combined with the observations mentioned above) that the answer this sum minus 1 if n or k is equal to 1 and exactly the sum otherwise.
That is, the result is C(n + 1, 0) + C(n + 1, 2) + ... + C(n + 1, 2 * K) - d, where d = 1 if n = 1 or k = 1 and 0 otherwise.
Here is code I used to look for patterns running a brute force search and to verify that the formula is correct for small n and k:
reachable = set()
was = set()
def other(c):
"""
returns '1' if c == '0' and '0' otherwise
"""
return '0' if c == '1' else '1'
def flipped(s, l, r):
"""
Flips the [l, r] segment of the string s and returns the result
"""
res = s[:l]
for i in range(l, r + 1):
res += other(s[i])
res += s[r + 1:]
return res
def go(xs, k):
"""
Exhaustive search. was is used to speed up the search to avoid checking the
same string with the same number of remaining operations twice.
"""
p = (xs, k)
if p in was:
return
was.add(p)
if k == 0:
reachable.add(xs)
return
for l in range(len(xs)):
for r in range(l, len(xs)):
go(flipped(xs, l, r), k - 1)
def calc_naive(n, k):
"""
Counts the number of reachable sequences by running an exhaustive search
"""
xs = '0' * n
global reachable
global was
was = set()
reachable = set()
go(xs, k)
return len(reachable)
def fact(n):
return 1 if n == 0 else n * fact(n - 1)
def cnk(n, k):
if k > n:
return 0
return fact(n) // fact(k) // fact(n - k)
def solve(n, k):
"""
Uses the formula shown above to compute the answer
"""
res = 0
for i in range(k + 1):
res += cnk(n + 1, 2 * i)
if k == 1 or n == 1:
res -= 1
return res
if __name__ == '__main__':
# Checks that the formula gives the right answer for small values of n and k
for n in range(1, 11):
for k in range(1, 11):
assert calc_naive(n, k) == solve(n, k)
This solution is much better than the exhaustive search. For instance, it can run in O(N * K) time per test case if we compute the coefficients using Pascal's triangle. Unfortunately, it is not fast enough. I know how to solve it more efficiently for prime MOD (using Lucas' theorem), but O do not have a solution in general case.
Multiplicative modular inverses can't solve this problem immediately as k! or (n - k)! may not have an inverse modulo MOD.
Note: I assumed that C(n, m) is defined for all non-negative n and m and is equal to 0 if n < m.
I think I know how to solve it for an arbitrary MOD now.
Let's factorize the MOD into prime factors p1^a1 * p2^a2 * ... * pn^an. Now can solve this problem for each prime factor independently and combine the result using the Chinese remainder theorem.
Let's fix a prime p. Let's assume that p^a|MOD (that is, we need to get the result modulo p^a). We can precompute all p-free parts of the factorial and the maximum power of p that divides the factorial for all 0 <= n <= N in linear time using something like this:
powers = [0] * (N + 1)
p_free = [i for i in range(N + 1)]
p_free[0] = 1
for cur_p in powers of p <= N:
i = cur_p
while i < N:
powers[i] += 1
p_free[i] /= p
i += cur_p
Now the p-free part of the factorial is the product of p_free[i] for all i <= n and the power of p that divides n! is the prefix sum of the powers.
Now we can divide two factorials: the p-free part is coprime with p^a so it always has an inverse. The powers of p are just subtracted.
We're almost there. One more observation: we can precompute the inverses of p-free parts in linear time. Let's compute the inverse for the p-free part of N! using Euclid's algorithm. Now we can iterate over all i from N to 0. The inverse of the p-free part of i! is the inverse for i + 1 times p_free[i] (it's easy to prove it if we rewrite the inverse of the p-free part as a product using the fact that elements coprime with p^a form an abelian group under multiplication).
This algorithm runs in O(N * number_of_prime_factors + the time to solve the system using the Chinese remainder theorem + sqrt(MOD)) time per test case. Now it looks good enough.
You're on a good path with binomial-coefficients already. There are several factors to consider:
Think of your number as a binary-string of length n. Now we can create another array counting the number of times a bit will be flipped:
[0, 1, 0, 0, 1] number
[a, b, c, d, e] number of flips.
But even numbers of flips all lead to the same result and so do all odd numbers of flips. So basically the relevant part of the distribution can be represented %2
Logical next question: How many different combinations of even and odd values are available. We'll take care of the ordering later on, for now just assume the flipping-array is ordered descending for simplicity. We start of with k as the only flipping-number in the array. Now we want to add a flip. Since the whole flipping-array is used %2, we need to remove two from the value of k to achieve this and insert them into the array separately. E.g.:
[5, 0, 0, 0] mod 2 [1, 0, 0, 0]
[3, 1, 1, 0] [1, 1, 1, 0]
[4, 1, 0, 0] [0, 1, 0, 0]
As the last example shows (remember we're operating modulo 2 in the final result), moving a single 1 doesn't change the number of flips in the final outcome. Thus we always have to flip an even number bits in the flipping-array. If k is even, so will the number of flipped bits be and same applies vice versa, no matter what the value of n is.
So now the question is of course how many different ways of filling the array are available? For simplicity we'll start with mod 2 right away.
Obviously we start with 1 flipped bit, if k is odd, otherwise with 1. And we always add 2 flipped bits. We can continue with this until we either have flipped all n bits (or at least as many as we can flip)
v = (k % 2 == n % 2) ? n : n - 1
or we can't spread k further over the array.
v = k
Putting this together:
noOfAvailableFlips:
if k < n:
return k
else:
return (k % 2 == n % 2) ? n : n - 1
So far so well, there are always v / 2 flipping-arrays (mod 2) that differ by the number of flipped bits. Now we come to the next part permuting these arrays. This is just a simple permutation-function (permutation with repetition to be precise):
flipArrayNo(flippedbits):
return factorial(n) / (factorial(flippedbits) * factorial(n - flippedbits)
Putting it all together:
solutionsByFlipping(n, k):
res = 0
for i in [k % 2, noOfAvailableFlips(), step=2]:
res += flipArrayNo(i)
return res
This also shows that for sufficiently large numbers we can't obtain 2^n sequences for the simply reason that we can not arrange operations as we please. The number of flips that actually affect the outcome will always be either even or odd depending upon k. There's no way around this. The best result one can get is 2^(n-1) sequences.
For completeness, here's a dynamic program. It can deal easily with arbitrary modulo since it is based on sums, but unfortunately I haven't found a way to speed it beyond O(n * k).
Let a[n][k] be the number of binary strings of length n with k non-adjacent blocks of contiguous 1s that end in 1. Let b[n][k] be the number of binary strings of length n with k non-adjacent blocks of contiguous 1s that end in 0.
Then:
# we can append 1 to any arrangement of k non-adjacent blocks of contiguous 1's
# that ends in 1, or to any arrangement of (k-1) non-adjacent blocks of contiguous
# 1's that ends in 0:
a[n][k] = a[n - 1][k] + b[n - 1][k - 1]
# we can append 0 to any arrangement of k non-adjacent blocks of contiguous 1's
# that ends in either 0 or 1:
b[n][k] = b[n - 1][k] + a[n - 1][k]
# complete answer would be sum (a[n][i] + b[n][i]) for i = 0 to k
I wonder if the following observations might be useful: (1) a[n][k] and b[n][k] are zero when n < 2*k - 1, and (2) on the flip side, for values of k greater than ⌊(n + 1) / 2⌋ the overall answer seems to be identical.
Python code (full matrices are defined for simplicity, but I think only one row of each would actually be needed, space-wise, for a bottom-up method):
a = [[0] * 11 for i in range(0,11)]
b = [([1] + [0] * 10) for i in range(0,11)]
def f(n,k):
return fa(n,k) + fb(n,k)
def fa(n,k):
global a
if a[n][k] or n == 0 or k == 0:
return a[n][k]
elif n == 2*k - 1:
a[n][k] = 1
return 1
else:
a[n][k] = fb(n-1,k-1) + fa(n-1,k)
return a[n][k]
def fb(n,k):
global b
if b[n][k] or n == 0 or n == 2*k - 1:
return b[n][k]
else:
b[n][k] = fb(n-1,k) + fa(n-1,k)
return b[n][k]
def g(n,k):
return sum([f(n,i) for i in range(0,k+1)])
# example
print(g(10,10))
for i in range(0,11):
print(a[i])
print()
for i in range(0,11):
print(b[i])

Reversed Huffman coding

Suppose I have a collection of words with a predefined binary prefix code. Given a very large random binary chunk of data, I can parse this chunk into words using the prefix code.
I want to determine, at least approximately (for random chunks of very large lengths) the expectation values of number of hits for each word (how many times it is mentioned in the decoded text).
At first glance, the problem appears trivial - the probability of each word being scanned from the random pool of bits is completely determined by its length (since each bit can be either 0 or 1). But I suspect this to be an incorrect answer to the problem above since words have different lengths and thus this probability is not the same as the expected number of hits (divided by the length of the data chunk).
UPD: I was asked (in comments below) to state this problem mathematically, so here it goes.
Let w be a list of words written with only zeros and ones (our alphabet consists of only two letters). Furthermore, no word in w is a prefix of any other word. Thus w forms a legitimate binary prefix code. I want to know (at least approximately) the mean value of hits, for each word in w, averaged over all possible binary chunks of data with fixed size n. n can be taken very large, much much larger than any of the lengths of our words. However, words have different lengths and this can not be neglected.
I would appreciate any references to attempts to solve this.
My brief answer: the expected number of hits (or rather the expected proportion of hits) can be calculated for every given list of words.
I will not describe the full algorithm, but just do the following example in detail for illustration: let us fix the following very simple list of three words: 0, 10, 11.
For every n, there are 2^n different data chunks of length n (I mean n bits), each occur with the same probability 2^(-n).
The first observation is that, not all the data chunks can be decoded exactly - e.g. the data 0101, when you decode, there will remain a single 1 in the end.
Let us write U(n) for the number of length n data chunks that CAN be decoded exactly, and write V(n) for the others (i.e. those with an extra 1 in the end). The following recurrence relations are clear:
U(n) + V(n) = 2^n
V(n) = U(n - 1)
with the initial values U(0) = 1 and V(0) = 0.
A simple calculation then yields:
U(n) = (2^(n + 1) + (- 1)^n) / 3.
Now let A(n) (resp. B(n), C(n)) be the sum of the number of hits on the word 0 (resp. 10, 11) for all the U(n) exact data chunks, and let a(n) (resp. b(n), c(n)) be the same sum for all the V(n) inexact data chunks (the last 1 does not count in this case).
Then we have the following relations:
a(n) = A(n - 1), b(n) = B(n - 1), c(n) = C(n - 1)
A(n) = A(n - 1) + U(n - 1) + A(n - 2) + A(n - 2)
B(n) = B(n - 1) + B(n - 2) + U(n - 2) + B(n - 2)
C(n) = C(n - 1) + C(n - 2) + C(n - 2) + U(n - 2)
Explanation for the relations 2 3 4:
If D is an exact data chunk of length n, then there are three possibilities:
D ends with 0, and deleting this 0 yields an exact data chunk of length n - 1;
D ends with 10, and deleting this 10 yields an exact data chunk of length n - 2;
D ends with 11, and deleting this 11 yields an exact data chunk of length n - 2.
Thus, for example, when we sum up all the hit numbers for 0 in all exact data chunks of length n, the contributions of the three cases are respectively A(n - 1) + U(n - 1), A(n - 2), A(n - 2). Similarly for the other two equalities.
Now, solving these recurrence relations, we get:
A(n) = 2/9 * n * 2^n + (smaller terms)
B(n) = C(n) = 1/9 * n * 2^n + (smaller terms)
Since U(n) = 2/3 * 2^n + (smaller terms), our conclusion is that there are approximately n/3 hits on 0, n/6 hits on 10, n/6 hits on 11.
Note that the same proportions hold if we take also the V(n) inexact data chunks into account, because of the relations between A(n), B(n), C(n), U(n) and a(n), b(n), c(n), V(n).
This method generalizes to any list of words. It's the same idea as if you were to solve this problem using dynamic programing - create status, find recurrence relation, and establish transition matrix.
To go further
I think the following might also be true, which will simplify the answer further.
Let w_1, ..., w_k be the words in the list, and let l_1, ..., l_k be their lengths.
For every i = 1, ..., k, let a_i be the proportion of hits of w_i, i.e. for length n data chunks the expected number of hits for w_i is a_i * n + (smaller terms).
Then, my feeling (conjecture) is that a_i * 2^(l_i) is the same for all i, i.e. if one word is one bit longer than another, then its hit number is a half of that of the other.
This conjecture, if correct, is probably not very difficult to prove. But I'm too lazy to think now...
If this is true, then we can calculate those a_i very easily, because we have the identity:
sum (a_i * l_i) = 1.
Let me illustrate this with the above example.
We have w_1 = 0, w_2 = 10, w_3 = 11, hence l_1 = 1, l_2 = l_3 = 2.
According to the conjecture, we should have a_1 = 2 * a_2 = 2 * a_3. Thus a_2 = a_3 = x and a_1 = 2x. The above equality becomes:
2x * 1 + x * 2 + x * 2 = 1
Hence x = 1 / 6, and we have a_1 = 1 / 3, a_2 = a_3 = 1 / 6, as can be verified by the above calculation.
Let's make a simple machine that can recognize words: a DFA with an accepting state for each word. To construct this DFA, start with a binary tree with each left-child-edge labeled 0 and each right-child-edge labeled 1. Each leaf is either a word-accepter (if the path to that leaf down the tree is the word's spelling) or is garbage (a string of letters that isn't a prefix for any valid word). We wire up "restart" edges from the leaves back to the root of the tree*.
Let's find out what the frequency of matching each word would be, if we had a string of infinite length. To do this, treat the graph of the DFA as a Markov state transition diagram, initialize the starting state to be at the root with probability 1 and all other states 0, and find the steady state distribution (by finding the dominant eigenvector of the transition diagram's corresponding matrix).
Our string is not of infinite length. But since n is large, I expect "edge effects" to not matter so much. We can approximate the matching frequency by word by taking the matching rate by word and multiplying by n. If we want to be more precise, instead of taking the eigenvector we could just take the transition matrix to the nth power and multiply that with the starting distribution to get the resulting distribution after n letters.
*This isn't quite precise, because this Markov system would spend some nonzero amount of time at the root, when after recognizing a word or skipping garbage it should immediately go to the 0-child or 1-child depending. So we don't actually wire up our "restart" edges to a root: from a word-accepting node we wire up two restart edges (one to the 0-child and one to the 1-child of the root); we replace garbage nodes that are left-children with an edge to the 0-child; and we replace garbage nodes that are right-children with an edge to the 1-child. In fact, if we set our initial state to 0 with probability 0.5 and 1 with probability 0.5, we don't even need the root.
EDIT: To use #WhatsUp's example, we start with a DFA that looks like this:
We rewire it a little bit to restart after a word is accepted and get rid of the root node:
The corresponding Markov transition matrix is:
0.5 0 0.5 0.5
0.5 0 0.5 0.5
0 0.5 0 0
0 0.5 0 0
whose first eigenvector is:
0.333
0.333
0.167
0.167
Which is to say that it spends 1/3 of its time in the 0 node, 1/3 in 1, 1/6 in 10, and 1/6 in 11. This is in agreement with #WhatsUp's results for that example.

Counting the strictly increasing sequences

I aligned the N candles from left to right. The ith candle from the left has the height Hi and the color Ci, an integer ranged from 1 to a given K, the number of colors.
Problem: , how many strictly increasing ( in height ) colorful subsequences are there? A subsequence is considered as colorful if every of the K colors appears at least one times in the subsequence.
For Ex: N=4 k= 3
H C
1 1
3 2
2 2
4 3
only two valid subsequences are (1, 2, 4) and (1, 3, 4)
I think it is a problem of Fenwick Tree please provide me a approach how to proceeded with such type of problems
For a moment, let's forget about the colors. So the problem is simpler: count the number of increasing subsequences. This problem has a standard solution:
1. Map each value to [0...n - 1] range.
2. Let's assume the f[value] is the number of increasing subsequences that have value as their last element.
3. Initially, f is filled with 0.
4. After that, you iterate over all array elements from left to right and perform the following operation: f[value] += 1 + get_sum(0, value - 1)(it means that you add this element to all possible subsequences so that they remain strictly increasing), where value is the current element of the array and get_sum(a, b) returns the sum of f[a] + f[a + 1] + ... + f[b].
5. The answer is f[0] + f[1] + ... + f[n - 1].
Using binary index tree(aka Fenwick tree), it is possible to do get_sum operation in O(log n) and get O(n log n) total time complexity.
Now let's come back to the original problem. To take into account the colors, we can compute f[value, mask] instead of f[value](that is, the number of increasing subsequences that have value as their last element and mask(it is a bitmask that shows which colors are present) colors). Then an update for each element looks like this:
for mask in [0...2^K - 1]:
f[value, mask or 2^(color[i] - 1)] += 1 + get_sum(0, value - 1, mask)
The answer is f[0, 2^K - 1] + f[1, 2^K - 1] + ... + f[n - 1, 2^K - 1].
You can maintain 2^K binary index trees to achieve O(n * log n * 2^K) time complexity using the same idea as in a simpler problem.

Resources