Efficient way to multiply a large set of small numbers - algorithm

This question was asked in an interview.
You have an array of small integers. You have to multiply all of them. You need not worry about overflow you have ample support for that. What can you do to speed up the multiplication on your machine?
Would multiple additions be better in this case?
I suggested multiplying using a divide and conquer approach but the interviewer was not impressed. What could be the best possible solution for this?

Here are some thoughts:
Divide-and-Conquer with Multithreading: Split the input apart into n different blocks of size b and recursively multiply all the numbers in each block together. Then, recursively multiply all n / b blocks back together. If you have multiple cores and can run parts of this in parallel, you could save a lot of time overall.
Word-Level Parallelism: Let's suppose that your numbers are all bounded from above by some number U, which happens to be a power of two. Now, suppose that you want to multiply together a, b, c, and d. Start off by computing (4U2a + b) × (4U2c + d) = 16U4ac + 4U2ad + 4U2bc + bd. Now, notice that this expression mod U2 is just bd. (Since bd < U2, we don't need to worry about the mod U2 step messing it up). This means that if we compute this product and take it mod U2, we get back bd. Since U2 is a power of two, this can be done with a bitmask.
Next, notice that
4U2ad + 4U2bc + bd < 4U4 + 4U4 + U2 < 9U4 < 16U4
This means that if we divide the entire expression by 16U4 and round down, we will end up getting back just ad. This division can be done with a bitshift, since 16U4 is a power of two.
Consequently, with one multiplication, you can get back the values of both ac and bd by applying a subsequent bitshift and bitmask. Once you have ac and bd, you can directly multiply them together to get back the value of abcd. Assuming that bitmasks and bitshifts are faster than multiplies, this reduces the number of multiplications necessary by 33% (two instead of three here).
Hope this helps!

Your divide and conquer suggestion was a good start. It just needed more explanation to impress.
With fast multiplication algorithms used to multiply large numbers (big-ints), it is much more efficient to multiply similar sized multiplicands than a series of mismatched sizes.
Here's an example in Clojure
; Define a vector of 100K random integers between 2 and 100 inclusive
(def xs (vec (repeatedly 100000 #(+ 2 (rand-int 99)))))
; Naive multiplication accumulating linearly through the array
(time (def a1 (apply *' xs)))
"Elapsed time: 7116.960557 msecs"
; Simple Divide and conquer algorithm
(defn d-c [v]
(let [m (quot (count v) 2)]
(if (< m 3)
(reduce *' v)
(*' (d-c (subvec v 0 m)) (d-c (subvec v m))))))
; Takes less than 1/10th the time.
(time (def a2 (d-c xs)))
"Elapsed time: 600.934268 msecs"
(= a1 a2) ;=> true (same result)
Note that this improvement does not rely on a set limit for the size of the integers in the array (100 chosen arbitrarily and to demonstrate the next algorithm), but only that they be similar in size. This is a very simple divide an conquer. As the numbers get larger and more expensive to multiply, it would make sense to invest more time in iteratively grouping them by similar size. Here I am relying on random distribution and chance that the sizes will stay similar, but it is still going to be significantly better than the naive approach even for the worst case.
As suggested by Evgeny Kluev in the comments, for a large number of small integers, there is going to be a lot of duplication, so efficient exponentiation is also better than naive multiplication. This depends a lot more on the relative parameters than the divide and conquer, that is the numbers must be sufficiently small relative to the count for enough duplicates to accumulate to bother, but certainly performs well with these parameters (100K numbers in the range 2-100).
; Hopefully an efficient implementation
(defn pow [x n] (.pow (biginteger x) ^Integer n))
; Perform pow on duplications based on frequencies
(defn exp-reduce [v] (reduce *' (map (partial apply pow) (frequencies v))))
(time (def a3 (exp-reduce xs)))
"Elapsed time: 650.211789 msecs"
Note the very simple divide and conquer performed just a wee better in this trial, but would be even relatively better if fewer duplicates were expected.
Of course we can also combine the two:
(defn exp-d-c [v] (d-c (mapv (partial apply pow) (frequencies v))))
(time (def a4 (exp-d-c xs)))
"Elapsed time: 494.394641 msecs"
(= a1 a2 a3 a4) ;=> true (all equal)
Note there are better ways to combine these two since the result of the exponentiation step is going to result in various sizes of multiplicands. The value of added complexity to do so depends on the expected number of distinct numbers in the input. In this case, there are very few distinct numbers so it wouldn't pay to add much complexity.
Note also that both of these are easily parallelized if multiple cores are available.

If many of the small integers occur multiple times, you could start by counting every unique integer. If c(n) is the number of occurrences of integer n, the product can be computed as
P = 2 ^ c(2) * 3 ^ c(3) * 4 ^ c(4) * ...
For the exponentiation steps, you can use exponentiation by squaring which can reduce the number of multiplications considerably.

If the count of numbers really is large compared to the range, then we have seen two asymptotic solutions presented to reduce the complexity considerably. One was based on successive squaring to compute c^k in O(log k) time for each number c, giving O(C mean(log k)) time if the largest number is C and k gives the exponent for each number between 1 and C. The mean(log k) term is maximized if every number appears an equal number of times, so if you have N numbers then the complexity becomes O(C log(N/C)), which is very weakly dependent on N and essentially just O(C) where C specifies the range of numbers.
The other approach we saw was sorting numbers by the number of times they appear, and keeping track of the product of leading numbers (starting with all numbers) and raising this to a power so that the least frequent number is removed from the array, and then updating the exponents on the remaining element in the array and repeating. If all numbers occur the same number of times K, then this gives O(C + log K) which is an improvement over O(C log K). However, say the kth number appears 2^k times. Then this will still give O(C^2 + C log(N/C)) time which is technically worse than the previous method O(C log(N/C)) if C > log(N/C). Thus, if you don't have good information on how evenly distributed the occurrences of each number are, you should go with the first approach, just take the appropriate power of each distinct number that appears in the product by using successive squaring, and take the product of the results. Total time O(C log (N/C)) if there are C distinct numbers and N total numbers.

To answer this question we need to interpret in some way the assumption from OP: need not worry about overflow. In larger part of this answer it is interpreted as "ignore overflows". But I start with several thoughts about other interpretation ("use multiprecision arithmetic"). In this case process of multiplication may be approximately split into 3 stages:
Multiply together small sets of small numbers to get a large set of not-so-small numbers. Some of the ideas from second part of this answer may be used here.
Multiply together these numbers to get a set of large numbers. Either trivial (quadratic time) algorithm or Toom–Cook/Karatsuba (sub-quadratic time) methods may be used.
Multiply together large numbers. Either Fürer's or Schönhage–Strassen algorithm may be used. Which gives O(N polylog N) time complexity for the whole process.
Binary exponentiation may give some (not very significant) speed improvement, because most (if not every) of complex multiplication algorithms mentioned here do squaring faster than multiplication of two unequal numbers. Also we could factorize every "small" number and use binary exponentiation only for prime factors. For evenly distributed "small" numbers this will decrease number of exponentiations by factor log(number_of_values) and slightly improve balance of squarings/multiplications.
Divide and conquer is OK when numbers are evenly distributed. Otherwise (for example when input array is sorted or when binary exponentiation is used) we could do better by placing all multiplicands into priority queue, ordered (may be approximately ordered) by number length. Then we could multiply two shortest numbers and place the result back to the queue (this process is very similar to Huffman encoding). There is no need to use this queue for squaring. Also we should not use it while numbers are not long enough.
More information on this could be found in the answer by A. Webb.
If overflows may be ignored we could multiply the numbers with linear-time or better algorithms.
Sub-linear time algorithm is possible if input array is sorted or input data is presented as set of tuples {value, number of occurrences}. In latter case we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log(N/C)), where C is number of different values in the array. (See also this answer).
If input array is sorted, we could use binary search to find positions where value changes. This allows to determine how many times each value occurs in the array. Then we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log N). We could do better by using one-sided binary search here. In this case time complexity is O(C log(N/C)).
But if input array is not sorted, we have to inspect each element, so O(N) time complexity is the best we can do. Still we could use parallelism (multithreading, SIMD, word-level parallelism) to obtain some speed improvement. Here I compare several such approaches.
To compare these approaches I've chosen very small (3-bit) values, which are pretty tightly packed (one value per 8-bit integer). And implemented them in low-level language (C++11) to get easier access to bit manipulation, specific CPU instructions, and SIMD.
Here are all the algorithms:
accumulate from standard library.
Trivial implementation with 4 accumulators.
Word-level parallelism for multiplication, as described in the answer by templatetypedef. With 64-bit word size this approach allows up to 10-bit values (with only 3 multiplications instead of each 4) or it may be applied twice (and I did it in the tests) with up to 5-bit values (requiring only 5 multiplications out of each 8).
Table lookup. In the tests 7 multiplications out of each 8 are substituted by single table lookup. With values larger than in these tests, number of substituted multiplications decreases, slowing down the algorithm. Values larger than 11-12 bits make this approach useless.
Binary exponentiation (see more details below). Values larger than 4 bits make this approach useless.
SIMD (AVX2). This implementation can use up to 8-bit values.
Here are sources for all tests on Ideone. Note that SIMD test requires AVX2 instruction set from Intel. Table lookup test requires BMI2 instruction set. Other tests do not depend on any particular hardware (I hope). I run these tests on 64-bit Linux, compiled with gcc 4.8.1, optimization level -O2.
Here are some more details for binary exponentiation test:
for (size_t i = 0; i < size / 8; i += 2)
{
auto compr = (qwords[i] << 4) | qwords[i + 1];
constexpr uint64_t lsb = 0x1111111111111111;
if ((compr & lsb) != lsb) // if there is at least one even value
{
auto b = reinterpret_cast<uint8_t*>(qwords + i);
acc1 *= accumulate(b, b + 16, acc1, multiplies<unsigned>{});
if (!acc1)
break;
}
else
{
const auto b2 = compr & 0x2222222222222222;
const auto b4 = compr & 0x4444444444444444;
const auto b24 = b4 & (b2 * 2);
const unsigned c7 = __builtin_popcountll(b24);
acc3 += __builtin_popcountll(b2) - c7;
acc5 += __builtin_popcountll(b4) - c7;
acc7 += c7;
}
}
const auto prod4 = acc1 * bexp<3>(acc3) * bexp<5>(acc5) * bexp<7>(acc7);
This code packs values even more densely than in the input array: two values per byte. Low-order bit of each value is handled differently: since we could stop after 32 zero bits is found here (with result "zero"), this case cannot alter performance very much, so it is handled by simplest (library) algorithm.
Out of 4 remaining values, "1" is not interesting, so we need to count only occurrences of "3", "5" , and "7" with bitwise manipulations and "population count" intrinsic.
Here are the results:
source size: 4 Mb: 400 Mb:
1. accumulate: 0.835392 ns 0.849199 ns
2. accum * 4: 0.290373 ns 0.286915 ns
3. 2 mul in 1: 0.178556 ns 0.182606 ns
4. mult table: 0.130707 ns 0.176102 ns
5. binary exp: 0.128484 ns 0.119241 ns
6. AVX2: 0.0607049 ns 0.0683234 ns
Here we can see that accumulate library algorithm is pretty slow: for some reason gcc could not unroll the loop and use 4 independent accumulators.
It is not too difficult to do this optimization "by hand". The result is not particularly fast. But if we allocate 4 threads for this task, CPU would approximately match memory bandwidth (2 channels, DDR3-1600).
Word-level parallelism for multiplications is almost twice as fast. (We'll need only 3 threads to match memory bandwidth).
Table lookup is even faster. But its performance degrades when input array cannot fit in L3 cache. (We'll need 3 threads to match memory bandwidth).
Binary exponentiation has approximately the same speed. But with larger inputs this performance does not degrade, it even slightly improves because exponentiation itself uses less time compared to value counting. (We'll need only 2 threads to match memory bandwidth).
As expected, SIMD is the fastest. Its performance slightly degrades when input array cannot fit in L3 cache. Which means we are close to memory bandwidth with single thread.

I have one solution. Let us discuss it with other solutions.
The key part of question is how to reduce times of multiply. And integers are small but set is big.
My solution:
use an small array to record how many times each number appears.
Remove number 1 from array. You don't need to count it.
Find the number which appears least times n. Then multiply all numbers and get result K. Then count K^n.
Remove this number (For instance, you can switch it with the last number of array and reduce size of array for 1). So next time you won't consider this number any more. At same time, the appearance times of other numbers need to be reduced with the times of removed number.
Once again get the number which appears least times. Do same thing as step 2.
Repeatedly do step 2-4 and complete counting.
Let me use an example to show how many multiply we need to do: Assume
we have 5 numbers [1, 2, 3, 4, 5]. Number 1 appears 100 times, number
2 appears 150 times, number 3 appears 200 times, number 4 appears 300
times, and number 5 appears 400 times.
method 1: multiply it directly or use divide and conquer
we need 100+150+200+300+400-1 = 1149 multiply to get result.
method 2: we do (1^100)(2^150)(3^200)(4^300)(5^400)
(100-1)+(150-1)+(200-1)+(300-1)+(400-1)+4 = 1149.[same as method 1]
Cause n^m will do m-1 multiply in deed. Plus you need time to go through all numbers, though this time is short.
method in this post:
First, you need time to go through all numbers. It can be discarded compare to time of multiply.
The real counting you are making is:
((2*3*4*5)^150)*((3*4*5)^50)*((4*5)^100)*(5^100)
Then you need to do multiply 3+149+2+49+1+99+99+3 = 405 times

Related

Perfect powers of numbers which can fit in 64 bit size integer (using priority queues)

How can we print out all perfect powers that can be represented as 64-bit long integers: 4, 8, 9, 16, 25, 27, .... A perfect power is a number that can be written as ab for integers a and b ≥ 2.
It's not a homework problem, I found it in job interview questions section of an algorithm design book. Hint, the chapter was based on priority queues.
Most of the ideas I have are quadratic in nature, that keep finding powers until they stop fitting 64 bit but that's not what an interviewer will look for. Also, I'm not able to understand how would PQ's help here.
Using a small priority queue, with one entry per power, is a reasonable way to list the numbers. See following python code.
import Queue # in Python 3 say: queue
pmax, vmax = 10, 150
Q=Queue.PriorityQueue(pmax)
p = 2
for e in range(2,pmax):
p *= 2
Q.put((p,2,e))
print 1,1,2
while not Q.empty():
(v, b, e) = Q.get()
if v < vmax:
print v, b, e
b += 1
Q.put((b**e, b, e))
With pmax, vmax as in the code above, it produces the following output. For the proposed problem, replace pmax and vmax with 64 and 2**64.
1 1 2
4 2 2
8 2 3
9 3 2
16 2 4
16 4 2
25 5 2
27 3 3
32 2 5
36 6 2
49 7 2
64 2 6
64 4 3
64 8 2
81 3 4
81 9 2
100 10 2
121 11 2
125 5 3
128 2 7
144 12 2
The complexity of this method is O(vmax^0.5 * log(pmax)). This is because the number of perfect squares is dominant over the number of perfect cubes, fourth powers, etc., and for each square we do O(log(pmax)) work for get and put queue operations. For higher powers, we do O(log(pmax)) work when computing b**e.
When vmax,pmax =64, 2**64, there will be about 2*(2^32 + 2^21 + 2^16 + 2^12 + ...) queue operations, ie about 2^33 queue ops.
Added note: This note addresses cf16's comment, “one remark only, I don't think "the number of perfect squares is dominant over the number of perfect cubes, fourth powers, etc." they all are infinite. but yes, if we consider finite set”. It is true that in the overall mathematical scheme of things, the cardinalities are the same. That is, if P(j) is the set of all j'th powers of integers, then the cardinality of P(j) == P(k) for all integers j,k > 0. Elements of any two sets of powers can be put into 1-1 correspondence with each other.
Nevertheless, when computing perfect powers in ascending order, no matter how many are computed, finite or not, the work of delivering squares dominates that for any other power. For any given x, the density of perfect kth powers in the region of x declines exponentially as k increases. As x increases, the density of perfect kth powers in the region of x is proportional to (x1/k)/x, hence third powers, fourth powers, etc become vanishingly rare compared to squares as x increases.
As a concrete example, among perfect powers between 1e8 and 1e9 the number of (2; 3; 4; 5; 6)th powers is about (21622; 535; 77; 24; 10). There are more than 30 times as many squares between 1e8 and 1e9 than there are instances of any higher powers than squares. Here are ratios of the number of perfect squares between two numbers, vs the number of higher perfect powers: 10¹⁰–10¹⁵, r≈301; 10¹⁵–10²⁰, r≈2K; 10²⁰–10²⁵, r≈15K; 10²⁵–10³⁰, r≈100K. In short, as x increases, squares dominate more and more when perfect powers are delivered in ascending order.
A priority queue helps, for example, if you want to avoid duplicates in the output, or if you want to list the values particularly sorted.
Priority queues can often be replaced by sorting and vice versa. You could therefore generate all combinations of ab, then sort the results and remove adjacent duplicates. In this application, this approach appears to be slightly but perhaps not drammatically memory-inefficient as witnessed by one of the sister answers.
A priority queue can be superior to sorting, if you manage to remove duplicates as you go; or if you want to avoid storing and processing the whole result to be generated in memory. The other sister answer is an example of the latter but it could easily do both with a slight modification.
Here it makes the difference between an array taking up ~16 GB of RAM and a queue with less than 64 items taking up several kilobytes at worst. Such a huge difference in memory consumption also translates to RAM access time versus cache access time difference, so the memory lean algorithm may end up much faster even if the underlying data structure incurs some overhead by maintaining itself and needs more instructions compared to the naive algorithm that uses sorting.
Because the size of the input is fixed, it is not technically possible that the methods you thought of have been quadratic in nature. Having two nested loops does not make an algorithm quadratic, until you can say that the upper bound of each such loop is proportional to input size, and often not even then). What really matters is how many times the innermost logic actually executes.
In this case the competition is between feasible constants and non-feasible constants.
The only way I can see the priority queue making much sense is that you want to print numbers as they become available, in strictly increasing order, and of course without printing any number twice. So you start off with a prime generator (that uses the sieve of eratosthenes or some smarter technique to generate the sequence 2, 3, 5, 7, 11, ...). You start by putting a triple representing the fact that 2^2 = 4 onto the queue. Then you repeat a process of removing the smallest item (the triple with the smallest exponentiation result) from the queue, printing it, increasing the exponent by one, and putting it back onto the queue (with its priority determined by the result of the new exponentiation). You interleave this process with one that generates new primes as needed (sometime before p^2 is output).
Since the largest exponent base we can possibly have is 2^32 (2^32)^2 = 2^64, the number of elements on the queue shouldn't exceed the number of primes less than 2^32, which is evidently 203,280,221, which I guess is a tractable number.

Compression of sequence of integers providing random access

I have a sequence of n integers in a small range [0,k) and all the integers have the same frequency f (so the size of the sequence is n=f∗k). What I'm trying to do now is to compress this sequence while providing random access (what is the i-th integer). The time to achieve random access doesn't have to be O(1). I'm more interested in achieving high compression at the expense of higher random access times.
I haven't tried with Huffman coding since it assigns codes based on frequencies (and all my frequencies are the same). Perhaps I'm missing some simple encoding for this particular case.
Any help or pointers would be appreciated.
Thanks in advance.
PS: Already asked in cs.stackexchange, but asking here also for better coverage, sorry.
If all your integers have the same frequency, then a fair approximation to optimal compression will be ceil(log2(k)) bits per integer. You can access a bit-array of these in constant time.
If k is painfully small (like 3), the above method may waste a fair amount of space. But, you can combine a fixed number of your small integers into a base-k number, which can fit more efficiently into a fixed number of bits (you may also be able to fit the result conveniently into a standard-sized word). In any case, you can also access this coding in constant time.
If your integers don't have the same frequency, optimal compression may yield variable bit rates from different parts of your input, so the simple array access won't work. In that case, good random-access performance would require an index structure: break your compressed data into convenient sized chunks, which can each be decompressed sequentially, but this time is bounded by the chunk size.
If the frequency of each number is exactly the same, you may be able to save some space by taking advantage of this -- but it may not be enough to be worthwhile.
The entropy of n random numbers in range [0,k) is n log2(k), which is log2(k) bits per number; this is the number of bits it takes to encode your numbers without taking advantage of the exact frequency.
The entropy of distinguishable permutations of f copies each of k elements (where n=f*k) is:
log2( n!/(f!)^k ) = log2(n!) - k * log2(f!)
Applying Stirling's approximation (which is good here only if n and f are large), yields:
~ n log2(n) - n log2(e) - k ( f log2(f) - f log2(e) )
= n log2(n) - n log2(e) - n log2(f) + n log2(e)
= n ( log2(n) - log2(f) )
= n log2(n/f)
= n log2(k)
What this means is that, if n is large and k is small, you will not gain a significant amount of space by taking advantage of the exact frequency of your input.
The total error from the Stirling approximation above is O(log2(n) + k log2(f)), which is O(log2(n)/n + log2(f)/f) per number encoded. This does mean that if your k is so large that your f is small (i.e., each distinct number only has a small number of copies), you may be able to save some space with a clever encoding. However, the question specifies that k is, in fact, small.
If you work out the number of possible different combinations and take its log base 2 you can find the best possible compression, and I don't think it will be that great in your case. With 16 numbers of frequency 1 the number of possible messages is 16! and Excel tells me log base 2 of 16! is 44.25, whereas storing them as 4-bit codes would only take 64 bits. (where there is more than one of each kind you want http://mathworld.wolfram.com/MultinomialCoefficient.html)
I think you will have a problem mixing random access into this because the only information you have is that there are fixed numbers of each type of element - in the whole sequence. That's not a lot of information for the whole sequences, and it says almost nothing about the first half of the sequence in isolation, because you could well have more of some number in the first half and less in the second half.

Segmented Sieve of Atkin, possible?

I am aware of the fact that the Sieve of Eratosthenes can be implemented so that it finds primes continuosly without an upper bound (the segmented sieve).
My question is, could the Sieve of Atkin/Bernstein be implemented in the same way?
Related question: C#: How to make Sieve of Atkin incremental
However the related question has only 1 answer, which says "It's impossible for all sieves", which is obviously incorrect.
Atkin/Bernstein give a segmented version in Section 5 of their original paper. Presumably Bernstein's primegen program uses that method.
In fact, one can implement an unbounded Sieve of Atkin (SoA) not using segmentation at all as I have done here in F#. Note that this is a pure functional version that doesn't even use arrays to combine the solutions of the quadratic equations and the squaresfree filter and thus is considerably slower than a more imperative approach.
Berstein's optimizations using look up tables for optimum 32-bit ranges would make the code extremely complex and not suitable for presentation here, but it would be quite easy to adapt my F# code so that the sequences start at a set lower limit and are used only over a range in order to implement a segmented version, and/or applying the same techniques to a more imperative approach using arrays.
Note that even Berstein's implementation of the SoA isn't really faster than the Sieve of Eratosthenes with all possible optimizations as per Kim Walisch's primesieve but is only faster than an equivalently optimized version of the Sieve of Eratosthenes for the selected range of numbers as per his implementation.
EDIT_ADD: For those who do not want to wade through Berstein's pseudo-code and C code, I am adding to this answer to add a pseudo-code method to use the SoA over a range from LOW to HIGH where the delta from LOW to HIGH + 1 might be constrained to an even modulo 60 in order to use the modulo (and potential bit packing to only the entries on the 2,3,5 wheel) optimizations.
This is based on a possible implementation using the SoA quadratics of (4*x^2 + y^), (3*x^2 + y^2), and (3*x^2 -y^2) to be expressed as sequences of numbers with the x value for each sequence fixed to values between one and SQRT((HIGH - 1) / 4), SQRT((HIGH - 1) / 3), and solving the quadratic for 2*x^2 + 2*x - HIGH - 1 = 0 for x = (SQRT(1 + 2 * (HIGH + 1)) - 1) / 2, respectively, with the sequences expressed in my F# code as per the top link. Optimizations to the sequences there use that when sieving for only odd composites, for the "4x" sequences, the y values need only be odd and that the "3x" sequences need only use odd values of y when x is even and vice versa. Further optimization reduce the number of solutions to the quadratic equations (= elements in the sequences) by observing that the modulo patterns over the above sequences repeat over very small ranges of x and also repeat over ranges of y of only 30, which is used in the Berstein code but not (yet) implemented in my F# code.
I also do not include the well known optimizations that could be applied to the prime "squares free" culling to use wheel factorization and the calculations for the starting segment address as I use in my implementations of a segmented SoE.
So for purposes of calculating the sequence starting segment addresses for the "4x", "3x+", and "3x-" (or with "3x+" and "3x-" combined as I do in the F# code), and having calculated the ranges of x for each as per the above, the pseudo-code is as follows:
Calculate the range LOW - FIRST_ELEMENT, where FIRST_ELEMENT is with the lowest applicable value of y for each given value of x or y = x - 1 for the case of the "3x-" sequence.
For the job of calculating how many elements are in this range, this boils down to the question of how many of (y1)^2 + (y2)^2 + (y3)^2... there are where each y number is separated by two, to produce even or odd 'y's as required. As usual in square sequence analysis, we observe that differences between squares have a constant increasing increment as in delta(9 - 1) is 8, delta(25 - 9) is 16 for an increase of 8, delta (49 - 25) is 24 for a further increase of 8, etcetera. so that for n elements the last increment is 8 * n for this example. Expressing the sequence of elements using this, we get it is one (or whatever one chooses as the first element) plus eight times the sequence of something like (1 + 2 + 3 + ...+ n). Now standard reduction of linear sequences applies where this sum is (n + 1) * n / 2 or n^2/2 + n/2. This we can solve for how many n elements there are in the range by solving the quadratic equation n^2/2 + n/2 - range = 0 or n = (SQRT(8*range + 1) - 1) / 2.
Now, if FIRST_ELEMENT + 4 * (n + 1) * n does not equal LOW as the starting address, add one to n and use FIRST_ELEMENT + 4 * (n + 2) * (n + 1) as the starting address. If one uses further optimizations to apply wheel factorization culling to the sequence pattern, look up table arrays can be used to look up the closest value of used n that satisfies the conditions.
The modulus 12 or 60 of the starting element can be calculated directly or can be produced by use of look up tables based on the repeating nature of the modulo sequences.
Each sequence is then used to toggle the composite states up to the HIGH limit. If the additional logic is added to the sequences to jump values between only the applicable elements per sequence, no further use of modulo conditions is necessary.
The above is done for every "4x" sequence followed by the "3x+" and "3x-" sequences (or combine "3x+" and "3x-" into just one set of "3x" sequences) up to the x limits as calculated earlier or as tested per loop.
And there you have it: given an appropriate method of dividing the sieve range into segments, which is best used as fixed sizes that are related to the CPU cache sizes for best memory access efficiency, a method of segmenting the SoA just as used by Bernstein but somewhat simpler in expression as it mentions but does not combine the modulo operations and bit packing.

Create a random permutation of 1..N in constant space

I am looking to enumerate a random permutation of the numbers 1..N in fixed space. This means that I cannot store all numbers in a list. The reason for that is that N can be very large, more than available memory. I still want to be able to walk through such a permutation of numbers one at a time, visiting each number exactly once.
I know this can be done for certain N: Many random number generators cycle through their whole state space randomly, but entirely. A good random number generator with state size of 32 bit will emit a permutation of the numbers 0..(2^32)-1. Every number exactly once.
I want to get to pick N to be any number at all and not be constrained to powers of 2 for example. Is there an algorithm for this?
The easiest way is probably to just create a full-range PRNG for a larger range than you care about, and when it generates a number larger than you want, just throw it away and get the next one.
Another possibility that's pretty much a variation of the same would be to use a linear feedback shift register (LFSR) to generate the numbers in the first place. This has a couple of advantages: first of all, an LFSR is probably a bit faster than most PRNGs. Second, it is (I believe) a bit easier to engineer an LFSR that produces numbers close to the range you want, and still be sure it cycles through the numbers in its range in (pseudo)random order, without any repetitions.
Without spending a lot of time on the details, the math behind LFSRs has been studied quite thoroughly. Producing one that runs through all the numbers in its range without repetition simply requires choosing a set of "taps" that correspond to an irreducible polynomial. If you don't want to search for that yourself, it's pretty easy to find tables of known ones for almost any reasonable size (e.g., doing a quick look, the wikipedia article lists them for size up to 19 bits).
If memory serves, there's at least one irreducible polynomial of ever possible bit size. That translates to the fact that in the worst case you can create a generator that has roughly twice the range you need, so on average you're throwing away (roughly) every other number you generate. Given the speed an LFSR, I'd guess you can do that and still maintain quite acceptable speed.
One way to do it would be
Find a prime p larger than N, preferably not much larger.
Find a primitive root of unity g modulo p, that is, a number 1 < g < p such that g^k ≡ 1 (mod p) if and only if k is a multiple of p-1.
Go through g^k (mod p) for k = 1, 2, ..., ignoring the values that are larger than N.
For every prime p, there are φ(p-1) primitive roots of unity, so it works. However, it may take a while to find one. Finding a suitable prime is much easier in general.
For finding a primitive root, I know nothing substantially better than trial and error, but one can increase the probability of a fast find by choosing the prime p appropriately.
Since the number of primitive roots is φ(p-1), if one randomly chooses r in the range from 1 to p-1, the expected number of tries until one finds a primitive root is (p-1)/φ(p-1), hence one should choose p so that φ(p-1) is relatively large, that means that p-1 must have few distinct prime divisors (and preferably only large ones, except for the factor 2).
Instead of randomly choosing, one can also try in sequence whether 2, 3, 5, 6, 7, 10, ... is a primitive root, of course skipping perfect powers (or not, they are in general quickly eliminated), that should not affect the number of tries needed greatly.
So it boils down to checking whether a number x is a primitive root modulo p. If p-1 = q^a * r^b * s^c * ... with distinct primes q, r, s, ..., x is a primitive root if and only if
x^((p-1)/q) % p != 1
x^((p-1)/r) % p != 1
x^((p-1)/s) % p != 1
...
thus one needs a decent modular exponentiation (exponentiation by repeated squaring lends itself well for that, reducing by the modulus on each step). And a good method to find the prime factor decomposition of p-1. Note, however, that even naive trial division would be only O(√p), while the generation of the permutation is Θ(p), so it's not paramount that the factorisation is optimal.
Another way to do this is with a block cipher; see this blog post for details.
The blog posts links to the paper Ciphers with Arbitrary Finite Domains which contains a bunch of solutions.
Consider the prime 3. To fully express all possible outputs, think of it this way...
bias + step mod prime
The bias is just an offset bias. step is an accumulator (if it's 1 for example, it would just be 0, 1, 2 in sequence, while 2 would result in 0, 2, 4) and prime is the prime number we want to generate the permutations against.
For example. A simple sequence of 0, 1, 2 would be...
0 + 0 mod 3 = 0
0 + 1 mod 3 = 1
0 + 2 mod 3 = 2
Modifying a couple of those variables for a second, we'll take bias of 1 and step of 2 (just for illustration)...
1 + 2 mod 3 = 0
1 + 4 mod 3 = 2
1 + 6 mod 3 = 1
You'll note that we produced an entirely different sequence. No number within the set repeats itself and all numbers are represented (it's bijective). Each unique combination of offset and bias will result in one of prime! possible permutations of the set. In the case of a prime of 3 you'll see that there are 6 different possible permuations:
0,1,2
0,2,1
1,0,2
1,2,0
2,0,1
2,1,0
If you do the math on the variables above you'll not that it results in the same information requirements...
1/3! = 1/6 = 1.66..
... vs...
1/3 (bias) * 1/2 (step) => 1/6 = 1.66..
Restrictions are simple, bias must be within 0..P-1 and step must be within 1..P-1 (I have been functionally just been using 0..P-2 and adding 1 on arithmetic in my own work). Other than that, it works with all prime numbers no matter how large and will permutate all possible unique sets of them without the need for memory beyond a couple of integers (each technically requiring slightly less bits than the prime itself).
Note carefully that this generator is not meant to be used to generate sets that are not prime in number. It's entirely possible to do so, but not recommended for security sensitive purposes as it would introduce a timing attack.
That said, if you would like to use this method to generate a set sequence that is not a prime, you have two choices.
First (and the simplest/cheapest), pick the prime number just larger than the set size you're looking for and have your generator simply discard anything that doesn't belong. Once more, danger, this is a very bad idea if this is a security sensitive application.
Second (by far the most complicated and costly), you can recognize that all numbers are composed of prime numbers and create multiple generators that then produce a product for each element in the set. In other words, an n of 6 would involve all possible prime generators that could match 6 (in this case, 2 and 3), multiplied in sequence. This is both expensive (although mathematically more elegant) as well as also introducing a timing attack so it's even less recommended.
Lastly, if you need a generator for bias and or step... why don't you use another of the same family :). Suddenly you're extremely close to creating true simple-random-samples (which is not easy usually).
The fundamental weakness of LCGs (x=(x*m+c)%b style generators) is useful here.
If the generator is properly formed then x%f is also a repeating sequence of all values lower than f (provided f if a factor of b).
Since bis usually a power of 2 this means that you can take a 32-bit generator and reduce it to an n-bit generator by masking off the top bits and it will have the same full-range property.
This means that you can reduce the number of discard values to be fewer than N by choosing an appropriate mask.
Unfortunately LCG Is a poor generator for exactly the same reason as given above.
Also, this has exactly the same weakness as I noted in a comment on #JerryCoffin's answer. It will always produce the same sequence and the only thing the seed controls is where to start in that sequence.
Here's some SageMath code that should generate a random permutation the way Daniel Fischer suggested:
def random_safe_prime(lbound):
while True:
q = random_prime(lbound, lbound=lbound // 2)
p = 2 * q + 1
if is_prime(p):
return p, q
def random_permutation(n):
p, q = random_safe_prime(n + 2)
while True:
r = randint(2, p - 1)
if pow(r, 2, p) != 1 and pow(r, q, p) != 1:
i = 1
while True:
x = pow(r, i, p)
if x == 1:
return
if 0 <= x - 2 < n:
yield x - 2
i += 1

Programming problem - Game of Blocks

maybe you would have an idea on how to solve the following problem.
John decided to buy his son Johnny some mathematical toys. One of his most favorite toy is blocks of different colors. John has decided to buy blocks of C different colors. For each color he will buy googol (10^100) blocks. All blocks of same color are of same length. But blocks of different color may vary in length.
Jhonny has decided to use these blocks to make a large 1 x n block. He wonders how many ways he can do this. Two ways are considered different if there is a position where the color differs. The example shows a red block of size 5, blue block of size 3 and green block of size 3. It shows there are 12 ways of making a large block of length 11.
Each test case starts with an integer 1 ≤ C ≤ 100. Next line consists c integers. ith integer 1 ≤ leni ≤ 750 denotes length of ith color. Next line is positive integer N ≤ 10^15.
This problem should be solved in 20 seconds for T <= 25 test cases. The answer should be calculated MOD 100000007 (prime number).
It can be deduced to matrix exponentiation problem, which can be solved relatively efficiently in O(N^2.376*log(max(leni))) using Coppersmith-Winograd algorithm and fast exponentiation. But it seems that a more efficient algorithm is required, as Coppersmith-Winograd implies a large constant factor. Do you have any other ideas? It can possibly be a Number Theory or Divide and Conquer problem
Firstly note the number of blocks of each colour you have is a complete red herring, since 10^100 > N always. So the number of blocks of each colour is practically infinite.
Now notice that at each position, p (if there is a valid configuration, that leaves no spaces, etc.) There must block of a color, c. There are len[c] ways for this block to lie, so that it still lies over this position, p.
My idea is to try all possible colors and positions at a fixed position (N/2 since it halves the range), and then for each case, there are b cells before this fixed coloured block and a after this fixed colour block. So if we define a function ways(i) that returns the number of ways to tile i cells (with ways(0)=1). Then the number of ways to tile a number of cells with a fixed colour block at a position is ways(b)*ways(a). Adding up all possible configurations yields the answer for ways(i).
Now I chose the fixed position to be N/2 since that halves the range and you can halve a range at most ceil(log(N)) times. Now since you are moving a block about N/2 you will have to calculate from N/2-750 to N/2-750, where 750 is the max length a block can have. So you will have to calculate about 750*ceil(log(N)) (a bit more because of the variance) lengths to get the final answer.
So in order to get good performance you have to through in memoisation, since this inherently a recursive algorithm.
So using Python(since I was lazy and didn't want to write a big number class):
T = int(raw_input())
for case in xrange(T):
#read in the data
C = int(raw_input())
lengths = map(int, raw_input().split())
minlength = min(lengths)
n = int(raw_input())
#setup memoisation, note all lengths less than the minimum length are
#set to 0 as the algorithm needs this
memoise = {}
memoise[0] = 1
for length in xrange(1, minlength):
memoise[length] = 0
def solve(n):
global memoise
if n in memoise:
return memoise[n]
ans = 0
for i in xrange(C):
if lengths[i] > n:
continue
if lengths[i] == n:
ans += 1
ans %= 100000007
continue
for j in xrange(0, lengths[i]):
b = n/2-lengths[i]+j
a = n-(n/2+j)
if b < 0 or a < 0:
continue
ans += solve(b)*solve(a)
ans %= 100000007
memoise[n] = ans
return memoise[n]
solve(n)
print "Case %d: %d" % (case+1, memoise[n])
Note I haven't exhaustively tested this, but I'm quite sure it will meet the 20 second time limit, if you translated this algorithm to C++ or somesuch.
EDIT: Running a test with N = 10^15 and a block with length 750 I get that memoise contains about 60000 elements which means non-lookup bit of solve(n) is called about the same number of time.
A word of caution: In the case c=2, len1=1, len2=2, the answer will be the N'th Fibonacci number, and the Fibonacci numbers grow (approximately) exponentially with a growth factor of the golden ratio, phi ~ 1.61803399. For the
huge value N=10^15, the answer will be about phi^(10^15), an enormous number. The answer will have storage
requirements on the order of (ln(phi^(10^15))/ln(2)) / (8 * 2^40) ~ 79 terabytes. Since you can't even access 79
terabytes in 20 seconds, it's unlikely you can meet the speed requirements in this special case.
Your best hope occurs when C is not too large, and leni is large for all i. In such cases, the answer will
still grow exponentially with N, but the growth factor may be much smaller.
I recommend that you first construct the integer matrix M which will compute the (i+1,..., i+k)
terms in your sequence based on the (i, ..., i+k-1) terms. (only row k+1 of this matrix is interesting).
Compute the first k entries "by hand", then calculate M^(10^15) based on the repeated squaring
trick, and apply it to terms (0...k-1).
The (integer) entries of the matrix will grow exponentially, perhaps too fast to handle. If this is the case, do the
very same calculation, but modulo p, for several moderate-sized prime numbers p. This will allow you to obtain
your answer modulo p, for various p, without using a matrix of bigints. After using enough primes so that you know their product
is larger than your answer, you can use the so-called "Chinese remainder theorem" to recover
your answer from your mod-p answers.
I'd like to build on the earlier #JPvdMerwe solution with some improvements. In his answer, #JPvdMerwe uses a Dynamic Programming / memoisation approach, which I agree is the way to go on this problem. Dividing the problem recursively into two smaller problems and remembering previously computed results is quite efficient.
I'd like to suggest several improvements that would speed things up even further:
Instead of going over all the ways the block in the middle can be positioned, you only need to go over the first half, and multiply the solution by 2. This is because the second half of the cases are symmetrical. For odd-length blocks you would still need to take the centered position as a seperate case.
In general, iterative implementations can be several magnitudes faster than recursive ones. This is because a recursive implementation incurs bookkeeping overhead for each function call. It can be a challenge to convert a solution to its iterative cousin, but it is usually possible. The #JPvdMerwe solution can be made iterative by using a stack to store intermediate values.
Modulo operations are expensive, as are multiplications to a lesser extent. The number of multiplications and modulos can be decreased by approximately a factor C=100 by switching the color-loop with the position-loop. This allows you to add the return values of several calls to solve() before doing a multiplication and modulo.
A good way to test the performance of a solution is with a pathological case. The following could be especially daunting: length 10^15, C=100, prime block sizes.
Hope this helps.
In the above answer
ans += 1
ans %= 100000007
could be much faster without general modulo :
ans += 1
if ans == 100000007 then ans = 0
Please see TopCoder thread for a solution. No one was close enough to find the answer in this thread.

Resources