Parallelizing large loops and improving cache accesses

Parallelizing large loops and improving cache accesses - parallel-processing

I have a code like the following which I am using to find prime numbers (using Eratosthenes sieve) within a range, and using OpenMP to parallelize. Before this, I have a preprocessing stage where I am flagging off all even numbers, and multiples of 3 and 5 so that I have to do less work in this stage.
The shared L3 cache of the testbed is 12MB, and the physical memory is 32 GB. I am using 12 threads. The flag array is unsigned char.
#pragma omp parallel for
for (i = 0; i < range; i++)
{
for (j = 5; j < range; j+=2)
{
if( flag[i] == 1 && i*j < range )
if ( flag[i*j] == 1 )
flag[i*j] = 0;
}
}
This program works well for ranges less than 1,000,000...but after that the execution time shoots up for larger ranges; eg, for range = 10,000,000 this program takes around 70 mins (not fitting in cache?). I have modified the above program to incorporate loop tiling so that it could utilize the cache for any loop range, but even the blocking approach seems to be time consuming. Interchanging the loops also do not help for large ranges.
How do I modify the above code to tackle large ranges? And how could I rewrite the code to make it fully parallel (range and flag [since the flag array is quite large so I can't declare it private] is shared)?

Actually, I just noticed a few easy speedups in your code. So I'll mention these before I get into the fast algorithm:
Use a bit-field instead of a char array. You can save a factor of 8 in memory.
Your outer loop is running over all integers. Not just the primes. After each iteration, start from the first number that hasn't been crossed off yet. (that number will be prime)
I'm suggesting this because you mentioned that it take 70 min. on a (pretty powerful) machine to run N = 10,000,000. That didn't look right, since my own trivial implementation can do N = 2^32 in under 20 seconds on a laptop - single-threaded, no source-level optimizations. So then I noticed that you missed a few basic optimizations.
Here's the efficient solution. But it takes some work.
The key is to recognize that the Eratosthenes Sieve only needs to go up to sqrt(N) of your target size. In other words, you only need to run the sieve on all prime numbers up to sqrt(N) before you are done.
So the trick is to first run the algorithm on sqrt(N). Then dump all the primes into a dense data structure. By pre-computing all the needed primes, you break the dependency on the outer-loop.
Now, for the rest of the numbers from sqrt(N) - N, you can cross off all numbers that are divisible by any prime in your pre-computed table. Note that this is independent for all the remaining numbers. So the algorithm is now embarrassingly parallel.
To be efficient, this needs to be done using "mini"-sieves on blocks that fit in cache. To be even more efficient, you should compute and cache the reciprocals of all the primes in the table. This will help you efficiently find the "initial offsets" of each prime when you fill out each "mini-sieve".
The initial step of running the algorithm sequential for sqrt(N) will be very fast since it's only sqrt(N). The rest of the work is completely parallelizable.
In the fully general case, this algorithm can be applied recursively on the initial sieve, but that's generally overkill.

Related

Can the efficiency of an algorithm be modelled as a function between input size and time?

Consider the following algorithm (just as an example as the implementation is obviously inefficient):
def add(n):
for i in range(n):
n += 1
return n
The program adds one number with itself and returns it. Now the efficiency of an algorithm is sometimes modelled as a function between the size of the input and the number of primitive steps the algorithm has to compute. In this case, the input is an integer, n, and as n gets increased the number of steps necessary to complete the algorithm also increase (in this case linearly). But is it true that the size of the input increases? Let's assume that the machine where the program is running is representing integers in 8 bits. So if I increase the hypthetical input 3 for example to 7, the number of bits involved remains the same: 00000011 -> 00000111. However, the steps necessary to compute the algorithm increase. So it seems like that it's not always true that algorithmic efficiency can be modelled as a relation between input size and steps to compute. Could somebody explain to me where I go wrong or if I don't go wrong, why it still makes sense to model the efficiency of an algorithm as a function between the size of the input and the number of primitive steps to be computed?

Let S be the size of the input n. (Normally we'd use n for this size, but since the argument is also called n, that's confusing). For positive n, there's a relation between S and n, namely S = ceil(ln(n)). The program loops n times, and since n < 2^S, it loops at most 2^S times. You can also show it loops at least 1/2 * 2^S times, so the runtime (measured in loop iterations) is Theta(2^S).
This shows there's a way to model the runtime as a function of the size, even if it's not exact.
Whether it makes sense. In your example it doesn't much, but if your input is an array for sorting, taking size as the number of elements in the array does makes sense. (And it's typically what's used for example to model the number of comparisons done by different sort algorithms).

How can I optimize sieve of eratosthenes to just store prime numbers for a very large range?

I have studied the working of Sieve of Eratosthenes for generating the prime numbers up to a given number using iteration and striking off all the composite numbers. And the algorithm needs just to be iterated up to sqrt(n) where n is the upper bound upto which we need to find all the prime numbers. We know that the number of prime numbers up to n=10^9 is very less as compared to the number of composite numbers. So we use all the space to just tell that these numbers are not prime first by marking them composite.
My question is can we modify the algorithm to just store prime numbers since we deal with a very large range (since number of primes are very less)?
Can we just store straight away the prime numbers?

Changing the structure from that of a set (sieve) - one bit per candidate - to storing primes (e.g. in a list, vector or tree structure) actually increases storage requirements.
Example: there are 203.280.221 primes below 2^32. An array of uint32_t of that size requires about 775 MiB whereas the corresponding bitmap (a.k.a. set representation) occupies only 512 MiB (2^32 bits / 8 bits/byte = 2^29 bytes).
The most compact number-based representation with fixed cell size would be storing the halved distance between consecutive odd primes, since up to about 2^40 the halved distance fits into a byte. At 193 MiB for the primes up to 2^32 this is slightly smaller than an odds-only bitmap but it is only efficient for sequential processing. For sieving it is not suitable because, as Anatolijs has pointed out, algorithms like the Sieve of Eratosthenes effectively require a set representation.
The bitmap can be shrunk drastically by leaving out the multiples of small primes. Most famous is the odds-only representation that leaves out the number 2 and its multiples; this halves the space requirement to 256 MiB at virtually no cost in added code complexity. You just need to remember to pull the number 2 out of thin air when needed, since it isn't represented in the sieve.
Even more space can be saved by leaving out multiples of more small primes; this generalisation of the 'odds-only' trick is usually called 'wheeled storage' (see Wheel Factorization in the Wikipedia). However, the gain from adding more small primes to the wheel gets smaller and smaller whereas the wheel modulus ('circumference') increases explosively. Adding 3 removes 1/3rd of the remaining numbers, adding 5 removes a further 1/5th, adding 7 only gets you a further 1/7th and so on.
Here's an overview of what adding another prime to the wheel can get you. 'ratio' is the size of the wheeled/reduced set relative to the full set that represents every number; 'delta' gives the shrinkage compared to the previous step. 'spokes' refers to the number of prime-bearing spokes which need to be represented/stored; the total number of spokes for a wheel is of course equal to its modulus (circumference).
The mod 30 wheel (about 136 MiB for the primes up to 2^32) offers an excellent cost/benefit ratio because it has eight prime-bearing spokes, which means that there is a one-to-one correspondence between wheels and 8-bit bytes. This enables many efficient implementation tricks. However, its cost in added code complexity is considerable despite this fortuitous circumstance, and for many purposes the odds-only sieve ('mod 2 wheel') gives the most bang for buck by far.
There are two additional considerations worth keeping in mind. The first is that data sizes like these often exceed the capacity of memory caches by a wide margin, so that programs can often spend a lot of time waiting for the memory system to deliver the data. This is compounded by the typical access patterns of sieving - striding over the whole range, again and again and again. Speedups of several orders of magnitude are possible by working the data in small batches that fit into the level-1 data cache of the processor (typically 32 KiB); lesser speedups are still possible by keeping within the capacity of the L2 and L3 caches (a few hundred KiB and a few MiB, respectively). The keyword here is 'segmented sieving'.
The second consideration is that many sieving tasks - like the famous SPOJ PRIME1 and its updated version PRINT (with extended bounds and tightened time limit) - require only the small factor primes up to the square root of the upper limit to be permanently available for direct access. That's a comparatively small number: 3512 when sieving up to 2^31 as in the case of PRINT.
Since these primes have already been sieved there's no need for a set representation anymore, and since they are few there are no problems with storage space. This means they are most profitably kept as actual numbers in a vector or list for easy iteration, perhaps with additional, auxiliary data like current working offset and phase. The actual sieving task is then easily accomplished via a technique called 'windowed sieving'. In the case of PRIME1 and PRINT this can be several orders of magnitude faster than sieving the whole range up to the upper limit, since both tasks only asks for a small number of subranges to be sieved.

You can do that (remove numbers that are detected to be non-prime from your array/linked list), but then time complexity of algorithm will degrade to O(N^2/log(N)) or something like that, instead of original O(N*log(N)). This is because you will not be able to say "the numbers 2X, 3X, 4X, ..." are not prime anymore. You will have to loop through your entire compressed list.

You could erase each composite number from the array/vector once you have shown it to be composite. Or when you fill an array of numbers to put through the sieve, remove all even numbers (other than 2) and all numbers ending in 5.

If you have studied the sieve right, you must know we don't have the primes to begin with. We have an array, sizeof which is equal to the range. Now, if you want the range to be 10e9, you want this to be the size of the array. You have mentioned no language, but for each number, you must need a bit to represent whether it is prime or not.
Even that means you need 10^9 bits = 1.125 * 10^8 bytes which is greater than 100 MB of RAM.
Assuming you have all this, most optimized sieve takes O(n * log(log n)) time, which is, if n = 10e9, on a machine that evaluates 10e8 instructions per second, will still take some minutes.
Now, assuming you have all this with you, still, number of primes till 10e9 is q = 50,847,534, to save these will still take q * 4 bytes, which is still greater than 100MB. (more RAM)
Even if you remove the indexes which are multiples of 2, 3 or 5, this removes 21 numbers in every 30. This is not good enough, because in total, you will still need around 140 MB space. (40MB = a third of 10^9 bits + ~100MB for storing prime numbers).
So, since, for storing the primes, you will, in any case require similar amount of memory (of the same order as calculation), your question, IMO has no solution.

You can halve the size of the sieve by only 'storing' the odd numbers. This requires code to explicitly deal with the case of testing even numbers. For odd numbers, bit b of the sieve represents n = 2b + 3. Hence bit 0 represents 3, bit 1 represents 5 and so on. There is a small overhead in converting between the number n and the bit index b.
Whether this technique is any use to you depends on the memory/speed balance you require.

Efficient way to multiply a large set of small numbers

This question was asked in an interview.
You have an array of small integers. You have to multiply all of them. You need not worry about overflow you have ample support for that. What can you do to speed up the multiplication on your machine?
Would multiple additions be better in this case?
I suggested multiplying using a divide and conquer approach but the interviewer was not impressed. What could be the best possible solution for this?

Here are some thoughts:
Divide-and-Conquer with Multithreading: Split the input apart into n different blocks of size b and recursively multiply all the numbers in each block together. Then, recursively multiply all n / b blocks back together. If you have multiple cores and can run parts of this in parallel, you could save a lot of time overall.
Word-Level Parallelism: Let's suppose that your numbers are all bounded from above by some number U, which happens to be a power of two. Now, suppose that you want to multiply together a, b, c, and d. Start off by computing (4U2a + b) × (4U2c + d) = 16U4ac + 4U2ad + 4U2bc + bd. Now, notice that this expression mod U2 is just bd. (Since bd < U2, we don't need to worry about the mod U2 step messing it up). This means that if we compute this product and take it mod U2, we get back bd. Since U2 is a power of two, this can be done with a bitmask.
Next, notice that
4U2ad + 4U2bc + bd < 4U4 + 4U4 + U2 < 9U4 < 16U4
This means that if we divide the entire expression by 16U4 and round down, we will end up getting back just ad. This division can be done with a bitshift, since 16U4 is a power of two.
Consequently, with one multiplication, you can get back the values of both ac and bd by applying a subsequent bitshift and bitmask. Once you have ac and bd, you can directly multiply them together to get back the value of abcd. Assuming that bitmasks and bitshifts are faster than multiplies, this reduces the number of multiplications necessary by 33% (two instead of three here).
Hope this helps!

Your divide and conquer suggestion was a good start. It just needed more explanation to impress.
With fast multiplication algorithms used to multiply large numbers (big-ints), it is much more efficient to multiply similar sized multiplicands than a series of mismatched sizes.
Here's an example in Clojure
; Define a vector of 100K random integers between 2 and 100 inclusive
(def xs (vec (repeatedly 100000 #(+ 2 (rand-int 99)))))
; Naive multiplication accumulating linearly through the array
(time (def a1 (apply *' xs)))
"Elapsed time: 7116.960557 msecs"
; Simple Divide and conquer algorithm
(defn d-c [v]
(let [m (quot (count v) 2)]
(if (< m 3)
(reduce *' v)
(*' (d-c (subvec v 0 m)) (d-c (subvec v m))))))
; Takes less than 1/10th the time.
(time (def a2 (d-c xs)))
"Elapsed time: 600.934268 msecs"
(= a1 a2) ;=> true (same result)
Note that this improvement does not rely on a set limit for the size of the integers in the array (100 chosen arbitrarily and to demonstrate the next algorithm), but only that they be similar in size. This is a very simple divide an conquer. As the numbers get larger and more expensive to multiply, it would make sense to invest more time in iteratively grouping them by similar size. Here I am relying on random distribution and chance that the sizes will stay similar, but it is still going to be significantly better than the naive approach even for the worst case.
As suggested by Evgeny Kluev in the comments, for a large number of small integers, there is going to be a lot of duplication, so efficient exponentiation is also better than naive multiplication. This depends a lot more on the relative parameters than the divide and conquer, that is the numbers must be sufficiently small relative to the count for enough duplicates to accumulate to bother, but certainly performs well with these parameters (100K numbers in the range 2-100).
; Hopefully an efficient implementation
(defn pow [x n] (.pow (biginteger x) ^Integer n))
; Perform pow on duplications based on frequencies
(defn exp-reduce [v] (reduce *' (map (partial apply pow) (frequencies v))))
(time (def a3 (exp-reduce xs)))
"Elapsed time: 650.211789 msecs"
Note the very simple divide and conquer performed just a wee better in this trial, but would be even relatively better if fewer duplicates were expected.
Of course we can also combine the two:
(defn exp-d-c [v] (d-c (mapv (partial apply pow) (frequencies v))))
(time (def a4 (exp-d-c xs)))
"Elapsed time: 494.394641 msecs"
(= a1 a2 a3 a4) ;=> true (all equal)
Note there are better ways to combine these two since the result of the exponentiation step is going to result in various sizes of multiplicands. The value of added complexity to do so depends on the expected number of distinct numbers in the input. In this case, there are very few distinct numbers so it wouldn't pay to add much complexity.
Note also that both of these are easily parallelized if multiple cores are available.

If many of the small integers occur multiple times, you could start by counting every unique integer. If c(n) is the number of occurrences of integer n, the product can be computed as
P = 2 ^ c(2) * 3 ^ c(3) * 4 ^ c(4) * ...
For the exponentiation steps, you can use exponentiation by squaring which can reduce the number of multiplications considerably.

If the count of numbers really is large compared to the range, then we have seen two asymptotic solutions presented to reduce the complexity considerably. One was based on successive squaring to compute c^k in O(log k) time for each number c, giving O(C mean(log k)) time if the largest number is C and k gives the exponent for each number between 1 and C. The mean(log k) term is maximized if every number appears an equal number of times, so if you have N numbers then the complexity becomes O(C log(N/C)), which is very weakly dependent on N and essentially just O(C) where C specifies the range of numbers.
The other approach we saw was sorting numbers by the number of times they appear, and keeping track of the product of leading numbers (starting with all numbers) and raising this to a power so that the least frequent number is removed from the array, and then updating the exponents on the remaining element in the array and repeating. If all numbers occur the same number of times K, then this gives O(C + log K) which is an improvement over O(C log K). However, say the kth number appears 2^k times. Then this will still give O(C^2 + C log(N/C)) time which is technically worse than the previous method O(C log(N/C)) if C > log(N/C). Thus, if you don't have good information on how evenly distributed the occurrences of each number are, you should go with the first approach, just take the appropriate power of each distinct number that appears in the product by using successive squaring, and take the product of the results. Total time O(C log (N/C)) if there are C distinct numbers and N total numbers.

To answer this question we need to interpret in some way the assumption from OP: need not worry about overflow. In larger part of this answer it is interpreted as "ignore overflows". But I start with several thoughts about other interpretation ("use multiprecision arithmetic"). In this case process of multiplication may be approximately split into 3 stages:
Multiply together small sets of small numbers to get a large set of not-so-small numbers. Some of the ideas from second part of this answer may be used here.
Multiply together these numbers to get a set of large numbers. Either trivial (quadratic time) algorithm or Toom–Cook/Karatsuba (sub-quadratic time) methods may be used.
Multiply together large numbers. Either Fürer's or Schönhage–Strassen algorithm may be used. Which gives O(N polylog N) time complexity for the whole process.
Binary exponentiation may give some (not very significant) speed improvement, because most (if not every) of complex multiplication algorithms mentioned here do squaring faster than multiplication of two unequal numbers. Also we could factorize every "small" number and use binary exponentiation only for prime factors. For evenly distributed "small" numbers this will decrease number of exponentiations by factor log(number_of_values) and slightly improve balance of squarings/multiplications.
Divide and conquer is OK when numbers are evenly distributed. Otherwise (for example when input array is sorted or when binary exponentiation is used) we could do better by placing all multiplicands into priority queue, ordered (may be approximately ordered) by number length. Then we could multiply two shortest numbers and place the result back to the queue (this process is very similar to Huffman encoding). There is no need to use this queue for squaring. Also we should not use it while numbers are not long enough.
More information on this could be found in the answer by A. Webb.
If overflows may be ignored we could multiply the numbers with linear-time or better algorithms.
Sub-linear time algorithm is possible if input array is sorted or input data is presented as set of tuples {value, number of occurrences}. In latter case we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log(N/C)), where C is number of different values in the array. (See also this answer).
If input array is sorted, we could use binary search to find positions where value changes. This allows to determine how many times each value occurs in the array. Then we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log N). We could do better by using one-sided binary search here. In this case time complexity is O(C log(N/C)).
But if input array is not sorted, we have to inspect each element, so O(N) time complexity is the best we can do. Still we could use parallelism (multithreading, SIMD, word-level parallelism) to obtain some speed improvement. Here I compare several such approaches.
To compare these approaches I've chosen very small (3-bit) values, which are pretty tightly packed (one value per 8-bit integer). And implemented them in low-level language (C++11) to get easier access to bit manipulation, specific CPU instructions, and SIMD.
Here are all the algorithms:
accumulate from standard library.
Trivial implementation with 4 accumulators.
Word-level parallelism for multiplication, as described in the answer by templatetypedef. With 64-bit word size this approach allows up to 10-bit values (with only 3 multiplications instead of each 4) or it may be applied twice (and I did it in the tests) with up to 5-bit values (requiring only 5 multiplications out of each 8).
Table lookup. In the tests 7 multiplications out of each 8 are substituted by single table lookup. With values larger than in these tests, number of substituted multiplications decreases, slowing down the algorithm. Values larger than 11-12 bits make this approach useless.
Binary exponentiation (see more details below). Values larger than 4 bits make this approach useless.
SIMD (AVX2). This implementation can use up to 8-bit values.
Here are sources for all tests on Ideone. Note that SIMD test requires AVX2 instruction set from Intel. Table lookup test requires BMI2 instruction set. Other tests do not depend on any particular hardware (I hope). I run these tests on 64-bit Linux, compiled with gcc 4.8.1, optimization level -O2.
Here are some more details for binary exponentiation test:
for (size_t i = 0; i < size / 8; i += 2)
{
auto compr = (qwords[i] << 4) | qwords[i + 1];
constexpr uint64_t lsb = 0x1111111111111111;
if ((compr & lsb) != lsb) // if there is at least one even value
{
auto b = reinterpret_cast<uint8_t*>(qwords + i);
acc1 *= accumulate(b, b + 16, acc1, multiplies<unsigned>{});
if (!acc1)
break;
}
else
{
const auto b2 = compr & 0x2222222222222222;
const auto b4 = compr & 0x4444444444444444;
const auto b24 = b4 & (b2 * 2);
const unsigned c7 = __builtin_popcountll(b24);
acc3 += __builtin_popcountll(b2) - c7;
acc5 += __builtin_popcountll(b4) - c7;
acc7 += c7;
}
}
const auto prod4 = acc1 * bexp<3>(acc3) * bexp<5>(acc5) * bexp<7>(acc7);
This code packs values even more densely than in the input array: two values per byte. Low-order bit of each value is handled differently: since we could stop after 32 zero bits is found here (with result "zero"), this case cannot alter performance very much, so it is handled by simplest (library) algorithm.
Out of 4 remaining values, "1" is not interesting, so we need to count only occurrences of "3", "5" , and "7" with bitwise manipulations and "population count" intrinsic.
Here are the results:
source size: 4 Mb: 400 Mb:
1. accumulate: 0.835392 ns 0.849199 ns
2. accum * 4: 0.290373 ns 0.286915 ns
3. 2 mul in 1: 0.178556 ns 0.182606 ns
4. mult table: 0.130707 ns 0.176102 ns
5. binary exp: 0.128484 ns 0.119241 ns
6. AVX2: 0.0607049 ns 0.0683234 ns
Here we can see that accumulate library algorithm is pretty slow: for some reason gcc could not unroll the loop and use 4 independent accumulators.
It is not too difficult to do this optimization "by hand". The result is not particularly fast. But if we allocate 4 threads for this task, CPU would approximately match memory bandwidth (2 channels, DDR3-1600).
Word-level parallelism for multiplications is almost twice as fast. (We'll need only 3 threads to match memory bandwidth).
Table lookup is even faster. But its performance degrades when input array cannot fit in L3 cache. (We'll need 3 threads to match memory bandwidth).
Binary exponentiation has approximately the same speed. But with larger inputs this performance does not degrade, it even slightly improves because exponentiation itself uses less time compared to value counting. (We'll need only 2 threads to match memory bandwidth).
As expected, SIMD is the fastest. Its performance slightly degrades when input array cannot fit in L3 cache. Which means we are close to memory bandwidth with single thread.

I have one solution. Let us discuss it with other solutions.
The key part of question is how to reduce times of multiply. And integers are small but set is big.
My solution:
use an small array to record how many times each number appears.
Remove number 1 from array. You don't need to count it.
Find the number which appears least times n. Then multiply all numbers and get result K. Then count K^n.
Remove this number (For instance, you can switch it with the last number of array and reduce size of array for 1). So next time you won't consider this number any more. At same time, the appearance times of other numbers need to be reduced with the times of removed number.
Once again get the number which appears least times. Do same thing as step 2.
Repeatedly do step 2-4 and complete counting.
Let me use an example to show how many multiply we need to do: Assume
we have 5 numbers [1, 2, 3, 4, 5]. Number 1 appears 100 times, number
2 appears 150 times, number 3 appears 200 times, number 4 appears 300
times, and number 5 appears 400 times.
method 1: multiply it directly or use divide and conquer
we need 100+150+200+300+400-1 = 1149 multiply to get result.
method 2: we do (1^100)(2^150)(3^200)(4^300)(5^400)
(100-1)+(150-1)+(200-1)+(300-1)+(400-1)+4 = 1149.[same as method 1]
Cause n^m will do m-1 multiply in deed. Plus you need time to go through all numbers, though this time is short.
method in this post:
First, you need time to go through all numbers. It can be discarded compare to time of multiply.
The real counting you are making is:
((2*3*4*5)^150)*((3*4*5)^50)*((4*5)^100)*(5^100)
Then you need to do multiply 3+149+2+49+1+99+99+3 = 405 times

Programming Pearls: find one integer appears at least twice

It's in the section 2.6 and problem 2, the original problem is like this:
"Given a sequential file containing 4,300,000,000 32-bit integers, how can you find one that appears at least twice?"
My question toward this exercise is that: what is the tricks of the above problem and what kind of general algorithm category this problem is in?

Create a bit array of length 2^32 bits (initialize to zero), that would be about 512MB and will fit into RAM on any modern machine.
Start reading the file, int by int, check bit with the same index as the value of the int, if the bit is set you have found a duplicate, if it is zero, set to one and proceed with the next int from the file.
The trick is to find a suitable data structure and algorithm. In this case everything fits into RAM with a suitable data structure and a simple and efficient algorithm can be used.
If the numbers are int64 you need to find a suitable sorting strategy or make multiple passes, depending on how much additional storage you have available.

The Pigeonhole Principle -- If you have N pigeons in M pigeonholes, and N>M, there are at least 2 pigeons in a hole. The set of 32-bit integers are our 2^32 pigeonholes, the 4.3 billion numbers in our file are the pigeons. Since 4.3x10^9 > 2^32, we know there are duplicates.
You can apply this principle to test if a duplicate we're looking for is in a subset of the numbers at the cost of reading the whole file, without loading more than a little at a time into RAM-- just count the number of times you see a number in your test range, and compare to the total number of integers in that range. For example, to check for a duplicate between 1,000,000 and 2,000,000 inclusive:
int pigeons = 0;
int pigeonholes = 2000000 - 1000000 + 1; // include both fenceposts
for (each number N in file) {
if ( N >= 1000000 && N <= 2000000 ) {
pigeons++
}
}
if (pigeons > pigeonholes) {
// one of the duplicates is between 1,000,000 and 2,000,000
// try again with a narrower range
}
Picking how big of range(s) to check vs. how many times you want to read 16GB of data is up to you :)
As far as a general algorithm category goes, this is a combinatorics (math about counting) problem.

If what do you mean is 32 bit positive integers,
I think this problem doesn't require some special algorithm
or trick to solve. Just a simple observation will lead to the intended solution.
My observation goes like this, the sequential file will contain only
32 bit integers (which is from 0 to 2 ^ 31 - 1). Assume you put all of them
in that file uniquely, you will end up with 2 ^ 31 lines. You can see
that if you put those positive integers once again, you will end up with 2 ^ 31 * 2 lines
and it is smaller than 4,300,000,000.
Thus, the answer is the whole positive integers ranging from 0 to 2 ^ 31 - 1.

Sort the integers and loop through them to see if consecutive integers are duplicates. If you want to do this in memory, it requires 16GB memory that is possible with todays machines. If this is not possible, you could sort the numbers using mergesort and by store intermediate arrays to disk.
My first implementation attempt would be to use sort and uniq commands from unix.

Programming problem - Game of Blocks

maybe you would have an idea on how to solve the following problem.
John decided to buy his son Johnny some mathematical toys. One of his most favorite toy is blocks of different colors. John has decided to buy blocks of C different colors. For each color he will buy googol (10^100) blocks. All blocks of same color are of same length. But blocks of different color may vary in length.
Jhonny has decided to use these blocks to make a large 1 x n block. He wonders how many ways he can do this. Two ways are considered different if there is a position where the color differs. The example shows a red block of size 5, blue block of size 3 and green block of size 3. It shows there are 12 ways of making a large block of length 11.
Each test case starts with an integer 1 ≤ C ≤ 100. Next line consists c integers. ith integer 1 ≤ leni ≤ 750 denotes length of ith color. Next line is positive integer N ≤ 10^15.
This problem should be solved in 20 seconds for T <= 25 test cases. The answer should be calculated MOD 100000007 (prime number).
It can be deduced to matrix exponentiation problem, which can be solved relatively efficiently in O(N^2.376*log(max(leni))) using Coppersmith-Winograd algorithm and fast exponentiation. But it seems that a more efficient algorithm is required, as Coppersmith-Winograd implies a large constant factor. Do you have any other ideas? It can possibly be a Number Theory or Divide and Conquer problem

Firstly note the number of blocks of each colour you have is a complete red herring, since 10^100 > N always. So the number of blocks of each colour is practically infinite.
Now notice that at each position, p (if there is a valid configuration, that leaves no spaces, etc.) There must block of a color, c. There are len[c] ways for this block to lie, so that it still lies over this position, p.
My idea is to try all possible colors and positions at a fixed position (N/2 since it halves the range), and then for each case, there are b cells before this fixed coloured block and a after this fixed colour block. So if we define a function ways(i) that returns the number of ways to tile i cells (with ways(0)=1). Then the number of ways to tile a number of cells with a fixed colour block at a position is ways(b)*ways(a). Adding up all possible configurations yields the answer for ways(i).
Now I chose the fixed position to be N/2 since that halves the range and you can halve a range at most ceil(log(N)) times. Now since you are moving a block about N/2 you will have to calculate from N/2-750 to N/2-750, where 750 is the max length a block can have. So you will have to calculate about 750*ceil(log(N)) (a bit more because of the variance) lengths to get the final answer.
So in order to get good performance you have to through in memoisation, since this inherently a recursive algorithm.
So using Python(since I was lazy and didn't want to write a big number class):
T = int(raw_input())
for case in xrange(T):
#read in the data
C = int(raw_input())
lengths = map(int, raw_input().split())
minlength = min(lengths)
n = int(raw_input())
#setup memoisation, note all lengths less than the minimum length are
#set to 0 as the algorithm needs this
memoise = {}
memoise[0] = 1
for length in xrange(1, minlength):
memoise[length] = 0
def solve(n):
global memoise
if n in memoise:
return memoise[n]
ans = 0
for i in xrange(C):
if lengths[i] > n:
continue
if lengths[i] == n:
ans += 1
ans %= 100000007
continue
for j in xrange(0, lengths[i]):
b = n/2-lengths[i]+j
a = n-(n/2+j)
if b < 0 or a < 0:
continue
ans += solve(b)*solve(a)
ans %= 100000007
memoise[n] = ans
return memoise[n]
solve(n)
print "Case %d: %d" % (case+1, memoise[n])
Note I haven't exhaustively tested this, but I'm quite sure it will meet the 20 second time limit, if you translated this algorithm to C++ or somesuch.
EDIT: Running a test with N = 10^15 and a block with length 750 I get that memoise contains about 60000 elements which means non-lookup bit of solve(n) is called about the same number of time.

A word of caution: In the case c=2, len1=1, len2=2, the answer will be the N'th Fibonacci number, and the Fibonacci numbers grow (approximately) exponentially with a growth factor of the golden ratio, phi ~ 1.61803399. For the
huge value N=10^15, the answer will be about phi^(10^15), an enormous number. The answer will have storage
requirements on the order of (ln(phi^(10^15))/ln(2)) / (8 * 2^40) ~ 79 terabytes. Since you can't even access 79
terabytes in 20 seconds, it's unlikely you can meet the speed requirements in this special case.
Your best hope occurs when C is not too large, and leni is large for all i. In such cases, the answer will
still grow exponentially with N, but the growth factor may be much smaller.
I recommend that you first construct the integer matrix M which will compute the (i+1,..., i+k)
terms in your sequence based on the (i, ..., i+k-1) terms. (only row k+1 of this matrix is interesting).
Compute the first k entries "by hand", then calculate M^(10^15) based on the repeated squaring
trick, and apply it to terms (0...k-1).
The (integer) entries of the matrix will grow exponentially, perhaps too fast to handle. If this is the case, do the
very same calculation, but modulo p, for several moderate-sized prime numbers p. This will allow you to obtain
your answer modulo p, for various p, without using a matrix of bigints. After using enough primes so that you know their product
is larger than your answer, you can use the so-called "Chinese remainder theorem" to recover
your answer from your mod-p answers.

I'd like to build on the earlier #JPvdMerwe solution with some improvements. In his answer, #JPvdMerwe uses a Dynamic Programming / memoisation approach, which I agree is the way to go on this problem. Dividing the problem recursively into two smaller problems and remembering previously computed results is quite efficient.
I'd like to suggest several improvements that would speed things up even further:
Instead of going over all the ways the block in the middle can be positioned, you only need to go over the first half, and multiply the solution by 2. This is because the second half of the cases are symmetrical. For odd-length blocks you would still need to take the centered position as a seperate case.
In general, iterative implementations can be several magnitudes faster than recursive ones. This is because a recursive implementation incurs bookkeeping overhead for each function call. It can be a challenge to convert a solution to its iterative cousin, but it is usually possible. The #JPvdMerwe solution can be made iterative by using a stack to store intermediate values.
Modulo operations are expensive, as are multiplications to a lesser extent. The number of multiplications and modulos can be decreased by approximately a factor C=100 by switching the color-loop with the position-loop. This allows you to add the return values of several calls to solve() before doing a multiplication and modulo.
A good way to test the performance of a solution is with a pathological case. The following could be especially daunting: length 10^15, C=100, prime block sizes.
Hope this helps.

In the above answer
ans += 1
ans %= 100000007
could be much faster without general modulo :
ans += 1
if ans == 100000007 then ans = 0

Please see TopCoder thread for a solution. No one was close enough to find the answer in this thread.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio