Related
I am going through an algorithms and datastructures textbook and came accross this question:
1-28. Write a function to perform integer division without using
either the / or * operators. Find a fast way to do it.
How can we come up with a fast way to do it?
I like this solution: https://stackoverflow.com/a/34506599/1008519, but I find it somewhat hard to reason about (especially the |-part). This solution makes a little more sense in my head:
var divide = function (dividend, divisor) {
// Handle 0 divisor
if (divisor === 0) {
return NaN;
}
// Handle negative numbers
var isNegative = false;
if (dividend < 0) {
// Change sign
dividend = ~dividend+1;
isNegative = !isNegative;
}
if (divisor < 0) {
// Change sign
divisor = ~divisor+1;
isNegative = !isNegative;
}
/**
* Main algorithm
*/
var result = 1;
var denominator = divisor;
// Double denominator value with bitwise shift until bigger than dividend
while (dividend > denominator) {
denominator <<= 1;
result <<= 1;
}
// Subtract divisor value until denominator is smaller than dividend
while (denominator > dividend) {
denominator -= divisor;
result -= 1;
}
// If one of dividend or divisor was negative, change sign of result
if (isNegative) {
result = ~result+1;
}
return result;
}
We initialize our result to 1 (since we are going to double our denominator until it is bigger than the dividend)
Double our denominator (with bitwise shifts) until it is bigger than the dividend
Since we know our denominator is bigger than our dividend, we can minus our divisor until it is less than our dividend
Return result since denominator is now as close to the result as possible using the divisor
Here are some test runs:
console.log(divide(-16, 3)); // -5
console.log(divide(16, 3)); // 5
console.log(divide(16, 33)); // 0
console.log(divide(16, 0)); // NaN
console.log(divide(384, 15)); // 25
Here is a gist of the solution: https://gist.github.com/mlunoe/e34f14cff4d5c57dd90a5626266c4130
Typically, when an algorithms textbook says fast they mean in terms of computational complexity. That is, the number of operations per bit of input. In general, they don't care about constants, so if you have an input of n bits, whether it takes two operations per bit or a hundred operations per bit, we say the algorithm takes O(n) time. This is because if we have an algorithm that runs in O(n^2) time (polynomial... in this case, square time) and we imagine a O(n) algorithm that does 100 operations per bit compared to our algorithm which may do 1 operation per bit, once the input size is 100 bits, the polynomial algorithm starts to run really slow really quickly (compared to our other algorithm). Essentially, you can imagine two lines, y=100x and y=x^2. Your teacher probably made you do an exercise in Algebra (maybe it was calculus?) where you have to say which one is bigger as x approaches infinity. This is actually a key concept in divergence/convergence in calculus if you have gotten there already in mathematics. Regardless, with a little algebra, you can imagine our graphs intersecting at x=100, and y=x^2 being larger for all points where x is greater than 100.
As far as most textbooks are concerned, O(nlgn) or better is considered "fast". One example of a really bad algorithm to solve this problem would be the following:
crappyMultiplicationAlg(int a, int b)
int product = 0
for (b>0)
product = product + a
b = b-1
return product
This algorithm basically uses "b" as a counter and just keeps adding "a" to some variable for each time b counts down. To calculate how "fast" the algorithm is (in terms of algorithmic complexity) we count how many runs different components will take. In this case, we only have a for loop and some initialization (which is negligible in this case, ignore it). How many times does the for loop run? You may be saying "Hey, guy! It only runs 'b' times! That may not even be half the input. Thats way better than O(n) time!"
The trick here, is that we are concerned with the size of the input in terms of storage... and we all (should) know that to store an n bit integer, we need lgn bits. In other words, if we have x bits, we can store any (unsigned) number up to (2^x)-1. As a result, if we are using a standard 4 byte integer, that number could be up to 2^32 - 1 which is a number well into the billions, if my memory serves me right. If you dont trust me, run this algorithm with a number like 10,000,000 and see how long it takes. Still not convinced? Use a long to use a number like 1,000,000,000.
Since you didn't ask for help with the algorithm, Ill leave it for you as a homework exercise (not trying to be a jerk, I am a total geek and love algorithm problems). If you need help with it, feel free to ask! I already typed up some hints by accident since I didnt read your question properly at first.
EDIT: I accidentally did a crappy multiplication algorithm. An example of a really terrible division algorithm (i cheated) would be:
AbsolutelyTerribleDivisionAlg(int a, int b)
int quotient = 0
while crappyMultiplicationAlg(int b, int quotient) < a
quotient = quotient + 1
return quotient
This algorithm is bad for a whole bunch of reasons, not the least of which is the use of my crappy multiplication algorithm (which will be called more than once even on a relatively "tame" run). Even if we were allowed to use the * operator though, this is still a really bad algorithm, largely due to the same mechanism used in my awful mult alg.
PS There may be a fence-post error or two in my two algs... i posted them more for conceptual clarity than correctness. No matter how accurate they are at doing multiplication or division, though, never use them. They will give your laptop herpes and then cause it to burn up in a sulfur-y implosion of sadness.
I don't know what you mean by fast...and this seems like a basic question to test your thought process.
A simple function can be use a counter and keep subtracting the divisor from the dividend till it becomes 0. This is O(n) process.
int divide(int n, int d){
int c = 0;
while(1){
n -= d;
if(n >= 0)
c++;
else
break;
}
return c;
}
Another way can be using shift operator, which should do it in log(n) steps.
int divide(int n, int d){
if(d <= 0)
return -1;
int k = d;
int i, c, index=1;
c = 0;
while(n > d){
d <<= 1;
index <<= 1;
}
while(1){
if(k > n)
return c;
if(n >= d){
c |= index;
n -= d;
}
index >>= 1;
d >>= 1;
}
return c;
}
This is just like integer division as we do in High-School Mathematics.
PS: If you need a better explanation, I will. Just post that in comments.
EDIT: edited the code wrt Erobrere's comment.
The simplest way to perform a division is by successive subtractions: subtract b from a as long as a remains positive. The quotient is the number of subtractions performed.
This can be pretty slow, as you will perform q subtractions and tests.
With a=28 and b=3,
28-3-3-3-3-3-3-3-3-3=1
the quotient is 9 and the remainder 1.
The next idea that comes to mind is to subtract several times b in a single go. We can try with 2b or 4b or 8b... as these numbers are easy to compute with additions. We can go as for as possible as long as the multiple of b does not exceed a.
In the example, 2³.3 is the largest multiple which is possible
28>=2³.3
So we subtract 8 times 3 in a single go, getting
28-2³.3=4
Now we continue to reduce the remainder with the lower multiples, 2², 2 and 1, when possible
4-2².3<0
4-2.3 <0
4-1.3 =1
Then our quotient is 2³+1=9 and the remainder 1.
As you can check, every multiple of b is tried once only, and the total number of attempts equals the number of doublings required to reach a. This number is just the number of bits required to write q, which is much smaller than q itself.
This is not the fastest solution, but I think it's readable enough and works:
def weird_div(dividend, divisor):
if divisor == 0:
return None
dend = abs(dividend)
dsor = abs(divisor)
result = 0
# This is the core algorithm, the rest is just for ensuring it works with negatives and 0
while dend >= dsor:
dend -= dsor
result += 1
# Let's handle negative numbers too
if (dividend < 0 and divisor > 0) or (dividend > 0 and divisor < 0):
return -result
else:
return result
# Let's test it:
print("49 divided by 7 is {}".format(weird_div(49,7)))
print("100 divided by 7 is {} (Discards the remainder) ".format(weird_div(100,7)))
print("-49 divided by 7 is {}".format(weird_div(-49,7)))
print("49 divided by -7 is {}".format(weird_div(49,-7)))
print("-49 divided by -7 is {}".format(weird_div(-49,-7)))
print("0 divided by 7 is {}".format(weird_div(0,7)))
print("49 divided by 0 is {}".format(weird_div(49,0)))
It prints the following results:
49 divided by 7 is 7
100 divided by 7 is 14 (Discards the remainder)
-49 divided by 7 is -7
49 divided by -7 is -7
-49 divided by -7 is 7
0 divided by 7 is 0
49 divided by 0 is None
unsigned bitdiv (unsigned a, unsigned d)
{
unsigned res,c;
for (c=d; c <= a; c <<=1) {;}
for (res=0;(c>>=1) >= d; ) {
res <<= 1;
if ( a >= c) { res++; a -= c; }
}
return res;
}
The pseudo code:
count = 0
while (dividend >= divisor)
dividend -= divisor
count++
//Get count, your answer
How to write two functions for generating random numbers that supporting next and previous?
I mean how to write two functions: next_number() and previous_number(), that next_number() function generates a new random number and previous_number() function generates previously generated random number.
for example:
int next_number()
{
// ...?
}
int previous_number()
{
// ...?
}
int num;
// Forward random number generating.
// ---> 54, 86, 32, 46, 17
num = next_number(); // num = 54
num = next_number(); // num = 86
num = next_number(); // num = 32
num = next_number(); // num = 46
num = next_number(); // num = 17
// Backward random number generating.
// <--- 17, 46, 32, 86, 54
num = previous_number(); // num = 46
num = previous_number(); // num = 32
num = previous_number(); // num = 86
num = previous_number(); // num = 54
You can trivially do this with a Pseudo-Random Function (PRF).
Such functions take a key and a value, and output a pseudo-random number based on them. You'd select a key from /dev/random that remains the same for the run of the program, and then feed the function an integer that you increment to go forward or decrement to go back.
Here's an example in pseudo-code:
initialize():
Key = sufficiently many bytes from /dev/random
N = 0
next_number():
N = N + 1
return my_prf(Key, N)
previous_number():
N = N - 1
return my_prf(Key, N)
Strong, Pseudo-Random Functions are found in most cryptography libraries. As rici points out, you can also use any encryption function (encryption functions are pseudo-random permutations, a subset of PRFs, and the period is so huge that the difference doesn't matter).
Some linear congruential generators (a common but not very good PRNG) are reversible.
They work by next = (a * previous + c) mod m. That's reversible if a has a modular multiplicative inverse mod m. That's often the case, because m is often a power of two and a is usually odd.
For example for the "MSVC" parameters from the table from the first link:
m = 232
a = 214013
c = 2531011
The reverse is:
previous = (current - 2531011) * 0xb9b33155;
With types chosen to make it work modulo 232.
Suppose you have a linear congruential sequence S defined by
S[0] = seed
S[i] = (p * S[i-1] + k) % m
for some p, m, k such that gcd(p, m) == 1. Then you can find q such that (p * q) % m == 1 and compute:
S[i-1] = (q * (S[i] - k)) % m
In other words: if you pick suitable p and precompute q, you can traverse your sequence in either order in O(1) time.
A reasonably simple way of generating an indexable pseudo-random sequence -- that is, a sequence which looks random, but can be traversed in either direction -- is to choose some (reasonably good) encryption algorithm and a fixed encryption key, and then define:
sequence(i): encrypt(i, known_key)
You don't need to know the value of i, because you can decrypt it from the number:
next(r): encrypt(decrypt(r, known_key) + 1)
prev(r): encrypt(decrypt(r, known_key) - 1)
Consequently, i does not have to be a small integer; since the only arithmetic you need to do to it is addition and subtraction by a small integer, a bignum implementation is trivial. So if you wanted 128-bit pseudorando numbers, you could set the first i to be a 128-bit random number extracted from /dev/random.
You have to keep the entire value of i in static storage, and the period of the pseudorandom numbers cannot be greater than the range of i. That will be true of any solution to this problem, though: since the next() and prev() operators are required to be functions, every value has a unique successor and predecessor, and thus can only appear once in the cycle of values. That's quite different from the Mersenne twister, for example, whose cycle is much larger than 232.
I think what you are asking for is random number generator that is deterministic. This does not make sense because if it is deterministic, it's not random. The only solution is to generate a list of random numbers and then step back and forward in this list.
PS! I know that essentialy all software PRNG-s are deterministic. You can of course use this to create the functionality you need, but don't fool yourself, it has nothing to do with randomness. If your software design requires having deterministic PRNG then you could probably skip the PRNG part at all.
I have a 16 bit number which I want to divide by 100. Let's say it's 50000. The goal is to obtain 500. However, I am trying to avoid inferred dividers on my FPGA because they break timing requirements. The result does not have to be accurate; an approximation will do.
I have tried hardware multiplication by 0.01 but real numbers are not supported. I'm looking at pipelined dividers now but I hope it does not come to that.
Conceptually: Multiply by 655 (= 65536/100) and then shift right by 16 bits. Of course, in hardware, the shift right is free.
If you need it to be even faster, you can hardwire the divide as a sum of divisions by powers of two (shifts). E.g.,
1/100 ~= 1/128 = 0.0078125
1/100 ~= 1/128 + 1/256 = 0.01171875
1/100 ~= 1/128 + 1/512 = 0.009765625
1/100 ~= 1/128 + 1/512 + 1/2048 = 0.01025390625
1/100 ~= 1/128 + 1/512 + 1/4096 = 0.010009765625
etc.
In C code the last example above would be:
uint16_t divideBy100 (uint16_t input)
{
return (input >> 7) + (input >> 9) + (input >> 12);
}
Assuming that
the integer division is intended to truncate, not round (e.g. 599 /
100 = 5)
it's ok to have a 16x16 multiplier in the FPGA (with a fixed value on
one input)
then you can get exact values by implementing a 16x16 unsigned multiplier where one input is 0xA3D7 and the other input is your 16-bit number. Add 0x8000 to the 32-bit product, and your result is in the upper 10 bits.
In C code, the algorithm looks like this
uint16_t divideBy100( uint16_t input )
{
uint32_t temp;
temp = input;
temp *= 0xA3D7; // compute the 32-bit product of two 16-bit unsigned numbers
temp += 0x8000; // adjust the 32-bit product since 0xA3D7 is actually a little low
temp >>= 22; // the upper 10-bits are the answer
return( (uint16_t)temp );
}
Generally, you can multiply by the inverse and shift. Compilers do this all the time, even for software.
Here is a page that does that for you: http://www.hackersdelight.org/magic.htm
In your case that seems to be multiplication by 0x431BDE83, followed by a right-shift of 17.
And here is an explanation: Computing the Multiplicative Inverse for Optimizing Integer Division
Multiplying by the reciprocal is often a good approach, as you have noted though real numbers are not supported. You need to work with fixed point rather than floating point reals.
Verilog does not have a definition of fixed point, but it it just uses a word length and you decide how many bits are integer and how many fractional.
0.01 (0.0098876953125) in binary would be 0_0000001010001. The bigger this word length the greater the precision.
// 1Int, 13Frac
wire ONE_HUNDREDTH = 14'b0_0000001010001 ;
input a [15:0]; //Integer (no fractional bits)
output result [15+14:0]; //13 fractional bits inherited form ONE_HUNDREDTH
output result_int [15:0]; //Integer result
always #* begin
result = ONE_HUNDREDTH * a;
result_int = result >>> 13;
end
Real to binary conversion done using the ruby gem fixed_point.
A ruby irb session (with fixed_point installed via gem install fixed_point):
require 'fixed_point'
#Unsigned, 1 Integer bit, 13 fractional bits
format = FixedPoint::Format.new(0, 1, 13)
fix_num = FixedPoint::Number.new(0.01, format )
=> 0.0098876953125
fix_num.to_b
=> "0.0000001010001"
int x = n / 3; // <-- make this faster
// for instance
int a = n * 3; // <-- normal integer multiplication
int b = (n << 1) + n; // <-- potentially faster multiplication
The guy who said "leave it to the compiler" was right, but I don't have the "reputation" to mod him up or comment. I asked gcc to compile int test(int a) { return a / 3; } for an ix86 and then disassembled the output. Just for academic interest, what it's doing is roughly multiplying by 0x55555556 and then taking the top 32 bits of the 64 bit result of that. You can demonstrate this to yourself with eg:
$ ruby -e 'puts(60000 * 0x55555556 >> 32)'
20000
$ ruby -e 'puts(72 * 0x55555556 >> 32)'
24
$
The wikipedia page on Montgomery division is hard to read but fortunately the compiler guys have done it so you don't have to.
This is the fastest as the compiler will optimize it if it can depending on the output processor.
int a;
int b;
a = some value;
b = a / 3;
There is a faster way to do it if you know the ranges of the values, for example, if you are dividing a signed integer by 3 and you know the range of the value to be divided is 0 to 768, then you can multiply it by a factor and shift it to the left by a power of 2 to that factor divided by 3.
eg.
Range 0 -> 768
you could use shifting of 10 bits, which multiplying by 1024, you want to divide by 3 so your multiplier should be 1024 / 3 = 341,
so you can now use (x * 341) >> 10
(Make sure the shift is a signed shift if using signed integers), also make sure the shift is an actually shift and not a bit ROLL
This will effectively divide the value 3, and will run at about 1.6 times the speed as a natural divide by 3 on a standard x86 / x64 CPU.
Of course the only reason you can make this optimization when the compiler cant is because the compiler does not know the maximum range of X and therefore cannot make this determination, but you as the programmer can.
Sometime it may even be more beneficial to move the value into a larger value and then do the same thing, ie. if you have an int of full range you could make it an 64-bit value and then do the multiply and shift instead of dividing by 3.
I had to do this recently to speed up image processing, i needed to find the average of 3 color channels, each color channel with a byte range (0 - 255). red green and blue.
At first i just simply used:
avg = (r + g + b) / 3;
(So r + g + b has a maximum of 768 and a minimum of 0, because each channel is a byte 0 - 255)
After millions of iterations the entire operation took 36 milliseconds.
I changed the line to:
avg = (r + g + b) * 341 >> 10;
And that took it down to 22 milliseconds, its amazing what can be done with a little ingenuity.
This speed up occurred in C# even though I had optimisations turned on and was running the program natively without debugging info and not through the IDE.
See How To Divide By 3 for an extended discussion of more efficiently dividing by 3, focused on doing FPGA arithmetic operations.
Also relevant:
Optimizing integer divisions with Multiply Shift in C#
Depending on your platform and depending on your C compiler, a native solution like just using
y = x / 3
Can be fast or it can be awfully slow (even if division is done entirely in hardware, if it is done using a DIV instruction, this instruction is about 3 to 4 times slower than a multiplication on modern CPUs). Very good C compilers with optimization flags turned on may optimize this operation, but if you want to be sure, you are better off optimizing it yourself.
For optimization it is important to have integer numbers of a known size. In C int has no known size (it can vary by platform and compiler!), so you are better using C99 fixed-size integers. The code below assumes that you want to divide an unsigned 32-bit integer by three and that you C compiler knows about 64 bit integer numbers (NOTE: Even on a 32 bit CPU architecture most C compilers can handle 64 bit integers just fine):
static inline uint32_t divby3 (
uint32_t divideMe
) {
return (uint32_t)(((uint64_t)0xAAAAAAABULL * divideMe) >> 33);
}
As crazy as this might sound, but the method above indeed does divide by 3. All it needs for doing so is a single 64 bit multiplication and a shift (like I said, multiplications might be 3 to 4 times faster than divisions on your CPU). In a 64 bit application this code will be a lot faster than in a 32 bit application (in a 32 bit application multiplying two 64 bit numbers take 3 multiplications and 3 additions on 32 bit values) - however, it might be still faster than a division on a 32 bit machine.
On the other hand, if your compiler is a very good one and knows the trick how to optimize integer division by a constant (latest GCC does, I just checked), it will generate the code above anyway (GCC will create exactly this code for "/3" if you enable at least optimization level 1). For other compilers... you cannot rely or expect that it will use tricks like that, even though this method is very well documented and mentioned everywhere on the Internet.
Problem is that it only works for constant numbers, not for variable ones. You always need to know the magic number (here 0xAAAAAAAB) and the correct operations after the multiplication (shifts and/or additions in most cases) and both is different depending on the number you want to divide by and both take too much CPU time to calculate them on the fly (that would be slower than hardware division). However, it's easy for a compiler to calculate these during compile time (where one second more or less compile time plays hardly a role).
For 64 bit numbers:
uint64_t divBy3(uint64_t x)
{
return x*12297829382473034411ULL;
}
However this isn't the truncating integer division you might expect.
It works correctly if the number is already divisible by 3, but it returns a huge number if it isn't.
For example if you run it on for example 11, it returns 6148914691236517209. This looks like a garbage but it's in fact the correct answer: multiply it by 3 and you get back the 11!
If you are looking for the truncating division, then just use the / operator. I highly doubt you can get much faster than that.
Theory:
64 bit unsigned arithmetic is a modulo 2^64 arithmetic.
This means for each integer which is coprime with the 2^64 modulus (essentially all odd numbers) there exists a multiplicative inverse which you can use to multiply with instead of division. This magic number can be obtained by solving the 3*x + 2^64*y = 1 equation using the Extended Euclidean Algorithm.
What if you really don't want to multiply or divide? Here is is an approximation I just invented. It works because (x/3) = (x/4) + (x/12). But since (x/12) = (x/4) / 3 we just have to repeat the process until its good enough.
#include <stdio.h>
void main()
{
int n = 1000;
int a,b;
a = n >> 2;
b = (a >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
printf("a=%d\n", a);
}
The result is 330. It could be made more accurate using b = ((b+2)>>2); to account for rounding.
If you are allowed to multiply, just pick a suitable approximation for (1/3), with a power-of-2 divisor. For example, n * (1/3) ~= n * 43 / 128 = (n * 43) >> 7.
This technique is most useful in Indiana.
I don't know if it's faster but if you want to use a bitwise operator to perform binary division you can use the shift and subtract method described at this page:
Set quotient to 0
Align leftmost digits in dividend and divisor
Repeat:
If that portion of the dividend above the divisor is greater than or equal to the divisor:
Then subtract divisor from that portion of the dividend and
Concatentate 1 to the right hand end of the quotient
Else concatentate 0 to the right hand end of the quotient
Shift the divisor one place right
Until dividend is less than the divisor:
quotient is correct, dividend is remainder
STOP
For really large integer division (e.g. numbers bigger than 64bit) you can represent your number as an int[] and perform division quite fast by taking two digits at a time and divide them by 3. The remainder will be part of the next two digits and so forth.
eg. 11004 / 3 you say
11/3 = 3, remaineder = 2 (from 11-3*3)
20/3 = 6, remainder = 2 (from 20-6*3)
20/3 = 6, remainder = 2 (from 20-6*3)
24/3 = 8, remainder = 0
hence the result 3668
internal static List<int> Div3(int[] a)
{
int remainder = 0;
var res = new List<int>();
for (int i = 0; i < a.Length; i++)
{
var val = remainder + a[i];
var div = val/3;
remainder = 10*(val%3);
if (div > 9)
{
res.Add(div/10);
res.Add(div%10);
}
else
res.Add(div);
}
if (res[0] == 0) res.RemoveAt(0);
return res;
}
If you really want to see this article on integer division, but it only has academic merit ... it would be an interesting application that actually needed to perform that benefited from that kind of trick.
Easy computation ... at most n iterations where n is your number of bits:
uint8_t divideby3(uint8_t x)
{
uint8_t answer =0;
do
{
x>>=1;
answer+=x;
x=-x;
}while(x);
return answer;
}
A lookup table approach would also be faster in some architectures.
uint8_t DivBy3LU(uint8_t u8Operand)
{
uint8_t ai8Div3 = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, ....];
return ai8Div3[u8Operand];
}
Say you have 100000000 32-bit floating point values in an array, and each of these floats has a value between 0.0 and 1.0. If you tried to sum them all up like this
result = 0.0;
for (i = 0; i < 100000000; i++) {
result += array[i];
}
you'd run into problems as result gets much larger than 1.0.
So what are some of the ways to more accurately perform the summation?
Sounds like you want to use Kahan Summation.
According to Wikipedia,
The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).
In pseudocode, the algorithm is:
function kahanSum(input)
var sum = input[1]
var c = 0.0 //A running compensation for lost low-order bits.
for i = 2 to input.length
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
next i //Next time around, the lost low part will be added to y in a fresh attempt.
return sum
Make result a double, assuming C or C++.
If you can tolerate a little extra space (in Java):
float temp = new float[1000000];
float temp2 = new float[1000];
float sum = 0.0f;
for (i=0 ; i<1000000000 ; i++) temp[i/1000] += array[i];
for (i=0 ; i<1000000 ; i++) temp2[i/1000] += temp[i];
for (i=0 ; i<1000 ; i++) sum += temp2[i];
Standard divide-and-conquer algorithm, basically. This only works if the numbers are randomly scattered; it won't work if the first half billion numbers are 1e-12 and the second half billion are much larger.
But before doing any of that, one might just accumulate the result in a double. That'll help a lot.
If in .NET using the LINQ .Sum() extension method that exists on an IEnumerable. Then it would just be:
var result = array.Sum();
The absolutely optimal way is to use a priority queue, in the following way:
PriorityQueue<Float> q = new PriorityQueue<Float>();
for(float x : list) q.add(x);
while(q.size() > 1) q.add(q.pop() + q.pop());
return q.pop();
(this code assumes the numbers are positive; generally the queue should be ordered by absolute value)
Explanation: given a list of numbers, to add them up as precisely as possible you should strive to make the numbers close, t.i. eliminate the difference between small and big ones. That's why you want to add up the two smallest numbers, thus increasing the minimal value of the list, decreasing the difference between the minimum and maximum in the list and reducing the problem size by 1.
Unfortunately I have no idea about how this can be vectorized, considering that you're using OpenCL. But I am almost sure that it can be. You might take a look at the book on vector algorithms, it is surprising how powerful they actually are: Vector Models for Data-Parallel Computing