OpenCL matrix reduction in two stages - algorithm

I have a 2d array of M x N entries.
A0 A1 A2 A3 A4 A5 ... AN
B0 B1 B2 B3 B4 B5 ... BN
...
I want to reduce this array to a single value in two stages.
Stage 1: Calculate function values for each item in a row, and sum them with some weights.
A = W0*f(A0) + W1*f(A1) + W2*f(A2) + ...
B = W0*f(B0) + W1*f(B1) + W2*f(B2) + ...
Stage 2: Compare results to an input vector and calculate a chi-square value.
CHI_SQ = (A - X)^2/SX^2 + (B - Y)^2/SY^2 + ...
I am trying to do this in parallel using OpenCL. However, I have a hard time figuring out the best strategy for this kind of algorithm. For example, there are many examples out there that would loop over the matrix rows, and many sources that state you should not do this. Could someone be so kind as to outline how this problem can be solved optimally?

Related

Understanding Modified Baugh-Wooley multiplication algorithm

For Modified Baugh-Wooley multiplication algorithm , why is it !(A0*B5) instead of just (A0*B5) ?
Same questions for !(A1*B5), !(A2*B5), !(A3*B5), !(A4*B5), !(A5*B4), !(A5*3), !(A5*B2), !(A5*B1) and !(A5*B0)
Besides, why there are two extra '1' ?
In signed 6-bit 2s complement notation, the place values of the bits are:
-32 16 8 4 2 1
Notice that the top bit has a negative value. When addition, subtraction, and multiplication are performed mod 64, however, that minus sign makes absolutely no difference to how those operations work, because 32 = -32 mod 64.
Your multiplication is not being performed mod 64, though, so that sign must be taken into account.
One way to think of your multiplication is that the 6-bit numbers are extended to 12 bits, and multiplication is then performed mod 4096. When extending a signed number, the top bit is replicated, so -32 becomes -2048 + 1024 + 512 ... +32, which all together has the same value of -32. So extend the signed numbers and multiply. I'll do it with 3 bits, multiplying mod 64:
Given: Sign-extended:
A2 A1 A0 A2 A2 A2 A2 A1 A0
B2 B1 B0 B2 B2 B2 B2 B1 B0
Multiply:
A0B2 A0B2 A0B2 A0B2 A0B1 A0B0
A1B2 A1B2 A1B2 A1B1 A1B0
A2B2 A2B2 A2B1 A2B0
A2B2 A2B1 A2B0
A2B1 A2B0
A2B0
Since we replicated the same bits in multiple positions, you'll see the same bit products at multiple positions.
A0B2 appears 4 times with with total place value 60 or 15<<2, and so on. Let write the multipliers in:
A0B2*15 A0B1 A0B0
A1B2*7 A1B1 A1B0
A2B2*5 A2B1*7 A2B0*15
Again, because of modular arithmetic, the *15s and *7s are the same as *-1, and the *5 is the same as *1:
-A0B2 A0B1 A0B0
-A1B2 A1B1 A1B0
A2B2 -A2B1 -A2B0
That pattern is starting to look familiar. Now, of course -1 is not a bit value, but ~A0B2 = 1-A0B2, so we can translate -A0B2 into ~A0B2 and then subtract the extra 1 we added. If we do this for all the subtracted products:
~A0B2 A0B1 A0B0
~A1B2 A1B1 A1B0
A2B2 ~A2B1 ~A2B0
-2 -2
If we add up the place values of those -2s and expand them into the equivalent bits, we discover the source of the additional 1s in your diagram:
~A0B2 A0B1 A0B0
~A1B2 A1B1 A1B0
A2B2 ~A2B1 ~A2B0
1 1
why two extra '1'?
See some previous explanation in Matt Timmermans's answer
Note : '-2' in two complement is 110, and this contributes to the carries, thus two extra '1'
why flipping the values of some of the partial product bits.
It is due to signed bit in the MSB (A5 and B5).
Besides, please see below the Countermeasure for modified baugh-wooley algorithm in the case of A_WIDTH != B_WIDTH with the help of others.
I have written a hardware verilog code for this algorithm
Hopefully, this post helps some readers.
The short answer is that's because how 2's-complement representation works: the top bit is effectively a sign bit so 1 there means -. In other words you have to subtract
A5*(B4 B3 B2 B1 B0) << 5
and
B5*(A4 A3 A2 A1 A0) << 5
from the sum (note that A5*B5 is added again because both have the same - sign). And those two 1 is the result of substituting those two subtractions with additions of -X.
If you need more details, then you probably just need to re-read how 2's-complement work and then the whole math behind the Baugh-Wooley multiplication algorithm. It is not that complicated.

Findig a solution for a linear equation system which has more variable then equtions

Let's divide the problem to 2 parts, the second one is optional.
Part 1
I have 3 linear equtions with N variables where N usually bigger then 3.
x1*a+x2*b+x3*c+x4*d[....]xN*p = B1
y1*a+y2*b+y3*c+y4*d[....]yN*p = B2
z1*a+z2*b+z3*c+z4*d[....]zN*p = B3
Looking for (a,b,c,d,[...],p), others are constant.
The standard Gaussian way won't work because the matrix will be wider then tall. Of course i can use it to eliminate 2 variables. Do you know an algorithm to find out a solution? (I only need one.) More 0s in the solution coefficients are better but not required.
Part 2
The coefficients in the solution must be non-negative.
Requirements:
The algorithm must be fast enough to run real time. (1800 per sec on an avrage pc). So trial and error method is a no go.
The algorithm will be implemented in C# but feel free to use pseudo language if you want to write code.
Set extra variables to zero. Now we have the matrix equation
A.x = b, where
x1 x2 x3
A = y1 y2 y3
z1 z2 z3
b = (B1, B2, B3), as a column vector
Now invert A. The solution is;
X = A-1.x
End matrix formula's in excel with Ctrl Shift Enter

Maximizing a Trigonometric Function of Many Variables in Mathematica

Just to give some context, my motivation for this programming question is to understand the derivation of the CSHS inequality and basically entails maximizing the following function:
Abs[c1 Cos[2(a1-b1)]+ c2 Cos[2(a1-b2)] + c3 Cos[2(a2-b1)] + c4 Cos[2(a2-b2)]]
where a1,b1,b2,and a2 are arbitrary angles and c1,c2,c3,c4 = +/- 1 ONLY. I want to be able to determine the maximum value of this function along with the combination of angles that lead to this maximum
Eventually, I also want to repeat the calculation for a1,a2,a3,b1,b2,b3 (which will have a total of nine cosine terms)
When I tried putting the following code in Mathematica, it simply spat the input back at me and did not perform any computation, can someone help me out? (note my code didn't include the c1,c2,c3,c4 parameters, I wasn't quite sure how to incorporate them)
Maximize[{Abs[Cos[2 (a1 - b1)] - Cos[2 (a1 - b2)] + Cos[2 (a2 - b1)] +
Cos[2 (a2 - b2)]], 0 <= a1 <= 2 \[Pi] , 0 <= b1 <= 2 \[Pi], 0 <= a2 <= 2 \[Pi], 0 <= b2 <= 2 \[Pi]}, {a1, b2, a2, b1}]
The answer is 4. This is because each Cos can be made to equal 1. You have 4 variables a1, a2, b1 and b2, and four cosines, so there are going to be several ways of making the combinations 2(a1-b1), 2(a1-b2), 2(a2-b1) and 2(a2-b2) equal 0 (hence choosing the corresponding c1/c2/c3/c4 to be +1), or equal to pi (hence choosing the corresponding c1/c2/c3/c4 to be -1).
For one set of angles that give the max, the obvious answer is a1=a2=b1=b2=0. For the 9 cosine case, the max will be 9, and one possible answer is a1=a2=a3=b1=b2=b3=0.
Regarding using Mathematica, I think the lesson is that it's always best to think about before the maths itself before using tools to help with the maths.

How can I make this vector enumeration code faster?

I have three large sets of vectors: A, B1 and B2. These sets are stored in files on disk. For each vector a from A I need to check whether it may be presented as a = b1 + b2, where b1 is from B1 and b2 is from B2. Vectors have 20 components, and all components are non-negative numbers.
How I'm solving this problem now (pseudocode):
foreach a in A
foreach b1 in B1
for i = 1 to 20
bt[i] = a[i] - b1[i]
if bt[i] < 0 then try next b1
next i
foreach b2 in B2
for i = 1 to 20
if bt[i] != b2[i] then try next b2
next i
num_of_expansions++
next b2
next b1
next a
My questions:
1. Any ideas on how to make if faster?
2. How to make it in parallel?
3. Questions 1, 2 for the case when I have B1, B2, ..., Bk, k > 2?
You can sort B1 and B2 by norm. If a = b1 + b2, then ||a|| = ||b1 + b2|| <= ||b1|| + ||b2||, so for any a and b1, you can efficiently eliminate all elements of B2 that have norm < ||a|| - ||b1||. There may also be some way to use the distribution of norms in B1 and B2 to decide whether to switch the roles of the two sets in this. (I don't see off-hand how to do it, but it seems to me that something like this should hold if the distributions of norms in B1 and B2 are significantly different.)
As for making it parallel, it seems that each loop can be turned into a parallel computation, since all computations of one inner iteration are independent of all other iterations.
EDIT
Continuing the analysis: since b2 = a - b1, we also have ||b2|| <= ||a|| + ||b1||. So for any given a and b1, you can restrict the search in B2 to those elements with norms in the range ||a|| ± ||b1||. This suggests that for B1 you should select the set with the smallest average norm.

Reordering of array elements

Given an array
[a1 a2 a3 ... an b1 b2 b3 ... bn c1 c2 c3 ...cn]
without using extra memory how do you reorder into an array
[a1 b1 c1 a2 b2 c2 a3 b3 c3 ... an bn cn]
Your question can also be rephrased as 'How to do an in-place matrix transposition?'. To see why, imagine adding a newline after each subsequence in both of your arrays. This will turn the first array into an NxM matrix, and the second array into an MxN matrix.
Still, it is not trivial for non-square matrices. Please refer to the Wikipedia page on In-place matrix transposition for a comprehensive description of the problem and its solutions.
Assuming you mean O(1) memory (or depending on the model O(log n)) rather than no extra memory, a linear time in-place algorithm exists.
This paper: http://arxiv.org/abs/0805.1598 has an algorithm for the case when you have
a1 ... an b1 ... bn and want to convert to
b1 a1 b2 a2 ... bn an.
The paper also mentions that you can generalize this to other k-way shuffles. In your case, k = 3.
The algorithm in the paper will give the following:
Start with a1 a2 ... an b1 b2 ... bn c1 c2 ... cn and convert to
c1 b1 a1 c2 b2 a2 ... cn bn an
Another pass through this, and you can easily get a1 b1 c2 a2 b2 c2 ... an bn cn.
Now to generalize the algorithm in the paper, we need to pick a prime p, such that k is a primitive root of p^2.
For k = 3, p = 5 will do.
Now to apply the algorithm, first you need to find the largest m < n such 3m+1 is a power of 5.
Note: this will only happen when 3m+1 is an even power of 5. Thus you can actually work with powers of 25 when trying to find the m. (5^odd - 1 is not divisible by 3).
Once you find m,
You shuffle the array to be
[a1 a2 ... am b1 b2 ... bm c1 c2 ... cm] [a(m+1) ... an b(m+1) ... bn c(m+1) ... cn]
and then use the follow the cycle method(refer the paper) for the first 3m elements, using the powers of 5 (including 1 = 5^0) as the starting points of the different cycles) and do a tail recursion for the rest.
Now to convert
a1 a2 ... an b1 b2 ... bn c1 c2 ... cn
to
[a1 a2 ... am b1 b2 ... bm c1 c2 ... cm] [a(m+1) ... an b(m+1) ... bn c(m+1) ... cn]
you first do a cyclic shift to get
a1 a2 ... am [b1 b2 bm a(m+1) .. an] b(m+1) .. bn c1 c2 ... cn
(the elements in the square brackets are the ones that were shifted)
Then do a cyclic shift to get
a1 a2 ... am b1 b2 bm a(m+1) .. an [c1 c2 ..cm b(m+1) .. bn ] c(m+1) ... cn
And then a final shift to
a1 a2 ... am b1 b2 bm [c1 c2 ..cm a(m+1) .. an ] b(m+1) .. bn c(m+1) ... cn
Note that cyclic shift can be done in O(n) time and O(1) space.
So whole algorithm is O(n) time and O(1) space.
You can calculate each item's target position based on its index.
groupSize = N/3
group = i/groupSize
rank = i - group * groupSize
dest = rank * 3 + group
You can use this calculation with a cycle sort to put each element in its proper place in linear time. The only issue is tracking which items are already in place. All you need for that is N bits. With certain types of data, you can "steal" a bit from the data item itself. For instance you can use the high bit of ASCII data, or the low byte of word-aligned pointers.
Alternately, you can do it without any extra bits at the expense going to polynomial time. Reverse the calculation, so you can find the original source index of each item in the final array.
source = i % groupSize + groupSize * (i/groupSize) ; //integer division
Now walk forward through the array, swapping every item with the one from the source. The trick is that any time the source index is less than the current position (meaning it has already been swapped out), you need to follow the trail until you find its current location
getSource(i):
s = i % groupSize + groupSize * (i/groupSize)
while (s<i)
s = s % groupSize + groupSize * (s/groupSize)
return s
shuffle:
for i in (0..N-1)
swap(a[i],a[getSource(i)]
You can do this for certain - just take cards ace, 2, ... 5 in 3 suits and put them in order.
First you take out the a2 card and put it aside.
Then you move the b1 to the a2 position and shift all the cards up
Then you put back the a2 card and put in the shifted out position.
Then you take out the a3 card and puti taside
Move the c1 to the a3 position and shift all the cards up.
Then put back the a3 card in the emptied position.
repeat until done.
The actual calculation of the indices is tricky but I believe a previous poster has done this.

Resources