OpenCL Cholesky Decomposition

OpenCL Cholesky Decomposition - algorithm

I implemented the following Cholesky decomposition algorithm using OpenCL. The code is exhibiting random behavior. It matches the cpu output only some times. Can someone please help me to figure out what is wrong with my implementation.
Here is the algorithm:
procedure CHOLESKY(A)
int i, j, k;
for k := 0 to n − 1 do /* 1st loop */
/* Obtain the square root of the diagonal element. */
A[k, k] := A[k, k];
for j := k + 1 to n − 1 do /* 2nd loop */
/* The division step. */
A[k, j] := A[k, j]/A[k, k];
end for
for i := k + 1 to n − 1 do /* 3rd loop */
for j := i to n − 1 do /* 4th loop */
/* The elimination step. */
A[i, j] := A[i, j] - A[k, i] × A[k, j];
end for
end for
end for
Methodology to parallelize the above algorithm:
From the algorithm, the elimination step is the most expensive. So I have the outermost loop
in the host code, and I call the kernel within the loop. A single run of the kernel basically
corresponds to a single iteration of the 3rd loop. Therefore, I launch (n-1 )- (k+1) + 1 work groups. The number of work items within a workgroup is set to n/2. The 2nd for loop is also computed within the kernel, but I allow only the first workgroup to do it.
RELEVANT HOST CODE
// for a 10 X 10 matrix, MATRIX_SIZE = 10
localWorkSize[0] = MATRIX_SIZE/2;
stride = MATRIX_SIZE/2;
cl_event event;
for(k = 0; k < MATRIX_SIZE; k++)
{
int isize = (MATRIX_SIZE-1) - (k+1) + 1;
int num_blocks = isize;
if(num_blocks <= 0)
num_blocks = 1;
globalWorkSize[0] = num_blocks * WA/2;
errcode = clSetKernelArg(clKernel, 0, sizeof(int), (void *)&k);
errcode |= clSetKernelArg(clKernel, 1, sizeof(cl_mem), (void *)&d_A);
errcode |= clSetKernelArg(clKernel, 2, sizeof(int), (void *)&stride);
errcode = clEnqueueNDRangeKernel(clCommandQueue,
clKernel, 1, NULL, globalWorkSize,
localWorkSize, 0, NULL, &event);
OpenCL_CheckError(errcode, "clEnqueueNDRangeKernel");
clFinish(clCommandQueue);
}
KERNEL CODE
__kernel void
batchedCholesky(__global float *U, int k, int stride)
{
int tx = get_global_id(0);
unsigned int j;
unsigned int num_rows = MATRIX_SIZE;
if(tx==0)
{
// Take the square root of the diagonal element
U[k * num_rows + k] = sqrt(U[k * num_rows + k]);
}
barrier(CLK_GLOBAL_MEM_FENCE);
int offset = (k+1); //From original loop
int jstart = get_local_id(0) + offset;
int jstep = stride;
int jtop = num_rows - 1;
int jbottom = (k + 1);
//Do work for this i iteration
//Division step
if(get_group_id(0) == 0)
{
for(j = jstart; (j >= jbottom) && (j <= jtop); j+=jstep)
{
U[k * num_rows + j] /= U[k * num_rows + k]; // Division step
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
j = 0;
int i = get_group_id(0) + (k+1);
offset = i;
jstart = get_local_id(0) + offset;
jbottom = i;
for( j = jstart; j >= jbottom && j <= jtop; j += jstep)
U[i * num_rows + j] -= U[k * num_rows + i] * U[k * num_rows + j];
barrier(CLK_GLOBAL_MEM_FENCE);
}

Not all of your work items execute at the same time, they may run in batches. So your code running prior to CLK_GLOBAL_MEM_FENCE won't include every value. That may be the source of your errors.
If you require global synchronization, use multiple kernels.

Related

Largest sum of all increasing subsequences of length k

I am currently stuck with the classic longest increasing subsequence problem, but there is a slight twist to it. Instead of just finding the longest increasing subsequence, I need to find the largest sum of all increasing subsequences that are of length k.
I have the following pseudo code implemented:
input = [4,13,5,14] k = 2
n = size of input
opt = array of size n which stores the highest increasing subsequence sum up to this index
counts = array of size n which stores the amount of values in the subsequence up to this index
highestSum = -1
FOR i in range(0, n)
high = new data object(value = 0, sum = 0, count = 0)
FOR j in range(i-1, 0, -1)
IF high.sum < opt[j] AND opt[j] < opt[i] AND counts[j] < k
high.value = input[j]
high.sum = opt[j]
high.count = counts[j]
opt[i] = high.sum + input[i]
counts[i] = high.count + 1
IF counts[i] == k
highestSum = higher value between (highestSum, opt[i])
return highestSum
This dynamic programming approach works in most cases, but for the list I outlined above it does not return the optimal subsequence sum. The optimal subsequence sum with length 2 should be 27 (13-14), but 18 is returned (4-14). This is due to the opt and counts array looking like this:
k = 2
input: 0 4 13 5 14
opt: 0 4 17 9 18
counts: 0 1 2 2 2
Due to 13 already having a subsequence of 4-13, and thus its count value (2) is no longer less than k, 14 is unable to accept 13 as a correct subsequence due to its count value.
Are there any suggestions as to what I can change?

You'll need k+1 sorted data structures, one for each possible length of subsequence currently found.
Each structure contains, by the last entry in an optimal subsequence, the current sum. That is, we only care about a subsequence that can lead to the best possible solution. (Technical note. Of those that can lead to the best solution, pick the one whose positions are lexicographically first.) Which will be sorted by increasing last entry, and decreasing sum.
In pseudocode it works like this.
initialize optimal[0..k]
optimal[0][min(sequence) - 1] = 0 # empty set.
for entry in sequence:
for i in k..1:
entry_prev = biggest < entry in optimal[i-1]
if entry_prev is not None:
this_sum = optimal[i-1][entry_prev] + entry
entry_smaller = biggest <= entry in optimal[i-1]
if entry_smaller is None or optimal[i][entry_smaller] < this_sum:
delete (e, v) from optimal[i] where entry <= e and v <= this_sum
 insert (entry, this_sum) into optimal[i]
return optimal[k][largest entry in optimal[k]]
But you need this kind of 2-d structure to keep track of what might happen from here.
The total memory needed is O(k n) and running time will be O(k n log(n)).
It is possible to also reconstruct the optimal subsequence, but that requires a more complex data structure.

Here is a working solution in C++ that runs in O(logn * n * k) time with O(n*k) space. I think you can not make it faster but let me know if you find a faster solution. This is a modification of the solution for from https://stackoverflow.com/questions/16402854/number-of-increasing-subsequences-of-length-k. The key difference here is that we keep track of the maximum sum for each subsequences of different legths instead of accumulating the number of subsequences and we are iterating from the back of the array (since for increasing subsequences that have length larger than k the best k-length subarray will be at the end).
An other trick is that we use the array sums to map index + length combinations to maximum sums.
maxSumIncreasingKLenSeqDP function is the simple dynamic programming solution with O(n * n * k) time complexity.
#include <iostream>
#include <algorithm>
#include <unordered_map>
#include <limits.h>
using namespace std;
#include <random>
int maxSumIncreasingKLenSeq(int arr[], size_t n, int k){
// inverse compression: assign N-1, N-2, ... , 1 to smallest, ..., largest
size_t N = 1;
size_t compArr[n];
{
for(size_t i = 0; i<n; ++i)
compArr[i] = arr[i];
// descending order
sort(compArr, compArr + n, greater<int>());
unordered_map<int, size_t> compMap;
for(int val : compArr){
if(compMap.find(val) == compMap.end()){
compMap[val] = N;
++N;
}
}
for(size_t i = 0; i<n; ++i)
compArr[i] = compMap[arr[i]];
}
int sums[n * (k - 1) + n]; // key is combined from index and length by n * (length - 1) + index
for(size_t i = 0; i < n * (k - 1) + n; ++i)
sums[i] = -1;
for(size_t i = 0; i < n; ++i)
sums[i] = arr[i]; // i, 1
int BIT[N];
for(size_t len = 2; len <= k; ++len){
for(size_t i = 0; i<N; ++i)
BIT[i] = INT_MIN;
for(size_t i = 0; i < len - 1; ++i)
sums[n * (len - 1) + i] = INT_MIN;
for(int i = n - len; i >= 0; --i){
int val = sums[n * (len - 2) + i + 1]; // i + 1, len - 1
int idx = compArr[i + 1];
while(idx <= N){
BIT[idx] = max(val, BIT[idx]);
idx += (idx & (-idx));
}
// it does this:
//BIT[compArr[i + 1]] = sums[n * (len - 2) + i + 1];
idx = compArr[i] - 1;
int maxSum = INT_MIN;
while(idx > 0){
maxSum = max(BIT[idx], maxSum);
idx -= (idx & (-idx));
}
sums[n * (len - 1) + i] = maxSum;
// it does this:
//for(int j = 0; j < compArr[i]; ++j)
// sums[n * (len - 1) + i] = max(sums[n * (len - 1) + i], BIT[j]);
if(sums[n * (len - 1) + i] > INT_MIN)
sums[n * (len - 1) + i] += arr[i];
}
}
int maxSum = INT_MIN;
for(int i = n - k; i >= 0; --i)
maxSum = max(maxSum, sums[n * (k - 1) + i]); // i, k
return maxSum;
}
int maxSumIncreasingKLenSeqDP(int arr[], int n, int k){
int sums[n * (k - 1) + n]; // key is combined from index and length by n * (length - 1) + index
for(size_t i = 0; i < n; ++i)
sums[i] = arr[i]; // i, 1
for(int i = 2; i <= k; ++i)
sums[n * (i - 1) + n - 1] = INT_MIN; // n - 1, i
// moving backward since for increasing subsequences it will be the last k items
for(int i = n - 2; i >= 0; --i){
for(size_t len = 2; len <= k; ++len){
int idx = n * (len - 1) + i; // i, length
sums[idx] = INT_MIN;
for(int j = n - 1; j > i; --j){
if(arr[i] < arr[j])
sums[idx] = max(sums[idx], sums[n * (len - 2) + j]); // j, length - 1
}
if(sums[idx] > INT_MIN)
sums[idx] += arr[i];
}
}
int maxSum = INT_MIN;
for(int i = n - k; i >= 0; --i)
maxSum = max(maxSum, sums[n * (k - 1) + i]); // i, k
return maxSum;
}
int main(){
std::random_device dev;
std::mt19937 rng(dev());
std::uniform_int_distribution<std::mt19937::result_type> dist(1,10);
for(int len = 3; len < 10; ++len){
for(int i = 0; i < 10000; ++i){
int arr[100];
for(int n = 0; n < 100; ++n)
arr[n] = dist(rng);
int res = maxSumIncreasingKLenSeqDP(arr, 100, len);
int fastRes = maxSumIncreasingKLenSeq(arr, 100, len);
if(res != fastRes)
cout << "failed" << endl;
else
cout << "passed" << endl;
}
}
return 0;
}

Determine the run-time of a pseudo code

I have the following pseudo-code which I want to determine its run-time T(n).
Can someone please give me the steps I should follow ?
Here is the code:
i := 1;
while (i <= n)
j := i;
x := x+A[i];
while (j > 0)
y := x/(2*j);
j = j /2; // Assume here that this returns the floor of the quotient
i = 2 * i;
return y;

I think it's O(log log N) if my calculation is not wrong.
First, we omit the lines that does not affect the loop times, also assume array random visit as O(1).
i := 1;
while (i <= n)
j := i;
while (j > 0)
j = j / 2;
i = 2 * i;
return y;
Then we assume that all the operations in one line are done in O(1), we add them up and get the result.
Besides the math analysis, we can also use computational: we run the fragment for several times and see the time growth (I also omitted the operations that does not affect the loop times).
#include <iostream>
using namespace std;
int main(){
long long i, j, n;
double cost;
clock_t start, finish;
i = 1;
n = 1000000000000;
start = clock();
while (i <= n) {
j = i;
while (j > 0) {
j = j / 2;
}
i = 2 * i;
}
finish = clock();
cout << (double)(finish - start)/CLOCKS_PER_SEC << endl;
}

Rabin Karp algorithm for big strings

I wrote a simple step-by-step implementation of Rabin-Karp algorithm for substring search, and it seems to work fine until the hash becomes greater than the modulus, and then it goes wrong...
Here is the code, it's quite simple:
typedef long long ll;
#define B 257
//base
#define M 2147483647
//modulus
//modulus for positive and negative values
ll mod(ll a){
return (a % M + M) % M;
}
//fast way to calculate modular power
ll power(ll n, ll e){
ll r = 1;
for(; e > 0; e >>= 1, n = (n*n) % M)
if(e&1) r = (r * n) % M;
return r;
}
//function to calculate de initial hash
//H(s) = s[0] * B^0 + s[1] * B^1 + ...
ll H(char sub[], int s){
ll h = 0;
for(ll i = 0; i < s; i++)
h = mod(h + mod(power(B, i) * sub[i]));
return h;
}
//brute force comparing when hashes match
bool check(char text[], char sub[], int ini, int s){
int i = 0;
while(text[ini + i] == sub[i] && i < s) i++;
return i == s;
}
//all together here
void RabinKarp(char text[], char sub[]){
int t = strlen(text), s = strlen(sub);
ll hs = H(sub, s), ht = H(text, s);
int lim = t - s;
for(int i = 0; i <= lim; i++){
if(ht == hs)
if(check(text, sub, i, s))
printf("MATCH AT %d\n", i);
ht -= text[i];
ht /= B;
ht = mod(ht + power(B, s - 1) * text[i + s]);
//we had text[i] * B^0 + text[i+1] * B^1 + ... + text[i + len - 1] * B^(len-1)
//then text[i+1] * B^1 + text[i+2] * B^2 + ... + text[i + len - 1] * B^(len-1)
//then text[i+1] * B^0 + text[i+2] * B^1 + ... + text[i + len - 1] * B^(len-2)
//finally we add a new last term text[i + len] * B^(len-1)
//so we moved the hash to the next position
}
}
int main(){
char text[] = "uvauvauvaaauva";
char sub[] = "uva";
char sub2[] = "uvauva";
RabinKarp(text, sub);
printf("----------------------------\n");
RabinKarp(text, sub2);
}
The problem is that after I take the modulus, the hash can become a small number and then, when I add some big factor to it, the hashes may not match even when they should.
For example: abc inside xabc
when I take the hash of abc and xab, suppose both of them are bigger than the modulus, so they get small after the modulus operation.
Then, when I remove 'x' and add the 'c' factor, the sum can be smaller than the modulus but still big, so it won't match.
How can I overcome this problem?

ht /= B;
is not plausible. First of all because you are doing arithmetic mod M, and the modular equivalent of division is not the same as the standard one. Secondly because you should expect the same answer for x and x + M and this will not be the case.
You have text[i] * B^0 + text[i+1] * B^1 + ... + text[i + len - 1] * B^(len-1)
If you work with
text[i] * B^(len-1) + text[i+1] * B^(len - 2) + ... + text[i + len - 1] * B^0
You can subtract off text[i] * B^(len-1) and then multiply by B instead

Gold Rader bit reversal algorithm

I am trying to understand this bit reversal algorithm. I found a lot of sources but it doesn't really explain how the pseudo-code works. For example, I found the pseudo-code below from http://www.briangough.com/fftalgorithms.pdf
for i = 0 ... n − 2 do
k = n/2
if i < j then
swap g(i) and g(j)
end if
while k ≤ j do
j ⇐ j − k
k ⇐ k/2
end while
j ⇐ j + k
end for
From looking at this pseudo-code, I don't understand why you would do
swap g(i) and g(j)
when the if statement is true.
Also: what does the while loop do? It would be great if someone can explain this pseudo-code to me.
below is the c++ code that I found online.
void four1(double data[], int nn, int isign)
{
int n, mmax, m, j, istep, i;
double wtemp, wr, wpr, wpi, wi, theta;
double tempr, tempi;
n = nn << 1;
j = 1;
for (i = 1; i < n; i += 2) {
if (j > i) {
tempr = data[j]; data[j] = data[i]; data[i] = tempr;
tempr = data[j+1]; data[j+1] = data[i+1]; data[i+1] = tempr;
}
m = n >> 1;
while (m >= 2 && j > m) {
j -= m;
m >>= 1;
}
j += m;
}
Here is the full version of the source code that I found that does FFT
/************************************************
* FFT code from the book Numerical Recipes in C *
* Visit www.nr.com for the licence. *
************************************************/
// The following line must be defined before including math.h to correctly define M_PI
#define _USE_MATH_DEFINES
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define PI M_PI /* pi to machine precision, defined in math.h */
#define TWOPI (2.0*PI)
/*
FFT/IFFT routine. (see pages 507-508 of Numerical Recipes in C)
Inputs:
data[] : array of complex* data points of size 2*NFFT+1.
data[0] is unused,
* the n'th complex number x(n), for 0 <= n <= length(x)-1, is stored as:
data[2*n+1] = real(x(n))
data[2*n+2] = imag(x(n))
if length(Nx) < NFFT, the remainder of the array must be padded with zeros
nn : FFT order NFFT. This MUST be a power of 2 and >= length(x).
isign: if set to 1,
computes the forward FFT
if set to -1,
computes Inverse FFT - in this case the output values have
to be manually normalized by multiplying with 1/NFFT.
Outputs:
data[] : The FFT or IFFT results are stored in data, overwriting the input.
*/
void four1(double data[], int nn, int isign)
{
int n, mmax, m, j, istep, i;
double wtemp, wr, wpr, wpi, wi, theta;
double tempr, tempi;
n = nn << 1;
j = 1;
for (i = 1; i < n; i += 2) {
if (j > i) {
//swap the real part
tempr = data[j]; data[j] = data[i]; data[i] = tempr;
//swap the complex part
tempr = data[j+1]; data[j+1] = data[i+1]; data[i+1] = tempr;
}
m = n >> 1;
while (m >= 2 && j > m) {
j -= m;
m >>= 1;
}
j += m;
}
mmax = 2;
while (n > mmax) {
istep = 2*mmax;
theta = TWOPI/(isign*mmax);
wtemp = sin(0.5*theta);
wpr = -2.0*wtemp*wtemp;
wpi = sin(theta);
wr = 1.0;
wi = 0.0;
for (m = 1; m < mmax; m += 2) {
for (i = m; i <= n; i += istep) {
j =i + mmax;
tempr = wr*data[j] - wi*data[j+1];
tempi = wr*data[j+1] + wi*data[j];
data[j] = data[i] - tempr;
data[j+1] = data[i+1] - tempi;
data[i] += tempr;
data[i+1] += tempi;
}
wr = (wtemp = wr)*wpr - wi*wpi + wr;
wi = wi*wpr + wtemp*wpi + wi;
}
mmax = istep;
}
}
/********************************************************
* The following is a test routine that generates a ramp *
* with 10 elements, finds their FFT, and then finds the *
* original sequence using inverse FFT *
********************************************************/
int main(int argc, char * argv[])
{
int i;
int Nx;
int NFFT;
double *x;
double *X;
/* generate a ramp with 10 numbers */
Nx = 10;
printf("Nx = %d\n", Nx);
x = (double *) malloc(Nx * sizeof(double));
for(i=0; i<Nx; i++)
{
x[i] = i;
}
/* calculate NFFT as the next higher power of 2 >= Nx */
NFFT = (int)pow(2.0, ceil(log((double)Nx)/log(2.0)));
printf("NFFT = %d\n", NFFT);
/* allocate memory for NFFT complex numbers (note the +1) */
X = (double *) malloc((2*NFFT+1) * sizeof(double));
/* Storing x(n) in a complex array to make it work with four1.
This is needed even though x(n) is purely real in this case. */
for(i=0; i<Nx; i++)
{
X[2*i+1] = x[i];
X[2*i+2] = 0.0;
}
/* pad the remainder of the array with zeros (0 + 0 j) */
for(i=Nx; i<NFFT; i++)
{
X[2*i+1] = 0.0;
X[2*i+2] = 0.0;
}
printf("\nInput complex sequence (padded to next highest power of 2):\n");
for(i=0; i<NFFT; i++)
{
printf("x[%d] = (%.2f + j %.2f)\n", i, X[2*i+1], X[2*i+2]);
}
/* calculate FFT */
four1(X, NFFT, 1);
printf("\nFFT:\n");
for(i=0; i<NFFT; i++)
{
printf("X[%d] = (%.2f + j %.2f)\n", i, X[2*i+1], X[2*i+2]);
}
/* calculate IFFT */
four1(X, NFFT, -1);
/* normalize the IFFT */
for(i=0; i<NFFT; i++)
{
X[2*i+1] /= NFFT;
X[2*i+2] /= NFFT;
}
printf("\nComplex sequence reconstructed by IFFT:\n");
for(i=0; i<NFFT; i++)
{
printf("x[%d] = (%.2f + j %.2f)\n", i, X[2*i+1], X[2*i+2]);
}
getchar();
}
/*
Nx = 10
NFFT = 16
Input complex sequence (padded to next highest power of 2):
x[0] = (0.00 + j 0.00)
x[1] = (1.00 + j 0.00)
x[2] = (2.00 + j 0.00)
x[3] = (3.00 + j 0.00)
x[4] = (4.00 + j 0.00)
x[5] = (5.00 + j 0.00)
x[6] = (6.00 + j 0.00)
x[7] = (7.00 + j 0.00)
x[8] = (8.00 + j 0.00)
x[9] = (9.00 + j 0.00)
x[10] = (0.00 + j 0.00)
x[11] = (0.00 + j 0.00)
x[12] = (0.00 + j 0.00)
x[13] = (0.00 + j 0.00)
x[14] = (0.00 + j 0.00)
x[15] = (0.00 + j 0.00)
FFT:
X[0] = (45.00 + j 0.00)
X[1] = (-25.45 + j 16.67)
X[2] = (10.36 + j -3.29)
X[3] = (-9.06 + j -2.33)
X[4] = (4.00 + j 5.00)
X[5] = (-1.28 + j -5.64)
X[6] = (-2.36 + j 4.71)
X[7] = (3.80 + j -2.65)
X[8] = (-5.00 + j 0.00)
X[9] = (3.80 + j 2.65)
X[10] = (-2.36 + j -4.71)
X[11] = (-1.28 + j 5.64)
X[12] = (4.00 + j -5.00)
X[13] = (-9.06 + j 2.33)
X[14] = (10.36 + j 3.29)
X[15] = (-25.45 + j -16.67)
Complex sequence reconstructed by IFFT:
x[0] = (0.00 + j -0.00)
x[1] = (1.00 + j -0.00)
x[2] = (2.00 + j 0.00)
x[3] = (3.00 + j -0.00)
x[4] = (4.00 + j -0.00)
x[5] = (5.00 + j 0.00)
x[6] = (6.00 + j -0.00)
x[7] = (7.00 + j -0.00)
x[8] = (8.00 + j 0.00)
x[9] = (9.00 + j 0.00)
x[10] = (0.00 + j -0.00)
x[11] = (0.00 + j -0.00)
x[12] = (0.00 + j 0.00)
x[13] = (-0.00 + j -0.00)
x[14] = (0.00 + j 0.00)
x[15] = (0.00 + j 0.00)
*/

A bit-reversal algorithm creates a permutation of a data set by reversing the binary address of each item; so e.g. in a 16-item set the addresses:
0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
will be changed into:
1000 0100 1100 0010 1010 0110 1110 0001 1001 0101 1101 0011 1011 0111 1111
and the corresponding items are then moved to their new address.
Or in decimal notation:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
becomes
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
What the while loop in the pseudo-code does, is set variable j to this sequence. (Btw, the initial value of j should be 0).
You'll see that the sequence is made up like this:
0
0 1
0 2 1 3
0 4 2 6 1 5 3 7
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
with each sequence being made by multiplying the previous version by 2, and then repeating it with 1 added. Or looking at it another way: by repeating the previous sequence, interlaced with the values + n/2 (this more closely describes what happens in the algorithm).
0
0 1
0 2 1 3
0 4 2 6 1 5 3 7
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
Items i and j are then swapped in each iteration of the for loop, but only if i < j; otherwise every item would be swapped to its new place (e.g. when i = 3 and j = 12), and then back again (when i = 12 and j = 3).
function bitReversal(data) {
var n = data.length;
var j = 0;
for (i = 0; i < n - 1; i++) {
var k = n / 2;
if (i < j) {
var temp = data[i]; data[i] = data[j]; data[j] = temp;
}
while (k <= j) {
j -= k;
k /= 2;
}
j += k;
}
return(data);
}
console.log(bitReversal([0,1]));
console.log(bitReversal([0,1,2,3]));
console.log(bitReversal([0,1,2,3,4,5,6,7]));
console.log(bitReversal([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]));
console.log(bitReversal(["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p"]));
The C++ code you found appears to use the symmetry of the sequence to loop through it in double steps. It doesn't produce the correct result though, so either it's a failed attempt, or maybe it's designed to do something different entirely. Here's a version that uses the two-step idea:
function bitReversal2(data) {
var n = data.length;
var j = 0;
for (i = 0; i < n; i += 2) {
if (i < j) {
var temp = data[i]; data[i] = data[j]; data[j] = temp;
}
else {
var temp = data[n-1 - i]; data[n-1 - i] = data[n-1 - j]; data[n-1 - j] = temp;
}
var k = n / 4;
while (k <= j) {
j -= k;
k /= 2;
}
j += k;
}
return(data);
}
console.log(bitReversal2([0,1]));
console.log(bitReversal2([0,1,2,3]));
console.log(bitReversal2([0,1,2,3,4,5,6,7]));
console.log(bitReversal2([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]));
console.log(bitReversal2(["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p"]));

First off, thanks for everyone's help in answering my question. I was talking to the person who is helping me, and I think I understand my C++ code now. Maybe my question is kind unclear, but what I am trying to do is implementing FFT using C++. The C++ code I gave in the question is just the first half of the FFT source code that I found online. Essentially, this part of the C++ code is sorting the inputs into real and imaginary numbers, store the real numbers into the odd index, and imaginary numbers into the even index. (real_0, imag_0, real_1, imag_1, real_2, imag_2,.....) with index starts at 1 because we don't need to swap the zeroth index.
The swap operation below is swapping the real and imaginary numbers.
tempr = data[j]; data[j] = data[i]; data[i] = tempr;
tempr = data[j+1]; data[j+1] = data[i+1]; data[i+1] = tempr;
For example, we have an array length of 8 (nn=8), then 2*nn=16, and we separate our inputs into real and imag numbers and store it into an array of 16. So The first time going through the for loop of the C++ code, my j=1, i=1, it will skip the if and while statements, now in the second for loop, j=9, i=3, so the if statement will be true and data[9], data[3] will be swapped, data[9+1] and data[3+1] will be swapped. This is doing the bit reversal because index 3 and 4 has the real and imag number of the first input, and index 9 and 10 has the real and imag number of the fourth number.
As a result, 001 = 1 (fist input)
100 = 4 (fourth input)
is swapped, so, this is doing the bit reversal using the indexes.
I don't really understand the while loop yet, but I know from #m69, that it is just a way of setting a sequence, so that it can do bit reversal. Well helpful my explanation on my own question is kind clear to people who has the same questions. Once again, thanks everyone.

Is there any fast method of matrix exponentiation?

Is there any faster method of matrix exponentiation to calculate Mn (where M is a matrix and n is an integer) than the simple divide and conquer algorithm?

You could factor the matrix into eigenvalues and eigenvectors. Then you get
M = V * D * V^-1
Where V is the eigenvector matrix and D is a diagonal matrix. To raise this to the Nth power, you get something like:
M^n = (V * D * V^-1) * (V * D * V^-1) * ... * (V * D * V^-1)
= V * D^n * V^-1
Because all the V and V^-1 terms cancel.
Since D is diagonal, you just have to raise a bunch of (real) numbers to the nth power, rather than full matrices. You can do that in logarithmic time in n.
Calculating eigenvalues and eigenvectors is r^3 (where r is the number of rows/columns of M). Depending on the relative sizes of r and n, this might be faster or not.

It's quite simple to use Euler fast power algorith. Use next algorith.
#define SIZE 10
//It's simple E matrix
// 1 0 ... 0
// 0 1 ... 0
// ....
// 0 0 ... 1
void one(long a[SIZE][SIZE])
{
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
a[i][j] = (i == j);
}
//Multiply matrix a to matrix b and print result into a
void mul(long a[SIZE][SIZE], long b[SIZE][SIZE])
{
long res[SIZE][SIZE] = {{0}};
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
for (int k = 0; k < SIZE; k++)
{
res[i][j] += a[i][k] * b[k][j];
}
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
a[i][j] = res[i][j];
}
//Caluclate a^n and print result into matrix res
void pow(long a[SIZE][SIZE], long n, long res[SIZE][SIZE])
{
one(res);
while (n > 0) {
if (n % 2 == 0)
{
mul(a, a);
n /= 2;
}
else {
mul(res, a);
n--;
}
}
}
Below please find equivalent for numbers:
long power(long num, long pow)
{
if (pow == 0) return 1;
if (pow % 2 == 0)
return power(num*num, pow / 2);
else
return power(num, pow - 1) * num;
}

Exponentiation by squaring is frequently used to get high powers of matrices.

I would recommend approach used to calculate Fibbonacci sequence in matrix form. AFAIK, its efficiency is O(log(n)).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

OpenCL Cholesky Decomposition - algorithm

Not all of your work items execute at the same time, they may run in batches. So your code running prior to CLK_GLOBAL_MEM_FENCE won't include every value. That may be the source of your errors. If you require global synchronization, use multiple kernels.

Related

Largest sum of all increasing subsequences of length k

Determine the run-time of a pseudo code

Rabin Karp algorithm for big strings

Gold Rader bit reversal algorithm

Is there any fast method of matrix exponentiation?

Categories

Resources