how to realize a 3-level cached with Halide - halide

I hope to use halide to simulate a three-level cache architecture for cpu. and the memory size is divided by each top layer. such as
(256*256) -> 16 * (16*256) -> 16 * 4 * (4*256) -> 16 * (16*256) -> (256*256)
I use this cpp function to run
Func l1, l2, l3, l2_out, l3_out;
l3.store_in(MemoryType::L3);
l2.store_in(MemoryType::L2);
l1.store_in(MemoryType::L1);
l3_out.store_in(MemoryType::L3);
l2_out.store_in(MemoryType::L2);
auto l2_size = 16*256, l1_size = 4*256;
for (auto i = 0; i < 16; i++) { // 16 times l3->l2
RDom r_l2(0, l2_size, "l2_reduce");
l2(x2) = l3(x2);
l2(r_l2) = l3(i * l2_size + r_l2);
for (auto j = 0; j < 4; j++) { // 4 times l1->l2
RDom r_l1(0, l1_size, "l1_reduce");
l1(x1) = l2(j * l1_size + r_l1);
l2_out(j*l1_size+r_l1) = l1(r_l1);
}
l3_out(i*l2_size + r_l2) = l2_out(r_l2);
}
it seems I can't define reduction in pure fuction definition.
terminate called after throwing an instance of 'Halide::CompileError'
what(): Error: In pure definition of Func "l1$0":
Reduction domain referenced in pure function definition.
is there any way to run it?

Related

Matrix multiplication via std::vector is 10 times slower than numpy

Although it is known that using nested std::vector to represent matrices is a bad idea, let's use it for now since it is flexible and many existing functions can handle std::vector.
I thought, in small cases, the speed difference can be ignored. But it turned out that vector<vector<double>> is 10+ times slower than numpy.dot().
Let A and B be matrices whose size is sizexsize. Assuming square matrices is just for simplicity. (We don't intend to limit discussion to the square matrices case.) We initialize each matrix in a deterministic way, and finally calculate C = A * B.
We define "calculation time" as the time elapsed just to calculate C = A * B. In other words, various overheads are not included.
Python3 code
import numpy as np
import time
import sys
if (len(sys.argv) != 2):
print("Pass `size` as an argument.", file = sys.stderr);
sys.exit(1);
size = int(sys.argv[1]);
A = np.ndarray((size, size));
B = np.ndarray((size, size));
for i in range(size):
for j in range(size):
A[i][j] = i * 3.14 + j
B[i][j] = i * 3.14 - j
start = time.time()
C = np.dot(A, B);
print("{:.3e}".format(time.time() - start), file = sys.stderr);
C++ code
using namespace std;
#include <iostream>
#include <vector>
#include <chrono>
int main(int argc, char **argv) {
if (argc != 2) {
cerr << "Pass `size` as an argument.\n";
return 1;
}
const unsigned size = atoi(argv[1]);
vector<vector<double>> A(size, vector<double>(size));
vector<vector<double>> B(size, vector<double>(size));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
A[i][j] = i * 3.14 + j;
B[i][j] = i * 3.14 - j;
}
}
auto start = chrono::system_clock::now();
vector<vector<double>> C(size, vector<double>(size, /* initial_value = */ 0));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
cerr << scientific;
cerr.precision(3);
cerr << chrono::duration<double>(chrono::system_clock::now() - start).count() << "\n";
}
C++ code (multithreaded)
We also wrote a multithreaded version of C++ code since numpy.dot() is automatically calculated in parallel.
You can get all the codes from GitHub.
Result
C++ version is 10+ times slower than Python 3 (with numpy) version.
matrix_size: 200x200
--------------- Time in seconds ---------------
C++ (not multithreaded): 8.45e-03
C++ (1 thread): 8.66e-03
C++ (2 threads): 4.68e-03
C++ (3 threads): 3.14e-03
C++ (4 threads): 2.43e-03
Python 3: 4.07e-04
-----------------------------------------------
matrix_size: 400x400
--------------- Time in seconds ---------------
C++ (not multithreaded): 7.011e-02
C++ (1 thread): 6.985e-02
C++ (2 threads): 3.647e-02
C++ (3 threads): 2.462e-02
C++ (4 threads): 1.915e-02
Python 3: 1.466e-03
-----------------------------------------------
Question
Is there any way to make the C++ implementation faster?
Optimizations I Tried
swap calculation order -> at most 3.5 times faster (not than numpy code but than C++ code)
optimization 1 plus partial unroll -> at most 4.5 times faster, but this can be done only when size is known in advance No. As pointed out in this comment, size is not needed to be known. We can just limit the max value of loop variables of unrolled loops and process remaining elements with normal loops. See my implementation for example.
optimization 2, plus minimizing the call of C[i][j] by introducing a simple variable sum -> at most 5.2 times faster. The implementation is here. This result implies std::vector::operator[] is un-ignorably slow.
optimization 3, plus g++ -march=native flag -> at most 6.2 times faster (By the way, we use -O3 of course.)
Optimization 3, plus reducing the call of operator [] by introducing a pointer to an element of A since A's elements are sequentially accessed in the unrolled loop. -> At most 6.2 times faster, and a little little bit faster than Optimization 4. The code is shown below.
g++ -funroll-loops flag to unroll for loops -> no change
g++ #pragma GCC unroll n -> no change
g++ -flto flag to turn on link time optimizations -> no change
Block Algorithm -> no change
transpose B to avoid cache miss -> no change
long linear std::vector instead of nested std::vector<std::vector>, swap calculation order, block algorithm, and partial unroll -> at most 2.2 times faster
Optimization 1, plus PGO(profile-guided optimization) -> 4.7 times faster
Optimization 3, plus PGO -> same as Optimization 3
Optimization 3, plus g++ specific __builtin_prefetch() -> same as Optimization 3
Current Status
(originally) 13.06 times slower -> (currently) 2.10 times slower
Again, you can get all the codes on GitHub. But let us cite some codes, all of which are functions called from the multithreaded version of C++ code.
Original Code (GitHub)
void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
const unsigned j_max = B[0].size();
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int j = 0; j < j_max; ++j) {
for (int k = 0; k < k_max; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
Current Best Code (GitHub)
This is the implementation of the Optimization 5 above.
void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
static const unsigned num_unroll = 5;
const unsigned j_max = B[0].size();
const unsigned k_max_for_unrolled_loop = B.size() / num_unroll * num_unroll;
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int k = 0; k < k_max_for_unrolled_loop; k += num_unroll) {
for (int j = 0; j < j_max; ++j) {
const double *p = A[i].data() + k;
double sum;
sum = *p++ * B[k][j];
sum += *p++ * B[k+1][j];
sum += *p++ * B[k+2][j];
sum += *p++ * B[k+3][j];
sum += *p++ * B[k+4][j];
C[i][j] += sum;
}
}
for (int k = k_max_for_unrolled_loop; k < k_max; ++k) {
const double a = A[i][k];
for (int j = 0; j < j_max; ++j) {
C[i][j] += a * B[k][j];
}
}
}
}
We've tried many optimizations since we first posted this question. We spent whole two days struggling with this problem, and finally reached the point where we have no more idea how to optimize the current best code. We doubt more complex algorithms like Strassen's will do it better since cases we handle are not large and each operation on std::vector is so expensive that, as we've seen, just reducing the call of [] improved the performance well.
We (want to) believe we can make it better, though.
Matrix multiplication is relativly easy to optimize. However if you want to get to decent cpu utilization it becomes tricky because you need deep knowledge of the hardware you are using. The steps to implement a fast matmul kernel are the following:
Use SIMDInstructions
Use Register Blocking and fetch multiple data at once
Optimize for your chache lines (mainly L2 and L3)
Parallelize your code to use multiple threads
Under this linke is a very good ressource, that explains all the nasty details:
https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0
If you want more indepth advise leave a comment.

c code is running to slow from nested for loops

my c program is running to slow (right now it is around 40 seconds without parallelization). I have tried using openmp which has brought the timing down significantly but I am looking to use simple and natural ways to make my code run faster other than using parallel for loops. The basic structure of the code is that is takes some command line arguments as inputs and then saves those inputs as variables. Then it recursively computes a variable called Rplus1 using the math.h library and the complex.h library. The problem of the code and where it is taking most of it's time is at the bottom where there are nested for loops. My goal is to get the whole code running in under 5 seconds but as of now it runs in about 40 seconds without using parallel for loops. Please Help!
#include "time.h"
#include "stdio.h"
#include "stdlib.h"
#include "complex.h"
#include "math.h"
#include "string.h"
#include "unistd.h"
#include "omp.h"
#define PI 3.14159265
int main (int argc, char *argv[]){
if(argc >= 8){
double start1 = omp_get_wtime();
// command line arguments are aligned in the following order: [theta] [number of layers in superlattice] [material_1] [lat const_1] [number of unit cells_1] [material_2] [lat const_2] [number of unit cells_2] .... [material_N] [lat const_N] [number of unit cells_N] [Log/Linear] [number of repeating superlattice layers] [yes/no]
int N;
sscanf(argv[2],"%d",&N); // Number of layers in superlattice specified by second input argument
if(strcmp(argv[argc-1],"yes") == 0) //If the substrate is included then add one more layer to the N variable
{
N = N+1;
}
int total;
sscanf(argv[argc-2],"%d",&total); // Number of repeating superlattice layers specified by second to last argument
double layers[N][6], horizangle[1001], vertangle[1001];
double complex (*F_hkl)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)), (*F_0)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)), (*g)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)), (*g_0)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)),SF_table[10];// this array will hold the unit cell structure factors for all of the materials selected for each wavevector in the beam spectrum
double real, real2, lam, c_light = 299792458, h_pl = 4.135667516e-15,E = 10e3, r_0 = 2.818e-15, Lccd = 1.013;// just a few variables to hold values through calculations and constants, speed of light, plancks const, photon energy, and detector distance from sample
double angle;
double complex z;// just a variable to hold complex numbers throughout calculations
int i,j,m,n,t; // integers to index through arrays
lam = (h_pl*c_light)/E;
sscanf(argv[1],"%lf",&angle); //first argument is the angle of incidence, read it
angle = angle*(PI/180.0);
angle2 = -angle;
double (*table)[10] = malloc(10*9*sizeof(double)); // this array holds all the coefficients to calculate the atomic scattering factor below
double (*table2)[10] = malloc(10*2*sizeof(double));
FILE*datfile1 = fopen("/home/vhosts/xraydev.engr.wisc.edu/data/coef_table.bin","rb"); // read the binary file containg all the coefficients
fread(table,sizeof(double),90,datfile1);
fclose(datfile1);
FILE*datfile2 = fopen("/home/vhosts/xraydev.engr.wisc.edu/data/dispersioncs.bin","rb");
fread(table2,sizeof(double),20,datfile2);
fclose(datfile2);
// Calculate scattering factors for all elements
double a,b;
double k_z = (sin(angle)/lam)*1e-10; // incorporate angular dependence of SF but neglect 0.24 degree divergence because of approximation
for(i = 0;i<10;i++) // for each element...
{
SF_table[i] = 0;
for(j = 0;j<4;j++) // summation
{
a = table[2*j][i];
b = table[2*j+1][i];
SF_table[i] = SF_table[i] + a * exp(-b*k_z*k_z);
}
SF_table[i] = SF_table[i] + table[8][i] + table2[0][i] + table2[1][i]*I;
}
free(table);
double mm = 4.0, (*phi)[1001][1001] = malloc(N*1001*1001*sizeof(double));
for(i = 1; i < N+1; i++) // for each layer of material...
{
sscanf(argv[i*3+1],"%lf",&layers[i-1][1]); // get out of plane lattice constant
sscanf(argv[i*3+2],"%lf",&layers[i-1][2]); // get the number of unit cells in the layer
layers[i-1][1] = layers[i-1][1]*1e-10; // convert lat const input to meters
// Define reciprocal space positions at the incident angle h, k, l
layers[i-1][3] = 0; // h
layers[i-1][4] = 0; // k
double l; // l calculated for each wavevector in the spectrum because l changes with angle of incidence
for (m = 0; m < 1001; m++)
{
for (n = 0; n <1001; n++)
{
l = 4;
phi[i-1][m][n] = 2*PI*layers[i-1][1]*sin(angle)/lam; // Caculate phi for each layer
if(strcmp(argv[i*3],"GaAs") == 0)
{
F_hkl[i-1][m][n] = (2+2*cexp(I*PI*l))*(SF_table[2]+SF_table[3]*cexp(I*PI*l/2));
F_0[i-1][m][n] = 0.5*8.0*(31 + table2[0][2] + table2[1][2]*I) + 0.5*8.0*(33 + table2[0][3] + table2[1][3]*I);
g[i-1][m][n] = 2*r_0*F_hkl[i-1][m][n]/mm/layers[i-1][1]*cos(2*angle[m][n]);
g_0[i-1][m][n] = 2*r_0*F_0[i-1][m][n]/mm/layers[i-1][1];
}
if(strcmp(argv[i*3],"AlGaAs") == 0)
{
F_hkl[i-1][m][n] = (2+2*cexp(I*PI*l))*((0.76*SF_table[2]+ 0.24*SF_table[4])+SF_table[3]*cexp(I*PI*l/2));
F_0[i-1][m][n] = 0.24*4.0*(13 + table2[0][4] + table2[1][4]*I) + 0.76*4.0*(31 + table2[0][2] + table2[1][2]*I) + 4.0*(33 + table2[0][3] + table2[1][3]*I);
g[i-1][m][n] = 2*r_0*F_hkl[i-1][m][n]/mm/layers[i-1][1]*cos(2*angle[m][n]);
g_0[i-1][m][n] = 2*r_0*F_0[i-1][m][n]/mm/layers[i-1][1];
}
}
}
}
double complex (*Rplus1)[1001] = malloc(1001*1001*sizeof(double complex));
for (m = 0; m < 1001; m++)
{
for (n = 0; n <1001; n++)
{
Rplus1[m][n] = 0.0;
}
}
double stop1 = omp_get_wtime();
for(i=1;i<N;i++) // For each layer of the film
{
for(j=0;j<layers[i][2];j++) // For each unit cell
{
for (m = 0; m < 1001; m++) // For each row of the diffraction pattern
{
for (n = 0; n <1001; n++) // For each column of the diffraction pattern
{
Rplus1[m][n] = -I*g[i][m][n] + ((1-I*g_0[i][m][n])*(1-I*g_0[i][m][n]))/(I*g[i][m][n] + (cos(-2*phi[i][m][n])+I*sin(-2*phi[i][m][n]))/Rplus1[m][n]);
}
}
}
}
double stop2 = omp_get_wtime();
double elapsed1 = (double)(stop1 - start1);// Second user defined function to use Durbin and Follis recursive formula
double elapsed2 = (double)(stop2 - start1);// Second user defined function to use Durbin and Follis recursive formula
printf("main() through before diffraction function took %f seconds to run\n\n",elapsed1);
printf("main() through after diffraction function took %f seconds to run\n\n",elapsed2);
}
}

Issue with upc_memget: Caught a fatal signal : SIGSEGV(11) on node 2/4

I'm trying to write a matrix multiplication code in UPC. If I don't use b_local and directly use b, its working fine. But when I use b_local via memget function, it crashes at the line "upc_memget" with the above error.
#define N 10 //Input Matrix A = N*P
#define P 10 //Input Matrix B = P*M
#define M 10 //Result Matrix C = N*M
shared [N*P /THREADS] double a[N][P] , c[N][M];
shared [M / THREADS] double b[P][M] ;
double b_local[P][M];
int main() {
//Initialization
if(MYTHREAD==0)
gettimeofday(&start_time,NULL);
upc_barrier;
upc_memget(b_local, b, P*M*sizeof(double));
for (k=0; k<ITER; k++) {
/* UPC_FORALL work-sharing construct for matrix multiplication */
upc_forall(i=0;i<N;i++;&a[i][0]) {
// &a[i][0] determines affinity
for (j=0; j<M; j++) {
c[i][j] = 0;
for(l=0; l< P; l++) c[i][j] +=a[i][l]*b_local[l][j];
}
}
}
upc_barrier;
if(MYTHREAD==0)
gettimeofday(&end_time,NULL);
}
upc_memget fetches a contiguous block of memory with affinity to a single thread. Given your declaration of b, there are roughly P*M/THREADS elements with affinity to thread 0 and your call tries to fetch P*M elements from that one thread - the crash occurs because you are trying to copy unallocated memory beyond the end of the elements with affinity to thread 0.

Is this part of a real IFFT process really optimal?

When calculating (I)FFT it is possible to calculate "N*2 real" data points using a ordinary complex (I)FFT of N data points.
Not sure about my terminology here, but this is how I've read it described.
There are several posts about this on stackoverflow already.
This can speed things up a bit when only dealing with such "real" data which is often the case when dealing with for example sound (re-)synthesis.
This increase in speed is offset by the need for a pre-processing step that somehow... uhh... fidaddles? the data to achieve this. Look I'm not even going to try to convince anyone I fully understand this but thanks to previously mentioned threads, I came up with the following routine, which does the job nicely (thank you!).
However, on my microcontroller this costs a bit more than I'd like even though trigonometric functions are already optimized with LUTs.
But the routine itself just looks like it should be possible to optimize mathematically to minimize processing. To me it seems similar to plain 2d rotation. I just can't quite wrap my head around it, but it just feels like this could be done with fewer both trigonometric calls and arithmetic operations.
I was hoping perhaps someone else might easily see what I don't and provide some insight into how this math may be simplified.
This particular routine is for use with IFFT, before the bit-reversal stage.
pseudo-version:
INPUT
MAG_A/B = 0 TO 1
PHA_A/B = 0 TO 2PI
INDEX = 0 TO PI/2
r = MAG_A * sin(PHA_A)
i = MAG_B * sin(PHA_B)
rsum = r + i
rdif = r - i
r = MAG_A * cos(PHA_A)
i = MAG_B * cos(PHA_B)
isum = r + i
idif = r - i
r = -cos(INDEX)
i = -sin(INDEX)
rtmp = r * isum + i * rdif
itmp = i * isum - r * rdif
OUTPUT rsum + rtmp
OUTPUT itmp + idif
OUTPUT rsum - rtmp
OUTPUT itmp - idif
original working code, if that's your poison:
void fft_nz_set(fft_complex_t complex[], unsigned bits, unsigned index, int32_t mag_lo, int32_t pha_lo, int32_t mag_hi, int32_t pha_hi) {
unsigned size = 1 << bits;
unsigned shift = SINE_TABLE_BITS - (bits - 1);
unsigned n = index; // index for mag_lo, pha_lo
unsigned z = size - index; // index for mag_hi, pha_hi
int32_t rsum, rdif, isum, idif, r, i;
r = smmulr(mag_lo, sine(pha_lo)); // mag_lo * sin(pha_lo)
i = smmulr(mag_hi, sine(pha_hi)); // mag_hi * sin(pha_hi)
rsum = r + i; rdif = r - i;
r = smmulr(mag_lo, cosine(pha_lo)); // mag_lo * cos(pha_lo)
i = smmulr(mag_hi, cosine(pha_hi)); // mag_hi * cos(pha_hi)
isum = r + i; idif = r - i;
r = -sinetable[(1 << SINE_BITS) - (index << shift)]; // cos(pi_c * (index / size) / 2)
i = -sinetable[index << shift]; // sin(pi_c * (index / size) / 2)
int32_t rtmp = smmlar(r, isum, smmulr(i, rdif)) << 1; // r * isum + i * rdif
int32_t itmp = smmlsr(i, isum, smmulr(r, rdif)) << 1; // i * isum - r * rdif
complex[n].r = rsum + rtmp;
complex[n].i = itmp + idif;
complex[z].r = rsum - rtmp;
complex[z].i = itmp - idif;
}
// For reference, this would be used as follows to generate a sawtooth (after IFFT)
void synth_sawtooth(fft_complex_t *complex, unsigned fft_bits) {
unsigned fft_size = 1 << fft_bits;
fft_sym_dc(complex, 0, 0); // sets dc bin [0]
for(unsigned n = 1, z = fft_size - 1; n <= fft_size >> 1; n++, z--) {
// calculation of amplitude/index (sawtooth) for both n and z
fft_sym_magnitude(complex, fft_bits, n, 0x4000000 / n, 0x4000000 / z);
}
}

CUDA: Why accessing the same device array is not coalesced?

I am posting a drilled down code for review. I believe it should compile and execute without any problems but since i excluded all the irrelevant parts, I might have made some mistake.
struct Users {
double A[96];
double B[32];
double C[32];
};
This is my Users structure with fixed length arrays. Below is given the main function.
int main(int argc, char **argv) {
int numUsers = 10;
Users *users = new Users[numUsers];
double Step[96];
for (int i = 0; i < 32; i++) {
Step[i] = 0.8;
Step[i + 32] = 0.8;
Step[i + 64] = 0.8;
}
for (int usr = 0; usr < numUsers; usr++) {
for (int i = 0; i < 32; i++) {
users[usr].A[i] = 10;
users[usr].A[i + 32] = 20;
users[usr].A[i + 64] = 30;
}
memset(users[usr].B, 0, sizeof(double) * 32);
memset(users[usr].C, 0, sizeof(double) * 32);
}
double *d_Step;
cudaMalloc((void**)&d_Step, sizeof(double) * 96);
cudaMemcpy(d_Step, Step, sizeof(double) * 96, cudaMemcpyHostToDevice);
Users *deviceUsers;
cudaMalloc((void**)&deviceUsers, sizeof(Users) * numUsers);
cudaMemcpy(deviceUsers, users, sizeof(Users) * numUsers, cudaMemcpyHostToDevice);
dim3 grid;
dim3 block;
grid.x = 1;
grid.y = 1;
grid.z = 1;
block.x = 32;
block.y = 10;
block.z = 1;
calc<<<grid, block >>> (deviceUsers, d_Step, numUsers);
delete users;
return 0;
}
Please note here that Step array is 1D array with 96 bins and I am spanning 10 warps (32 threads in x direction and there are 10 of these in my block). Each warp will access the same Step array. This can be seen below in the kernel.
__global__ void calc(Users *users, double *Step, int numUsers) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
int uId = threadIdx.y;
while (uId < numUsers) {
double mean00 = users[uId].A[tId] * Step[tId];
double mean01 = users[uId].A[tId + 32] * Step[tId + 32];
double mean02 = users[uId].A[tId + 64] * Step[tId + 64];
users[uId].A[tId] = (mean00 == 0? 0 : 1 / mean00);
users[uId].A[tId + 32] = (mean01 == 0? 0 : 1 / mean01);
users[uId].A[tId + 64] = (mean02 == 0? 0 : 1 / mean02);
uId += 10;
}
}
Now when I use NVIDIA Visual Profiler, the coalesced retrieves are 47%. I further investigated and found out that Step array which is being accessed by each warp causes this problem. If i replace it with some constant, the accesses are 100% coalesced.
Q1) As I understand, coalesced accesses are linked to byte line i.e. byte lines has to be multiple of 32, whether they are integer, double byte lines. Why I am not getting coalesced accesses?
As per my knowledge, cuda whenever assigns a memory block in the device global memory it, it assigned an even address to it. Thus as long as the starting point + 32 location are accessed by a warp, the access should be coalesced. Am I correct?
Hardware
Geforce GTX 470, Compute Capability 2.0
Your kernel read Step 10 times from global memory. Although L1 cache can reduce the actual access to global mem, it still be treated as inefficient access pattern by the profiler.
My profiler names it 'global load efficiency'. It doesn't say if it is coalesced or not.

Resources