I am trying to optimize an algorithm that masks an array. The initial code looks like this:
void mask(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *m)
{
unsigned int rep=size_x*size_y;
while (rep--)
{
*(source++) &= *(m++);
}
}
I have tried to do Loop Unrolling + prefetching
void mask_LU4(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *mask)
{ // in place
unsigned int rep;
rep= size_x* size_y;
rep/= 4 ;
while (rep--)
{
_mm_prefetch(&source[16], _MM_HINT_T0);
_mm_prefetch(&mask[16], _MM_HINT_T0);
source[0] &= mask[0];
source[1] &= mask[1];
source[2] &= mask[2];
source[3] &= mask[3];
source += 4;
mask += 4;
}
}
and use intrinsics
void fmask_SIMD(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *mask)
{ // in place
unsigned int rep;
__m128i *s,*m ;
s = (__m128i *) source;
m = (__m128i *) mask;
rep= size_x* size_y;
rep/= 4 ;
while (rep--)
{
*s = _mm_and_si128(*s,*m);
source+=4;mask+=4;
s = (__m128i *) source;
m = (__m128i *) mask;
}
}
However the performance is the same. I have tried to perform sw prefetch both to the SIMD and Loop Unrolling version and it I couldn't see any improvement. Any ideas on how I could optimize this algorithm?
P.S.1: I am using gcc 4.8.1 and I compile with -march=native and -Ofast.
P.S.2: I am using an Intel Core i5 3470 #3.2Ghz, Ivy bridge architecture. L1 DCache 4X32KB (8-way), L2 4x256, L3 6MB, RAM-DDR3 4Gb (Dual Channel, DRAM #798,1Mhz)
Your operation is memory bandwidth bound. However, that does not necessarily mean your operation is achieving the maximum memory bandwidth. To get closer to the maximum memory bandwidth you need to use multiple threads. Using OpenMP (add -fopenmp to GCC's options) you can do this:
#pragma omp parallel for
for(int i=0; i<rep; i++) { source[i] &= m[i]; }
If you wanted to not modify the source but use a different destination then you can use stream instructions like this:
#pragma omp parallel for
for(int i=0; i<rep/4; i++) {
__m128i m4 = _mm_load_si128((__m128i*)&m[4*i]);
__m128i s4 = _mm_load_si128((__m128i*)&source[4*i]);
s4 = _mm_and_si128(s4,m4);
_mm_stream_si128((__m128i*i)&dest[4*i], s4);
}
This would not be any faster than using the same destination and source. However, if you already planned to use a destination not equal to the source this would likely be faster (for some value of rep) than using _mm_store_si128.
Your problem might be memory bound, but that does not mean you cannot process more per cycle. Usually when you have a low payload operation (like yous here, it's only an AND after all), it makes sense to combine many loads and stores. In most CPUs most loads will be combined by the L2 cache into a single cache line load (especially if they're consecutive). I'd suggest increasing the loop unrolling to at least 4 SIMD packets here, along with the prefetching. You'd still be memory bound, but you will get fewer cache misses, giving you a slightly better performance.
Related
When I try this I get the wrong result at 'output' even though I am copying the values of 'cum' array to output.
But if I rename the 'cum' array mentioned earlier in the code. I get the correct value of array. Therefore I am unable to reuse the result values.
The device has 8 cores with no shared memory.
Any and all comments/suggestions appreciated.
kernel void histogram(global unsigned int *input,
global unsigned int *output,
global unsigned int *frequency,
global unsigned int *cum,
unsigned int N)
{
int pid = get_global_id(0);
//cumulative sum
for(int i=0; i < 16; i++)
{
cum[(i*16)+(2*pid)+1] = frequency[(i*16)+(2*pid)] + frequency[(i*16)+(2*pid)+1];
}
barrier(CLK_GLOBAL_MEM_FENCE);
for(int i=0; i < 32; i++)
{
output[(i*8)+pid] = cum[(i*8)+pid];
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
Make sure you understand parallel prefix sums. In particular I don't see a downsweep step of the total sum or parts:
Parallel Prefix Sum (Scan) with CUDA
I'd look in the TI's Keystone II SDK you're using in OpenCL device memory read/write issue to see if they have any scan or parallel prefix sum implementations or built in functions.
I am trying to optimize an algorithm I am running on my GPU (AMD HD6850). I counted the number of floating point operations inside my kernel and measured its execution time. I found it to achieve ~20 SP GFLOPS, however according to the GPUs specs I should achieve ~1500 GFLOPS.
To find the bottleneck I created a very simple kernel:
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
float cd;
for (int i=0; i<100000; i++)
{
cd = d*i;
}
if (cd == -1.0f)
{
result[gid] = cd;
}
}
Running this kernel I get ~5*10^5 work_items/sec. I count one floating point operation (not sure if that's right, what about incrementing i and comparing it to 100000?) per iteration of the loop.
==> 5*10^5 work_items/sec * 10^5 FLOPS = 50 GFLOPS.
Even if there are 3 or 4 operations going on in the loop, it's much slower than the what the card should be able to do. What am I doing wrong?
The global work size is big enough (no speed change for 10k vs 100k work items).
Here are a couple of tricks:
GPU doesn't like cycles at all. Use #pragma unroll to unwind them.
Your GPU is good at vector operations. Stick to it, that will allow you to process multiple operands at once.
Use vector load/store whether it's possible.
Measure the memory bandwidth - I'm almost sure that you are bandwidth-limited because of poor access pattern.
In my opinion, kernel should look like this:
typedef union floats{
float16 vector;
float array[16];
} floats;
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
floats cd;
cd.vector = vload16(gid, result);
cd.vector *= d;
#pragma unroll
for (int i=0; i<16; i++)
{
if(cd.array[i] == -1.0f){
result[gid] = cd;
}
}
Make your NDRange bigger to compensate difference between 16 & 1000 in loop condition.
To my knowledge, if atomic operations are performed on same memory address location in a warp, the performance of the warp could be 32 times slower.
But what if atomic operations of threads in a warp are on 32 different memory locations? Is there any performance penalty at all? Or it will be as fast as normal operation?
My use case is that I have 32 different positions, each thread in a warp needs one of these position but which position is data dependent. So each thread could use atomicCAS to scan if the location desired is empty or not. If it is not empty, scan the next position.
If I am lucky, 32 threads could atomicCAS to 32 different memory locations, is there any performance penalty is this case?
I assume Kepler architecture is used
In the code below, I'm adding a constant value to the elements of an array (dev_input). I'm comparing two kernels, one using atomicAdd and one using regular addition. This is an example taken to the extreme in which atomicAdd operates on completely different addresses, so there will be no need for serialization of the operations.
#include <stdio.h>
#define BLOCK_SIZE 1024
int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void regular_addition(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) dev_input[i] = dev_input[i] + val;
}
__global__ void atomic_operations(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) atomicAdd(&dev_input[i],val);
}
int main(){
int N = 8192*32;
float* output = (float*)malloc(N*sizeof(float));
float* dev_input; gpuErrchk(cudaMalloc((void**)&dev_input, N*sizeof(float)));
gpuErrchk(cudaMemset(dev_input, 0, N*sizeof(float)));
int NumBlocks = iDivUp(N,BLOCK_SIZE);
float time, timing1 = 0.f, timing2 = 0.f;
cudaEvent_t start, stop;
int niter = 32;
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
atomic_operations<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing1 = timing1 + time;
}
printf("Time for atomic operations: %3.5f ms \n", timing1/(float)niter);
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
regular_addition<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing2 = timing2 + time;
}
printf("Time for regular addition: %3.5f ms \n", timing2/(float)niter);
}
Testing this code on my NVIDIA GeForce GT540M, CUDA 5.5, Windows 7, I obtain approximately the same results for the two kernels, i.e., about 0.7ms.
Now change the instruction
if (i < N) atomicAdd(&dev_input[i],val);
to
if (i < N) atomicAdd(&dev_input[i%32],val);
which is closer to the case of your interest, namely, each atomicAdd operates on different addresses within a warp. The result I obtain is that no performance penalty is observed.
Finally, change the above instruction to
if (i < N) atomicAdd(&dev_input[0],val);
This is the other extreme in which atomicAdd always operates on the same address. In this case, the execution time raises to 5.1ms.
The above tests have been performed on a Fermi architecture. You can try to run the above code on your Kepler card.
My formula for estimating the maximum FLOPs/s of an Intel CPU is
Max SP FLOPs/s = frequencey * 4 SSE(8AVX) * 2 (MAC) * number of cores (not HW threads)
Max DP FLOPs/s = 0.5 * Max SP FLOPs/s
By MAC I mean the CPU can do one SSE(AVX) multiplication and addition at the same time. On the system I'm using the maximum frequency under load is 2.66 GHz. It only has SSE and the number of cores (not Hardware threads) is 4. That gives: Max SP FLOPs/s = 85.12 GFLOPs/s.
The number of FLOPs for matrix multiplication is approxiamelty 2*n*m*k. For a square matrix with n=1000 that's 2*10E9 FLOPs (2 billion FLOPs). Once I know the time I can estimate the FLOPs/s.
However, the best I can get for my own code is about 40 SP GFLOPs/s for example with n=1000. I get about the same result with Eigen. That's about a 45% efficiency. Is my calculation for the maximum wrong? What's the best efficiency for a Intel CPU for large dense matrix multiplication? Does anyone have a paper describing this?
I know that on the GPU the efficiency can be more than 60%.
http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3
Edit:
I get similar results for n=500 which easily fits in the 12MB L3 cache of my system so the cache does not seem to be the limiting factor (though maybe I can use it more efficiently).
Edit2:
Eigen Benchmarks show it as good as MKL (for SSE). They use a Intel(R) Core(TM)2 Quad CPU Q9400 # 2.66GHz. So 2.66* 2(DP SSE) *2 MAC * 4 cores = 42.25 DP GFLOPs/s. You can see on the plot they are all getting less that 20. Something on order of 45% like me.
http://eigen.tuxfamily.org/index.php?title=Benchmark
http://ark.intel.com/products/35365/Intel-Core2-Quad-Processor-Q9400-6M-Cache-2_66-GHz-1333-MHz-FSB
Edit3:
Here is my code for anyone that cares. I can get slight better results than this but not much better. I'm using Agner Fog's vectorclass for SEE/AVX. and setting Vec8f to float8 and Vec4d to double4
//SGEMM and AVX call MM_tile<float, float8>(nthreads, a, b, c, n, m, k);
template <typename ftype, typename floatn>
void GEMM_tile(const int nthreads, const ftype*A , const ftype* B, ftype* C, const int N, const int M, const int K) {
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
C[K*i + j] = 0;
}
}
const int nc = 32;
const int kc = 32;
const int mc = 32;
omp_set_num_threads(nthreads);
#pragma omp parallel for if(nthreads>1)
for(int ii=0; ii<N; ii+=nc) {
for(int jj=0; jj<K; jj+=kc)
for(int ll=0; ll<M; ll+=mc) {
const int nb = min(N-ii, nc);
const int kb = min(K-jj, kc);
const int mb = min(M-ll, mc);
MM_block<ftype, floatn>(nb, mb, kb, &A[M*ii+ll], N, &B[K*ll+jj], K, &C[K*ii+jj], K );
}
}
}
template <typename ftype, typename floatn>
void MM_block(int n, int m, int k, const ftype *a, const int stridea,
const ftype *b, const int strideb,
ftype *c, const int stridec ) {
const int vec_size = sizeof(floatn)/sizeof(ftype);
for(int i=0; i<n; i+=4) {
for(int j=0; j<k; j+=vec_size) {
Dot4x4_vec_block<ftype, floatn>(m, &a[strideb*i], &b[j], &c[stridec*i + j], stridea, strideb, stridec);
}
}
template <typename ftype, typename floatn>
inline void Dot4x4_vec_block(const int n, const ftype *a, const ftype *b, ftype *c, const int stridea, const int strideb, const int stridec) {
floatn tmp0, tmp1, tmp2, tmp3;
load(tmp0, &c[stridec*0]);
load(tmp1, &c[stridec*1]);
load(tmp2, &c[stridec*2]);
load(tmp3, &c[stridec*3]);
ftype *a0_ptr = (ftype*)&a[stridea*0];
ftype *a1_ptr = (ftype*)&a[stridea*1];
ftype *a2_ptr = (ftype*)&a[stridea*2];
ftype *a3_ptr = (ftype*)&a[stridea*3];
for(int i=0; i<n; i++) {
floatn breg = floatn().load(&b[i*strideb + 0]);
floatn areg0 = *a0_ptr++;
floatn areg1 = *a1_ptr++;
floatn areg2 = *a2_ptr++;
floatn areg3 = *a3_ptr++;
tmp0 += areg0 * breg;
tmp1 += areg1 * breg;
tmp2 += areg2 * breg;
tmp3 += areg3 * breg;
}
tmp0.store(&c[stridec*0]);
tmp1.store(&c[stridec*1]);
tmp2.store(&c[stridec*2]);
tmp3.store(&c[stridec*3]);
}
Often, the limiting factor for processing throughput is memory bandwidth, especially in cases where your working set doesn't fit into the CPU cache (your 1000-by-1000 matrix of float will take up ~4 MB, whereas your CPU probably has a 2 MB L3 cache). This is a situation where the structure of your algorithm can make a big difference in how it performs, but you will usually hit a wall at some point where you just can't get any faster because you're waiting on values to come from some higher level in the memory hierarchy.
In addition, your theoretical numbers assume that you have sufficient instructions without data dependencies to keep all of the execution units tasked on every cycle. This can be very difficult to do in practice. I'm not sure what the optimum throughput for a general matrix multiply would be, but check out this previous question for information on what you can do to maximize the instruction throughput.
I get the code from wikipedia:
#include <stdio.h>
#include <omp.h>
#define N 100
int main(int argc, char *argv[])
{
float a[N], b[N], c[N];
int i;
omp_set_dynamic(0);
omp_set_num_threads(10);
for (i = 0; i < N; i++)
{
a[i] = i * 1.0;
b[i] = i * 2.0;
}
#pragma omp parallel shared(a, b, c) private(i)
{
#pragma omp for
for (i = 0; i < N; i++)
c[i] = a[i] + b[i];
}
printf ("%f\n", c[10]);
return 0;
}
I tryed to compile and run it in my Ubuntu 11.04 with gcc4.5 (my configuration: Intel C2D T7500M 2.2GHz, 2048Mb RAM) and this program worked in two times slower than single-threaded. Why?
Very simple answer: Increase N. And set the number of threads equal to the number processors you have.
For your machine, 100 is a very low number. Try some orders of magnitudes higher.
Another question is: How are you measuring the computation time? Usually one takes the program time to get comparable results.
I suppose the compiler optimized the for loop in the non-smp case (using SSE instructions, e.g.) and it can't in the OMP variant.
Use gcc -S (or objdump -S) to view the assembly for the different variants.
You might want to watch out with the shared variables anyway, because they need to be synchronized, making things very slow. If you can 'smart' chunks (look at the schedule pragma) you might reduce the contention, but again:
verify the emitted code
profile
don't underestimate the efficiency of singlethreaded code (because of cache locality and lack of context switches)
set the number of threads to the number of CPUs (let openMP decide it for you!); unless your thread-team has a master thread with dedicated tasks, in which case there might be value in allocating ONE extra thread
In all the cases where I tried to apply OMP for parallelization, roughly 70% of the cases are slower. The cases where it is a definite speedup is with
coarse-grained parallellism (your sample is on the fine-grained end of the spectrum)
no shared data
The issue you are facing is false memory sharing. Each thread should have its own private c[i].
Try this: #pragma omp parallel shared(a, b) private(i, c)
Run the code below and see the difference.
1.) OpenMP has an overhead so the runtime has to be more than the overhead to see a benefit.
2.) Don't set the number of threads yourself. In general I use the default threads. However, if your processor has hyper-threading you might get a bit better performance by setting the number of threads equal to the number of cores. With hyper threading the default number of threads will be twice the number of cores. For example on my machine I have four cores and the default number of threads is eight. By setting it to four in some situations I get better results and in other cases I get worse results.
3.) There is some false sharing in c but as long as N is large enough (which it needs to be to overcome the overhead) the false sharing will not cause much of a problem. You can play with the chunk size but I don't think it will be helpful.
4.) Cache issues. You have at least four levels of memory (the values are for my system): L1 (32Kb), L2(256Kb), L3(12Mb), and main memory (>>12Mb). The benefits of parallelism are going to diminish as you move into higher level. However, in the example below I set N to 100 million floats which is 400 million bytes or about 381Mb and it is still significantly faster using multiple threads. Try adjusting N and see what happens. For example try setting N to your cache levels/4 (one float is 4 bytes) (arrays a and b also need to be in the cache so you might need to set N to the cache level/12). However, if N is too small you fight with the OpenMP overhead (which is what the code in your question does).
#include <stdio.h>
#include <omp.h>
#define N 100000000
int main(int argc, char *argv[]) {
float *a = new float[N];
float *b = new float[N];
float *c = new float[N];
int i;
for (i = 0; i < N; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
double dtime;
dtime = omp_get_wtime();
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
dtime = omp_get_wtime() - dtime;
printf ("time %f, %f\n", dtime, c[10]);
dtime = omp_get_wtime();
#pragma omp parallel for private(i)
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
dtime = omp_get_wtime() - dtime;
printf ("time %f, %f\n", dtime, c[10]);
return 0;
}