Intel VTune Results Understanding - Naive Questions

Intel VTune Results Understanding - Naive Questions - performance

My application I want to speedup performs element-wise processing of large array (about 1e8 elements).
The processing procedure for each element is very simple and I suspect that bottleneck could be not CPU but DRAM bandwidth.
So I decided to study one-threaded version at first.
The system is: Windows 10 64bit, 32 GB RAM, Intel Core i7-3770S Ivybridge 1.10 GHz 4 cores, Hyperthreading enabled
Concurrency analysis
Elapsed Time: 34.425s
CPU Time: 14.908s
Effective Time: 14.908s
Idle: 0.005s
Poor: 14.902s
Ok: 0s
Ideal: 0s
Over: 0s
Spin Time: 0s
Overhead Time: 0s
Wait Time: 0.000s
Idle: 0.000s
Poor: 0s
Ok: 0s
Ideal: 0s
Over: 0s
Total Thread Count: 2
Paused Time: 18.767s
Memory Access Analysis
Memory Access Analysis provides different CPU times for three consecutive  runs on the same amount of data
Actual execution time was about 23 seconds as Concurrency Analysis says.
Elapsed Time: 33.526s
CPU Time: 5.740s
Memory Bound: 38.3%
L1 Bound: 10.4%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 0.8%
Memory Bandwidth: 36.1%
Memory Latency: 60.4%
Loads: 12,912,960,000
Stores: 7,720,800,000
LLC Miss Count: 420,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 18.081s
Elapsed Time: 33.011s
CPU Time: 4.501s
Memory Bound: 36.9%
L1 Bound: 10.6%
L2 Bound: 0.0%
L3 Bound: 0.2%
DRAM Bound: 0.6%
Memory Bandwidth: 36.5%
Memory Latency: 62.7%
Loads: 9,836,100,000
Stores: 5,876,400,000
LLC Miss Count: 180,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 17.913s
Elapsed Time: 33.738s
CPU Time: 5.999s
Memory Bound: 38.5%
L1 Bound: 10.8%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 0.9%
Memory Bandwidth: 57.8%
Memory Latency: 37.3%
Loads: 13,592,760,000
Stores: 8,125,200,000
LLC Miss Count: 660,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 18.228s
As far as I understand the Summary Page, the situation is not very good.
The paper Finding your Memory Access performance bottlenecks says that the reason is so-called false sharing. But I do not use multithreading, all processing is performed by  just one thread.
From the other hand according to Memory Access Analysis/Platform Page DRAM Bandwidth is not bottleneck.
So the questions are
Why CPU times metric values are different for Concurrency Analysis and Memory Access Analysis
What is the reason of bad memory metrics values, especially for L1 Bound?
The main loop is lambda function, where
tasklets: std::vector of simple structures that contain coefficients for data processing
points: data itself, Eigen::Matrix
projections: Eigen::Matrix, array to put results of processing into
The code is:
#include <iostream>
#include <future>
#include <random>
#include <Eigen/Dense>
#include <ittnotify.h>
using namespace std;
using Vector3 = Eigen::Matrix<float, 3, 1>;
using Matrix3X = Eigen::Matrix<float, 3, Eigen::Dynamic>;
uniform_real_distribution<float> rnd(0.1f, 100.f);
default_random_engine gen;
class Tasklet {
public:
Tasklet(int p1, int p2)
:
p1Id(p1), p2Id(p2), Loc0(p1)
{
RestDistance = rnd(gen);
Weight_2 = rnd(gen);
}
__forceinline void solve(const Matrix3X& q, Matrix3X& p)
{
Vector3 q1 = q.col(p1Id);
Vector3 q2 = q.col(p2Id);
for (int i = 0; i < 0; ++i) {
Vector3 delta = q2 - q1;
float norm = delta.blueNorm() * delta.hypotNorm();
}
Vector3 deltaQ = q2 - q1;
float dist = deltaQ.norm();
Vector3 deltaUnitVector = deltaQ / dist;
p.col(Loc0) = deltaUnitVector * RestDistance * Weight_2;
}
int p1Id;
int p2Id;
int Loc0;
float RestDistance;
float Weight_2;
};
typedef vector<Tasklet*> TaskList;
void
runTest(const Matrix3X& points, Matrix3X& projections, TaskList& tasklets)
{
size_t num = tasklets.size();
for (size_t i = 0; i < num; ++i) {
Tasklet* t = tasklets[i];
t->solve(points, projections);
}
}
void
prepareData(Matrix3X& points, Matrix3X& projections, int numPoints, TaskList& tasklets)
{
points.resize(3, numPoints);
projections.resize(3, numPoints);
points.setRandom();
/*
for (int i = 0; i < numPoints; ++i) {
points.col(i) = Vector3(1, 0, 0);
}
*/
tasklets.reserve(numPoints - 1);
for (int i = 1; i < numPoints; ++i) {
tasklets.push_back(new Tasklet(i - 1, i));
}
}
int
main(int argc, const char** argv)
{
// Pause VTune data collection
__itt_pause();
cout << "Usage: <exefile> <number of points (in thousands)> <#runs for averaging>" << endl;
int numPoints = 150 * 1000;
int numRuns = 1;
int argNo = 1;
if (argc > argNo) {
istringstream in(argv[argNo]);
int i;
in >> i;
if (in) {
numPoints = i * 1000;
}
}
++argNo;
if (argc > argNo) {
istringstream in(argv[argNo]);
int i;
in >> i;
if (in) {
numRuns = i;
}
}
cout
<< "Running test" << endl
<< "\t NumPoints (thousands): " << numPoints / 1000. << endl
<< "\t # of runs for averaging: " << numRuns << endl;
Matrix3X q, projections;
TaskList tasklets;
cout << "Preparing test data" << endl;
prepareData(q, projections, numPoints, tasklets);
cout << "Running test" << endl;
// Resume VTune data collection
__itt_resume();
for (int r = 0; r < numRuns; ++r) {
runTest(q, projections, tasklets);
}
// Pause VTune data collection
__itt_pause();
for (auto* t : tasklets) {
delete t;
}
return 0;
}
Thank you.

Related

how does std::this_thread::yield() works

std::this_thread::yield() leaves current CPU time slice to other threads/process.
On Windows platform, each CPU time slice should cost 2~10ms.
Which means the time gap should be larger than 2~10ms when I use std::this_thread::yield() in the while 1 loop.
I create a test program
`
void thread_function_2() {
double dt;
LARGE_INTEGER nFreq;
QueryPerformanceFrequency(&nFreq);
LARGE_INTEGER tstart;
LARGE_INTEGER tend;
QueryPerformanceCounter(&tstart);
for (int a =0; a< 100; a++)
{
std::this_thread::yield();
}
QueryPerformanceCounter(&tend);
cout << "-----------------hit 100 times yield cost time-----------------" << endl;
dt = (tend.QuadPart - tstart.QuadPart) / (double)nFreq.QuadPart;
cout << "Total time :" << dt * 1000000 << "us" << endl;//dt
}
int main()
{
std::thread thread_2(thread_function_2);
system("pause");
return 1;
}
`
The output is
-----------------hit 100 times yield cost time-----------------
Total time :9.9us
That means call the 100times std::this_thread::yield() totally cost 9.9us.
My understanding is each time I call the std::this_thread::yield(), it should leave the CPU time slice, at least const 2ms, then it should cost at least 200ms for the 100 cycle std::this_thread::yield().

Why is this code ten times slower on the GPU than CPU?

I have a problem that boils down to performing some arithmetic on each element of a set of matrices. I thought this sounded like the kind of computation that could benefit greatly from being shifted onto the GPU. However, I've only succeeded in slowing down the computation by a factor of 10!
Here are the specifics of my test system:
OS: Windows 10
CPU: Core i7-4700MQ # 2.40 GHz
GPU: GeForce GT 750M (compute capability 3.0)
CUDA SDK: v7.5
The code below performs equivalent calcs to my production code, on the CPU and on the GPU. The latter is consistently ten times slower on my machine (CPU approx. 650ms; GPU approx. 7s).
I've tried changing the grid and block sizes; I've increased and decreased the size of the array passed to the GPU; I've run it through the visual profiler; I've tried integer data rather than doubles, but whatever I do, the GPU version is always significantly slower than the CPU equivalent.
So why is the GPU version so much slower and what changes, that I've not mentioned above, could I try to improve its performance?
Here's my command line: nvcc source.cu -o CPUSpeedTest.exe -arch=sm_30
And here's the contents of source.cu:
#include <iostream>
#include <windows.h>
#include <cuda_runtime_api.h>
void AdjustArrayOnCPU(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
for (size_t i = 0; i < arrayLength; i++)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[i] / denominator);
array[i] = array[i] * adjustmentFactor;
}
}
__global__ void CudaKernel(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < arrayLength)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[idx] / denominator);
array[idx] = array[idx] * adjustmentFactor;
}
}
void AdjustArrayOnGPU(double array[], int arrayLength, double factor1, double factor2, double factor3, double denominator, double curve[], int curveLength)
{
double *dev_row, *dev_curve;
cudaMalloc((void**)&dev_row, sizeof(double) * arrayLength);
cudaMalloc((void**)&dev_curve, sizeof(double) * curveLength);
cudaMemcpy(dev_row, array, sizeof(double) * arrayLength, cudaMemcpyHostToDevice);
cudaMemcpy(dev_curve, curve, sizeof(double) * curveLength, cudaMemcpyHostToDevice);
CudaKernel<<<100, 1000>>>(factor1, factor2, factor3, denominator, dev_row, arrayLength, dev_curve, curveLength);
cudaMemcpy(array, dev_row, sizeof(double) * arrayLength, cudaMemcpyDeviceToHost);
cudaFree(dev_curve);
cudaFree(dev_row);
}
void FillArray(int length, double row[])
{
for (size_t i = 0; i < length; i++) row[i] = 0.1 + i;
}
int main(void)
{
const int arrayLength = 10000;
double arrayForCPU[arrayLength], curve1[arrayLength], arrayForGPU[arrayLength], curve2[arrayLength];;
FillArray(arrayLength, curve1);
FillArray(arrayLength, curve2);
///////////////////////////////////// CPU Version ////////////////////////////////////////
LARGE_INTEGER StartingTime, EndingTime, ElapsedMilliseconds, Frequency;
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForCPU);
AdjustArrayOnCPU(1.0, 1.0, 1.0, 1.0, arrayForCPU, 10000, curve1, 10000);
}
QueryPerformanceCounter(&EndingTime);
ElapsedMilliseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMilliseconds.QuadPart *= 1000;
ElapsedMilliseconds.QuadPart /= Frequency.QuadPart;
std::cout << "Elapsed Milliseconds: " << ElapsedMilliseconds.QuadPart << std::endl;
///////////////////////////////////// GPU Version ////////////////////////////////////////
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForGPU);
AdjustArrayOnGPU(arrayForGPU, 10000, 1.0, 1.0, 1.0, 1.0, curve2, 10000);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout << "CUDA Elapsed Milliseconds: " << elapsedTime << std::endl;
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
And here is an example of the output of CUDASpeedTest.exe
Elapsed Milliseconds: 565
CUDA Elapsed Milliseconds: 7156.76

What follows is likely to be embarrassingly obvious to most developers working with CUDA, but may be of value to others - like myself - who are new to the technology.
The GPU code is ten times slower than the CPU equivalent because the GPU code exhibits a perfect storm of performance-wrecking characteristics.
The GPU code spends most of its time allocating memory on the GPU, copying data to the device, performing a very, very simple calculation (that is supremely fast irrespective of the type of processor it's running on) and then copying data back from the device to the host.
As noted in the comments, if an upper bound exists on the size of the data structures being processed, then a buffer on the GPU can be allocated exactly once and reused. In the code above, this takes the GPU to CPU runtime down from 10:1 to 4:1.
The remaining performance disparity is down to the fact that the CPU is able to perform the required calculations, in serial, millions of times in a very short time span due to its simplicity. In the code above, the calculation involves reading a value from an array, some multiplication, and finally an assignment
to an array element. Something this simple must be performed millions of times
before the benefits of doing so in parallel outweigh the necessary time penalty of transferring the data to the GPU and back. On my test system, a million array elements is the break even point, where GPU and CPU perform in (approximately) the same amount of time.

OpenMP Code Not Scaling due to overheads and cache issues

struct xnode
{
float *mat;
};
void testScaling( )
{
int N = 1000000; ///total num matrices
int dim = 10;
//memory for matrices
std::vector<xnode> nodeArray(N);
for( int k = 0; k < N; ++k )
nodeArray[k].mat = new float [dim*dim];
//memory for Y
std::vector<float*> Y(N,0);
for( int k = 0; k < N; ++k )
Y[k] = new float [dim];
//shared X
float* X = new float [dim];
for(int i = 0; i < dim; ++i ) X[i] = 1.0;
//init mats
for( int k = 0; k < N; ++k )
{
for( int i=0; i<dim*dim; ++i )
nodeArray[k].mat[i] = 0.25+((float)i)/3;
}
int NTIMES = 500;
//gemv args
char trans = 'N';
int lda = dim;
int incx = 1;
float alpha =1 , beta = 0;
//threads
int thr[4];
thr[0] =1 ; thr[1] = 2; thr[2] = 4; thr[3] = 8;
for( int t = 0; t<4; ++t )//test for nthreads
{
int nthreads = thr[t];
double t_1 = omp_get_wtime();
for( int ii = 0; ii < NTIMES; ++ii )//do matvec NTIMES
{
#pragma omp parallel for num_threads(nthreads)
for( int k=0; k<N; ++k )
{
//compute Y[k] = mat[k] * X;
GEMV(&trans, &dim, &dim, &alpha, nodeArray[k].mat, &lda, X, &incx, &beta, Y[k], &incx);
//GEMV(&trans, &dim, &dim, &alpha, nodeArray[0].mat, &lda, X, &incx, &beta, Y[k], &incx);
}
}
double t_2 = omp_get_wtime();
std::cout << "Threads " << nthreads << " time " << (t_2-t_1)/NTIMES << std::endl;
}
//clear memory
for( int k = 0; k < N; ++k )
{
delete [] nodeArray[k].mat;
delete [] Y[k];
}
delete [] X;
}
The above code parallelizes the matrix-vector product of N matrices of size dim, and stores results in N output vectors. The average of 500 products is taken as the time per matrix-vector product. The matrix-vector products in the above example are all of equal size and thus the threads should be perfectly balanced - we should achieve a performance scaling close to ideal 8x. The following are the observations (Machine – Intel Xeon 3.1Ghz.2 processors,8cores each, HyperThreading enabled, Windows, VS2012, Intel MKL, Intel OMP library).
OBSERVATION 1:
dim=10 N=1000000
Threads 1 - time 0.138068s
Threads 2 - time 0.0729147s
Threads 4 - time 0.0360527s
Threads 8 - time 0.0224268s (6.1x on 8threads)
OBSERVATION 2 :
dim=20 N=1000000
Threads 1 time 0.326617
Threads 2 time 0.185706
Threads 4 time 0.0886508
Threads 8 time 0.0733666 (4.5x on 8 threads).
Note – I ran VTune on this case. It showed CPUTime 267.8sec, Overhead time 43 sec, Spin time – 8 sec. The overhead time is all spent in a libiomp function (intel library). 8Threads/1Thread scaling is poor for such cases.
Next - in the gemv for loop, we change nodeArray[k].mat to nodeArray[0].mat (see commented statement), so that only the first matrix is used for all the matrix-vector products.
OBSERVATION 3
dim=20 N=1000000
Threads 1 time 0.152298 (The serial time is halved)
Threads 2 time 0.0769173
Threads 4 time 0.0384086
Threads 8 time 0.019336 (7.87x on 8 threads)
Thus I get almost ideal scaling - why is this behavior? VTune says that a significant portion of CPU time is spent in synchronization and thread overhead. Here it seems there is no relation between the load balancing and thread synchronization. As matrix size is increased the granularity should increase and thread overhead should be proportionately small. But as we increase from size 10 to 20 the scaling is weakening. When we use nodeArray[0].mat (only the first matrix) for doing all the matrix-vector products the cache is updated only once (since the compiler knows this during optimization) and we get near ideal scaling. Thus the synchronization overhead seems to be related to some cache related issue. I have tried a number of other things like setting KMP_AFFINITY and varying load distribution but that did not buy me anything.
My questions are:
1. I dont have a clear idea about how does the cache performance affect openMP thread synchronization. Can someone explain this?
2. Can anything be done about improving the scaling and reducing the overhead?
Thanks

max FLOPS for matrix multiplication Intel/AMD CPU

My formula for estimating the maximum FLOPs/s of an Intel CPU is
Max SP FLOPs/s = frequencey * 4 SSE(8AVX) * 2 (MAC) * number of cores (not HW threads)
Max DP FLOPs/s = 0.5 * Max SP FLOPs/s
By MAC I mean the CPU can do one SSE(AVX) multiplication and addition at the same time. On the system I'm using the maximum frequency under load is 2.66 GHz. It only has SSE and the number of cores (not Hardware threads) is 4. That gives: Max SP FLOPs/s = 85.12 GFLOPs/s.
The number of FLOPs for matrix multiplication is approxiamelty 2*n*m*k. For a square matrix with n=1000 that's 2*10E9 FLOPs (2 billion FLOPs). Once I know the time I can estimate the FLOPs/s.
However, the best I can get for my own code is about 40 SP GFLOPs/s for example with n=1000. I get about the same result with Eigen. That's about a 45% efficiency. Is my calculation for the maximum wrong? What's the best efficiency for a Intel CPU for large dense matrix multiplication? Does anyone have a paper describing this?
I know that on the GPU the efficiency can be more than 60%.
http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3
Edit:
I get similar results for n=500 which easily fits in the 12MB L3 cache of my system so the cache does not seem to be the limiting factor (though maybe I can use it more efficiently).
Edit2:
Eigen Benchmarks show it as good as MKL (for SSE). They use a Intel(R) Core(TM)2 Quad CPU Q9400 # 2.66GHz. So 2.66* 2(DP SSE) *2 MAC * 4 cores = 42.25 DP GFLOPs/s. You can see on the plot they are all getting less that 20. Something on order of 45% like me.
http://eigen.tuxfamily.org/index.php?title=Benchmark
http://ark.intel.com/products/35365/Intel-Core2-Quad-Processor-Q9400-6M-Cache-2_66-GHz-1333-MHz-FSB
Edit3:
Here is my code for anyone that cares. I can get slight better results than this but not much better. I'm using Agner Fog's vectorclass for SEE/AVX. and setting Vec8f to float8 and Vec4d to double4
//SGEMM and AVX call MM_tile<float, float8>(nthreads, a, b, c, n, m, k);
template <typename ftype, typename floatn>
void GEMM_tile(const int nthreads, const ftype*A , const ftype* B, ftype* C, const int N, const int M, const int K) {
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
C[K*i + j] = 0;
}
}
const int nc = 32;
const int kc = 32;
const int mc = 32;
omp_set_num_threads(nthreads);
#pragma omp parallel for if(nthreads>1)
for(int ii=0; ii<N; ii+=nc) {
for(int jj=0; jj<K; jj+=kc)
for(int ll=0; ll<M; ll+=mc) {
const int nb = min(N-ii, nc);
const int kb = min(K-jj, kc);
const int mb = min(M-ll, mc);
MM_block<ftype, floatn>(nb, mb, kb, &A[M*ii+ll], N, &B[K*ll+jj], K, &C[K*ii+jj], K );
}
}
}
template <typename ftype, typename floatn>
void MM_block(int n, int m, int k, const ftype *a, const int stridea,
const ftype *b, const int strideb,
ftype *c, const int stridec ) {
const int vec_size = sizeof(floatn)/sizeof(ftype);
for(int i=0; i<n; i+=4) {
for(int j=0; j<k; j+=vec_size) {
Dot4x4_vec_block<ftype, floatn>(m, &a[strideb*i], &b[j], &c[stridec*i + j], stridea, strideb, stridec);
}
}
template <typename ftype, typename floatn>
inline void Dot4x4_vec_block(const int n, const ftype *a, const ftype *b, ftype *c, const int stridea, const int strideb, const int stridec) {
floatn tmp0, tmp1, tmp2, tmp3;
load(tmp0, &c[stridec*0]);
load(tmp1, &c[stridec*1]);
load(tmp2, &c[stridec*2]);
load(tmp3, &c[stridec*3]);
ftype *a0_ptr = (ftype*)&a[stridea*0];
ftype *a1_ptr = (ftype*)&a[stridea*1];
ftype *a2_ptr = (ftype*)&a[stridea*2];
ftype *a3_ptr = (ftype*)&a[stridea*3];
for(int i=0; i<n; i++) {
floatn breg = floatn().load(&b[i*strideb + 0]);
floatn areg0 = *a0_ptr++;
floatn areg1 = *a1_ptr++;
floatn areg2 = *a2_ptr++;
floatn areg3 = *a3_ptr++;
tmp0 += areg0 * breg;
tmp1 += areg1 * breg;
tmp2 += areg2 * breg;
tmp3 += areg3 * breg;
}
tmp0.store(&c[stridec*0]);
tmp1.store(&c[stridec*1]);
tmp2.store(&c[stridec*2]);
tmp3.store(&c[stridec*3]);
}

Often, the limiting factor for processing throughput is memory bandwidth, especially in cases where your working set doesn't fit into the CPU cache (your 1000-by-1000 matrix of float will take up ~4 MB, whereas your CPU probably has a 2 MB L3 cache). This is a situation where the structure of your algorithm can make a big difference in how it performs, but you will usually hit a wall at some point where you just can't get any faster because you're waiting on values to come from some higher level in the memory hierarchy.
In addition, your theoretical numbers assume that you have sufficient instructions without data dependencies to keep all of the execution units tasked on every cycle. This can be very difficult to do in practice. I'm not sure what the optimum throughput for a general matrix multiply would be, but check out this previous question for information on what you can do to maximize the instruction throughput.

CUDA: Dependence of kernel performance on occupancy

I am doing Finite Difference computation (Stencil Computation) on GPU (Fermi) using CUDA. When I tested my code using CUDA profiler, I found the occupany was 0.333. After I ordered my computation and increased the occupany to 0.677, the execution time of the kernel didn't decrease but increased. In other words, there was a decrease in performance when the occupany got increased by 1/3.
My question is:
Does the performance of the kernel depend on the computation irrespective of the occupancy?

The answer is "it depends", both on the characteristics of your workload and on how you define performance. Generally speaking, if your bottleneck is math throughput you're often fine with a lower occupancy (12.5%-33%), but if your bottleneck is memory then you usually want a higher occupancy (66% or higher). This is just a rule of thumb, not an absolute rule. Most kernels fall somewhere in the middle but there are exceptions at both extremes.
Occupancy is the maximum number of threads of your kernel that can be active at once (limited by register count per thread or other resources) divided by the maximum number of threads the GPU can have active when not limited by other resources. Active means the thread has hardware resources assigned and is available for scheduling, not that it has any instructions executing on a given clock cycle.
After issuing instruction i for a thread, the instruction i+1 for that thread might not be able to run immediately, if it depends on the result of instruction i. If that instruction is a math instruction, the result will be available in a few clock cycles. If it's a memory load instruction, it might be 100s of cycles. Rather than waiting, the GPU will issue instructions from some other thread who's dependencies are satisfied.
So if you're mostly doing math, you only need a few (few in GPU terms; on a CPU it would be considered many) threads to hide the few cycles of latency from math instructions, so you can get away with low occupancy. But if you've got a lot of memory traffic, you need more threads to ensure that some of them are ready to execute on every cycle, since each one spends a lot of time "sleeping" waiting for memory operations to complete.
If the algorithmic changes you made to increase occupancy also increased the amount of work done on each thread, and if you already had enough threads to keep the GPU busy, then the change will just slow you down. Increasing occupancy only improves performance up to the point where you have enough threads to keep the GPU busy.

Jesse Hall has already answered your question, so I will limit myself to complement his answer.
Occupancy is not the only figure of merit to take care of in order to maximize the algorithm performance, which most often coincide with the execution time. I suggest to take a look at the instructive GTC2010 presentation by Vasily Volkov:
Better Performance at Lower Occupancy
Below, I'm providing a simple example, inspired by Part II of the above presentation.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#define BLOCKSIZE 512
//#define DEBUG
/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
/***********************************************/
/* MEMCPY1 - EACH THREAD COPIES ONE FLOAT ONLY */
/***********************************************/
__global__ void memcpy1(float *src, float *dst, unsigned int N)
{
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
float a0 = src[tid];
dst[tid] = a0;
}
}
/*******************************************/
/* MEMCPY2 - EACH THREAD COPIES TWO FLOATS */
/*******************************************/
__global__ void memcpy2(float *src, float *dst, unsigned int N)
{
const int tid = threadIdx.x + blockIdx.x * (2 * blockDim.x);
if (tid < N) {
float a0 = src[tid];
float a1 = src[tid + blockDim.x];
dst[tid] = a0;
dst[tid + blockDim.x] = a1;
}
}
/********************************************/
/* MEMCPY4 - EACH THREAD COPIES FOUR FLOATS */
/********************************************/
__global__ void memcpy4(float *src, float *dst, unsigned int N)
{
const int tid = threadIdx.x + blockIdx.x * (4 * blockDim.x);
if (tid < N) {
float a0 = src[tid];
float a1 = src[tid + blockDim.x];
float a2 = src[tid + 2 * blockDim.x];
float a3 = src[tid + 3 * blockDim.x];
dst[tid] = a0;
dst[tid + blockDim.x] = a1;
dst[tid + 2 * blockDim.x] = a2;
dst[tid + 3 * blockDim.x] = a3;
}
}
/***********************************************/
/* MEMCPY4_2 - EACH THREAD COPIES FOUR FLOATS2 */
/***********************************************/
__global__ void memcpy4_2(float2 *src, float2 *dst, unsigned int N)
{
const int tid = threadIdx.x + blockIdx.x * (4 * blockDim.x);
if (tid < N/2) {
float2 a0 = src[tid];
float2 a1 = src[tid + blockDim.x];
float2 a2 = src[tid + 2 * blockDim.x];
float2 a3 = src[tid + 3 * blockDim.x];
dst[tid] = a0;
dst[tid + blockDim.x] = a1;
dst[tid + 2 * blockDim.x] = a2;
dst[tid + 3 * blockDim.x] = a3;
}
}
/********/
/* MAIN */
/********/
void main()
{
const int N = 131072;
const int N_iter = 20;
// --- Setting host data and memory space for result
float* h_vect = (float*)malloc(N*sizeof(float));
float* h_result = (float*)malloc(N*sizeof(float));
for (int i=0; i<N; i++) h_vect[i] = i;
// --- Setting device data and memory space for result
float* d_src; gpuErrchk(cudaMalloc((void**)&d_src, N*sizeof(float)));
float* d_dest1; gpuErrchk(cudaMalloc((void**)&d_dest1, N*sizeof(float)));
float* d_dest2; gpuErrchk(cudaMalloc((void**)&d_dest2, N*sizeof(float)));
float* d_dest4; gpuErrchk(cudaMalloc((void**)&d_dest4, N*sizeof(float)));
float* d_dest4_2; gpuErrchk(cudaMalloc((void**)&d_dest4_2, N*sizeof(float)));
gpuErrchk(cudaMemcpy(d_src, h_vect, N*sizeof(float), cudaMemcpyHostToDevice));
// --- Warmup
for (int i=0; i<N_iter; i++) memcpy1<<<iDivUp(N,BLOCKSIZE), BLOCKSIZE>>>(d_src, d_dest1, N);
// --- Creating events for timing
float time;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
/***********/
/* MEMCPY1 */
/***********/
cudaEventRecord(start, 0);
for (int i=0; i<N_iter; i++) {
memcpy1<<<iDivUp(N,BLOCKSIZE), BLOCKSIZE>>>(d_src, d_dest1, N);
#ifdef DEGUB
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
#endif
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("GB/s = %f\n", (1.e-6)*(float)(N*N_iter*sizeof(float))/time);
gpuErrchk(cudaMemcpy(h_result, d_dest1, N*sizeof(int), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) if(h_result[i] != h_vect[i]) { printf("Error at i=%i! Host = %i; Device = %i\n", i, h_vect[i], h_result[i]); return; }
/***********/
/* MEMCPY2 */
/***********/
cudaEventRecord(start, 0);
for (int i=0; i<N_iter; i++) {
memcpy2<<<iDivUp(N/2,BLOCKSIZE), BLOCKSIZE>>>(d_src, d_dest2, N);
#ifdef DEGUB
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
#endif
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("GB/s = %f\n", (1.e-6)*(float)(N*N_iter*sizeof(float))/time);
gpuErrchk(cudaMemcpy(h_result, d_dest2, N*sizeof(int), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) if(h_result[i] != h_vect[i]) { printf("Error at i=%i! Host = %i; Device = %i\n", i, h_vect[i], h_result[i]); return; }
/***********/
/* MEMCPY4 */
/***********/
cudaEventRecord(start, 0);
for (int i=0; i<N_iter; i++) {
memcpy4<<<iDivUp(N/4,BLOCKSIZE), BLOCKSIZE>>>(d_src, d_dest4, N);
#ifdef DEGUB
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
#endif
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("GB/s = %f\n", (1.e-6)*(float)(N*N_iter*sizeof(float))/time);
gpuErrchk(cudaMemcpy(h_result, d_dest4, N*sizeof(int), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) if(h_result[i] != h_vect[i]) { printf("Error at i=%i! Host = %i; Device = %i\n", i, h_vect[i], h_result[i]); return; }
/*************/
/* MEMCPY4_2 */
/*************/
cudaEventRecord(start, 0);
for (int i=0; i<N_iter; i++) {
memcpy4_2<<<iDivUp(N/8,BLOCKSIZE), BLOCKSIZE>>>((float2*)d_src, (float2*)d_dest4_2, N);
#ifdef DEGUB
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
#endif
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("GB/s = %f\n", (1.e-6)*(float)(N*N_iter*sizeof(float))/time);
gpuErrchk(cudaMemcpy(h_result, d_dest4_2, N*sizeof(int), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) if(h_result[i] != h_vect[i]) { printf("Error at i=%i! Host = %i; Device = %i\n", i, h_vect[i], h_result[i]); return; }
cudaDeviceReset();
}
Below, the performance of the above code, when run on a GeForce GT540M and a Kepler K20c.
BLOCKSIZE 32
GT540M K20c Tesla C2050
memcpy1 2.3GB/s 13% 28.1GB/s 18% 14.9GB/s 12%
memcpy2 4.4GB/s 13% 41.1GB/s 18% 24.8GB/s 13%
memcpy4 7.5GB/s 13% 54.8GB/s 18% 34.6GB/s 13%
memcpy4_2 11.2GB/2 14% 68.8GB/s 18% 44.0GB7s 14%
BLOCKSIZE 64
GT540M K20c Tesla C2050
memcpy1 4.6GB/s 27% 44.1GB/s 36% 26.1GB/s 26%
memcpy2 8.1GB/s 27% 57.1GB/s 36% 35.7GB/s 26%
memcpy4 11.4GB/s 27% 63.2GB/s 36% 43.5GB/s 26%
memcpy4_2 12.6GB/s 27% 72.8GB/s 36% 49.7GB/s 27%
BLOCKSIZE 128
GT540M K20c Tesla C2050
memcpy1 8.0GB/s 52% 60.6GB/s 78% 36.1GB/s 52%
memcpy2 11.6GB/2 52% 61.6GB/s 78% 44.8GB/s 52%
memcpy4 12.4GB/2 52% 62.2GB/s 78% 48.3GB/s 52%
memcpy4_2 12.5GB/s 52% 61.9GB/s 78% 49.5GB7s 52%
BLOCKSIZE 256
GT540M K20c Tesla C2050
memcpy1 10.6GB/s 80% 61.2GB/s 74% 42.0GB/s 77%
memcpy2 12.3GB/s 80% 66.2GB/s 74% 48.2GB/s 77%
memcpy4 12.4GB/s 80% 66.4GB/s 74% 45.5GB/s 77%
memcpy4_2 12.6GB/s 70% 72.6GB/s 74% 50.8GB/s 77%
BLOCKSIZE 512
GT540M K20c Tesla C2050
memcpy1 10.3GB/s 80% 54.5GB/s 75% 41.6GB/s 75%
memcpy2 12.2GB/s 80% 67.1GB/s 75% 47.7GB/s 75%
memcpy4 12.4GB/s 80% 67.9GB/s 75% 46.9GB/s 75%
memcpy4_2 12.5GB/s 55% 70.1GB/s 75% 48.3GB/s 75%
The above results show that you can have better performance, i.e. 12GB/s for the GT540M case, with lower occupancy, i.e. 27%, if you properly exploit Instruction Level Parallelism (ILP) by giving each thread more work to do in order to hide latency.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio