Windows 10 poor performance compared to Windows 7 (page fault handling is not scalable, severe lock contention when no of threads > 16) - windows

We set up two identical HP Z840 Workstations with the following specs
2 x Xeon E5-2690 v4 # 2.60GHz (Turbo Boost ON, HT OFF, total 28 logical CPUs)
32GB DDR4 2400 Memory, Quad-channel
and installed Windows 7 SP1 (x64) and Windows 10 Creators Update (x64) on each.
Then we ran a small memory benchmark (code below, built with VS2015 Update 3, 64-bit architecture) which performs memory allocation-fill-free simultaneously from multiple threads.
#include <Windows.h>
#include <vector>
#include <ppl.h>
unsigned __int64 ZQueryPerformanceCounter()
{
unsigned __int64 c;
::QueryPerformanceCounter((LARGE_INTEGER *)&c);
return c;
}
unsigned __int64 ZQueryPerformanceFrequency()
{
unsigned __int64 c;
::QueryPerformanceFrequency((LARGE_INTEGER *)&c);
return c;
}
class CZPerfCounter {
public:
CZPerfCounter() : m_st(ZQueryPerformanceCounter()) {};
void reset() { m_st = ZQueryPerformanceCounter(); };
unsigned __int64 elapsedCount() { return ZQueryPerformanceCounter() - m_st; };
unsigned long elapsedMS() { return (unsigned long)(elapsedCount() * 1000 / m_freq); };
unsigned long elapsedMicroSec() { return (unsigned long)(elapsedCount() * 1000 * 1000 / m_freq); };
static unsigned __int64 frequency() { return m_freq; };
private:
unsigned __int64 m_st;
static unsigned __int64 m_freq;
};
unsigned __int64 CZPerfCounter::m_freq = ZQueryPerformanceFrequency();
int main(int argc, char ** argv)
{
SYSTEM_INFO sysinfo;
GetSystemInfo(&sysinfo);
int ncpu = sysinfo.dwNumberOfProcessors;
if (argc == 2) {
ncpu = atoi(argv[1]);
}
{
printf("No of threads %d\n", ncpu);
try {
concurrency::Scheduler::ResetDefaultSchedulerPolicy();
int min_threads = 1;
int max_threads = ncpu;
concurrency::SchedulerPolicy policy
(2 // two entries of policy settings
, concurrency::MinConcurrency, min_threads
, concurrency::MaxConcurrency, max_threads
);
concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
}
catch (concurrency::default_scheduler_exists &) {
printf("Cannot set concurrency runtime scheduler policy (Default scheduler already exists).\n");
}
static int cnt = 100;
static int num_fills = 1;
CZPerfCounter pcTotal;
// malloc/free
printf("malloc/free\n");
{
CZPerfCounter pc;
for (int i = 1 * 1024 * 1024; i <= 8 * 1024 * 1024; i *= 2) {
concurrency::parallel_for(0, 50, [i](size_t x) {
std::vector<void *> ptrs;
ptrs.reserve(cnt);
for (int n = 0; n < cnt; n++) {
auto p = malloc(i);
ptrs.emplace_back(p);
}
for (int x = 0; x < num_fills; x++) {
for (auto p : ptrs) {
memset(p, num_fills, i);
}
}
for (auto p : ptrs) {
free(p);
}
});
printf("size %4d MB, elapsed %8.2f s, \n", i / (1024 * 1024), pc.elapsedMS() / 1000.0);
pc.reset();
}
}
printf("\n");
printf("Total %6.2f s\n", pcTotal.elapsedMS() / 1000.0);
}
return 0;
}
Surprisingly, the result is very bad in Windows 10 CU compared to Windows 7. I plotted the result below for 1MB chunk size and 8MB chunk size, varying the number of threads from 2,4,.., up to 28. While Windows 7 gave slightly worse performance when we increased the number of threads, Windows 10 gave much worse scalability.
We have tried to make sure all Windows update is applied, update drivers, tweak BIOS settings, without success. We also ran the same benchmark on several other hardware platforms, and all gave similar curve for Windows 10. So it seems to be a problem of Windows 10.
Does anyone have similar experience, or maybe know-how about this (maybe we missed something ?). This behavior has made our multithreaded application got significant performance hit.
*** EDITED
Using https://github.com/google/UIforETW (thanks to Bruce Dawson) to analyze the benchmark, we found that most of the time is spent inside kernels KiPageFault. Digging further down the call tree, all leads to ExpWaitForSpinLockExclusiveAndAcquire. Seems that the lock contention is causing this issue.
*** EDITED
Collected Server 2012 R2 data on the same hardware. Server 2012 R2 is also worse than Win7, but still a lot better than Win10 CU.
*** EDITED
It happens in Server 2016 as well. I added the tag windows-server-2016.
*** EDITED
Using info from #Ext3h, I modified the benchmark to use VirtualAlloc and VirtualLock. I can confirmed significant improvement compared to when VirtualLock is not used. Overall Win10 is still 30% to 40% slower than Win7 when both using VirtualAlloc and VirtualLock.

Microsoft seems to have fixed this issue with Windows 10 Fall Creators Update and Windows 10 Pro for Workstation.
Here is the updated graph.
Win 10 FCU and WKS has lower overhead than Win 7. In exchange, the VirtualLock seems to have higher overhead.

Unfortunately not an answer, just some additional insight.
Little experiment with a different allocation strategy:
#include <Windows.h>
#include <thread>
#include <condition_variable>
#include <mutex>
#include <queue>
#include <atomic>
#include <iostream>
#include <chrono>
class AllocTest
{
public:
virtual void* Alloc(size_t size) = 0;
virtual void Free(void* allocation) = 0;
};
class BasicAlloc : public AllocTest
{
public:
void* Alloc(size_t size) override {
return VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
}
void Free(void* allocation) override {
VirtualFree(allocation, NULL, MEM_RELEASE);
}
};
class ThreadAlloc : public AllocTest
{
public:
ThreadAlloc() {
t = std::thread([this]() {
std::unique_lock<std::mutex> qlock(this->qm);
do {
this->qcv.wait(qlock, [this]() {
return shutdown || !q.empty();
});
{
std::unique_lock<std::mutex> rlock(this->rm);
while (!q.empty())
{
q.front()();
q.pop();
}
}
rcv.notify_all();
} while (!shutdown);
});
}
~ThreadAlloc() {
{
std::unique_lock<std::mutex> lock1(this->rm);
std::unique_lock<std::mutex> lock2(this->qm);
shutdown = true;
}
qcv.notify_all();
rcv.notify_all();
t.join();
}
void* Alloc(size_t size) override {
void* target = nullptr;
{
std::unique_lock<std::mutex> lock(this->qm);
q.emplace([this, &target, size]() {
target = VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
VirtualLock(target, size);
VirtualUnlock(target, size);
});
}
qcv.notify_one();
{
std::unique_lock<std::mutex> lock(this->rm);
rcv.wait(lock, [&target]() {
return target != nullptr;
});
}
return target;
}
void Free(void* allocation) override {
{
std::unique_lock<std::mutex> lock(this->qm);
q.emplace([allocation]() {
VirtualFree(allocation, NULL, MEM_RELEASE);
});
}
qcv.notify_one();
}
private:
std::queue<std::function<void()>> q;
std::condition_variable qcv;
std::condition_variable rcv;
std::mutex qm;
std::mutex rm;
std::thread t;
std::atomic_bool shutdown = false;
};
int main()
{
SetProcessWorkingSetSize(GetCurrentProcess(), size_t(4) * 1024 * 1024 * 1024, size_t(16) * 1024 * 1024 * 1024);
BasicAlloc alloc1;
ThreadAlloc alloc2;
AllocTest *allocator = &alloc2;
const size_t buffer_size =1*1024*1024;
const size_t buffer_count = 10*1024;
const unsigned int thread_count = 32;
std::vector<void*> buffers;
buffers.resize(buffer_count);
std::vector<std::thread> threads;
threads.resize(thread_count);
void* reference = allocator->Alloc(buffer_size);
std::memset(reference, 0xaa, buffer_size);
auto func = [&buffers, allocator, buffer_size, buffer_count, reference, thread_count](int thread_id) {
for (int i = thread_id; i < buffer_count; i+= thread_count) {
buffers[i] = allocator->Alloc(buffer_size);
std::memcpy(buffers[i], reference, buffer_size);
allocator->Free(buffers[i]);
}
};
for (int i = 0; i < 10; i++)
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int t = 0; t < thread_count; t++) {
threads[t] = std::thread(func, t);
}
for (int t = 0; t < thread_count; t++) {
threads[t].join();
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << duration << std::endl;
}
DebugBreak();
return 0;
}
Under all sane conditions, BasicAlloc is faster, just as it should be. In fact, on a quad core CPU (no HT), there is no constellation in which ThreadAlloc could outperform it. ThreadAlloc is constantly around 30% slower. (Which is actually surprisingly little, and it keeps true even for tiny 1kB allocations!)
However, if the CPU has around 8-12 virtual cores, then it eventually reaches the point where BasicAlloc actually scales negatively, while ThreadAlloc just "stalls" on the base line overhead of soft faults.
If you profile the two different allocation strategies, you can see that for a low thread count, KiPageFault shifts from memcpy on BasicAlloc to VirtualLock on ThreadAlloc.
For higher thread and core counts, eventually ExpWaitForSpinLockExclusiveAndAcquire starts emerging from virtually zero load to up to 50% with BasicAlloc, while ThreadAlloc only maintains the constant overhead from KiPageFault itself.
Well, the stall with ThreadAlloc is also pretty bad. No matter how many cores or nodes in a NUMA system you have, you are currently hard capped to around 5-8GB/s in new allocations, across all processes in the system, solely limited by single thread performance. All the dedicated memory management thread achieves, is not wasting CPU cycles on a contended critical section.
You would have expected that Microsoft had a lock free strategy for assigning pages on different cores, but apparently that's not even remotely the case.
The spin-lock was also already present in the Windows 7 and earlier implementations of KiPageFault. So what did change?
Simple answer: KiPageFault itself became much slower. No clue what exactly caused it to slow down, but the spin-lock simply never became a obvious limit, because 100% contention was never possible before.
If someone whishes to disassemble KiPageFault to find the most expensive part - be my guest.

Related

CUDA unified memory and Windows 10

While using CudaMallocManaged() to allocate an array of structs with arrays inside, I'm getting the error "out of memory" even though I have enough free memory. Here's some code that replicates my problem:
#include <iostream>
#include <cuda.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
#define N 100000
#define ARR_SZ 100
struct Struct
{
float* arr;
};
int main()
{
Struct* struct_arr;
gpuErrchk( cudaMallocManaged((void**)&struct_arr, sizeof(Struct)*N) );
for(int i = 0; i < N; ++i)
gpuErrchk( cudaMallocManaged((void**)&(struct_arr[i].arr), sizeof(float)*ARR_SZ) ); //out of memory...
for(int i = 0; i < N; ++i)
cudaFree(struct_arr[i].arr);
cudaFree(struct_arr);
/*float* f;
gpuErrchk( cudaMallocManaged((void**)&f, sizeof(float)*N*ARR_SZ) ); //this works ok
cudaFree(f);*/
return 0;
}
There doesn't seem to be a problem when I call cudaMallocManaged() once to allocate a single chunk of memory, as I'm showing in the last piece of commented code.
I have a GeForce GTX 1070 Ti, and I'm using Windows 10. A friend tried to compile the same code in a PC with Linux and it worked correctly, while it had the same issue in another PC with Windows 10. WDDM TDR is deactivated.
Any help would be appreciated. Thanks.
There is an allocation granularity.
This means that if you ask for 1 byte, or 400 bytes, what is actually used up is something like 4096 65536 bytes. So a bunch of very small allocations will actually use up memory at a much faster rate than what you would predict based on the requested allocation size. The solution is to not make very small allocations, but instead to allocate in larger chunks.
An alternative strategy here would also be to flatten your allocation, and carve out pieces from it for each of your arrays:
#include <iostream>
#include <cstdio>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
#define N 100000
#define ARR_SZ 100
struct Struct
{
float* arr;
};
int main()
{
Struct* struct_arr;
float* f;
gpuErrchk( cudaMallocManaged((void**)&struct_arr, sizeof(Struct)*N) );
gpuErrchk( cudaMallocManaged((void**)&f, sizeof(float)*N*ARR_SZ) );
for(int i = 0; i < N; ++i)
struct_arr[i].arr = f+i*ARR_SZ;
cudaFree(struct_arr);
cudaFree(f);
return 0;
}
ARR_SZ divisible by 4 means the various created pointers can also be up-cast to larger vector types e.g. float2 or float4, if your use had any intention of doing that.
A possible reason the original code works on linux is because managed memory on linux, in a proper setup, can oversubscribe the GPU physical memory. The result is the actual allocation limit is much higher than what the GPU on-board memory would suggest. It might also be that the linux case has a bit more free memory, or perhaps the allocation granularity on linux is different (smaller).
Based on a question in the comments, I decided to estimate the allocation granularity, using this code:
#include <iostream>
#include <cstdio>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char* file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
#define N 100000
#define ARR_SZ 100
struct Struct
{
float* arr;
};
int main()
{
Struct* struct_arr;
//float* f;
gpuErrchk(cudaMallocManaged((void**)& struct_arr, sizeof(Struct) * N));
#if 0
gpuErrchk(cudaMallocManaged((void**)& f, sizeof(float) * N * ARR_SZ));
for (int i = 0; i < N; ++i)
struct_arr[i].arr = f + i * ARR_SZ;
#else
size_t fre, tot;
gpuErrchk(cudaMemGetInfo(&fre, &tot));
std::cout << "Free: " << fre << " total: " << tot << std::endl;
for (int i = 0; i < N; ++i)
gpuErrchk(cudaMallocManaged((void**) & (struct_arr[i].arr), sizeof(float) * ARR_SZ));
gpuErrchk(cudaMemGetInfo(&fre, &tot));
std::cout << "Free: " << fre << " total: " << tot << std::endl;
for (int i = 0; i < N; ++i)
cudaFree(struct_arr[i].arr);
#endif
cudaFree(struct_arr);
//cudaFree(f);
return 0;
}
When I compile a debug project with that code, and run that on a windows 10 desktop with RTX 2070 GPU (8GB memory, same as GTX 1070 Ti) I get the following output:
Microsoft Windows [Version 10.0.17763.973]
(c) 2018 Microsoft Corporation. All rights reserved.
C:\Users\Robert Crovella>cd C:\Users\Robert Crovella\source\repos\test12\x64\Debug
C:\Users\Robert Crovella\source\repos\test12\x64\Debug>test12
Free: 7069866393 total: 8589934592
Free: 516266393 total: 8589934592
C:\Users\Robert Crovella\source\repos\test12\x64\Debug>test12
Free: 7069866393 total: 8589934592
Free: 516266393 total: 8589934592
C:\Users\Robert Crovella\source\repos\test12\x64\Debug>
Note that on my machine there is only 0.5GB of reported free memory left after the 100,000 allocations. So if for any reason your 8GB GPU starts out with less free memory (entirely possible) you may run into an out-of-memory error, even though I did not.
The calculation of the allocation granularity is as follows:
7069866393 - 516266393 / 100000 = 65536 bytes per allocation(!)
So my previous estimate of 4096 bytes per allocation was way off, by at least 1 order of magnitude, on my machine/test setup.
The allocation granularity may vary based on:
windows or linux
WDDM or TCC
x86 or Power9
managed vs ordinary cudaMalloc
possibly other factors (e.g. CUDA version)
so my advice to future readers would not be to assume that it is always 65536 bytes per allocation, minimum.

Why is local memory in this OpenCL algorithm so slow?

I am writing some OpenCL code. My kernel should create a special "accumulator" output based on an input image. I have tried two concepts and both are equally slow, although the second one uses local memory. Could you please help me identify why the local memory version is so slow? The target GPU for the kernels is a AMD Radeon Pro 450.
// version one
__kernel void find_points(__global const unsigned char* input, __global unsigned int* output) {
const unsigned int x = get_global_id(0);
const unsigned int y = get_global_id(1);
int ind;
for(k = SOME_BEGINNING; k <= SOME_END; k++) {
// some pretty wild calculation
// ind is not linear and accesses different areas of the output
ind = ...
if(input[y * WIDTH + x] == 255) {
atomic_inc(&output[ind]);
}
}
}
// variant two
__kernel void find_points(__global const unsigned char* input, __global unsigned int* output) {
const unsigned int x = get_global_id(0);
const unsigned int y = get_global_id(1);
__local int buf[7072];
if(y < 221 && x < 32) {
buf[y * 32 + x] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
int ind;
int k;
for(k = SOME_BEGINNING; k <= SOME_END; k++) {
// some pretty wild calculation
// ind is not linear and access different areas of the output
ind = ...
if(input[y * WIDTH + x] == 255) {
atomic_inc(&buf[ind]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(get_local_id(0) == get_local_size(0) - 1)
for(k = 0; k < 7072; k++)
output[k] = buf[k];
}
}
I would expect that the second variant is faster than the first one, but it isn't. Sometimes it is even slower.
Local buffer size __local int buf[7072] (28288 bytes) is too big. I don't know how big shared memory for AMD Radeon Pro 450 is but likely that is 32kB or 64kB per computing unit.
32768/28288 = 1, 65536/28288 = 2 means only 1 or maximum 2 wavefronts (64 work items) can run simultaneously only, so occupancy of computing unit is very very low hence poor performance.
Your aim should be to reduce local buffer as much as possible so that more wavefronts can be processed simultaneously.
Use CodeXL to profile your kernel - there are tools to show you all of this.
Alternatively you can have a look at CUDA occupancy calculator excel spreadsheet if you don't want to run the profiler to get a better idea of what that is about.

CUDA performance of atomic operation on different address in warp

To my knowledge, if atomic operations are performed on same memory address location in a warp, the performance of the warp could be 32 times slower.
But what if atomic operations of threads in a warp are on 32 different memory locations? Is there any performance penalty at all? Or it will be as fast as normal operation?
My use case is that I have 32 different positions, each thread in a warp needs one of these position but which position is data dependent. So each thread could use atomicCAS to scan if the location desired is empty or not. If it is not empty, scan the next position.
If I am lucky, 32 threads could atomicCAS to 32 different memory locations, is there any performance penalty is this case?
I assume Kepler architecture is used
In the code below, I'm adding a constant value to the elements of an array (dev_input). I'm comparing two kernels, one using atomicAdd and one using regular addition. This is an example taken to the extreme in which atomicAdd operates on completely different addresses, so there will be no need for serialization of the operations.
#include <stdio.h>
#define BLOCK_SIZE 1024
int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void regular_addition(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) dev_input[i] = dev_input[i] + val;
}
__global__ void atomic_operations(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) atomicAdd(&dev_input[i],val);
}
int main(){
int N = 8192*32;
float* output = (float*)malloc(N*sizeof(float));
float* dev_input; gpuErrchk(cudaMalloc((void**)&dev_input, N*sizeof(float)));
gpuErrchk(cudaMemset(dev_input, 0, N*sizeof(float)));
int NumBlocks = iDivUp(N,BLOCK_SIZE);
float time, timing1 = 0.f, timing2 = 0.f;
cudaEvent_t start, stop;
int niter = 32;
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
atomic_operations<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing1 = timing1 + time;
}
printf("Time for atomic operations: %3.5f ms \n", timing1/(float)niter);
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
regular_addition<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing2 = timing2 + time;
}
printf("Time for regular addition: %3.5f ms \n", timing2/(float)niter);
}
Testing this code on my NVIDIA GeForce GT540M, CUDA 5.5, Windows 7, I obtain approximately the same results for the two kernels, i.e., about 0.7ms.
Now change the instruction
if (i < N) atomicAdd(&dev_input[i],val);
to
if (i < N) atomicAdd(&dev_input[i%32],val);
which is closer to the case of your interest, namely, each atomicAdd operates on different addresses within a warp. The result I obtain is that no performance penalty is observed.
Finally, change the above instruction to
if (i < N) atomicAdd(&dev_input[0],val);
This is the other extreme in which atomicAdd always operates on the same address. In this case, the execution time raises to 5.1ms.
The above tests have been performed on a Fermi architecture. You can try to run the above code on your Kepler card.

how to increase memory limit in Visual Studio C++

Need Help.I'm stuck at a problem when running a C++ code on Windows- Visual Studio.
When I run that code in Linux environment, there is no restriction on the memory I am able to allocate dynamically(till the size available in RAM).
But on VS Compiler, it does not let me create an array beyond a limited size.
I've tried /F option and 20-25 of google links to increase memory size but they dont seem to help much.
I am currently able to assign only around 100mb out of 3gb available.
If there is a solution for this in Windows and not in Visual Studio's compiler, I will be glad to hear that as I have a CUDA TeslaC2070 card which is proving to be pretty useless on Windows as I wanted to run my CUDA/C++ code on Windows environment.
Here's my code. it fails when LENGTH>128(no of images 640x480pngs. less than 0.5mb each. I've also calculated the approximate memory size it takes by counting data structures and types used in OpenCV and by me but still it is very less than 2gb). stackoverflow exception. Same with dynamic allocation. I've already maximized the heap and stack sizes.
#include "stdafx.h"
#include <cv.h>
#include <cxcore.h>
#include <highgui.h>
#include <cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#define LENGTH 100
#define SIZE1 640
#define SIZE2 480
#include <iostream>
using namespace std;
__global__ void square_array(double *img1_d, long N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
img1_d[idx]= 255.0-img1_d[idx];
}
int _tmain(int argc, _TCHAR* argv[])
{
IplImage *img1[LENGTH];
// Open the file.
for(int i=0;i<LENGTH;i++)
{ img1[i] = cvLoadImage("abstract3.jpg");}
CvMat *mat1[LENGTH];
for(int i=0;i<LENGTH;i++)
{
mat1[i] = cvCreateMat(img1[i]->height,img1[i]->width,CV_32FC3 );
cvConvert( img1[i], mat1[i] );
}
double a[LENGTH][2*SIZE1][SIZE2][3];
for(int m=0;m<LENGTH;m++)
{
for(int i=0;i<SIZE1;i++)
{
for(int j=0;j<SIZE2;j++)
{
CvScalar scal = cvGet2D( mat1[m],j,i);
a[m][i][j][0] = scal.val[0];
a[m][i][j][1] = scal.val[1];
a[m][i][j][2] = scal.val[2];
a[m][i+SIZE1][j][0] = scal.val[0];
a[m][i+SIZE1][j][1] = scal.val[1];
a[m][i+SIZE1][j][2] = scal.val[2];
}
} }
//cuda
double *a_d;
int N=LENGTH*2*SIZE1*SIZE2*3;
cudaMalloc((void **) &a_d, N*sizeof(double));
cudaMemcpy(a_d, a, N*sizeof(double), cudaMemcpyHostToDevice);
int block_size = 370;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
cout<<n_blocks<<block_size;
square_array <<< n_blocks, block_size >>> (a_d, N);
cudaMemcpy(a, a_d, N*sizeof(double), cudaMemcpyDeviceToHost);
//cuda end
char name[]= "Image: 00000";
name[12]='\0';
int x=0,y=0;
for(int m=0;m<LENGTH;m++)
{
for (int i = 0; i < img1[m]->width*img1[m]->height*3; i+=3)
{
img1[m]->imageData[i]= a[m][x][y][0];
img1[m]->imageData[i+1]= a[m][x][y][1];
img1[m]->imageData[i+2]= a[m][x][y][2];
if(x==SIZE1)
{
x=0;
y++;
}
x++;
}
switch(name[11])
{
case '9': switch(name[10])
{
case '9':
switch(name[9])
{
case '9': name[11]='0';name[10]='0';name[9]='0';name[8]++;
break;
default : name[11]='0';
name[10]='0';
name[9]++;
}break;
default : name[11]='0'; name[10]++;break;
}
break;
default : name[11]++;break;
}
// Display the image.
cvNamedWindow(name, CV_WINDOW_AUTOSIZE);
cvShowImage(name,img1);
//cvSaveImage(name ,img1);
}
// Wait for the user to press a key in the GUI window.
cvWaitKey(0);
// Free the resources.
//cvDestroyWindow(x);
//cvReleaseImage(&img1);
//cvDestroyWindow("Image:");
//cvReleaseImage(&img2);
return 0;
}
The problem is that you are allocating a huge multidimensional array on the stack in your main function (double a[..][..][..]). Do not allocate this much memory on the stack. Use malloc/new to allocate on the heap.

How to read performance counters on i5, i7 CPUs

Modern CPUs have quite a lot of performance counters - http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html how to read them?
I'm interested in cache misses and branch mispredictions.
Looks like PAPI has very clean API and works just fine on Ubuntu 11.04.
Once it's installed, following app will do what I wanted:
#include <stdio.h>
#include <stdlib.h>
#include <papi.h>
#define NUM_EVENTS 4
void matmul(const double *A, const double *B,
double *C, int m, int n, int p)
{
int i, j, k;
for (i = 0; i < m; ++i)
for (j = 0; j < p; ++j) {
double sum = 0;
for (k = 0; k < n; ++k)
sum += A[i*n + k] * B[k*p + j];
C[i*p + j] = sum;
}
}
int main(int /* argc */, char ** /* argv[] */)
{
const int size = 300;
double a[size][size];
double b[size][size];
double c[size][size];
int event[NUM_EVENTS] = {PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_BR_MSP, PAPI_L1_DCM };
long long values[NUM_EVENTS];
/* Start counting events */
if (PAPI_start_counters(event, NUM_EVENTS) != PAPI_OK) {
fprintf(stderr, "PAPI_start_counters - FAILED\n");
exit(1);
}
matmul((double *)a, (double *)b, (double *)c, size, size, size);
/* Read the counters */
if (PAPI_read_counters(values, NUM_EVENTS) != PAPI_OK) {
fprintf(stderr, "PAPI_read_counters - FAILED\n");
exit(1);
}
printf("Total instructions: %lld\n", values[0]);
printf("Total cycles: %lld\n", values[1]);
printf("Instr per cycle: %2.3f\n", (double)values[0] / (double) values[1]);
printf("Branches mispredicted: %lld\n", values[2]);
printf("L1 Cache misses: %lld\n", values[3]);
/* Stop counting events */
if (PAPI_stop_counters(values, NUM_EVENTS) != PAPI_OK) {
fprintf(stderr, "PAPI_stoped_counters - FAILED\n");
exit(1);
}
return 0;
}
Tested this on Intel Q6600, it supports up to 4 performance events. Your processor may support more or less.
What about perf? perf list hw cache shows 33 different events and the man page shows how to use raw performance counter descriptors.
Performance counters are read with the RDPMC insn.
EDIT: To add a bit more info, reading performance counters is not very easy and it would take pages upon pages if we are to describe it here, besides it involves writes to Model Specific Registers, which require privileged instructions. I would instead advise to use ready profilers - oprofile or Intel VTune, which are built upon performance counters.
I think there is a available library that can be used, called perfmon2, http://perfmon2.sourceforge.net/, and documentations are available at http://www.hpl.hp.com/research/linux/perfmon/perfmon.php4 and http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html, I am recently digging this lib out, I would post example code as soon as I figure it out~

Resources