A question about the details about the distribution from blocks to SMs in CUDA

A question about the details about the distribution from blocks to SMs in CUDA - gpgpu

Let me take the hardware with computation ability 1.3 as an example.
30 SMs are available. Then at most 240 blocks are able to be running at the same time(Considering the limit of register and shared memory, the restriction to the number of block may be much lower). Those blocks beyond 240 have to wait for available hardware resources.
My question is when those blocks beyond 240 will be assigned to SMs. Once some blocks of the first 240 are completed? Or when all of the first 240 blocks are finished?
I wrote such a piece of code.
#include<stdio.h>
#include<string.h>
#include<cuda_runtime.h>
#include<cutil_inline.h>
const int BLOCKNUM = 1024;
const int N=240;
__global__ void kernel ( volatile int* mark ) {
if ( blockIdx.x == 0 ) while ( mark[N] == 0 );
if ( threadIdx.x == 0 ) mark[blockIdx.x] = 1;
}
int main() {
int * mark;
cudaMalloc ( ( void** ) &mark, sizeof ( int ) *BLOCKNUM );
cudaMemset ( mark, 0, sizeof ( int ) *BLOCKNUM );
kernel <<< BLOCKNUM, 1>>> ( mark );
cudaFree ( mark );
return 0;
}
This code causes a deadlock and fails to terminate. But if I change N from 240 to 239, the code is able to terminate. So I want to know some details about the scheduling of blocks.

On the GT200, it has been demonstrated through micro-benchmarking that new blocks are scheduled whenever a SM has retired all the currently active blocks which it was running. So the answer is when some blocks are finished, and the scheduling granularity is SM level. There seems to be a consensus that Fermi GPUs have a finer scheduling granularity than previous generations of hardware.

I can't find any reference about this for compute capabilities < 1.3.
Fermi architectures introduce a new block dispatcher called GigaThread engine.
GigaThread enables immediate replacement of blocks on an SM when one completes executing and also enables concurrent kernel execution.

While there is no official answer to this, you can measure through atomic operations when your blocks begin your work and when they end.
Try playing with the following code:
#include <stdio.h>
const int maxBlocks=60; //Number of blocks of size 512 threads on current device required to achieve full occupancy
__global__ void emptyKernel() {}
__global__ void myKernel(int *control, int *output) {
if (threadIdx.x==1) {
//register that we enter
int enter=atomicAdd(control,1);
output[blockIdx.x]=enter;
//some intensive and long task
int &var=output[blockIdx.x+gridDim.x]; //var references global memory
var=1;
for (int i=0; i<12345678; ++i) {
var+=1+tanhf(var);
}
//register that we quit
var=atomicAdd(control,1);
}
}
int main() {
int *gpuControl;
cudaMalloc((void**)&gpuControl, sizeof(int));
int cpuControl=0;
cudaMemcpy(gpuControl,&cpuControl,sizeof(int),cudaMemcpyHostToDevice);
int *gpuOutput;
cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks*2);
int cpuOutput[maxBlocks*2];
for (int i=0; i<maxBlocks*2; ++i) //clear the host array just to be on the safe side
cpuOutput[i]=-1;
// play with these values
const int thr=479;
const int p=13;
const int q=maxBlocks;
//I found that this may actually affect the scheduler! Try with and without this call.
emptyKernel<<<p,thr>>>();
cudaEvent_t timerStart;
cudaEvent_t timerStop;
cudaEventCreate(&timerStart);
cudaEventCreate(&timerStop);
cudaThreadSynchronize();
cudaEventRecord(timerStart,0);
myKernel<<<q,512>>>(gpuControl, gpuOutput);
cudaEventRecord(timerStop,0);
cudaEventSynchronize(timerStop);
cudaMemcpy(cpuOutput,gpuOutput,sizeof(int)*maxBlocks*2,cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
float thisTime;
cudaEventElapsedTime(&thisTime,timerStart,timerStop);
cudaEventDestroy(timerStart);
cudaEventDestroy(timerStop);
printf("Elapsed time: %f\n",thisTime);
for (int i=0; i<q; ++i)
printf("%d: %d-%d\n",i,cpuOutput[i],cpuOutput[i+q]);
}
What you get in the output is the block ID, followed by the enter "time" and exit "time". This way you can learn in which order those events occured.

On Fermi, I'm sure that a block is scheduled on a SM as soon there is room for it. I.e., whenever, a SM finishes executing one block, it will execute another block if there is any block left. (However, the actual order is not deterministic).
In older versions, I don't know. But you can verify it by using the build-in clock() function.
For example, I used the following OpenCL kernel code (you can easily convert it to CUDA):
__kernel void test(uint* start, uint* end, float* buffer);
{
int id = get_global_id(0);
start[id] = clock();
__do_something_here;
end[id] = clock();
}
Then output it to a file and build a graph. You will see how visual it is.

Related

Number of Computing units in OpenCL

_kernel void kmp(__global char pattern[2*4], __global char* string, __global int failure[2*4], __global int ret[2], int g_length, int l_length, int thread_num){
int pattern_num = 2;
int pattern_size = 4;
int gid = get_group_id(0);
int glid = get_global_id(0);
int lid = get_local_id(0);
int i, j, x = 0;
int old = 0;
__local char tmp_string[32768];
event_t event;
event = async_work_group_copy(tmp_string+lid*l_length, string+glid*l_length, l_length, 0);
wait_group_events(1, &event);
for(i = 0; i < pattern_num; i++){
x = i*pattern_size;
for(j = lid*l_length; j < (lid+1)*l_length; j++){
while(tmp_string[j] != pattern[x] && x > 0 && x != i*pattern_size){
x = failure[x-1]+i*pattern_size;
}
if(tmp_string[j] == pattern[x]){
if(x == (i+1)*pattern_size-1){
//ret[i]++;
old = atomic_add(&ret[i], 1);
x = failure[x]+i*pattern_size;
}
else{
x++;
}
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
I need help with this code.
To find the matched pattern in the string, I wrote code like this.
I'm using AMD Hawaii and it has 44 groups which have 64 cores in each group(Total 2816 computing units, I mean).
The problem is when I try using more than 44 computing units(Using more than 1 core in one group; like 88 units-using 2 cores in each group- or 2816 units-using 64 cores in each group-), it doesn't work well.
It couldn't correctly find the matched number.
I checked the index of string, ids(glid, gid, lid) and the size of all variable.
But, there is nothing wrong.
Anyone who has some advice, please help!

What is going wrong that you saying it doesn't work well? Also why are you not doing anything within async copy? Maybe a simple global to local assignment could work. Why is there a local barrier at the end of kernel?
Anyway, the error seems to be the async copy. It has different values for each thread in a group. For it to work right, it must be given exact same numbers in all threads of a group. Thats why it works with local size = 1 and not for bigger local groups.
For example, glid is different for all 64 threads in a group so it wouldn't work. Async work group copy command makes all threads of a group work on same copy. Not different copies. If you need different copies, you need multiple async commands serially but they would work async if you use the waiting on all of them at once.

mandelbrot using openMP

// return 1 if in set, 0 otherwise
int inset(double real, double img, int maxiter){
double z_real = real;
double z_img = img;
for(int iters = 0; iters < maxiter; iters++){
double z2_real = z_real*z_real-z_img*z_img;
double z2_img = 2.0*z_real*z_img;
z_real = z2_real + real;
z_img = z2_img + img;
if(z_real*z_real + z_img*z_img > 4.0) return 0;
}
return 1;
}
// count the number of points in the set, within the region
int mandelbrotSetCount(double real_lower, double real_upper, double img_lower, double img_upper, int num, int maxiter){
int count=0;
double real_step = (real_upper-real_lower)/num;
double img_step = (img_upper-img_lower)/num;
for(int real=0; real<=num; real++){
for(int img=0; img<=num; img++){
count+=inset(real_lower+real*real_step,img_lower+img*img_step,maxiter);
}
}
return count;
}
// main
int main(int argc, char *argv[]){
double real_lower;
double real_upper;
double img_lower;
double img_upper;
int num;
int maxiter;
int num_regions = (argc-1)/6;
for(int region=0;region<num_regions;region++){
// scan the arguments
sscanf(argv[region*6+1],"%lf",&real_lower);
sscanf(argv[region*6+2],"%lf",&real_upper);
sscanf(argv[region*6+3],"%lf",&img_lower);
sscanf(argv[region*6+4],"%lf",&img_upper);
sscanf(argv[region*6+5],"%i",&num);
sscanf(argv[region*6+6],"%i",&maxiter);
printf("%d\n",mandelbrotSetCount(real_lower,real_upper,img_lower,img_upper,num,maxiter));
}
return EXIT_SUCCESS;
}
I need to convert the above code into openMP. I know how to do it for a single matrix or image but i have to do it for 2 images at the same time
the arguments are as follows
$./mandelbrot -2.0 1.0 -1.0 1.0 100 10000 -1 1.0 0.0 1.0 100 10000
Any suggestion how to divide the work in to different threads for the two images and then further divide work for each image.
thanks in advance

If you want to process multiple images at a time, you need to add a #pragma omp parallel for into the loop in the main body such as:
#pragma omp parallel for private(real_lower, real_upper, img_lower, img_upper, num, maxiter)
for(int region=0;region<num_regions;region++){
// scan the arguments
sscanf(argv[region*6+1],"%lf",&real_lower);
sscanf(argv[region*6+2],"%lf",&real_upper);
sscanf(argv[region*6+3],"%lf",&img_lower);
sscanf(argv[region*6+4],"%lf",&img_upper);
sscanf(argv[region*6+5],"%i",&num);
sscanf(argv[region*6+6],"%i",&maxiter);
printf("%d\n",mandelbrotSetCount(real_lower,real_upper,img_lower,img_upper,num,maxiter));
}
Notice that some variables need to be classified as private (i.e. each thread has its own copy).
Now, if you want additional parallelism you need nested OpenMP (see nested and NESTED_OMP in OpenMP specification) as the work will be spawned by OpenMP threads -- but note that nesting may not give you a performance boost always.
In this case, what about adding a #pragma omp parallel for (with the appropriate reduction clause so that each thread accumulates into count) into the mandelbrotSetCount routine such as
// count the number of points in the set, within the region
int mandelbrotSetCount(double real_lower, double real_upper, double img_lower, double img_upper, int num, int maxiter)
{
int count=0;
double real_step = (real_upper-real_lower)/num;
double img_step = (img_upper-img_lower)/num;
#pragma omp parallel for reduction(+:count)
for(int real=0; real<=num; real++){
for(int img=0; img<=num; img++){
count+=inset(real_lower+real*real_step,img_lower+img*img_step,maxiter);
}
}
return count;
}
The whole approach would split images between threads first and then the rest of the available threads would be able to split the loop iterations in this routine among all the available threads each time you invoke the routine.
EDIT
As user Hristo suggest's on the comments, the mandelBrotSetCount routine might be unbalanced (the best reason is that the user simply requests a different number of maxiter) on each invocation. One way to address this performance issue might be to use dynamic thread scheduling in the routine. So rather than having
#pragma omp parallel for reduction(+:count)
we might want to have
#pragma omp parallel for reduction(+:count) schedule(dynamic,N)
and here N should be a relatively small value (and likely larger than 1).

Optimize Cuda Kernel time execution

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.

Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__  unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.

I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.

Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

realloc in loop on win32

simple code on C create 10 000 000 numeric in memory.
on Mac OS X work = 1 second
on Win32 Visual C++ 2008 work = 15 minutes
on Mac and Win32 2Gb memory
Q: Why? realloc on Win32 work slowly when on Mac OS X?
// datrw.cpp : Defines the entry point for the console application.
//
#include "stdafx.h" // add for MSVC
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h> // // add for MSVC
#define POOL 9030000000
#define ARSIZE 10000000
//int main() // for Mac OS X compile as : gcc datrw.c
int _tmain(int argc, _TCHAR* argv[]) // add for MSVC
{
double *data,*temp;
//----------------------------------------create data
data=(double *)malloc(sizeof(double)); // add (double *) for MSVC
double c; // data for save
int i; // cycle variable
for(i=0;i<ARSIZE;i++){
c=POOL+i;
data[i]=c;
temp=(double *)realloc(data,(i+2)*sizeof(double)); // add (double *) for MSVC
if ( temp != NULL ) {
data=temp;
} else {
free(data);
printf("Error allocating memory!\n");
return 1;
}
}
return 0;
}
if replace for :
for(i=0;i<ARSIZE;i++){
c=POOL+i;
//data[i]=c;
temp=(double *)realloc(data,(i+2)*sizeof(double)); // (double *) MSVC
if ( temp != NULL ) {
data=temp;
if ( temp == data ){ // add for optimize compilation
data[i]=c;
}
} else {
free(data);
printf("Error allocating memory!\n");
return 1;
}
}
--
no result :-(
-- if delete realloc from FOR:
//----------------------------------------create data
data=(double *)malloc((ARSIZE+2)*sizeof(double)); // (double *) MSVC
double c; // data for save
int i; // cycle variable
if ( data != NULL ) {
for(i=0;i<ARSIZE;i++){
c=POOL+i;
data[i]=c;
}
} else {
free(data);
printf("Error allocating memory!\n");
return 1;
}
when is work !
BUT Why realloc in loop is BAD ?? for Win32 is BAD. On Mac OS X - Ok

Foremost, you're comparing performance of different C++ runtimes (libstd++ vs msvcrt), not performance of OSes.
There are a lot of different strategies of memory allocation. It's difficult to select an allocation strategy that will provide maximum utility, without excessively penalizing some behavior. For example, some strategies allows you to effectively allocate millions of small blocks, but they are not effective at (re)allocating huge memory blocks.
Generally, it's supposed that allocating small objects in not effective and developers try to reduce amounts of (re)allocations.
Another point is that MSVC does a few memory checks when a program is running in debug mode. It significantly slows the program down. Check you're running both versions in release mode.
Moving from theory to practice - always try to reduce amount of (re)allocations:
double *data = (double*) malloc( ARSIZE );
for( i = 0; i < ARSIZE; ++i ) {
data[i] = POOL + i;
...
}

Use std::vector like this
std::vector<double> data;
unsigned int reallocCount = 0;
for ( unsigned int i = 0; i < ARSIZE; ++i )
{
double * p = i ? &data[0] : 0;
data.push_back( POOL + i );
if ( &data[0] != p )
++reallocCount;
}
data.push_back( 0 ); // what ever you wish to store +2 doubles for...
data.push_back( 0 );
It is quite evident that your Mac version reserves larger memory block than actually requested, and this is what vector does for you. Given C++ is your way to go. Reallocation ARSIZE times needs ARSIZE mallocs and memcpys so cannot really run in less than 1 seconds.
You should test for data == temp as
for ( ;; )
if ( temp == data )
{
printf( "Didn't actually reallocate, they are same" );
}else if ( temp )
{
data = temp;
}else
{
free( data );
printf("Error");
return 1;
}
}
Hope this helps.
EDIT: making it even more funny, I added there reallocCount. On my machine, I get number of 41 as opposed to 10M !

Memory problems with a multi-threaded Win32 service that uses STL on VS2010

I have a multi-threaded Win32 service written in C++ (VS2010) that makes extensive use of the standard template library. The business logic of the program operates properly, but when looking at the task manager (or resource manager) the program leaks memory like a sieve.
I have a test set that averages about 16 simultaneous requests/second. When the program is first started up it consumes somewhere in the neighborhood of 1.5Mb of ram. After a full test run (which take 12-15 minutes) the memory consumption ends up somewhere near 12Mb. Normally, this would not be a problem for a program that runs once and then terminates, but this program is intended to run continuously. Very bad, indeed.
To try and narrow down the problem, I created a very small test application that spins off worker threads at a rate of once every 250ms. The worker thread creates a map and populates it with pseudo-random data, empties the map, and then exits. This program, too, leaks memory in like fashion, so I'm thinking that the problem is with the STL not releasing the memory as expected.
I have tried VLD to search for leaks and it has found a couple which I have remedied, but still the problem remains. I have tried integrating Hoard, but that has actually made the problem worse (i'm probably not integrating it properly, but i can't see how).
So I would like to pose the following question: is it possible to create a program that uses the STL in a multi-threaded environment that will not leak memory? Over the course of the last week I have made no less than 200 changes to this program. I have plotted the results of the changes and they all have the same basic profile. I don't want to have to remove all of the STL goodness that has made developing this application so much easier. I would earnestly appreciate any suggestions on how I can get this app working without leaking memory like it's going out of style.
Thanks again for any help!
P.S. I'm posting a copy of the memory test for inspection/personal edification.
#include <string>
#include <iostream>
#include <Windows.h>
#include <map>
using namespace std;
#define MAX_THD_COUNT 1000
DWORD WINAPI ClientThread(LPVOID param)
{
unsigned int thdCount = (unsigned int)param;
map<int, string> m;
for (unsigned int x = 0; x < 1000; ++x)
{
string s;
for (unsigned int y = 0; y < (x % (thdCount + 1)); ++y)
{
string z = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
unsigned int zs = z.size();
s += z[(y % zs)];
}
m[x] = s;
}
m.erase(m.begin(), m.end());
ExitThread(0);
return 0;
}
int main(int argc, char ** argv)
{
// wait for start
string inputWait;
cout << "type g and press enter to go: ";
cin >> inputWait;
// spawn many memory-consuming threads
for (unsigned int thdCount = 0; thdCount < MAX_THD_COUNT; ++thdCount)
{
CreateThread(NULL, 0, ClientThread, (LPVOID)thdCount, NULL, NULL);
cout
<< (int)(MAX_THD_COUNT - thdCount)
<< endl;
Sleep(250);
}
// wait for end
cout << "type e and press enter to end: ";
cin >> inputWait;
return 0;
}

Use _beginthreadex() when using the std library (includes the C runtime as far as MS is concerned). Also, you're going to experience a certain amount of fragmentation in the std runtime sub-allocator, especially in code designed to continually favor larger and larger requests like this.
The MS runtime library has some functions that allow you to debug memory requests and determine if there is a solid leak once you have a sound algorithm and are confident you don't see anything glaringly obvious. See the debug routines for more information.
Finally, I made the following modifications to the test jig you wrote:
Setup the proper _Crt report mode for spamming the debug window with any memory leaks after shutdown.
Modified the thread-startup loop to keep the maximum number of threads running constantly at MAXIMUM_WAIT_OBJECTS (WIN32-defined currently as 64 handles)
Threw in a purposeful leaked char array allocation to show the CRT will, in fact, catch it when dumping at program termination.
Eliminated console keyboard interaction. Just run it.
Hopefully this will make sense when you see the output log. Note: you must compile in Debug mode for this to make any proper dump for you.
#include <windows.h>
#include <dbghelp.h>
#include <process.h>
#include <string>
#include <iostream>
#include <map>
#include <vector>
using namespace std;
#define MAX_THD_COUNT 250
#define MAX_THD_LOOPS 250
unsigned int _stdcall ClientThread(void *param)
{
unsigned int thdCount = (unsigned int)param;
map<int, string> m;
for (unsigned int x = 0; x < MAX_THD_LOOPS; ++x)
{
string s;
for (unsigned int y = 0; y < (x % (thdCount + 1)); ++y)
{
string z = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
size_t zs = z.size();
s += z[(y % zs)];
}
m[x].assign(s);
}
return 0;
}
int main(int argc, char ** argv)
{
// setup reporting mode for the debug heap. when the program
// finishes watch the debug output window for any potential
// leaked objects. We're leaking one on purpose to show this
// will catch the leaks.
int flg = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);
flg |= _CRTDBG_LEAK_CHECK_DF;
_CrtSetDbgFlag(flg);
static char msg[] = "Leaked memory.";
new std::string(msg);
// will hold our vector of thread handles. we keep this fully populated
// with running threads until we finish the startup list, then wait for
// the last set of threads to expire.
std::vector<HANDLE> thrds;
for (unsigned int thdCount = 0; thdCount < MAX_THD_COUNT; ++thdCount)
{
cout << (int)(MAX_THD_COUNT - thdCount) << endl;
thrds.push_back((HANDLE)_beginthreadex(NULL, 0, ClientThread, (void*)thdCount, 0, NULL));
if (thrds.size() == MAXIMUM_WAIT_OBJECTS)
{
// wait for any single thread to terminate. we'll start another one after,
// cleaning up as we detected terminated threads
DWORD dwRes = WaitForMultipleObjects(thrds.size(), &thrds[0], FALSE, INFINITE);
if (dwRes >= WAIT_OBJECT_0 && dwRes < (WAIT_OBJECT_0 + thrds.size()))
{
DWORD idx = (dwRes - WAIT_OBJECT_0);
CloseHandle(thrds[idx]);
thrds.erase(thrds.begin()+idx, thrds.begin()+idx+1);
}
}
}
// there will be threads left over. need to wait on those too.
if (thrds.size() > 0)
{
WaitForMultipleObjects(thrds.size(), &thrds[0], TRUE, INFINITE);
for (std::vector<HANDLE>::iterator it=thrds.begin(); it != thrds.end(); ++it)
CloseHandle(*it);
}
return 0;
}
Output Debug Window
Note: there are two leaks reported. One is the std::string allocation, the other is the buffer within the std::string that held our message copy.
Detected memory leaks!
Dumping objects ->
{80} normal block at 0x008B1CE8, 8 bytes long.
Data: <09 > 30 39 8B 00 00 00 00 00
{79} normal block at 0x008B3930, 32 bytes long.
Data: < Leaked memor> E8 1C 8B 00 4C 65 61 6B 65 64 20 6D 65 6D 6F 72
Object dump complete.

It is not an easy task debug large applications.
Your sample is not the best choice to show what is happening.
One fragment of your real code guess better.
Of course it is not possible, so my suggestion is: use the maximum possible log, including insertion and deletion controls in all structures. Use counters to this information.
When they suspect something make a dump of all data to understand what is happening.
Try to work asynchronously to save the information so there is less impact on your application. This is not an easy task, but for anyone who enjoys a challenge and loves even more to program in C/C++ will be a ride.
Persistence and simplicity should be the goal.
Good luck

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

A question about the details about the distribution from blocks to SMs in CUDA - gpgpu

I can't find any reference about this for compute capabilities < 1.3. Fermi architectures introduce a new block dispatcher called GigaThread engine. GigaThread enables immediate replacement of blocks on an SM when one completes executing and also enables concurrent kernel execution.

Related

Number of Computing units in OpenCL

mandelbrot using openMP

Optimize Cuda Kernel time execution

realloc in loop on win32

Memory problems with a multi-threaded Win32 service that uses STL on VS2010

Categories

Resources