Compiling SSE intrinsics in GCC gives an error - gcc

My SSE code works completely fine on Windows platform, but when I run this on Linux I am facing many issues. One amongst them is this:
It's just a sample illustration of my code:
int main(int ref, int ref_two)
{
__128i a, b;
a.m128i_u8[0] = ref;
b.m128i_u8[0] = ref_two;
.
.
.
.....
}
Error 1:
error : request for member 'm128i_u8' in something not a structure or union
In this thread it gives the solution of to use appropriate _mm_set_XXX intrinsics instead of the above method as it only works on Microsoft.
SSE intrinsics compiling MSDN code with GCC error?
I tried the above method mentioned in the thread I have replaced set instruction in my program but it is seriously affecting the performance of my application.
My code is massive and it needs to be changed at 2000 places. So I am looking for better alternative without affecting the performance of my app.
Recently I got this link from Intel, which says to use -fms-diaelect option to port it from windows to Linux.
http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-7A69898B-BDBB-4AA9-9820-E4A590945903.htm
Has anybody tried the above method? Has anybody found the solution to porting large code to Linux?
#Paul, here is my code and I have placed a timer to measure the time taken by both methods and the results were shocking.
Code 1: 115 ms (Microsoft method to access elements directly)
Code 2: 151 ms (using set instruction)
It costed me a 36 ms when i used set in my code.
NOTE: If I replace in single instruction of mine it takes 36 ms and imagine the performance degrade which I am going to get if I replace it 2000 times in my program.
That's the reason I am looking for a better alternative other than set instruction
Code 1:
__m128i array;
unsigned char* temp_src;
unsigned char* temp_dst;
for (i=0; i< 20; i++)
{
for (j=0; j< 1600; j+= 16)
{
Timerstart(&x);
array = _mm_loadu_si128 ((__m128i *)(src));
array.m128i_u8[0] = 36;
y+ = Timerstop(x);
_mm_store_si128( (__m128i *)temp_dst,array);
}
}
Code2:
__m128i array;
unsigned char* temp_src;
unsigned char* temp_dst;
for (i=0; i< 20; i++)
{
for (j=0; j< 1600; j+= 16)
{
Timerstart(&x);
array = _mm_set_epi8(*(src+15),*(src+14),*(src+13),*(src+12),
*(src+11),*(src+10),*(src+9), *(src+8),
*(src+7), *(src+6), *(src+5), *(src+4),
*(src+3), *(src+2), *(src+1), 36 );
y+ = Timerstop(x);
_mm_store_si128( (__m128i *)temp_dst,array);
}
}

You're trying to use a non-portable Microsoft-ism. Just stick to the more portable intrinsics, e.g. _mm_set_epi8:
__128i a = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ref);
This will work on all platforms and compilers.
If you're seeing performance issues then it's probably because you're doing something inefficient inside a loop - without seeing the actual code though it's not possible to make any specific suggestions on making the code more efficient.
EDIT
Often there are much more efficient ways of loading a vector with a combination of values such as in your example, e.g.:
#include "smmintrin.h" // SSE4.1
for (...)
{
for (...)
{
__m128i v = _mm_loadu_si128(0, (__m128i *)src); // load vector from src..src+15
v = _mm_insert_epi8(v, 0, 36); // replace element 0 with constant `36`
_mm_storeu_si128((__m128i *)dst, v); // store vector at dst..dst+15
}
}
This translates to just 3 instructions. (Note: if you can't assume SSE4.1 minimum then the _mm_insert_epi8 can be replaced with two bitwise intrinsics - this will still be much more efficient than using _mm_set_epi8).

Related

gsl_integration_qag failed with gsl openmp

gsl_integration_qag works with 1 core (with/without openMP), but fails with multi-threads (i.e. >1).
Some information that may help...
gsl-2.5
#define _OPENMP 201107
Depending on the number of cores, I can get error reports of:
gsl: qag.c:248: ERROR: roundoff error prevents tolerance from being achieved (comment: usually with a small number of cores)
gsl: qag.c:257: ERROR: maximum number of subdivisions reached (comment: usually with a large number of cores)
A large max iteration number given to gsl_integration_qag only delays the code to crash.
The integration function is (can be more specific if needed):
double Func(double Param1, ..., double ParamN){
double result, error;
gsl_function F;
gsl_integration_workspace * w
= gsl_integration_workspace_alloc (1000);
struct parameters_gsl_int_ parameters_gsl = {
.Param1 = Param1,
...
.ParamN = ParamN,};
F.function = &func_integrand;
F.params = &parameters_gsl;
gsl_integration_qag (&F, LOWER_LIMIT, UPPER_LIMIT, 0, 0.001,
1000, GSL_INTEG_GAUSS61, w, &result, &error);
gsl_integration_workspace_free (w);
return result;
}
The OpenMP part that calls the integration is:
void call_Func(int Nbin, double array[], double Param1[], double Param2, ... double ParamN){
int i;
...
#pragma omp parallel shared(Nbin, array, Param1, ..., ParamN) private(i)
{
#pragma omp for
for (i=0; i<Nbin; i++)
array[i] = Func(Param1[i], Param2, ..., ParamN);
}
...
}
I'm new to both GSL and openMP. I hope I am using gsl_integration_qag correctly and the definition of shared or private variables makes sense.
btw, it's the same question as this 2014 one (gsl openmp failed integration), but I couldn't find the solution in this post.
Problem solved...
It is actually due to func_integrand having also a term which is estimated using gsl_integration_qag. There were some global variables adopted in this calculation, which I didn't capture before.

Optimize Cuda Kernel time execution

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.
Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__ ​ unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.
I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.
Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

OpenCL slow -- not sure why

I'm teaching myself OpenCL by trying to optimize the mpeg4dst reference audio encoder. I achieved a 3x speedup by using vector instructions on CPU but I figured the GPU could probably do better.
I'm focusing on computing auto-correlation vectors in OpenCL as my first area of improvement. The CPU code is:
for (int i = 0; i < NrOfChannels; i++) {
for (int shift = 0; shift <= PredOrder[ChannelFilter[i]]; shift++)
vDSP_dotpr(Signal[i] + shift, 1, Signal[i], 1, &out, NrOfChannelBits - shift);
}
NrOfChannels = 6
PredOrder = 129
NrOfChannelBits = 150528.
On my test file, this function take approximately 188ms to complete.
Here's my OpenCL method:
kernel void calculateAutocorrelation(size_t offset,
global const float *input,
global float *output,
size_t size) {
size_t index = get_global_id(0);
size_t end = size - index;
float sum = 0.0;
for (size_t i = 0; i < end; i++)
sum += input[i + offset] * input[i + offset + index];
output[index] = sum;
}
This is how it is called:
gcl_memcpy(gpu_signal_in, Signal, sizeof(float) * NrOfChannels * MAXCHBITS);
for (int i = 0; i < NrOfChannels; i++) {
size_t sz = PredOrder[ChannelFilter[i]] + 1;
cl_ndrange range = { 1, { 0, 0, 0 }, { sz, 0, 0}, { 0, 0, 0 } };
calculateAutocorrelation_kernel(&range, i * MAXCHBITS, (cl_float *)gpu_signal_in, (cl_float *)gpu_out, NrOfChannelBits);
gcl_memcpy(out, gpu_out, sizeof(float) * sz);
}
According to Instruments, my OpenCL implementation seems to take about 13ms, with about 54ms of memory copy overhead (gcl_memcpy).
When I use a much larger test file, 1 minute of 2-channel music vs, 1 second of 6-channel, while the measured performance of the OpenCL code seems to be the same, the CPU usage falls to about 50% and the whole program takes about 2x longer to run.
I can't find a cause for this in Instruments and I haven't read anything yet that suggests that I should expect very heavy overhead switching in and out of OpenCL.
If I'm reading your kernel code correctly, each work item is iterating over all of the data from it's location to the end. This isn't going to be efficient. For one (and the primary performance concern), the memory accesses won't be coalesced and so won't be at full memory bandwidth. Secondly, because each work item has a different amount of work, there will be branch divergence within a work group, which will leave some threads idle waiting for others.
This seems like it has a lot in common with a reduction problem and I'd suggest reading up on "parallel reduction" to get some hints about doing an operation like this in parallel.
To see how memory is being read, work out how 16 work items (say, global_id 0 to 15) will be reading data for each step.
Note that if every work item in a work group access the same memory, there is a "broadcast" optimization the hardware can make. So just reversing the order of your loop could improve things.

OpenCL in Xcode/OSX - Can't assign zero in kernel loop

I'm developing an accelerated component in OpenCL, using Xcode 4.5.1 and Grand Central Dispatch, guided by this tutorial.
The full kernel kept failing on the GPU, giving signal SIGABRT. I couldn't make much progress interpreting the error beyond that.
But I broke out aspects of the kernel to test, and I found something very peculiar involving assigning certain values to positions in an array within a loop.
Test scenario: give each thread a fixed range of array indices to initialize.
kernel void zero(size_t num_buckets, size_t positions_per_bucket, global int* array) {
size_t bucket_index = get_global_id(0);
if (bucket_index >= num_buckets) return;
for (size_t i = 0; i < positions_per_bucket; i++)
array[bucket_index * positions_per_bucket + i] = 0;
}
The above kernel fails. However, when I assign 1 instead of 0, the kernel succeeds (and my host code prints out the array of 1's). Based on a handful of tests on various integer values, I've only had problems with 0 and -1.
I've tried to outsmart the compiler with 1-1, (int) 0, etc, with no success. Passing zero in as a kernel argument worked though.
The assignment to zero does work outside of the context of a for loop:
array[bucket_index * positions_per_bucket] = 0;
The findings above were confirmed on two machines with different configurations. (OSX 10.7 + GeForce, OSX 10.8 + Radeon.) Furthermore, the kernel had no trouble when running on CL_DEVICE_TYPE_CPU -- it's just on the GPU.
Clearly, something ridiculous is happening, and it must be on my end, because "zero" can't be broken. Hopefully it's something simple. Thank you for your help.
Host code:
#include <stdio.h>
#include <OpenCL/OpenCL.h>
#include "zero.cl.h"
int main(int argc, const char* argv[]) {
dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
size_t num_buckets = 64;
size_t positions_per_bucket = 4;
cl_int* h_array = malloc(sizeof(cl_int) * num_buckets * positions_per_bucket);
cl_int* d_array = gcl_malloc(sizeof(cl_int) * num_buckets * positions_per_bucket, NULL, CL_MEM_WRITE_ONLY);
dispatch_sync(queue, ^{
cl_ndrange range = { 1, { 0 }, { num_buckets }, { 0 } };
zero_kernel(&range, num_buckets, positions_per_bucket, d_array);
gcl_memcpy(h_array, d_array, sizeof(cl_int) * num_buckets * positions_per_bucket);
});
for (size_t i = 0; i < num_buckets * positions_per_bucket; i++)
printf("%d ", h_array[i]);
printf("\n");
}
Refer to the OpenCL standard, section 6, paragraph 8 "Restrictions", bullet point k (emphasis mine):
6.8 k. Arguments to kernel functions in a program cannot be declared with the built-in scalar types bool, half, size_t, ptrdiff_t, intptr_t, and uintptr_t. [...]
The fact that your compiler even let you build the kernel at all indicates it is somewhat broken.
So you might want to fix that... but if that doesn't fix it, then it looks like a compiler bug, plain and simple (of CLC, that is, the OpenCL compiler, not your host code). There is no reason this kernel should work with any constant other than 0, -1. Did you try updating your OpenCL driver, what about trying on a different operating system (though I suppose this code is OS X only)?

Memory problems with a multi-threaded Win32 service that uses STL on VS2010

I have a multi-threaded Win32 service written in C++ (VS2010) that makes extensive use of the standard template library. The business logic of the program operates properly, but when looking at the task manager (or resource manager) the program leaks memory like a sieve.
I have a test set that averages about 16 simultaneous requests/second. When the program is first started up it consumes somewhere in the neighborhood of 1.5Mb of ram. After a full test run (which take 12-15 minutes) the memory consumption ends up somewhere near 12Mb. Normally, this would not be a problem for a program that runs once and then terminates, but this program is intended to run continuously. Very bad, indeed.
To try and narrow down the problem, I created a very small test application that spins off worker threads at a rate of once every 250ms. The worker thread creates a map and populates it with pseudo-random data, empties the map, and then exits. This program, too, leaks memory in like fashion, so I'm thinking that the problem is with the STL not releasing the memory as expected.
I have tried VLD to search for leaks and it has found a couple which I have remedied, but still the problem remains. I have tried integrating Hoard, but that has actually made the problem worse (i'm probably not integrating it properly, but i can't see how).
So I would like to pose the following question: is it possible to create a program that uses the STL in a multi-threaded environment that will not leak memory? Over the course of the last week I have made no less than 200 changes to this program. I have plotted the results of the changes and they all have the same basic profile. I don't want to have to remove all of the STL goodness that has made developing this application so much easier. I would earnestly appreciate any suggestions on how I can get this app working without leaking memory like it's going out of style.
Thanks again for any help!
P.S. I'm posting a copy of the memory test for inspection/personal edification.
#include <string>
#include <iostream>
#include <Windows.h>
#include <map>
using namespace std;
#define MAX_THD_COUNT 1000
DWORD WINAPI ClientThread(LPVOID param)
{
unsigned int thdCount = (unsigned int)param;
map<int, string> m;
for (unsigned int x = 0; x < 1000; ++x)
{
string s;
for (unsigned int y = 0; y < (x % (thdCount + 1)); ++y)
{
string z = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
unsigned int zs = z.size();
s += z[(y % zs)];
}
m[x] = s;
}
m.erase(m.begin(), m.end());
ExitThread(0);
return 0;
}
int main(int argc, char ** argv)
{
// wait for start
string inputWait;
cout << "type g and press enter to go: ";
cin >> inputWait;
// spawn many memory-consuming threads
for (unsigned int thdCount = 0; thdCount < MAX_THD_COUNT; ++thdCount)
{
CreateThread(NULL, 0, ClientThread, (LPVOID)thdCount, NULL, NULL);
cout
<< (int)(MAX_THD_COUNT - thdCount)
<< endl;
Sleep(250);
}
// wait for end
cout << "type e and press enter to end: ";
cin >> inputWait;
return 0;
}
Use _beginthreadex() when using the std library (includes the C runtime as far as MS is concerned). Also, you're going to experience a certain amount of fragmentation in the std runtime sub-allocator, especially in code designed to continually favor larger and larger requests like this.
The MS runtime library has some functions that allow you to debug memory requests and determine if there is a solid leak once you have a sound algorithm and are confident you don't see anything glaringly obvious. See the debug routines for more information.
Finally, I made the following modifications to the test jig you wrote:
Setup the proper _Crt report mode for spamming the debug window with any memory leaks after shutdown.
Modified the thread-startup loop to keep the maximum number of threads running constantly at MAXIMUM_WAIT_OBJECTS (WIN32-defined currently as 64 handles)
Threw in a purposeful leaked char array allocation to show the CRT will, in fact, catch it when dumping at program termination.
Eliminated console keyboard interaction. Just run it.
Hopefully this will make sense when you see the output log. Note: you must compile in Debug mode for this to make any proper dump for you.
#include <windows.h>
#include <dbghelp.h>
#include <process.h>
#include <string>
#include <iostream>
#include <map>
#include <vector>
using namespace std;
#define MAX_THD_COUNT 250
#define MAX_THD_LOOPS 250
unsigned int _stdcall ClientThread(void *param)
{
unsigned int thdCount = (unsigned int)param;
map<int, string> m;
for (unsigned int x = 0; x < MAX_THD_LOOPS; ++x)
{
string s;
for (unsigned int y = 0; y < (x % (thdCount + 1)); ++y)
{
string z = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
size_t zs = z.size();
s += z[(y % zs)];
}
m[x].assign(s);
}
return 0;
}
int main(int argc, char ** argv)
{
// setup reporting mode for the debug heap. when the program
// finishes watch the debug output window for any potential
// leaked objects. We're leaking one on purpose to show this
// will catch the leaks.
int flg = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);
flg |= _CRTDBG_LEAK_CHECK_DF;
_CrtSetDbgFlag(flg);
static char msg[] = "Leaked memory.";
new std::string(msg);
// will hold our vector of thread handles. we keep this fully populated
// with running threads until we finish the startup list, then wait for
// the last set of threads to expire.
std::vector<HANDLE> thrds;
for (unsigned int thdCount = 0; thdCount < MAX_THD_COUNT; ++thdCount)
{
cout << (int)(MAX_THD_COUNT - thdCount) << endl;
thrds.push_back((HANDLE)_beginthreadex(NULL, 0, ClientThread, (void*)thdCount, 0, NULL));
if (thrds.size() == MAXIMUM_WAIT_OBJECTS)
{
// wait for any single thread to terminate. we'll start another one after,
// cleaning up as we detected terminated threads
DWORD dwRes = WaitForMultipleObjects(thrds.size(), &thrds[0], FALSE, INFINITE);
if (dwRes >= WAIT_OBJECT_0 && dwRes < (WAIT_OBJECT_0 + thrds.size()))
{
DWORD idx = (dwRes - WAIT_OBJECT_0);
CloseHandle(thrds[idx]);
thrds.erase(thrds.begin()+idx, thrds.begin()+idx+1);
}
}
}
// there will be threads left over. need to wait on those too.
if (thrds.size() > 0)
{
WaitForMultipleObjects(thrds.size(), &thrds[0], TRUE, INFINITE);
for (std::vector<HANDLE>::iterator it=thrds.begin(); it != thrds.end(); ++it)
CloseHandle(*it);
}
return 0;
}
Output Debug Window
Note: there are two leaks reported. One is the std::string allocation, the other is the buffer within the std::string that held our message copy.
Detected memory leaks!
Dumping objects ->
{80} normal block at 0x008B1CE8, 8 bytes long.
Data: <09 > 30 39 8B 00 00 00 00 00
{79} normal block at 0x008B3930, 32 bytes long.
Data: < Leaked memor> E8 1C 8B 00 4C 65 61 6B 65 64 20 6D 65 6D 6F 72
Object dump complete.
It is not an easy task debug large applications.
Your sample is not the best choice to show what is happening.
One fragment of your real code guess better.
Of course it is not possible, so my suggestion is: use the maximum possible log, including insertion and deletion controls in all structures. Use counters to this information.
When they suspect something make a dump of all data to understand what is happening.
Try to work asynchronously to save the information so there is less impact on your application. This is not an easy task, but for anyone who enjoys a challenge and loves even more to program in C/C++ will be a ride.
Persistence and simplicity should be the goal.
Good luck

Resources