Allocate maximum memory available

Allocate maximum memory available - algorithm

I currently try to write a programm which is supposed to allocate the maximum memory available. I came to a solution which limits the area of potential available memory until both borders are equal (see listing)
void enforceMemoryLeakage(void** arrayOfAllocMemory)
{
unsigned int maxMemory = 0x80000000;
unsigned int minMemory = 0x50000000;
unsigned int attempAllocatedMemory = minMemory + (maxMemory - minMemory) / 2;
void* pAllocMemory;
while((maxMemory - minMemory) > 1)
{
pAllocMemory = malloc(attempAllocatedMemory);
if (pAllocMemory != NULL)
{
minMemory = attempAllocatedMemory;
attempAllocatedMemory += (maxMemory - minMemory) / 2;
free(pAllocMemory);
}
else
{
maxMemory = attempAllocatedMemory;
attempAllocatedMemory = minMemory + (maxMemory - minMemory) / 2;
}
}
arrayOfAllocMemory[0] = malloc(maxMemory);
void* pAllocAdditionalMemory = malloc(100);
if (pAllocAdditionalMemory == NULL)
std::cout << "Maximum memory: " << minMemory << "\n";
}
The above displayed code works fine. However if the command is executed
void* pAllocAdditionalMemory = malloc(100);
if (pAllocAdditionalMemory == NULL)
std::cout << "Maximum memory: " << minMemory << "\n";
I would have expected that there is no further memory available. However it does not work which brings me to my actual question why the above shown approach does not work.
Best regards
Ratbald

You did not specify OS nor platform so I am completely guessing here so read with extreme prejudice...
Assuming you do not have a bug in your binary search code... My bet is you face memory fragmentation problems as you are successfully allocate/free memory during execution other processes can do the same so you might fragment your memory. Example:
OS has 2 MByte chunk of continuous free memory
you allocate 1.5 MByte of mem (0.5 MByte free)
some other process allocates 1 KByte (0.499 MByte free)
you free the 1 MByte (1.0 + 0.499 MByte free two fragments)
and attempt to allocate 1.25 MByte
but OS does not have 1.25 memory in single continuous chunk so it failed hence you allocate the 1MByte again (0.499 MByte free)
you succesffuly allocate 1 KByte (0.498 MByte free)
Depending on the OS memory management strategies you could sometimes even not need another process interfering to fragment memory ...
There are however another possibilities not related to Fragmentation. In case of emulations or WOW64 the OS will not allocate whole available RAM, also there are limits on single continuous chunk size. For example Win32 will not allow more than ~1.25 GByte but that does not mean there is only 1.25 GByte of free RAM ...

Related

How to leverage multichannel RAM?

My objective
I want to write a RAM bandwidth benchmark
My solution
U64 benchRead(U64 &io_minLoopTicks) //io_minLoopTicks initial value is U64_MAX
{
const U32 elementsCount = 100'000'000;
U64 accumulator = 0;
std::vector<U64> tab;
tab.resize(elementsCount); //800MB
for(U8 i = 0; i < 10; ++i) // Loops to get a reliable value.
{
const U64 startTimestamp = profiler.start(); // profiler uses under the hood QueryPerformanceCounter
for(const U64 j : tab)
accumulator += j; // Do something with the RAM to not be optimized away.
const U64 loopTicks = profiler.end() - startTimestamp;
if(loopTicks < io_minLoopTicks)
io_minLoopTicks = loopTicks;
}
return accumulator;
}
My problem
The theory would be 57.6 GB/s (8 bytes (bus width) * 2 (channels) * 3.6 GT).
In practise I have 27.7 GB/s (9700K #4.8GHz, 4 single rank 3600MT DDR4 DIMMs).
So my number seems to reflect only one channel performance.
My guess is that the way I coded, the array is in an area of RAM that can only be accessed by one channel. I am not familiar at all with multi-channel so I don't really understand the limits of that technology.
For sure my RAM is dual channel (CPU-Z confirmed it)
Performance analysis
I checked the ouput of Clang14 and it uses SSE (O3 with no -march=native). I could be wrong (not an ASM guy) but it seems it unrolled the loop and does 128 bytes per iteration. A loop is 4.5 cycles on Coffee Lake according to uiCA leading to 139.38 GB/s, way above the RAM theoretical speed. So we should be RAM limited.
My questions
Is my array stored in one bank hence the fact it can only be accessed by one channel ?
How do we write code to leverage the benefits of multichannel RAM (here dual channel) ?
References
Godbolt

Why is `_mm_stream_si128` much slower than `_mm_storeu_si128` on Skylake-Xeon when writing parts of 2 cache lines? But less effect on Haswell

I have code that looks like this (simple load, modify, store) (I've simplified it to make it more readable):
__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
__m128i in = _mm_loadu_si128(inptr);
__m128i out = in; // real code does more than this, but I've simplified it
_mm_stream_si12(outptr,out);
inptr += 12;
outptr += 16;
}
This code runs about 5 times faster on our older Sandy Bridge Haswell hardware compared to our newer Skylake machines. For example, if the while loop runs about 16e9 iterations, it takes 14 seconds on Sandy Bridge Haswell and 70 seconds on Skylake.
We upgraded to the lasted microcode on the Skylake,
and also stuck in vzeroupper commands to avoid any AVX issues. Both fixes had no effect.
outptr is aligned to 16 bytes, so the stream command should be writing to aligned addresses. (I put in checks to verify this statement). inptr is not aligned by design. Commenting out the loads doesn't make any effect, the limiting commands are the stores. outptr and inptr are pointing to different memory regions, there is no overlap.
If I replace the _mm_stream_si128 with _mm_storeu_si128, the code runs way faster on both machines, about 2.9 seconds.
So the two questions are
1) why is there such a big difference between Sandy Bridge Haswell and Skylake when writing using the _mm_stream_si128 intrinsic?
2) why does the _mm_storeu_si128 run 5x faster than the streaming equivalent?
I'm a newbie when it comes to intrinsics.
Addendum - test case
Here is the entire test case: https://godbolt.org/z/toM2lB
Here is a summary of the benchmarks I took on two difference processors, E5-2680 v3 (Haswell) and 8180 (Skylake).
// icpc -std=c++14 -msse4.2 -O3 -DNDEBUG ../mre.cpp -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU # 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz.
// The command line was
// perf stat ./mre 100000
//
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 1.65 7.29
// _mm_storeu_si128 0.41 0.40
The ratio of stream to store is 4x or 18x, respectively.
I'm relying on the default new allocator to align my data to 16 bytes. I'm getting luck here that it is aligned. I have tested that this is true, and in my production application, I use an aligned allocator to make absolutely sure it is, as well as checks on the address, but I left that off of the example because I don't think it matters.
Second edit - 64B aligned output
The comment from #Mystical made me check that the outputs were all cache aligned. The writes to the Tile structures are done in 64-B chunks, but the Tiles themselves were not 64-B aligned (only 16-B aligned).
So changed my test code like this:
#if 0
std::vector<Tile> tiles(outputPixels/32);
#else
std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif
and now the numbers are quite different:
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 0.19 0.48
// _mm_storeu_si128 0.25 0.52
So everything is much faster. But the Skylake is still slower than Haswell by a factor of 2.
Third Edit. Purposely misalignment
I tried the test suggested by #HaidBrais. I purposely allocated my vector class aligned to 64 bytes, then added 16 bytes or 32 bytes inside the allocator such that the allocation was either 16 Byte or 32 Byte aligned, but NOT 64 byte aligned. I also increased the number of loops to 1,000,000, and ran the test 3 times and picked the smallest time.
perf stat ./mre1 1000000
To reiterate, an alignment of 2^N means it is NOT aligned to 2^(N+1) or 2^(N+2).
// STORER alignment time (seconds)
// byte E5-2680 8180
// ---------------------------------------------------
// _mm_storeu_si128 16 3.15 2.69
// _mm_storeu_si128 32 3.16 2.60
// _mm_storeu_si128 64 1.72 1.71
// _mm_stream_si128 16 14.31 72.14
// _mm_stream_si128 32 14.44 72.09
// _mm_stream_si128 64 1.43 3.38
So it is clear that cache alignment gives the best results, but _mm_stream_si128 is better only on the 2680 processor and suffers some sort of penalty on the 8180 that I can't explain.
For furture use, here is the misaligned allocator I used (I did not templatize the misalignment, you'll have to edit the 32 and change to 0 or 16 as needed):
template <class T >
struct Mallocator {
typedef T value_type;
Mallocator() = default;
template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept
{}
T* allocate(std::size_t n) {
if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
if(! p1) throw std::bad_alloc();
p1 += 32; // misalign on purpose
return reinterpret_cast<T*>(p1);
}
void deallocate(T* p, std::size_t) noexcept {
uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
p1 -= 32;
std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }
...
std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);

The simplified code doesn't really show the actual structure of your benchmark. I don't think the simplified code will exhibit the slowness you've mentioned.
The actual loop from your godbolt code is:
while (count > 0)
{
// std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
__m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
__m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
__m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
__m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));
__m128i tileVal0 = value0;
__m128i tileVal1 = value1;
__m128i tileVal2 = value2;
__m128i tileVal3 = value3;
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);
ptr += diffBytes * 4;
count -= diffBytes * 4;
tile += diffPixels * 4;
ipixel += diffPixels * 4;
if (ipixel == 32)
{
// go to next tile
ipixel = 0;
tileIter++;
tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
}
}
Note the if (ipixel == 32) part. This jumps to a different tile every time ipixel reaches 32. Since diffPixels is 8, this happens every iteration. Hence you are only making 4 streaming stores (64 bytes) per tile. Unless each tile happens to be 64-byte aligned, which is unlikely to happen by chance and cannot be relied on, this means that every write writes to only part of two different cache lines. That's a known anti-pattern for streaming stores: for effective use of streaming stores you need to write out the full line.
On to the performance differences: streaming stores have widely varying performance on different hardware. These stores always occupy a line fill buffer for some time, but how long varies: on lots of client chips it seems to only occupy a buffer for about the L3 latency. I.e., once the streaming store reaches the L3 it can be handed off (the L3 will track the rest of the work) and the LFB can be freed on the core. Server chips often have much longer latency. Especially multi-socket hosts.
Evidently, the performance of NT stores is worse on the SKX box, and much worse for partial line writes. The overall worse performance is probably related to the redesign of the L3 cache.

Why is my program slow when access 4K offset array element?

I wrote a program to get the cache & cache line size for my computer, but I got the result that I can't explain, could anyone help me to explain it for me?
Here is my program, the access_array() traverses through array with different step size, and I measure the execution time for these step size.
// Program to calculate L1 cache line size, compile in g++ -O1
#include <iostream>
#include <string>
#include <sys/time.h>
#include <cstdlib>
using namespace std;
#define ARRAY_SIZE (256 * 1024) // arbitary array size, must in 2^N to let module work
void access_array(char* arr, int steps)
{
const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
int idx = 0;
for (int i = 0; i < loop_cnt; i++)
{
arr[idx] += 10;
idx = (idx + steps) & (ARRAY_SIZE - 1); // if use %, the latency will be too high to see the gap
}
}
int main(int argc, char** argv){
double cpu_us_used;
struct timeval start, end;
for(int step = 1 ; step <= ARRAY_SIZE ; step *= 2){
char* arr = new char[ARRAY_SIZE];
for(int i = 0 ; i < ARRAY_SIZE ; i++){
arr[i] = 0;
}
gettimeofday(&start, NULL); // get start clock
access_array(arr, step);
gettimeofday(&end, NULL); // get end clock
cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
cout << step << " , " << cpu_us_used << endl;
delete[] arr;
}
return 0;
}
Result
My question is:
From 64 to 512, I can't explain why the execution time is almost the same, any why there is a linear growth from 1K to 4K?
Here are my assumptions.
For step = 1, every 64 iterations cause 1 cache line miss. And after 32K iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 64 iterations.
For step = 64, every 1 iteration cause 1 cache line miss. And after 512 iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 1 iterations.
As a result, there is a gap between step = 32 and 64.
By observing the first gap, I can conclude that the L1 cache line size is 64 bytes.
For step = 512, every 1 iteration cause 1 cache line miss. And after 64 iterations, the Set 0,8,16,24,32,40,48,56 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For step = 4K, every 1 iteration cause 1 cache line miss. And after 8 iterations, the set 0 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For 128 to 4K cases, they all happened L1 collision miss, and the difference is that with the more steps, we start the collision miss earlier.
The only idea that I can come up with is that there are other mechanism (maybe page, TLB, etc.) to impact the execution time.
Here is the cache size & CPU info of my workstation. By the way, I have run this program on my PC either, and I got the similar results.
Platform : Intel Xeon(R) CPU E5-2667 0 # 2.90GHz
LEVEL1_ICACHE_SIZE 32768
LEVEL1_ICACHE_ASSOC 8
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 32768
LEVEL1_DCACHE_ASSOC 8
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 262144
LEVEL2_CACHE_ASSOC 8
LEVEL2_CACHE_LINESIZE 64
LEVEL3_CACHE_SIZE 15728640
LEVEL3_CACHE_ASSOC 20
LEVEL3_CACHE_LINESIZE 64
LEVEL4_CACHE_SIZE 0
LEVEL4_CACHE_ASSOC 0
LEVEL4_CACHE_LINESIZE 0

This CPU probably has:
a hardware cache line prefetcher, that detects linear access patterns within the same physical 4 KiB page and prefetches them before an access is made. This stops prefetching at 4 KiB boundaries (because the physical address is likely to be very different and unknown).
a hardware TLB prefetcher, that detects linear access patterns in TLB usage and prefetches TLB entries.
From 1 to 16 the cache line prefetcher is doing its job, fetching cache lines before you access them, so execution time remains the same (uneffected by cache misses).
At 32, the cache line prefetcher starts to struggle (due to the "stop at 4 KiB page boundary" thing).
From 64 to 512 the TLB prefetcher is doing its job, fetching TLB entries before you access them, so execution time remains the same (unaffected by TLB misses).
From 512 to 4096 the TLB prefetcher is failing to keep up. The CPU stalls waiting for TLB info for every "4096/step" accesses; and these stalls are causing "linear-ish" growth in execution time.
From, 4096 to 131072; I'd like to assume that the "new char[ARRAY_SIZE];" allocates so much space that the library and/or OS decided to give you 2 MiB pages and/or 1 GiB pages, eliminating some TLB misses, and improving execution time as the number of pages being accessed decreases.
For "larger than 131072"; I'd assume you start to see the effects of "1 GiB page TLB miss".
Note that it's probably easier (and less error prone) to get the cache characteristics (size, associativity, how many logical CPUs are sharing it, ..) and cache line size from the CPUID instruction. The approach you're using is more suited to measuring cache latency (how long it takes to get data from one of the caches).
Also; to reduce TLB interference the OS might allow you to explicitly ask for 1 GiB pages (e.g. mmap(..., MAP_POPULATE | MAP_HUGE_1GB, ... ) on Linux); and you can "pre-warm" the TLB by doing a "touch then CLFLUSH" warm-up loop before you start measuring. The hardware cache-line prefetcher can be disabled via. a flag in an MSR (if you have permission), or can be defeated by using an "random" (unpredictable) access pattern.

Finally I found the answer.
I tried to set the array size to 16KiB and smaller, but it slow on step = 4KiB too.
On the other hand, I tried to change the offset of steps from mutiplying 2 in each iteration to add 1 in each iteration, it still slow down when step = 4KiB.
Code
#define ARRAY_SIZE (4200)
void access_array(char* arr, int steps)
{
const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
int idx = 0;
for (int i = 0; i < loop_cnt; i++)
{
arr[idx] += 10;
idx = idx + steps;
if(idx >= ARRAY_SIZE)
idx = 0;
}
}
for(int step = 4090 ; step <= 4100 ; step ++){
char* arr = new char[ARRAY_SIZE];
for(int i = 0 ; i < ARRAY_SIZE ; i++){
arr[i] = 0;
}
gettimeofday(&start, NULL); // get start clock
access_array(arr, step);
gettimeofday(&end, NULL); // get end clock
cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
cout << step << " , " << cpu_us_used << endl;
delete[] arr;
}
Result
4090 , 48385
4091 , 48497
4092 , 48136
4093 , 48520
4094 , 48090
4095 , 48278
4096 , **51818**
4097 , 48196
4098 , 48600
4099 , 48185
4100 , 63149
As a result, I suspected it's not related to any of cache / TLB / prefetch mechanism.
With more and more googling the relation ship between performance and magic number "4K", I have found the 4K aliasing problem on Intel Platform, which slows down the performance of load.
This occurs when a load is issued after a store and their memory addresses are offset by (4K). When this is processed in the pipeline, the issue of the load will match the previous store (the full address is not used at this point), so pipeline will try to forward the results of the store and avoid doing the load (this is store forwarding). Later on when the address of the load is fully resolved, it will not match the store, and so the load will have to be re-issued from a later point in the pipe. This has a 5-cycle penalty in the normal case, but could be worse in certain situations, like with un-aligned loads that span 2 cache lines.

Optimize Cuda Kernel time execution

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.

Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__  unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.

I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.

Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

How to allocate 16GB of memory in Go?

I'm using the following simple Go code to allocate a 3D array of size 1024x1024x1024:
grid = make([][][]TColor, 1024)
for x = 0; x < 1024; x++ {
grid[x] = make([][]TColor, 1024)
for y = 0; y < 1024; y++ {
grid[x][y] = make([]TColor, 1024)
}
}
That TColor struct is a 4-component float64 vector:
type TColor struct { R, G, B, A float64 }
Halfway (x=477 and y=~600ish) through the allocation, the inner-most make() call panics with... runtime: out of memory: cannot allocate 65536-byte block (17179869184 in use)
This works fine with lower grid resolutions, ie 256³, 128³ etc. Now since the size of the struct is 4x4 bytes, that whole grid should require exactly 16 GB of memory. My machine (openSuse 12.1 64bit) has 32 GB of addressable physical (ie not-virtual) memory. Why can Go (weekly.2012-02-22) not allocate even half of this?

The struct has 4x8 bytes, not 4x4.

In the current implementation of the Go language, on 64-bit CPUs the Go runtime reserves 16GB of virtual memory from the operating system. This limits the total memory used by a Go program to 16GB.
If you plan to use Go in projects that require large amounts of memory you will need to edit the function runtime·mallocinit in file malloc.goc and increase the value of variable arena_size from 16GB to a bigger value (such as 32GB). After the edit, run
cd $GOROOT/src/pkg/runtime
go tool dist install -v
and then recompile your project.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio