Different views of memory bandwidth between visual profiler and nsight analysis - visual-studio-2010

I'm using Cuda 5.5 under windows, with VS2010, nsight 3.1 and bundled visual profiler.
I have a toy kernel which only do stores and I see different data from nsight and visual profiler. Which should I trust? and why do I get different views?
Nsight says 4.21MB stores and visual profiler says 71402 transactions which represents 8.9MB (assuming all of them are 128B). Consequently, Nsight says BW is 277GB/s and visual profiler 126.69GB/s
I see Nsight data more close to reality, since my dataset is 1024x1024.
EDIT
I have deleted a lot of bad assumptions from my original question. I was thinking somewhat in CPUs and caches coherence.
Access pattern:
each thread performs 4 stores of 1 byte consecutive like this (dst is char*):
for (int i = 0; i < 4; i++) {
dst[offset+i] = 0;
}

There is a difference between Device memory and Global memory. In the programming guide, it says that device memory includes "global, local, shared, constant, or texture memory" (see 5.3.2).
In your first picture, Global loads and stores should be in the first table named L1/Shared Memory (which is not visible in your capture).

Related

How do I change memory bytes in Visual Studio 2017?

I am trying to change bytes in memory, in a memory window, when debugging a C++ project in my Visual Studio 2017. The memory window is pointing at memory holding code, as I am trying to patch a piece of code quickly (just need to change parameter value) without needing to stop and re-compile.
I also noticed that you cannot change values in the memory window even for data memory.
Is there some hidden configuration setting to let you do that. It was possible to do this in VS6.
I found a kind of work-around, that works even for modifying executable code memory.
Here are the steps:
Define a spare global pointer in your code (you can actually use any memory pointer in your code as long as you do not care that you will change its value):
char* memptr;
Set the pointer in a watch window.
Set the pointer value to the memory address that you want to modify.
Expend the content (BTW you can use "memptr,100" in your watch window to access more than one byte).
Type the updated value in the expended byte value cells.
This works even if you set the pointer to executable machine code memory, so you can use it to patch code.
It can be an int pointer or any other type, or you can use a cast in the watch window if you desire to edit any other kinds of objects.
Be careful this can be dangerous, modifying of memory has to be done with the greatest of care.
How do I change memory bytes in Visual Studio 2017?
As far as I know, Microsoft does not support changing the memory bytes directly in the latest Visual Studio including VS2017 and there is no such hidden option to realize it.
In general, during debugging, the Memory window shows the memory space your app is using. And the Memory window isn't limited to displaying data. It displays everything in the memory space, including data, code, and random bits of garbage in unassigned memory.
Besides, memory bytes changes with the values of variables in debugger windows such as Watch, Autos, Local Variables, and QuickWatch dialogs. Then analyze its changes in memory usage to improve the program. Because of it, we cannot change it directly.
In addition, more info about Memory Window, you can check this official document.

Why are OpenGL and CUDA contexts memory greedy?

I develop software which usually includes both OpenGL and Nvidia CUDA SDK. Recently, I also started to seek ways to optimize run-time memory footprint. I noticed the following (Debug and Release builds differ only by 4-7 Mb):
Application startup - Less than 1 Mb total
OpenGL 4.5 context creation ( + GLEW loader init) - 45 Mb total
CUDA 8.0 context (Driver API) creation 114 Mb total.
If I create OpenGL context in "headless" mode, the GL context uses 3 Mb less, which probably goes to default frame buffers allocation. That makes sense as the window size is 640x360.
So after OpenGL and CUDA context are up, the process already consumes 114 Mb.
Now, I don't have deep knowledge regarding OS specific stuff that occurs under the hood during GL and CUDA context creation, but 45 Mb for GL and 68 for CUDA seems a whole lot to me. I know that usually several megabytes goes to system frame buffers, function pointers,(probably a bulk of allocations happens on driver side). But hitting over 100 Mb with just "empty" contexts looks too much.
I would like to know:
Why GL/CUDA context creation consumes such a considerable amount of memory?
Are there ways to optimize that?
The system setup under test:
Windows 10 64bit. NVIDIA GTX 960 GPU (Driver Version:388.31). 8 Gb RAM. Visual Studio 2015, 64bit C++ console project.
I measure memory consumption using Visual Studio built-in Diagnostic Tools -> Process Memory section.
UPDATE
I tried Process Explorer, as suggested by datenwolf. Here is the screenshot of what I got, (my process at the bottom marked with yellow):
I would appreciate some explanation on that info. I was always looking at "Private Bytes" in "VS Diagnostic Tools" window. But here I see also "Working Set", "WS Private" etc. Which one correctly shows how much memory my process currently uses? 281,320K looks way too much, because as I said above, the process at the startup does nothing, but creates CUDA and OpenGL contexts.
Partial answer: This is an OS-specific issue; on Linux, CUDA takes 9.3 MB.
I'm using CUDA (not OpenGL) on GNU/Linux:
CUDA version: 10.2.89
OS distribution: Devuan GNU/Linux Beowulf (~= Debian Buster without systemd)
Kernel: Linux 5.2.0
Processor: Intel x86_64
To check how much memory gets used by CUDA when creating a context, I ran the following C program (which also checks what happens after context destruction):
#include <stdio.h>
#include <cuda.h>
#include <malloc.h>
#include <stdlib.h>
static void print_allocation_stats(const char* s)
{
printf("%s:\n", s);
printf("--------------------------------------------------\n");
malloc_stats();
printf("--------------------------------------------------\n\n");
}
int main()
{
display_mallinfo("Initially");
int status = cuInit(0);
if (status != 0 ) { return EXIT_FAILURE; }
print_allocation_stats("After CUDA driver initialization");
int device_id = 0;
unsigned flags = 0;
CUcontext context_id;
status = cuCtxCreate(&context_id, flags, device_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context creation");
status = cuCtxDestroy(context_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context destruction");
return EXIT_SUCCESS;
}
(note that this uses a glibc-specific function, not in the standard library.)
Summarizing the results and snipping irrelevant parts:
Point in program
Total bytes
In-use
Max MMAP Regions
Max MMAP bytes
Initially
135168
1632
0
0
After CUDA driver initialization
552960
439120
2
307200
After context creation
9314304
6858208
8
6643712
After context destruction
7016448
580688
8
6643712
So CUDA starts with 0.5 MB and after allocating a context takes up 9.3 MB (going back down to 7.0 MB on destroying the context). 9 MB is still a lot of memory for not having done anything; but - maybe some of it is all-zeros, or uninitialized, or copy-on-write, in which case it doesn't really take up that much memory.
It's possible that memory use improved dramatically over the two years between the driver release with CUDA 8 and with CUDA 10, but I doubt it. So - it looks like your problem is Windows specific.
Also, I should mention I did not create an OpenGL context - which is another part of OP's question; so I haven't estimated how much memory that takes. OP brings up the question of whether the sum is greater than its part, i.e. whether a CUDA context would take more memory if an OpenGL context existed as well; I believe this should not be the case, but readers are welcome to try and report...

Directx Texture interface to existing memory

I'm writing a rendering app that communicates with an image processor as a sort of virtual camera, and I'm trying to figure out the fastest way to write the texture data from one process to the awaiting image buffer in the other.
Theoretically I think it should be possible with 1 DirectX copy from VRAM directly to the area of memory I want it in, but I can't figure out how to specify a region of memory for a texture to occupy, and thus must perform an additional memcpy. DX9 or DX11 solutions would be welcome.
So far, the docs here: http://msdn.microsoft.com/en-us/library/windows/desktop/bb174363(v=vs.85).aspx have held the most promise.
"In Windows Vista CreateTexture can create a texture from a system memory pointer allowing the application more flexibility over the use, allocation and deletion of the system memory"
I'm running on Windows 7 with the June 2010 Directx SDK, However, whenever I try and use the function in the way it specifies, I the function fails with an invalid arguments error code. Here is the call I tried as a test:
static char s_TextureBuffer[640*480*4]; //larger than needed
void* p = (void*)s_TextureBuffer;
HRESULT res = g_D3D9Device->CreateTexture(640,480,1,0, D3DFORMAT::D3DFMT_L8, D3DPOOL::D3DPOOL_SYSTEMMEM, &g_ReadTexture, (void**)p);
I tried with several different texture formats, but with no luck. I've begun looking into DX11 solutions, it's going slowly since I'm used to DX9. Thanks!

feature request: an atomicAdd() function included in gwan.h

In the G-WAN KV options, KV_INCR_KEY will use the 1st field as the primary key.
That means there is a function which increments atomically already built in the G-WAN core to make this primary index work.
It would be good to make this function opened to be used by servlets, i.e. included in gwan.h.
By doing so, ANSI C newbies like me could benefit from it.
There was ample discussion about this on the old G-WAN forum, and people were invited to share their experiences with atomic operations in order to build a rich list of documented functions, platform by platform.
Atomic operations are not portable because they address the CPU directly. It means that the code for Intel x86 (32-bit) and Intel AMD64 (64-bit) is different. Each platform (ARM, Power7, Cell, Motorola, etc.) has its own atomic instruction sets.
Such a list was not published in the gwan.h file so far because basic operations are easy to find (the GCC compiler offers several atomic intrinsics as C extensions) but more sophisticated operations are less obvious (needs asm skills) and people will build them as they need - for very specific uses in their code.
Software Engineering is always a balance between what can be made available at the lowest possible cost to entry (like the G-WAN KV store, which uses a small number of functions) and how it actually works (which is far less simple to follow).
So, beyond the obvious (incr/decr, set/get), to learn more about atomic operations, use Google, find CPU instruction sets manuals, and arm yourself with courage!
Thanks for Gil's helpful guidance.
Now, I can do it by myself.
I change the code in persistence.c, as below:
firstly, i changed the definition of val in data to volatile.
//data[0]->val++;
//xbuf_xcat(reply, "Value: %d", data[0]->val);
int new_count, loops=50000000, time1, time2, time;
time1=getus();
for(int i; i<loops; i++){
new_count = __sync_add_and_fetch(&data[0]->val, 1);
}
time2=getus();
time=loops/(time2-time1);
time=time*1000;
xbuf_xcat(reply, "Value: %d, time: %d incr_ops/msec", new_count, time);
I got 52,000 incr_operations/msec with my old E2180 CPU.
So, with GCC compiler I can do it by myself.
thanks again.

Can ETW (event tracing for windows) be used to gather also memory statistics?

Is it possible using ETW to also get memory statistics of all the processes and the system ?
With memory statistics I mean : e.g. Commited bytes, private bytes,paged pool,working set,...
I cannot find anything about using xperf to get and see memory statistics. It is always about CPU , disk , network.
One could probably use performance counters to get that kind of information, but how can one overlay the statistics graphically in one chart (how to correlate/sync the timestamps) ?
Your best bet on Windows 8.1 and higher is the Microsoft-Windows-Kernel-Memory provider, which records per-process memory information every 0.5 s. See https://github.com/google/UIforETW/issues/80 for details. UIforETW enables this by default when it is available.
You could also try the MEMINFO provider. It gives a system-wide overview of memory pressure. It shows the Active List (currently in use memory), the Standby List ('useful' pages not currently in use, such as the disk cache), and the Zero and Free lists (genuinely free memory). This at least lets you tell whether a system is running out of memory.
You could also try MEMINFO_WS and CONTMEMGEN but these are undocumented so I really don't know what they do. They show up in xperf -providers k but when I record with them I can't see any new graphs appearing. Apparently Microsoft ships these providers but no way to view them. Sigh...
If you want more memory details on Windows 7 -- such as per-process working sets -- your best bet is to have a process running which periodically queries this data and emits it in custom ETW events. This is available in a prepackaged form in UIforETW which can query the working set of a specified set of processes once a second. See the announcement post for how to get UIforETW:
https://randomascii.wordpress.com/2015/04/14/uiforetw-windows-performance-made-easier/
UIforETW's Windows 7 working set data shows up in Generic Events under Task Name == WorkingSet. On Windows 8.1 the OS working set data (more detailed, more efficiently recorded) shows up under Memory-> Virtual Memory Snapshots.
You can trace memory usage with ReferenceSet kernel group. It includes the following traceflags:
PROC_THREAD+LOADER+HARD_FAULTS+MEMORY+FOOTPRINT+VIRT_ALLOC+MEMINFO+VAMAP+SESSION+REFSET+MEMINFO_WS
MEMORY = Memory tracing
FOOTPRINT+REFSET = Support footprint analysis
MEMINFO = Memory List Info (active, standby and oters you see from ResMon)
VIRT_ALLOC = Virtual allocation reserve and release
VAMAP = mapped files information
MEMINFO_WS = Working set Info
As you can see xperf can capture a lot of memory data when you sue the right flags.

Resources