OpenACC shared memory usage

OpenACC shared memory usage - shared-memory

I am working with openacc using pgi compiler. I want to know how I can profile the code about memory usage specially the shared memory at runtime?
Thank you so much for your help!
Behzad

I'm assuming you mean "shared memory" in the CUDA sense (the fast, per-SM shared memory on NVIDIA GPUs). In this case, you have a few options.
First, if you just want to know how much shared memory is being used, this can be determined at compile-time by adding -Mcuda=ptxinfo.
pgcc -fast -ta=tesla:cc35 laplace2d.c -Mcuda=ptxinfo
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'main_61_gpu' for 'sm_35'
ptxas info : Function properties for main_61_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 26 registers, 368 bytes cmem[0]
ptxas info : Compiling entry function 'main_65_gpu_red' for 'sm_35'
ptxas info : Function properties for main_65_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 18 registers, 368 bytes cmem[0]
ptxas info : Compiling entry function 'main_72_gpu' for 'sm_35'
ptxas info : Function properties for main_72_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 18 registers, 344 bytes cmem[0]
In the above case, it doesn't appear that I'm using any shared memory. (Follow-up I spoke with a PGI compiler engineer and learned that the shared memory is dynamically adjusted at kernel launch, so it will not show up via ptxinfo.)
You can also use NVIDIA Visual profiler to get at this information. If you gather a GPU timeline, then click on an instance of a particular kernel, the properties panel should open and show shared memory/block. In my case, the above showed 0 bytes of shared memory used and the Visual Profiler showed some memory being used, so I'll need to dig into why.
You can get some info at runtime too. If you're comfortable on the command-line, you can use nvprof:
# Analyze load/store transactions
$ nvprof -m shared_load_transactions,shared_store_transactions ./a.out
# Analyze shared memory efficiency
# This will result in a LOT of kernel replays.
$ nvprof -m shared_efficiency ./a.out
This doesn't show the amount used, but does give you an idea of how it's used. The Visual Profiler's guided analysis will give you some insight into what these metrics mean.

Related

How to enlarge the memory in Microblaze for software applications?

I wrote a C program, which has a big size . However, it is known that the Microblaze by default uses only 64KB. So I change the amount of BRAM in the EDK to 512K but when I generate the bitsream I got this errors:
- C:\Users\slim\Desktop\hs\system.mhs line 74
IPNAME: plb_v46, INSTANCE: mb_plb - 2 master(s) : 1 slave(s)
IPNAME: lmb_v10, INSTANCE: ilmb - 1 master(s) : 1 slave(s)
IPNAME: lmb_v10, INSTANCE: dlmb - 1 master(s) : 1 slave(s)
ERROR:EDK:440 - platgen failed with errors!
Done!

On a Spartan-6, EDK's blockRAM memory block are limited to 64kB if they are 32 bits large, 128kB if they are 64 bits large and 256kB if they are 128 bits large. Besides, the spartan-6 LX45 has only 238kB of BlockRAM memory available, so this is impossible altogether on your platform.
The memory the microblaze use to store it's program is limited to 32 bits, as far as I know. To use the larger memories you would need to connect it to an AXI port. Otherwise, Xilinx has an answer record on how to use 2 differents 64kB memory to present 128kB to the microblaze.
Do you really need that much memory? If the answer is "yes", you should use a microblaze with external memory instead. You will have much more memory available (your board has 128MB DDR ready to use) with minimal performance impact if you have sufficient caches.

Windows program has big native heap, much larger than all allocations

We are running a mixed mode process (managed + unmanaged) on Win 7 64 bit.
Our process is using up too much memory (especially VM). Based on our analysis, the majority of the memory is used by a big native heap. Our theory is that the LFH is saving too many free blocks in committed memory, for future allocations. They sum to about 1.2 GB while our actual allocated native memory is only at most 0.6 GB.
These numbers are from a test run of the process. In production it sometimes exceeded 10 GB of VM - with maybe 6 GB unaccounted for by known allocations.
We'd like to know if this theory of excessive committed-but-free-for-allocation segments is true, and how this waste can be reduced.
Here's the details of our analysis.
First we needed to figure out what's allocated and rule out memory leaks. We ran the excellent Heap Inspector by Jelle van der Beek and we ruled out a leak and established that the known allocations are at most 0.6 deci-GB.
We took a full memory dump and opened in WinDbg.
Ran !heap -stat
It reports a big native heap with 1.83 deci-GB committed memory. Much more than the sum of our allocations!
_HEAP 000000001b480000
Segments 00000078
Reserved bytes 0000000072980000
Committed bytes 000000006d597000
VirtAllocBlocks 0000001e
VirtAlloc bytes 0000000eb7a60118
Then we ran !heap -stat -h 0000001b480000
heap # 000000001b480000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
c0000 12 - d80000 (10.54)
b0000 d - 8f0000 (6.98)
e0000 a - 8c0000 (6.83)
...
If we add up all the 20 reported items, they add up to 85 deci-MB - much less than the 1.79 deci-GB we're looking for.
We ran !heap -h 1b480000
...
Flags: 00001002
ForceFlags: 00000000
Granularity: 16 bytes
Segment Reserve: 72a70000
Segment Commit: 00002000
DeCommit Block Thres: 00000400
DeCommit Total Thres: 00001000
Total Free Size: 013b60f1
Max. Allocation Size: 000007fffffdefff
Lock Variable at: 000000001b480208
Next TagIndex: 0000
Maximum TagIndex: 0000
Tag Entries: 00000000
PsuedoTag Entries: 00000000
Virtual Alloc List: 1b480118
Unable to read nt!_HEAP_VIRTUAL_ALLOC_ENTRY structure at 000000002acf0000
Uncommitted ranges: 1b4800f8
FreeList[ 00 ] at 000000001b480158: 00000000be940080 . 0000000085828c60 (9451 blocks)
...
When adding up up all the segment sizes in the report, we get:
Total Size = 1.83 deci-GB
Segments Marked Busy Size = 1.50 deci-GB
Segments Marked Busy and Internal Size = 1.37 deci-GB
So all the committed bytes in this report do add up to the total commit size. We grouped on block size and the most heavy allocations come from blocks of size 0x3fff0. These don't correspond to allocations that we know of. There were also mystery blocks of other sizes.
We ran !heap -p -all. This reports the LFH internal segments but we don't understand it fully. Those 3fff0 sized blocks in the previous report appear in the LFH report with an asterisk mark and are sometimes Busy and sometimes Free. Then inside them we see many smaller free blocks.
We guess these free blocks are legitimate. They are committed VM that the LFH reserves for future allocations. But why is their total size so much greater than sum of memory allocations, and can this be reduced?

Well, I can sort of answer my own question.
We had been doing lots and lots of tiny allocations and deallocations in our program. There was no leak, but it seems this somehow created a fragmentation of some sort. After consolidating and eliminating most of these allocations our software is running much better and using less peak memory. It is still a mystery why the peak committed memory was so much higher than the peak actually-used memory.

gcc has a memory leak?

I've been trying to be more meticulous lately about memory management in my code. Just for a laugh, I wrote a simple C source file containing only one function, and used valgrind to see if the C compiler itself had any leaks. To my great surprise, it did!
valgrind --leak-check=full --show-reachable=yes gcc -c example.c
...bunch of junk...
==4587== LEAK SUMMARY:
==4587== definitely lost: 4,207 bytes in 60 blocks
==4587== indirectly lost: 56 bytes in 5 blocks
==4587== possibly lost: 27 bytes in 2 blocks
==4587== still reachable: 29,048 bytes in 47 blocks
==4587== suppressed: 0 bytes in 0 blocks
Clang had leaks too, of only 68 bytes though, all of them reachable.
I thought that if your code has memory leaks, you get thrown in solitary confinement for every byte you lose. Have I misunderstood the implications of memory leaks? Are they actually sort of tolerable as long as it's not a long-running program? Is this actually just valgrind being wrong?

Linux memory overcommit details

I am developing SW for embedded Linux and i am suffering system hangs because OOM Killer appears from time to time. Before going beyond i would like to solve some confusing issues about how Linux Kernel allocate dynamic memory assuming /proc/sys/vm/overcommit_memory has 0 and /proc/sys/vm/min_free_kbytes has 712, and no swap.
Supposing embedded Linux currently physical memory available is 5MB (5MB of free memory and there is not usable cached or buffered memory available) if i write this piece of code:
.....
#define MEGABYTE 1024*1024
.....
.....
void *ptr = NULL;
ptr = (void *) malloc(6*MEGABYTE); //Preserving 6MB
if (!prt)
exit(1);
memset(ptr, 1, MEGABYTE);
.....
I would like to know if when memset call is committed the kernel will try to allocate ~6MB or ~1MB (or min_free_kbytes multiple) in the physical memory space.
Right now there is about 9MB in my embedded device which has 32MB RAM. I check it by doing
# echo 3 > /proc/sys/vm/drop_caches
# free
total used free shared buffers
Mem: 23732 14184 9548 0 220
Swap: 0 0 0
Total: 23732 14184 9548
Forgetting last piece of C code, i would like to know if its possible that oom killer appears when for instance free memory is about >6MB.
I want to know if the system is out of memory when oom appears, so i think i have two options:
See VmRSS entries in /proc/pid/status of suspicious process.
Set /proc/sys/vm/overcommit_memory = 2 and /proc/sys/vm/overcommit_memory = 75 and see if there is any process requiring more of physical memory available.

I think you can read this document. Is provides you three small C programs that you can use to understand what happens with the different possible values of /proc/sys/vm/overcommit_memory .

Allocating a large DMA buffer

I want to allocate a large DMA buffer, about 40 MB in size. When I use dma_alloc_coherent(), it fails and what I see is:
------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2106 __alloc_pages_nodemask+0x1dc/0x788()
Modules linked in:
[<8004799c>] (unwind_backtrace+0x0/0xf8) from [<80078ae4>] (warn_slowpath_common+0x4c/0x64)
[<80078ae4>] (warn_slowpath_common+0x4c/0x64) from [<80078b18>] (warn_slowpath_null+0x1c/0x24)
[<80078b18>] (warn_slowpath_null+0x1c/0x24) from [<800dfbd0>] (__alloc_pages_nodemask+0x1dc/0x788)
[<800dfbd0>] (__alloc_pages_nodemask+0x1dc/0x788) from [<8004a880>] (__dma_alloc+0xa4/0x2fc)
[<8004a880>] (__dma_alloc+0xa4/0x2fc) from [<8004b0b4>] (dma_alloc_coherent+0x54/0x60)
[<8004b0b4>] (dma_alloc_coherent+0x54/0x60) from [<803ced70>] (mxc_ipu_ioctl+0x270/0x3ec)
[<803ced70>] (mxc_ipu_ioctl+0x270/0x3ec) from [<80123b78>] (do_vfs_ioctl+0x80/0x54c)
[<80123b78>] (do_vfs_ioctl+0x80/0x54c) from [<8012407c>] (sys_ioctl+0x38/0x5c)
[<8012407c>] (sys_ioctl+0x38/0x5c) from [<80041f80>] (ret_fast_syscall+0x0/0x30)
---[ end trace 4e0c10ffc7ffc0d8 ]---
I've tried different values and it looks like dma_alloc_coherent() can't allocate more than 2^25 bytes (32 MB).
How can such a large DMA buffer can be allocated?

After the system has booted up dma_alloc_coherent() is not necessarily reliable for large allocations. This is simply because non-moveable pages quickly fill up your physical memory making large contiguous ranges rare. This has been a problem for a long time.
Conveniently a recent patch-set may help you out, this is the contiguous memory allocator which appeared in kernel 3.5. If you're using a kernel with this then you should be able to pass cma=64M on your kernel command line and that much memory will be reserved (only moveable pages will be placed there). When you subsequently ask for your 40M allocation it should reliably succeed. Simples!
For more information check out this LWN article:
https://lwn.net/Articles/486301/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio