As an abstract concept of parallel computing, Local(shared) memory is allocated per Thread Blocks (CUDA) / Workgroups (OpenCL) and shared between all threads in the same Thread Blocks (CUDA) / Workgroups (OpenCL).
How it is actually allocated ? is it allocated by the first thread of the Block/Group or it is allocated before creating the Blocks by the memory controller ? or something else ?
What OpenCL considers "Local Memory" is:
Memory available only during the kernel execution, that is shared only by elements of the same workgroup. Each workgroup can only see their local memory.
The memory usage is known at compile time and limited.
It is very similar to registers or L1/L2 cache in CPUs / multicore systems. Compilers know about the registers of the target CPU and plan accordingly.
When the scheduler schedules the workgroups to hardware resources will always ensure enough memory is in place for each workgroup.
You can consider local memory inside kernel execution as a pointer to memory that is already allocated, similar to a register or private memory.
Related
I'm a fresher and was asked this question in the Microsoft recruitment process.
I'd read somewhere that the maximum memory allocated to a process can be the maximum physical memory available. So is it that if the RAM is 4GB, that's the answer? If yes, then how? Because some part of the RAM is always occupied by the Operating System, right? If no, then could you tell me the answer and what are the factors it really depends on?
First of all, the base of your question is totally related to Virtual Memory which has already been pointed out by Chris O!
Now,proceeding to your questions step by step :-
I'd read somewhere that the maximum memory allocated to a process can
be the maximum physical memory available. So is it that if the RAM is
4GB, that's the answer?
No, the maximum memory which your process can use can be anything depending on the virtual memory assigned or the swap size. Swap memory is generally taken twice of the physical memory,thought it can always be more or less depending on the requirements!
Also, PAE (Physical Address Extension) allows more memory to be allocated. PAE allows a 32-bit OS to use more RAM, that is, more physical memory. This has nothing whatsoever to do with the 4GB virtual address space limitation that 32-bit OSes have.
A 32-bit OS uses 32-bit virtual addresses. That limits it to 4GB of addressable virtual memory at any one time. If a 32-bit OS also uses 32-bit physical addresses, it is limited to 4GB of physical memory as well. PAE allows a 32-bit OS to use 36-bit physical addresses, which raises the limit to 64GB.
Next, the point which you mentioned is valid for the atomic processes which can't be broken further into threads or So. I doubt one would rarely face that situation in which the size of atomic process is more than that of the physical memory...
If yes, then how?Because some part of the RAM is always occupied by
the Operating System, right?
No.it's not as I already have mentioned above!
If no, then could you tell me the answer and what are the factors it
really depends on?
The memory requirement of a process is not defined earlier. But, you might have heard about this that many programs recommend at least it must have this much of memory to execute this process. This is the minimal requirement of the process without which the process won't even run properly! Because it must have suitable physical memory to handle those events! Next, the term swapping comes into picture whenever we are talking about Virtual memory! All the process which are currently not running are send to disks and the process which are to be executed are sent to the physical memory for execution.So, more than one processes are requested and executed by continuous swapping!
Some other continuous processes which are maintained in main memory are :-
System processes OR daemons
cache memory or cache maintenance
Being asked this in my homework.
I don't know how to proceed.
Without virtual memory, you couldn't have something as simple as fork without having to swap processes in and out on every task switch (because they'd be using the same physical memory addresses). You also couldn't memory map files or fault-load executables. So yes, virtual memory would still be useful even if not being used to permit more virtual memory than physical memory.
I have a query related to memory leak.
A 32 bit Linux based system is running multiple active processes A,B,C,D. All the processes are allocating/deallocating memory from the heap. Now if process A is contionusly leaking a significant amount of memory, could it happen that after a certain amount of time process B cant find any memory to allocate from the heap?
As per my understanding, each process is provided with a unque VM of 2GB from the OS. But there is a mappig between the VM and the physical memory.
Yes, if the total amount of VM (RAM + swap space) is exhausted by process A, then malloc in any of the other processes might fail because of that. Linux hides processes' memory spaces from other processes, but it doesn't magically create extra memory in your machine. (Although it may seem to do so due to its overcommit behavior.)
In addition, Linux may employ its OOM killer when memory is running low.
Linux kernel does memory overcommit by default.
When a process request a memory segment with malloc() the memory is not automatically allocated.
You may have 4 processes malloc()ing 2gb each and not having any problem.
The problem arise when the process make use (initialize, bzero, copy) of the malloc()ed memory.
You may even malloc more memory than the system may reserve for you, without any problem, and malloc() doesn't even return NULL!!
I trying to to make the linux memory management a little bit more clear for tuning and performances purposes.
By reading this very interesting redbook "Linux Performance and Tuning Guidelines" found on the IBM website I came across something I don't fully understand.
On 32-bit architectures such as the IA-32, the Linux kernel can directly address only the first gigabyte of physical memory (896 MB when considering the reserved range). Memory above the so-called ZONE_NORMAL must be mapped into the lower 1 GB. This mapping is completely transparent to applications, but allocating a memory page in ZONE_HIGHMEM causes a small performance degradation.
why the memory above 896 MB has to be mapped into the lower 1GB ?
Why there is an impact on performances by allocating a memory page in ZONE_HIGHMEM ?
what is the ZONE_HIGHMEM used for then ?
why a kernel that is able to recognize up to 4gb ( CONFIG_HIGHMEM=y ) can just use the first gigabyte ?
Thanks in advance
When a user process traps in to the kernel, the page tables are not changed. This means that one linear address space must be able to cover both the memory addresses available to the user process, and the memory addresses available to the kernel.
On IA-32, which allows a 4GB linear address space, usually the first 3GB of the linear address space are allocated to the user process, and the last 1GB of the linear address space is allocated to the kernel.
The kernel must use its 1GB range of addresses to be able to address any part of physical memory it needs to. Memory above 896MB is not "mapped into the low 1GB" - what happens is that physical memory below 896MB is assigned a permanent linear address in the kernel's part of the linear address space, whereas as memory above that limit must be assigned a temporary mapping in the remaining part of the linear address space.
There is no impact on performance when mapping a ZONE_HIGHMEM page into a userspace process - for a userspace process, all physical memory pages are equal. The impact on performance exists when the kernel needs to access a non-user page in ZONE_HIGHMEM - to do so, it must map it into the linear address space if it is not already mapped.
I'm using VMMap from SysInternals to look at memory allocated by my Win32 C++ process on WinXP, and I see a bunch of allocations where portions of the allocated memory are reserved but not committed. As far as I can tell, from my reading and testing, all of the common memory allocators (e.g., malloc, new, LocalAlloc, GlobalAlloc) used in a C++ program always allocate fully committed blocks of memory.
Heaps are a common example of code that reserves memory but doesn't commit it until needed. I suspect that some of these blocks are Windows/CRT heaps, but there appears to be more of these types of blocks than I would expect for heaps. I see on the order of 30 of these blocks in my process, between 64k and 8MB in size, and I know that my code never intentionally calls VirtualAlloc to allocate reserved, uncommitted memory.
Here are a couple of examples from VMMap: http://www.flickr.com/photos/95123032#N00/5280550393/
What else would allocate such blocks of memory, where much of it is reserved but not committed? Would it make sense that my process has 30 heaps? Thanks.
I figured it out - it's the CRT heap that gets allocated by calls to malloc. If you allocate a large chunk of memory (e.g., 2 MB) using malloc, it allocates a single committed block of memory. But if you allocate smaller chunks (say 177kb), then it will reserve a 1 MB chunk of memory, but only commit approximately what you asked for (e.g., 184kb for my 177kb request).
When you free that small chunk, that larger 1 MB chunk is not returned to the OS. Everything but 4k is uncommitted, but the full 1 MB is still reserved. If you then call malloc again, it will attempt to use that 1 MB chunk to satisfy your request. If it can't satisfy your request with the memory that it's already reserved, it will allocate a new chunk of memory that's twice the previous allocation (in my case it went from 1 MB to 2 MB). I'm not sure if this pattern of doubling continues or not.
To actually return your freed memory to the OS, you can call _heapmin. I would think that this would make a future large allocation more likely to succeed, but it would all depend on memory fragmentation, and perhaps heapmin already gets called if an allocation fails (?), I'm not sure. There would also be a performance hit since heapmin would release the memory (taking time) and malloc would then need to re-allocate it from the OS when needed again. This information is for Windows/32 XP, your mileage may vary.
UPDATE: In my testing, heapmin did absolutely nothing. And the malloc heap is only used for blocks that are less than 512kb. Even if there are MBs of contiguous free space in the malloc heap, it will not use it for requests over 512kb. In my case, this freed, unused, yet reserved malloc memory chewed up huge parts of my process' 2GB address space, eventually leading to memory allocation failures. And since heapmin doesn't return the memory to the OS, I haven't found any solution to this problem, other than restarting my process or writing my own memory manager.
Whenever a thread is created in your application a certain (configurable) amount of memory will be reserved in the address space for the call stack of the thread. There's no need to commit all the reserved memory unless your thread is actually going to need all of that memory. So only a portion needs to be committed.
If more than the committed amount of memory is required, it will be possible to obtain more system memory.
The practical consideration is that the reserved memory is a hard limit on the stack size that reduces address space available to the application. However, by only committing a portion of the reserve, we don't have to consume the same amount of memory from the system until needed.
Therefore it is possible for each thread to have a portion of reserved uncommitted memory. I'm unsure what the page type will be in those cases.
Could they be the DLLs loaded into your process? DLLs (and the executable) are memory mapped into the process address space. I believe this initially just reserves space. The space is backed by the files themselves (at least initially) rather than the pagefile.
Only the code that's actually touched gets paged in. If I understand the terminology correctly, that's when it's committed.
You could confirm this by running your application in a debugger and looking at the modules that are loaded and comparing their locations and sizes to what you see in VMMap.