How OS handles stack growth of multiple threads with option "ulimit -s unlimited"?

How OS handles stack growth of multiple threads with option "ulimit -s unlimited"? - memory-management

By default, linux stack size is limited to 8 MB. So in case of multi-threaded environment each thread will get its own 8 MB stack. If any thread wanders off the bottom of a stack into the guard page will be rewarded with a segmentation-fault signal. This way we were preventing stacks to overlap with each other or with other memory regions.
However with the help of “# ulimit -s unlimited” we can allocate as much memory possible to stack (till we are not colliding with heap or other memory regions).
My questions are:
After executing “# ulimit -s unlimited”
Where does linux placed stacks of multiple threads in Virtual memory? It cannot be contiguous allocation otherwise they cannot expand.
How it calculate free space between two stacks in virtual memory?, so that they can get equal opportunity to expand.

After executing “# ulimit -s unlimited”
It will invoke syscall setrlimit to setup current shell process's rlimit(see more details from do_prlimit).
Where does linux placed stacks of multiple threads in Virtual memory? It cannot be contiguous allocation otherwise they cannot expand.
Child processes forked from the current bash also inherit its rlimit:
syscall_clone => kernel_clone => copy_process =>
memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
In Linux, every process has its own view of system memory, you can get it from /proc/pid/maps, which has many regions including stack and each region is represented as a struct vm_area_struct in the kernel, there is no magic. stack is virtual memory continuous, because it is located in a single vm_area_struct for every process, and it never expands or resizes actually. The stack size is determined at exec early time
Let's see how does kernel setup stack each time you call the exec family function:
syscall execv into kernel
=> do_execveat_common
=> call load_binary(for elf, it will invoke `load_elf_binary`)
=> setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
executable_stack);
=> expand_stack(vma, stack_base);
=> expand_downwards(vma, address);
=> anon_vma_interval_tree_pre_update_vma(vma);
=> anon_vma_interval_tree_post_update_vma(vma);

Related

Assembly - How to modify stack size?

I am a newbie in assembly programming and I am using push and pop instructions that use the memory stack.
So, What is the stack default size, How to modify it and What is the limit if its size?

Stack size depends upon a lot of factors.
It depends on where you start the stack, how much memory you have, what CPU you are using etc.
The CPU you are using is not called a "Windows CPU".
If you are specifying what CPU you are using, you specify the name of that CPU in detail and also, very important, the architecture of the CPU. In this case, you are probably using x86 architecture.
Here is a memory map for x86 architecture:
All addresses Before 0X100000 - Free
0x100000 - 0xc0000 - BIOS
0xc0000 - 0xa0000 - Video Memory
0xa0000 - 0x9fc00 - Extended BIOS data area
0x9fC00 - 0x7e00 - Free
0x7e00 - 0x7c00 - Boot loader
0x7c00 - 0x500 - Free
0x500 - 0x400 - BIOS data area
0x400 - 0x00 - Interupt vector table
In x86, stack information is held by two registers:
Base pointer (bp): Holds starting address of the stack
Stack pointer (sp): Holds the address in which next value will be stored
These registers have different names in different modes:
`Base pointer Stack pointer`
16 bit real mode: bp sp
32 bit protected mode: ebp esp
64 bit mode: rbp rsp
When you set up a stack, stack pointer and base pointer gets the same address.
Stack is setup in the address specified in base pointer register.
You can set up your stack anywhere in memory that is free and the stack grows downwards.
Each time you "push" something on to the stack, the value is stored in the address specified by stack pointer (which is same as base pointer at the beginning), and the stack pointer register is decremented.
Each time you "pop" something from the stack, the value stored in address specified by stack pointer register is stored in a register specified by the programmer and the stack pointer register is incremented.
In 16 bit real mode, you "push" and "pop" 16 bits. So each time you "push" or "pop", The stack pointer register is decremented or incremented by 0x02, since each address holds 8 bits..
In 32 bit protected mode, you "push" and "pop" 32 bits. So each time you "push" or "pop", The stack pointer register is decremented or incremented by 0x04, since each address holds 8 bits.
You will have to setup the stack in the right place dpending upon how many values you are going to be "pushing".
If you keep "pushing" your stack keeps growing downwards and at some point of time your stack may overwrite something. So be wise and set up the stack in a address in the memory where there is plenty of room for the stack to grow downwards.
For example:
If you setup your stack at 0x7c00, just below the bootloader and you "push" too many values, your stack might overwrite the BIOS data area at some point of time which causes a lot of errors.
You should have a basic idea of a stack and the size of it by now.

Whatever loaded ("the loader") your program into memory, and passed control to it, determines where in memory the stack is located, and how much space is available for the stack.
It does so by the simple artifice of loading the stack pointer, typically using a MOV ESP, ... instruction before calling/jumping to your code. Your program then uses the stack area supplied.
If your program uses too much, it will write beyond the end of the allocated stack area. This is a program bug, because the memory past the end may be allocated for some other purpose in the application. Writing on that other memory is likely to change the program behavior (e.g., "bug") when that memory gets used, and finding the cause of that bug is likely to be difficult (people assume that stacks don't damage program data and vice versa).
If your application wants to use a larger stack, generally all you have to do is allocate your own area, large enough for your purposes, and do a MOV ESP, ... yourself to set the stack to the chosen location. How you allocate an area depends on the execution environment in which you run. (You need to respect ESP conventions: must be a multiple of 4, should be initialized to the bottom of a cache line, often useful to initialize to the bottom of virtual memory page).
It is generally a good idea when "switching" stacks to save the old value of ESP provided by the loader, and restore ESP to that old value before returning control to the loader/caller/OS. Likewise, you should free the extended stack space no longer being used.
This scheme will work if you know the amount of stack space you need in advance. In practice, this is rather hard to "guess" (and may be impossible if your code has a recursive algorithm that nests deeply). So you can either pick a really huge number bigger than you need (ick) or you can use an organized approach to switch stacks when it is clear to the program that it needs more.
See How does a stackless language work? for more discussion.

Why memset function make the virtual memory so large

I have a process will do much lithography calculation, so I used mmap to alloc some memory for memory pool. When process need a large chunk of memory, I used mmap to alloc a chunk, after use it then put it in the memory pool, if the same chunk memory is needed again in the process, get it from the pool directly, not used memory map again.(not alloc all the need memory and put it in the pool at the beginning of the process). Between mmaps function, there are some memory malloc not used mmap, such as malloc() or new().
Now the question is:
If I used memset() to set all the chunk data to ZERO before putting it in the memory pool, the process will use too much virtual memory as following, format is "mmap(size)=virtual address":
mmap(4198400)=0x2aaab4007000
mmap(4198400)=0x2aaab940c000
mmap(8392704)=0x2aaabd80f000
mmap(8392704)=0x2aaad6883000
mmap(67112960)=0x2aaad7084000
mmap(8392704)=0x2aaadb085000
mmap(2101248)=0x2aaadb886000
mmap(8392704)=0x2aaadba89000
mmap(67112960)=0x2aaadc28a000
mmap(2101248)=0x2aaae028b000
mmap(2101248)=0x2aaae0c8d000
mmap(2101248)=0x2aaae0e8e000
mmap(8392704)=0x2aaae108f000
mmap(8392704)=0x2aaae1890000
mmap(4198400)=0x2aaae2091000
mmap(4198400)=0x2aaae6494000
mmap(8392704)=0x2aaaea897000
mmap(8392704)=0x2aaaeb098000
mmap(2101248)=0x2aaaeb899000
mmap(8392704)=0x2aaaeba9a000
mmap(2101248)=0x2aaaeca9c000
mmap(8392704)=0x2aaaec29b000
mmap(8392704)=0x2aaaecc9d000
mmap(2101248)=0x2aaaed49e000
mmap(8392704)=0x2aaafd6a7000
mmap(2101248)=0x2aacc5f8c000
The mmap last - first = 0x2aacc5f8c000 - 0x2aaab4007000 = 8.28G
But if I don't call memset before put in the memory pool:
mmap(4198400)=0x2aaab4007000
mmap(8392704)=0x2aaab940c000
mmap(8392704)=0x2aaad2480000
mmap(67112960)=0x2aaad2c81000
mmap(2101248)=0x2aaad6c82000
mmap(4198400)=0x2aaad6e83000
mmap(8392704)=0x2aaadb288000
mmap(8392704)=0x2aaadba89000
mmap(67112960)=0x2aaadc28a000
mmap(2101248)=0x2aaae0a8c000
mmap(2101248)=0x2aaae0c8d000
mmap(2101248)=0x2aaae0e8e000
mmap(8392704)=0x2aaae1890000
mmap(8392704)=0x2aaae108f000
mmap(4198400)=0x2aaae2091000
mmap(4198400)=0x2aaae6494000
mmap(8392704)=0x2aaaea897000
mmap(8392704)=0x2aaaeb098000
mmap(2101248)=0x2aaaeb899000
mmap(8392704)=0x2aaaeba9a000
mmap(2101248)=0x2aaaec29b000
mmap(8392704)=0x2aaaec49c000
mmap(8392704)=0x2aaaecc9d000
mmap(2101248)=0x2aaaed49e000
The mmap last - first = 0x2aaaed49e000 - 0x2aaab4007000= 916M
So the first process will "out of memory" and killed.
In the process, the mmap memory chunk will not be fully used or not even used although it is alloced, I mean, for example, before calibration, the process mmap 67112960(64M), it will not used(write or read data in this memory region) or just used the first 2M bytes, then put in the memory pool.
I know the mmap just return virtual address, the physical memory used delay alloc, it will be alloced when read or write on these address.
But what made me confused is that, why the virtual address increase so much? I used the centos 5.3, kernel version is 2.6.18, I tried this process both on libhoard and the GLIBC(ptmalloc), both with the same behavior.
Do anyone meet the same issue before, what is the possible root cause?
Thanks.

VMAs (virtual memory areas, AKA memory mappings) do not need to be contiguous. Your first example uses ~256 Mb, the second ~246 Mb.
Common malloc() implementations use mmap() automatically for large allocations (usually larger than 64Kb), freeing the corresponding chunks with munmap(). So you do not need to mmap() manually for large allocations, your malloc() library will take care of that.
When mmap()ing, the kernel returns a COW copy of a special zero page, so it doesn't allocate memory until it's written to. Your zeroing is causing memory to be really allocated, better just return it to the allocator, and request a new memory chunk when you need it.
Conclusion: don't write your own memory management unless the system one has proven inadecuate for your needs, and then use your own memory management only when you have proved it noticeably better for your needs with real life load.

How to view the current stack size of a windows thread when it overflows

I have a process that is overflowing the stack when run from within an IIS process, but works fine when run on its own. I suspect that on its own it gets the default 1MB stack, but within IIS gets somewhat less.
To avoid messing with the IIS worker processes I am using a sub-thread within the IIS process to allocate a bigger stack, but I suspect the stack size argument to Thread creation is being ignored as per the documentation (http://msdn.microsoft.com/en-us/library/ms149581.aspx)
When the stack overflows I can view the halted process in the debugger, but how do I find out how big a stack was actually allocated?

The answer is as follows.
In debugger, add a watch on the pseudo register TIB (http://msdn.microsoft.com/en-us/library/aa232399(v=vs.60).aspx )
Now take this value and display that address in a memory window. Subtract the third 4 byte word from the second 4 byte word, remembering to use little endian byte ordering.
http://en.wikipedia.org/wiki/Win32_Thread_Information_Block

Allocating a large DMA buffer

I want to allocate a large DMA buffer, about 40 MB in size. When I use dma_alloc_coherent(), it fails and what I see is:
------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2106 __alloc_pages_nodemask+0x1dc/0x788()
Modules linked in:
[<8004799c>] (unwind_backtrace+0x0/0xf8) from [<80078ae4>] (warn_slowpath_common+0x4c/0x64)
[<80078ae4>] (warn_slowpath_common+0x4c/0x64) from [<80078b18>] (warn_slowpath_null+0x1c/0x24)
[<80078b18>] (warn_slowpath_null+0x1c/0x24) from [<800dfbd0>] (__alloc_pages_nodemask+0x1dc/0x788)
[<800dfbd0>] (__alloc_pages_nodemask+0x1dc/0x788) from [<8004a880>] (__dma_alloc+0xa4/0x2fc)
[<8004a880>] (__dma_alloc+0xa4/0x2fc) from [<8004b0b4>] (dma_alloc_coherent+0x54/0x60)
[<8004b0b4>] (dma_alloc_coherent+0x54/0x60) from [<803ced70>] (mxc_ipu_ioctl+0x270/0x3ec)
[<803ced70>] (mxc_ipu_ioctl+0x270/0x3ec) from [<80123b78>] (do_vfs_ioctl+0x80/0x54c)
[<80123b78>] (do_vfs_ioctl+0x80/0x54c) from [<8012407c>] (sys_ioctl+0x38/0x5c)
[<8012407c>] (sys_ioctl+0x38/0x5c) from [<80041f80>] (ret_fast_syscall+0x0/0x30)
---[ end trace 4e0c10ffc7ffc0d8 ]---
I've tried different values and it looks like dma_alloc_coherent() can't allocate more than 2^25 bytes (32 MB).
How can such a large DMA buffer can be allocated?

After the system has booted up dma_alloc_coherent() is not necessarily reliable for large allocations. This is simply because non-moveable pages quickly fill up your physical memory making large contiguous ranges rare. This has been a problem for a long time.
Conveniently a recent patch-set may help you out, this is the contiguous memory allocator which appeared in kernel 3.5. If you're using a kernel with this then you should be able to pass cma=64M on your kernel command line and that much memory will be reserved (only moveable pages will be placed there). When you subsequently ask for your 40M allocation it should reliably succeed. Simples!
For more information check out this LWN article:
https://lwn.net/Articles/486301/

thread stack size on Windows (Visual C++)

Is there a call to determine the stack size of a running thread? I've been looking in MSDN thread functions documentation, and can't seem to find one.

Whilst there isn't an API to find out stack size directly, contiguous virtual address space must be reserved up to the maximum stack size - it's just that a lot of that space isn't committed yet. You can take advantage of this and make two calls to VirtualQuery.
For the first call, pass it the address of any value on the stack to get the base address and size, in bytes, of the committed stack space. On an x86 machine where the stack grows downwards, subtract the size from the base address and VirtualQuery again: this will give you the size of the space reserved for the stack (assuming you're not precisely on the limit of stack size at the time). Summing the two naturally gives you the total stack size.

You can get the current committed size from the Top and Bottom in the TEB. You can get the process initial reserve and commit sizes from the PE header. But you cannot retrieve the actual sizes passed to CreateThread, nor is there any API to get the remaining size of reserved nor committed from current stack, see Thread Stack Size.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio