Stackoverflow error in Fortran with OpenMP [duplicate]

Stackoverflow error in Fortran with OpenMP [duplicate] - windows

main program:
program main
use omp_lib
use my_module
implicit none
integer, parameter :: nmax = 202000
real(8) :: e_in(nmax) = 0.D0
integer i
call omp_set_num_threads(2)
!$omp parallel default(firstprivate)
!$omp do
do i=1,2
print *, e_in(i)
print *, eTDSE(i)
end do
!$omp end do
!$omp end parallel
end program main
module:
module my_module
implicit none
integer, parameter, private :: ntmax = 202000
double complex :: eTDSE(ntmax) = (0.D0,0.D0)
!$omp threadprivate(eTDSE)
end module my_module
compiled using:
ifort -openmp main.f90 my_module.f90
It gives the Segmentation fault when execution. If remove one of the print commands in the main program, it runs fine. Also if remove the omp function and compile without -openmp option, it runs fine too.

The most probable cause for this behaviour is that your stack size limit is too small (for whatever reason). Since e_in is private to each OpenMP thread, one copy per thread is allocated on the thread stack (even if you have specified -heap-arrays!). 202000 elements of REAL(KIND=8) take 1616 kB (or 1579 KiB).
The stack size limit can be controlled by several mechanisms:
On standard Unix system shells the amount of stack size is controlled by ulimit -s <stacksize in KiB>. This is also the stack size limit for the main OpenMP thread. The value of this limit is also used by the POSIX threads (pthreads) library as the default thread stack size when creating new threads.
OpenMP supports control over the stack size limit of all additional threads via the environment variable OMP_STACKSIZE. Its value is a number with an optional suffix k/K for KiB, m/M ffor MiB, or g/G for GiB. This value does not affect the stack size of the main thread.
The GNU OpenMP run-time (libgomp) recognises the non-standard environment variable GOMP_STACKSIZE. If set it overrides the value of OMP_STACKSIZE.
The Intel OpenMP run-time recognises the non-standard environment variable KMP_STACKSIZE. If set it overrides the value of OMP_STACKSIZE and also overrides the value of GOMP_STACKSIZE if the compatibility OpenMP run-time is used (which is the default as currently the only available Intel OpenMP run-time library is the compat one).
If none of the *_STACKSIZE variables are set, the default for Intel OpenMP run-time is 2m on 32-bit architectures and 4m on 64-bit ones.
On Windows, the stack size of the main thread is part of the PE header and is embedded there by the linker. If using Microsoft's LINK to do the linking, the size is specified using the /STACK:reserve[,commit]. The reserve argument specifies the maximum stack size in bytes while the optional commit argument specifies the initial commit size. Both can be specified as hexadecimal values using the 0x prefix. If re-linking the executable is not an option, the stack size could be modified by editing the PE header with EDITBIN. It takes the same stack-related argument as the linker. Programs compiled with MSVC's whole program optimisation enabled (/GL) cannot be edited.
The GNU linker for Win32 targets supports setting the stack size via the --stack argument. To pass the option directly from GCC, the -Wl,--stack,<size in bytes> can be used.
Note that thread stacks are actually allocated with the size set by *_STACKSIZE (or to the default value), unlike the stack of the main thread, which starts small and then grows on demand up to the set limit. So don't set *_STACKSIZE to an arbitrary large value otherwise you may hit the process virtual memory size limit.
Here are some examples:
$ ifort -openmp my_module.f90 main.f90
Set the main stack size limit to 1 MiB (the additional OpenMP thread would get 4 MiB as per default):
$ ulimit -s 1024
$ ./a.out
zsh: segmentation fault (core dumped) ./a.out
Set the main stack size limit to 1700 KiB:
$ ulimit -s 1700
$ ./a.out
0.000000000000000E+000
(0.000000000000000E+000,0.000000000000000E+000)
0.000000000000000E+000
(0.000000000000000E+000,0.000000000000000E+000)
Set the main stack size limit to 2 MiB and the stack size of the additional thread to 1 MiB:
$ ulimit -s 2048
$ KMP_STACKSIZE=1m ./a.out
zsh: segmentation fault (core dumped) KMP_STACKSIZE=1m ./a.out
On most Unix systems the stack size limit of the main thread is set by PAM or other login mechanism (see /etc/security/limits.conf). The default on Scientific Linux 6.3 is 10 MiB.
Another possible scenario that can lead to an error is if the virtual address space limit is set too low. For example, if the virtual address space limit is 1 GiB and the thread stack size limit is set to 512 MiB, then the OpenMP run-time would try to allocate 512 MiB for each additional thread. With two threads one would have 1 GiB for the stacks only, and when the space for code, shared libraries, heap, etc. is added up, the virtual memory size would grow beyond 1 GiB and an error would occur:
Set the virtual address space limit to 1 GiB and run with two additional threads with 512 MiB stacks (I have commented out the call to omp_set_num_threads()):
$ ulimit -v 1048576
$ KMP_STACKSIZE=512m OMP_NUM_THREADS=3 ./a.out
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.
forrtl: error (76): Abort trap signal
... trace omitted ...
zsh: abort (core dumped) OMP_NUM_THREADS=3 KMP_STACKSIZE=512m ./a.out
In this case the OpenMP run-time library would fail to create a new thread and would notify you before it aborts program termination.

Segmentation fault is due to stack memory limit when using OpenMP. Using the solutions from the previous answer did not solve the problem for me on my Windows OS. Using memory allocation into heap rather than stack memory seems to work:
integer, parameter :: nmax = 202000
real(dp), dimension(:), allocatable :: e_in
integer i
allocate(e_in(nmax))
e_in = 0
! rest of code
deallocate(e_in)
Plus this would not involve changing any default environment parameters.
Acknowledgement to and refer to ohm314's solution here: large array using heap memory allocation

Related

GCC: how -pie affects address of file scope variable?

Consider this code:
#include <stdio.h>
int gprs[32];
int main(void)
{
printf("%p\n", (void*)&gprs);
}
being compiled with -pie (seems to be the default) produces:
0x55c183951040
while being compiled with -no-pie produces:
0x404060
Can someone explain how -pie affects address of file scope variable?
Note: Clang seems to have -no-pie by default.

Can someone explain how -pie affects address of file scope variable?
Using -pie, the operating system can load the executable file to any address in memory. Under Windows, this is done using a "base relocation table"; under Linux this is done using "position-independent code".
In this case, many modern OSs load an executable file to any (random) address in memory for security reasons (because it is harder to write a virus accessing the variable gprs if its address is not known).
This means that the difference between the addresses of the (static or global) variables a and b in the following example:
printf("%p, %p\n", &a, &b);
... should be constant but the address of a (and b) may be different every time you run the program.
Using -no-pie, "position-dependent code" is generated under both OSs and no "base relocation table" is generated under Windows.
This means that the executable file can only be loaded into a fixed memory address. And for this reason, the address of a static or global variable (but not necessarily of a non-static local variable) should not change when you run the program multiple times.

Do dynamic libraries have the same virtual memory address in all programs?

When a library is dynamically linked to a program does it have the same address in that program as in any other program?
I my head I imagined each process gets the whole of the address space and then everything in that process (inc. dynamic libraries that are already in memory) gets mapped to semi-random parts of it because of ASLR.
But I did a short experiment that seems to imply that the address of libraries that are in memory are fixed across different processes and thus reusable across programs? Is that correct?
I wrote two short c programs which used the "sleep" function. In one I printed out the address of the sleep function and in the second I assigned a function pointer to that address. I ran them both and the sleep function worked in both.
#include <stdio.h>
#include <unistd.h>
int main()
{
while(1)
{
printf("%s\n", &"hi");
sleep(2);
printf("pointer to sleep: %p\n", sleep);
}
}
#include <stdio.h>
#include <unistd.h>
#define sleepagain ((void (*)(int))0x7fff7652e669) //addr of sleep from first program
int main()
{
while(1)
{
printf("%s\n", &"test");
sleepagain(2);
}
}
I wasn't sure what this would show but what it actually showed was a) the address was the same every time I ran the first program and b) that sleep still functioned when I ran the second.
I think I understand how this works but I am curious if it has to work the way it does and what are the reasons behind it?
Just to reference the answer I got already when I took a look with otool -IvV I got:
a.out:
Indirect symbols for (__TEXT,__stubs) 2 entries
address index name
0x0000000100000f62 2 _printf
0x0000000100000f68 3 _sleep
Indirect symbols for (__DATA,__nl_symbol_ptr) 2 entries
address index name
0x0000000100001000 4 dyld_stub_binder
0x0000000100001008 ABSOLUTE
Indirect symbols for (__DATA,__got) 1 entries
address index name
0x0000000100001010 3 _sleep
Indirect symbols for (__DATA,__la_symbol_ptr) 2 entries
address index name
0x0000000100001018 2 _printf
0x0000000100001020 3 _sleep
Which is also what the indirect address was in lldb. The address was the address of sleep itself:
Process 11209 launched: 'stuff/a.out' (x86_64)
hi
Process 11209 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x00007fff7652e669 libsystem_c.dylib`sleep
libsystem_c.dylib`sleep:
-> 0x7fff7652e669 <+0>: push rbp
0x7fff7652e66a <+1>: mov rbp, rsp
0x7fff7652e66d <+4>: push rbx
0x7fff7652e66e <+5>: sub rsp, 0x28
Target 0: (a.out) stopped.
For some additional info:
$ otool -hv a.out
Mach header
magic cputype cpusubtype caps filetype ncmds sizeofcmds flags
MH_MAGIC_64 X86_64 ALL LIB64 EXECUTE 15 1296 NOUNDEFS DYLDLINK TWOLEVEL PIE

On macOS, many system libraries are part of the dyld shared cache. There's a machine-wide mapping. So, those libraries end up at the same address in all processes of the same architecture (32- or 64-bit).
The location of the dyld shared cache is randomized at system boot. So, library addresses will be the same from process to process until you reboot.
Not all system libraries are part of the cache, only the ones that Apple deems to be commonly loaded.
Libraries of your own or from third parties will be loaded at random locations each time they're loaded, assuming they are position-independent.
Try looking at the output from vmmap -v <pid>. Look for the line with "machine-wide VM submap" and those that follow.

GCC for ARM -- ELF output file segment misplaced

Edited to add: I have now cross-posted this to the GNU ARM Embedded Toolchain site, as I am fairly certain that it's a linker bug.
Also, I have noticed that it seems to happen when the first program segment fits into the first page in the ELF file (i.e. its starting offset within its page is >= the number of bytes in the ELF header). In this case the segment erroneously gets extended downwards to the beginning of the file. This would explain why the problem disappears if the in-page offset of the start address is reduced from 0x80 to 0x40.
I am implementing a stand-alone OS for ARM Cortex M0, and I have a weird problem with the linker. Here is my source file OS.c, stripped down to illustrate the problem:
int EntryPoint (void) { return 99 ; }
And here is my linker script file OS.ld, simply assigning all code to the region starting at 0x10080:
MEMORY
{
NVM (rx) : ORIGIN = 0x10080, LENGTH = 0x1000
}
SECTIONS
{
.text 0x10080 :
{
OS.o (.text)
} > NVM
}
I compile and link it:
arm-none-eabi-gcc.exe -march=armv6-m -mthumb -c OS.c
arm-none-eabi-gcc.exe -oOS.elf -Xlinker --script=OS.ld OS.o -nostartfiles -nodefaultlibs
And now when I list the program segments with readelf OS.elf -l, I get:
Elf file type is EXEC (Executable file)
Entry point 0x10080
There are 1 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x00010000 0x00010000 0x0008c 0x0008c R E 0x10000
According to this, the one and only program segment starts at offset 0x000000 in the ELF output file, which is crazy: that region contains ELF header info irrelevant to the OS. And the physical start address is 0x00010000, which doesn't exist in my hardware.
But the weird thing is that if I change both instances of 0x10080 to 0x10040 in the linker script file, it works! I get:
Elf file type is EXEC (Executable file)
Entry point 0x10040
There are 1 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x010040 0x00010040 0x00010040 0x0000c 0x0000c R E 0x10000
Now the program segment is in the right place in the file, and has length 0x0000c instead of 0x0008c. Unfortunately address 0x00010040 doesn't exist in my hardware either, so this is not a solution.
Is this a bug in the GCC ARM compiler? Running it with --version gives:
arm-none-eabi-gcc.exe (GNU Tools for Arm Embedded Processors 7-2018-q2-update) 7.3.1 20180622 (release) [ARM/embedded-7-branch revision 261907]

what you see might not be what you expect, but is nevertheless correct, IMHO.
ELF was created for System V. An OS that supports virtual memory and mmap() (a system call to map the contents of a file into memory).
You are looking at the ELF program header (not the section headers, see below). The program header is information to a (virtual memory capable) operation system's ELF loader about where it is supposed to mmap() the (complete) ELF file into virtual memory it prepared as process image. This OS would then just allocate one (or more) page(s) somewhere, call that (virtual) 0x10000 (for that process), map the file and jump to 0x10080 (the entry point).
For your second example, this would not work as you specified the (virtual) start address before the end of the ELF file's header (ELF header + program header + section headers), sot it cannot just map the file to a page boundary, making it more complicated (or even impossible) to the OS to do it's mmap() trick.
For your bare metal OS (that most likely doesn't support virtual memory, at least not on startup), the ELF program header's information is probably completely irrelevant.
You should probably rather look at the section headers, instead. They describe physical memory.

I had very similar issue with GNU linker for ARM Cortex-M platform (GNU ld (Atmel build: 508) 2.28.0.20170620). I had bootloader and application projects where linker from application was placing ELF headers in flash location where bootloader code is. I'm not an expert but this modification tricked my linker not to put ELF header in memory space before entry point address (will try to show on your example):
redefine NVM space by including first 0x80 bytes
NVM (rx) : ORIGIN = 0x10000, LENGTH = 0x1000+0x80
in sections part add that offset:
SECTIONS
{
.text :
{
. += 0x80;
OS.o (.text)
} > NVM
}
I'm not sure if this can work in your case but perhaps can be used as a hint for others.

GDB: how to call functions with modified parameters during debugging

Consider the following trivial Fortran program that adds two integers via a subroutine and prints the result:
PROGRAM MAIN
INTEGER I, J, SUM
I = 1
J = 1
CALL ADD(I, J, SUM)
WRITE(*,*) SUM
END
SUBROUTINE ADD(I, J, SUM)
INTEGER I, J, SUM
SUM = I + J
END
Compiling via gfortran -g -O0 gdb-mwe.f -o gdb-mwe and running in the GNU Debugger, I want to call ADD from the debugger with modified input arguments right before the write output. Here's what happens:
Reading symbols from gdb-mwe...done.
(gdb) break 10
Breakpoint 1 at 0x4007dd: file gdb-mwe.f, line 10.
(gdb) r
Starting program: /home/username/Documents/Fortran/gdb-mwe
Breakpoint 1, MAIN__ () at gdb-mwe.f:10
10 WRITE(*,*) SUM
(gdb) p j = j+1
$2 = 2
(gdb) call add(i,j,sum)
Program received signal SIGSEGV, Segmentation fault.
0x000000000040079a in add (
i=<error reading variable: Cannot access memory at address 0x1>,
j=<error reading variable: Cannot access memory at address 0x2>,
sum=<error reading variable: Cannot access memory at address 0x2>)
at gdb-mwe.f:18
18 SUM = I + J
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(add) will be abandoned.
When the function is done executing, GDB will silently stop.
How do I get this right?

As pointed out in the comments, the open bugs in gdb prevents doing this currently.
A possible workaround would be to debug a 32-bit version of the code. This results in some differences, but for simple debugging tasks it may be sufficient.
For intel fortran compilers, this requires only adding the -m32 flag (provided 32-bit libraries have been installed).
For gfortran it seems that installing the multilib package first is necessary, as show in this questions.

Is there a method/function to get the code size of a C program compiled using GCC compiler? (may vary when some optimization technique is applied)

Can I measure the code size with the help of an fseek() function and store it to a shell variable?
Is it possible to extract the code size, compilation time and execution time using milepost gcc or a GNU Profiler tool? If yes, how to store them into shell variables?
Since my aim is to find the best set of optimization technique upon the basis of the compilation time, execution time and code size, I will be expecting some function that can return these parameters.
MyPgm=/root/Project/Programs/test.c
gcc -Wall -o1 -fauto-inc-dec $MyPgm -o output
time -f "%e" -o Output.log ./output
while read line;
do
echo -e "$line";
Val=$line
done<Output.log
This will store the execution time to the variable Val. Similarly, I want to get the values of code size as well as compilation time.
I will prefer something that I can do to accomplish this, without using an external program!

for code size on linux, you can use size command on terminal.
$size file-name.out
it will give size of different sections. use text section for code size. you can use data and bss if you want to consider global data size as well.

You can use the size(1) command http://www.linuxmanpages.com/man1/size.1.php
Or open the ELF file, walk over section headers and sum the sizes of all the section with type SHT_PROGBITS and the SHF_EXECINSTR flag set.

On non-Linux / non-GNU-utils systems (where you may have neither GNU size nor readelf), the nm program can be used to dump symbol information (including sizes) from object files (libraries / executables). The syntax is slightly system-dependent:
OpenGroup manpage for nm (the "portable subset")
Linux/BSD manpage for nm (GNU version)
Solaris manpage for nm
AIX manpage for nm
nm usage on HP/UX (this says "PA-RISC" but the utility is present / usable on Itanium)
Windows: Doesn't have nm as such, but see: Microsoft equivalent of the nm command
Unfortunately, while the utility is available almost everywhere, its output format is not as portable as could be, so some system-specific scripting is necessary.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio