fortran netcdf close parallel deadlock - parallel-processing

I am adapting a fortran mpi program from sequential to parallel writing for certain types of files. It uses netcdf 4.3.3.1/hdf5 1.8.9 parallel. I use intel compiler version 14.0.3.174.
When all reads/writes are done it is time to close the files. At this point, the simulations does not continue anymore. So all calls are waiting. When I check the call stack from each processor I can see the master root is different compared to the rest of them.
Mpi Master processor call stack:
__sched_yield, FP=7ffc6aa978b0
opal_progress, FP=7ffc6aa978d0
ompi_request_default_wait_all, FP=7ffc6aa97940
ompi_coll_tuned_sendrecv_actual, FP=7ffc6aa979e0
ompi_coll_tuned_barrier_intra_recursivedoubling, FP=7ffc6aa97a40
PMPI_Barrier, FP=7ffc6aa97a60
H5AC_rsp__dist_md_write__flush, FP=7ffc6aa97af0
H5AC_flush, FP=7ffc6aa97b20
H5F_flush, FP=7ffc6aa97b50
H5F_flush_mounts, FP=7ffc6aa97b80
H5Fflush, FP=7ffc6aa97ba0
NC4_close, FP=7ffc6aa97be0
nc_close, FP=7ffc6aa97c00
restclo, FP=7ffc6aa98660
driver, FP=7ffc6aaa5ef0
main, FP=7ffc6aaa5f90
__libc_start_main, FP=7ffc6aaa6050
_start,
Remaining processors call stack:
__sched_yield, FP=7fffe330cdd0
opal_progress, FP=7fffe330cdf0
ompi_request_default_wait, FP=7fffe330ce50
ompi_coll_tuned_bcast_intra_generic, FP=7fffe330cf30
ompi_coll_tuned_bcast_intra_binomial, FP=7fffe330cf90
ompi_coll_tuned_bcast_intra_dec_fixed, FP=7fffe330cfb0
mca_coll_sync_bcast, FP=7fffe330cff0
PMPI_Bcast, FP=7fffe330d030
mca_io_romio_dist_MPI_File_set_size, FP=7fffe330d080
PMPI_File_set_size, FP=7fffe330d0a0
H5FD_mpio_truncate, FP=7fffe330d0c0
H5FD_truncate, FP=7fffe330d0f0
H5F_dest, FP=7fffe330d110
H5F_try_close, FP=7fffe330d340
H5F_close, FP=7fffe330d360
H5I_dec_ref, FP=7fffe330d370
H5I_dec_app_ref, FP=7fffe330d380
H5Fclose, FP=7fffe330d3a0
NC4_close, FP=7fffe330d3e0
nc_close, FP=7fffe330d400
RESTCOM`restclo, FP=7fffe330de60
driver, FP=7fffe331b6f0
main, FP=7fffe331b7f0
__libc_start_main, FP=7fffe331b8b0
_start,
I do realize one call stack contain bcast an the other a barrier. This might cause a deadlock. Yet I do not foresee how to continue from here. If a mpi call is not properly done (e.g only called in 1 proc), I would expect an error message instead of such behaviour.
Update: the source code is around 100k lines.
The files are opened this way:
cmode = ior(NF90_NOCLOBBER,NF90_NETCDF4)
cmode = ior(cmode, NF90_MPIIO)
CALL ipslnc( NF90_CREATE(fname,cmode=cmode,ncid=ncfid, comm=MPI_COMM, info=MPI_INFO))
And closed as:
iret = NF90_CLOSE(ncfid)

It turns out when writting NF90_PUT_ATT, the root processor has a different value compared to the others. Once solved, the program runs as expected.

Related

why need linker script and startup code?

I've read this tutorial
I could follow the guide and run the code. but I have questions.
1) Why do we need both load-address and run-time address. As I understand it is because we have put .data at flash too; so why we don't run app there, but need start-up code to copy it into RAM?
http://www.bravegnu.org/gnu-eprog/c-startup.html
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
arm-none-eabi-gcc -nostdlib -o sum_array.elf sum_array.c
Many thanks
Your first question was answered in the guide.
When you load a program on an operating system your .data section, basically non-zero globals, are loaded from the "binary" into the right offset in memory for you, so that when your program starts those memory locations that represent your variables have those values.
unsigned int x=5;
unsigned int y;
As a C programmer you write the above code and you expect x to be 5 when you first start using it yes? Well, if are booting from flash, bare metal, you dont have an operating system to copy that value into ram for you, somebody has to do it. Further all of the .data stuff has to be in flash, that number 5 has to be somewhere in flash so that it can be copied to ram. So you need a flash address for it and a ram address for it. Two addresses for the same thing.
And that begins to answer your second question, for every line of C code you write you assume things like for example that any function can call any other function. You would like to be able to call functions yes? And you would like to be able to have local variables, and you would like the variable x above to be 5 and you might assume that y will be zero, although, thankfully, compilers are starting to warn about that. The startup code at a minimum for generic C sets up the stack pointer, which allows you to call other functions and have local variables and have functions more than one or two lines of code long, it zeros the .bss so that the y variable above is zero and it copies the value 5 over to ram so that x is ready to go when the code your entry point C function is run.
If you dont have an operating system then you have to have code to do this, and yes, there are many many many sandboxes and toolchains that are setup for various platforms that already have the startup and linker script so that you can just
gcc -O myprog.elf myprog.c
Now that doesnt mean you can make system calls without a...system...printf, fopen, etc. But if you download one of these toolchains it does mean that you dont actually have to write the linker script nor the bootstrap.
But it is still valuable information, note that the startup code and linker script are required for operating system based programs too, it is just that native compilers for your operating system assume you are going to mostly write programs for that operating system, and as a result they provide a linker script and startup code in that toolchain.
1) The .data section contains variables. Variables are, well, variable -- they change at run time. The variables need to be in RAM so that they can be easily changed at run time. Flash, unlike RAM, is not easily changed at run time. The flash contains the initial values of the variables in the .data section. The startup code copies the .data section from flash to RAM to initialize the run-time variables in RAM.
2) Linker-script: The object code created by your compiler has not been located into the microcontroller's memory map. This is the job of the linker and that is why you need a linker script. The linker script is input to the linker and provides some instructions on the location and extent of the system's memory.
Startup code: Your C program that begins at main does not run in a vacuum but makes some assumptions about the environment. For example, it assumes that the initialized variables are already initialized before main executes. The startup code is necessary to put in place all the things that are assumed to be in place when main executes (i.e., the "run-time environment"). The stack pointer is another example of something that gets initialized in the startup code, before main executes. And if you are using C++ then the constructors of static objects are called from the startup code, before main executes.
1) Why do we need both load-address and run-time address.
While it is in most cases possible to run code from memory mapped ROM, often code will execute faster from RAM. In some cases also there may be a much larger RAM that ROM and application code may compressed in ROM, so the executable code may not simply be copied from ROM also decompressed - allowing a much larger application than the available ROM.
In situations where the code is stored on non-memory mapped mass-storage media such as NAND flash, it cannot be executed directly in any case and must be loaded into RAM by some sort of bootloader.
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
The linker script defines the memory layout of you target and application. Since this tutorial is for bare-metal programming, there is no OS to handle that for you. Similarly the start-up code is required to at least set an initial stack-pointer, initialise static data, and jump to main. On an embedded system it is also necessary to initialise various hardware such as the PLL, memory controllers etc.

Why is ExitProcess necessary under Win32 when you can use a RET?

I've noticed that many assembly language examples built using straight Win32 calls (no C Runtime dependency) illustrate the use of an explicit call to ExitProcess() to end the program at the end of the entry-point code. I'm not talking about using ExitProcess() to exit at some nested location within the program. There are surprisingly fewer examples where the entry-point code simply exits with a RET instruction. One example that comes to mind is the famous TinyPE, where the program variations exit with a RET instruction, because a RET instruction is a single byte. Using either ExitProcess() or a RET both seem to do the job.
A RET from an executable's entry-point returns the value of EAX back to the Windows loader in KERNEL32, which ultimately propagates the exit code back to NtTerminateProcess(), at least on Windows 7. On Windows XP, I think I remember seeing that ExitProcess() was even called directly at the end of the thread-cleanup chain.
Since there are many respected optimizations in assembly language that are chosen purely on generating smaller code, I wonder why more code floating around prefers the explicit call to ExitProcess() rather than RET. Is this habit or is there another reason?
In its purest sense, wouldn't a RET instruction be preferable to a direct call to ExitProcess()? A direct call to ExitProcess() seems akin to exiting your program by killing it from the task manager as this short-circuits the normal flow of returning back to where the Windows loader called your entry-point and thus skipping various thread cleanup operations?
I can't seem to locate any information specific to this issue, so I was hoping someone could shed some light on the topic.
If your main function is being called from the C runtime library, then exiting will result in a call to ExitProcess() and the process will exit.
If your main function is being called directly by Windows, as may well be the case with assembly code, then exiting will only cause the thread to exit. The process will exit if and only if there are no other threads. That's a problem nowadays, because even if you didn't create any threads, Windows may have created one or more on your behalf.
As far as I know this behaviour is not properly documented, but is described in Raymond Chen's blog post, "If you return from the main thread, does the process exit?".
(I have also tested this myself on both Windows 7 and Windows 10 and confirmed that they behaved as Raymond describes.)
Addendum: in recent versions of Windows 10, the process loader is itself multi-threaded, so there will always be additional threads present when the process first starts.

implementing blocking syscalls in Linux

I would like to understand how implementing blocking I/O syscalls is different from non-blocking? Googling it didn't help much, any links or references would be greatly appreciated.
Thanks.
http://faculty.salina.k-state.edu/tim/ossg/Device/blocking.html
Blocking syscall will put the task (calling thread) to sleep (block it from running on CPU), and syscall will return only after event (or timeout). Non-blocking syscall will not block thread, it just checks in-kernel states and immediately returns.
More detailed description: http://www.makelinux.net/ldd3/chp-6-sect-2
one important issue: how does a driver respond if it cannot immediately satisfy the request? A call to read may come when no data is available, but more is expected in the future. Or a process could attempt to write, but your device is not ready to accept the data, because your output buffer is full. The calling process usually does not care about such issues; the programmer simply expects to call read or write and have the call return after the necessary work has been done. So, in such cases, your driver should (by default) block the process, putting it to sleep until the request can proceed. ....
There are several forms of wait_event kernel functions to block the caller thread, check include/linux/wait.h; thread can be waked up by different ways, for example with wake_up/wake_up_interruptible.

Boost Fixed_Managed_Shared_Memory Hangs

I am attempting to do a relatively simple task using boost interprocess semaphore and shared memory. I want to fixed buffer of data shared between two processes, where the first process is a producer and the second process is a consumer. The buffer will consist of 3 parts. The first part will be a boost::interprocess::semaphore, used to coordinate the producer/consumer. The second part will just be an integer, so the consumer knows how many items are on the buffer. The third part will be the actual array of items. I have a very basic implementation started, but the processes hang when attempting to access open the shared memory, and i'm not certain why. I am doing this on 64-bit Centos 6.5 with gcc/g++ 4.8.2. I should note also that the machine has two CPUs, and using process affinity I am ensuring that the producer and the consumer both run on separate CPUs.
The code is at http://pastie.org/9693362. I am experiencing the following issues; with the code as is, the consumer and producer both hang at line 3 (fixed_managed_shared_memory shm(open_only, "SharedMem" );). If i comment that line out, then both end up terminating (no error is caught however) at line 26 (the post/wait on the semaphore). This makes me think that somehow the memory isnt being shared, b/c when i print out the addresses, they seem to be properly formed (as in the offsets seem to be correct), and they are properly passed between processes. Is there something I'm missing on how to properly set this up?

Cuda kernel launch failure

I am trying to call two kernels as shown below
for (t=0; t<=time_total; t++)
{
//kernel calls
kernel1<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
kernel2<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
}
And the structure of the second kernel is
var[index+0]=**SOME CALCULATION**
var[index+1]=**SOME CALCULATION**
var[index+2]=**SOME CALCULATION**
Now when I execute this code, checkCudaError does not report anything and the code is executed giving some output but visual studio gives the following exception
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
And when I check on Nsight it says kernel 2 is having the following error
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Now the problem is that var array in kernel 2 is giving some of the rows correct some are copies of other row values and some are garbage.
Also when I do this
var[index+0]=3
var[index+1]=3
var[index+2]=3
All the values of var are set to 3
A few side notes:
cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize().
The fact that nsight is reporting an error on the 2nd kernel launch, but your error checking code is not, leads me to believe your error checking code is broken.
Now, regarding your issue, out of resources is frequently due to a code requesting too many registers (too many registers per thread times the number of threads per threadblock requested.) Try re-compiling your code specifying -Xptxas -v to get verbose output, and then recompiling again with -maxrregcount 20 (or something like that) to try to work around this for test purposes.
If this "fixes" your problem, you may then want to consider the following:
See if there is a way you can re-order or restructure your code to reduce the register pressure
If not, then adjust your maxrregcount value upwards to approximately the highest value that will allow your code to compile and run according to the launch configurations (number of threads per block) that you care about. You may also want to benchmark your code at different levels of this setting, as it can affect occupancy. Usually if you have it set to the highest value that will compile and run, then you are limiting yourself to one threadblock per SM at execution time. This may be OK, or there may be a lower setting that is better, allowing two threadblocks per SM residency, and possibly higher performance. Only benchmarking your code will tell.

Resources