Boost Fixed_Managed_Shared_Memory Hangs

Boost Fixed_Managed_Shared_Memory Hangs - boost

I am attempting to do a relatively simple task using boost interprocess semaphore and shared memory. I want to fixed buffer of data shared between two processes, where the first process is a producer and the second process is a consumer. The buffer will consist of 3 parts. The first part will be a boost::interprocess::semaphore, used to coordinate the producer/consumer. The second part will just be an integer, so the consumer knows how many items are on the buffer. The third part will be the actual array of items. I have a very basic implementation started, but the processes hang when attempting to access open the shared memory, and i'm not certain why. I am doing this on 64-bit Centos 6.5 with gcc/g++ 4.8.2. I should note also that the machine has two CPUs, and using process affinity I am ensuring that the producer and the consumer both run on separate CPUs.
The code is at http://pastie.org/9693362. I am experiencing the following issues; with the code as is, the consumer and producer both hang at line 3 (fixed_managed_shared_memory shm(open_only, "SharedMem" );). If i comment that line out, then both end up terminating (no error is caught however) at line 26 (the post/wait on the semaphore). This makes me think that somehow the memory isnt being shared, b/c when i print out the addresses, they seem to be properly formed (as in the offsets seem to be correct), and they are properly passed between processes. Is there something I'm missing on how to properly set this up?

Related

fortran netcdf close parallel deadlock

I am adapting a fortran mpi program from sequential to parallel writing for certain types of files. It uses netcdf 4.3.3.1/hdf5 1.8.9 parallel. I use intel compiler version 14.0.3.174.
When all reads/writes are done it is time to close the files. At this point, the simulations does not continue anymore. So all calls are waiting. When I check the call stack from each processor I can see the master root is different compared to the rest of them.
Mpi Master processor call stack:
__sched_yield, FP=7ffc6aa978b0
opal_progress, FP=7ffc6aa978d0
ompi_request_default_wait_all, FP=7ffc6aa97940
ompi_coll_tuned_sendrecv_actual, FP=7ffc6aa979e0
ompi_coll_tuned_barrier_intra_recursivedoubling, FP=7ffc6aa97a40
PMPI_Barrier, FP=7ffc6aa97a60
H5AC_rsp__dist_md_write__flush, FP=7ffc6aa97af0
H5AC_flush, FP=7ffc6aa97b20
H5F_flush, FP=7ffc6aa97b50
H5F_flush_mounts, FP=7ffc6aa97b80
H5Fflush, FP=7ffc6aa97ba0
NC4_close, FP=7ffc6aa97be0
nc_close, FP=7ffc6aa97c00
restclo, FP=7ffc6aa98660
driver, FP=7ffc6aaa5ef0
main, FP=7ffc6aaa5f90
__libc_start_main, FP=7ffc6aaa6050
_start,
Remaining processors call stack:
__sched_yield, FP=7fffe330cdd0
opal_progress, FP=7fffe330cdf0
ompi_request_default_wait, FP=7fffe330ce50
ompi_coll_tuned_bcast_intra_generic, FP=7fffe330cf30
ompi_coll_tuned_bcast_intra_binomial, FP=7fffe330cf90
ompi_coll_tuned_bcast_intra_dec_fixed, FP=7fffe330cfb0
mca_coll_sync_bcast, FP=7fffe330cff0
PMPI_Bcast, FP=7fffe330d030
mca_io_romio_dist_MPI_File_set_size, FP=7fffe330d080
PMPI_File_set_size, FP=7fffe330d0a0
H5FD_mpio_truncate, FP=7fffe330d0c0
H5FD_truncate, FP=7fffe330d0f0
H5F_dest, FP=7fffe330d110
H5F_try_close, FP=7fffe330d340
H5F_close, FP=7fffe330d360
H5I_dec_ref, FP=7fffe330d370
H5I_dec_app_ref, FP=7fffe330d380
H5Fclose, FP=7fffe330d3a0
NC4_close, FP=7fffe330d3e0
nc_close, FP=7fffe330d400
RESTCOM`restclo, FP=7fffe330de60
driver, FP=7fffe331b6f0
main, FP=7fffe331b7f0
__libc_start_main, FP=7fffe331b8b0
_start,
I do realize one call stack contain bcast an the other a barrier. This might cause a deadlock. Yet I do not foresee how to continue from here. If a mpi call is not properly done (e.g only called in 1 proc), I would expect an error message instead of such behaviour.
Update: the source code is around 100k lines.
The files are opened this way:
cmode = ior(NF90_NOCLOBBER,NF90_NETCDF4)
cmode = ior(cmode, NF90_MPIIO)
CALL ipslnc( NF90_CREATE(fname,cmode=cmode,ncid=ncfid, comm=MPI_COMM, info=MPI_INFO))
And closed as:
iret = NF90_CLOSE(ncfid)

It turns out when writting NF90_PUT_ATT, the root processor has a different value compared to the others. Once solved, the program runs as expected.

Does Windows clear memory pages?

I know that Windows has an option to clear the page file when it shuts down.
Does Windows do anything special with the actual physical/virtual memory when it goes in or out of scope?
For instance, let's say I run application A, which writes a recognizable string to a variable in memory, and then I close the application. Then I run application B. It allocates a large chunk of memory, leaves the contents uninitialized, and searches it for the known string written by application A.
Is there ANY possibility that application B will pick up the string written by application A? Or does Windows scrub the memory before making it available?

Windows does "scrub" the freed memory returned by a process before allocating it to other processes. There is a kernel thread specifically for this task alone.
The zero page thread runs at the lowest priority and is responsible for zeroing out free pages before moving them to the zeroed page list[1].
Rather than worrying about retaining sensitive data in the paging file, you should be worried about continuing to retain it in memory (after use) in the first place. Clearing the page-file on shutdown is not the default behavior. Also a system crash dump will contain any sensitive info that you may have in "plain-text" in RAM.
Windows does NOT "scrub" the memory as long as it is allocated to a process (obviously). Rather it is left to the program(mer) to do so. For this very purpose one can use the SecureZeroMemory() function.
This function is defined as the RtlSecureZeroMemory() function ( see WinBase.h). The implementation of RtlSecureZeroMemory() is provided inline and can be used on any version of Windows ( see WinNT.h)
Use this function instead of ZeroMemory() when you want to ensure that your data will be overwritten promptly, as some C++ compilers can optimize a call to ZeroMemory() by removing it entirely.
WCHAR szPassword[MAX_PATH];
/* Obtain the password */
if (GetPasswordFromUser(szPassword, MAX_PATH))
{
UsePassword(szPassword);
}
/* Before continuing, clear the password from memory */
SecureZeroMemory(szPassword, sizeof(szPassword));
Don't forget to read this interesting article by Raymond Chen.

how does linux kernel implement shared memory between 2 processes

How does the Linux kernel implement the shared memory mechanism between different processes?
To elaborate further, each process has its own address space. For example, an address of 0x1000 in Process A is a different location when compared to an address of 0x1000 in Process B.
So how does the kernel ensure that a piece of memory is shared between different process, having different address spaces?
Thanks in advance.

Interprocess Communication Mechanisms
Processes communicate with each other and with the kernel to coordinate their activities. Linux supports a number of Inter-Process Communication (IPC) mechanisms. Signals and pipes are two of them but Linux also supports the System V IPC mechanisms named after the Unix TM release in which they first appeared.
Signals
Signals are one of the oldest inter-process communication methods used by Unix TM systems. They are used to signal asynchronous events to one or more processes. A signal could be generated by a keyboard interrupt or an error condition such as the process attempting to access a non-existent location in its virtual memory. Signals are also used by the shells to signal job control commands to their child processes.
There are a set of defined signals that the kernel can generate or that can be generated by other processes in the system, provided that they have the correct privileges. You can list a system's set of signals using the kill command (kill -l).
Pipes
The common Linux shells all allow redirection. For example
$ ls | pr | lpr
pipes the output from the ls command listing the directory's files into the standard input of the pr command which paginates them. Finally the standard output from the pr command is piped into the standard input of the lpr command which prints the results on the default printer. Pipes then are unidirectional byte streams which connect the standard output from one process into the standard input of another process. Neither process is aware of this redirection and behaves just as it would normally. It is the shell which sets up these temporary pipes between the processes.
In Linux, a pipe is implemented using two file data structures which both point at the same temporary VFS inode which itself points at a physical page within memory. Figure shows that each file data structure contains pointers to different file operation routine vectors; one for writing to the pipe, the other for reading from the pipe.
Sockets
Message Queues: Message queues allow one or more processes to write messages, which will be read by one or more reading processes. Linux maintains a list of message queues, the msgque vector; each element of which points to a msqid_ds data structure that fully describes the message queue. When message queues are created a new msqid_ds data structure is allocated from system memory and inserted into the vector.
System V IPC Mechanisms: Linux supports three types of interprocess communication mechanisms that first appeared in Unix TM System V (1983). These are message queues, semaphores and shared memory. These System V IPC mechanisms all share common authentication methods. Processes may access these resources only by passing a unique reference identifier to the kernel via system calls. Access to these System V IPC objects is checked using access permissions, much like accesses to files are checked. The access rights to the System V IPC object is set by the creator of the object via system calls. The object's reference identifier is used by each mechanism as an index into a table of resources. It is not a straight forward index but requires some manipulation to generate the index.
Semaphores: In its simplest form a semaphore is a location in memory whose value can be tested and set by more than one process. The test and set operation is, so far as each process is concerned, uninterruptible or atomic; once started nothing can stop it. The result of the test and set operation is the addition of the current value of the semaphore and the set value, which can be positive or negative. Depending on the result of the test and set operation one process may have to sleep until the semphore's value is changed by another process. Semaphores can be used to implement critical regions, areas of critical code that only one process at a time should be executing.
Say you had many cooperating processes reading records from and writing records to a single data file. You would want that file access to be strictly coordinated. You could use a semaphore with an initial value of 1 and, around the file operating code, put two semaphore operations, the first to test and decrement the semaphore's value and the second to test and increment it. The first process to access the file would try to decrement the semaphore's value and it would succeed, the semaphore's value now being 0. This process can now go ahead and use the data file but if another process wishing to use it now tries to decrement the semaphore's value it would fail as the result would be -1. That process will be suspended until the first process has finished with the data file. When the first process has finished with the data file it will increment the semaphore's value, making it 1 again. Now the waiting process can be woken and this time its attempt to increment the semaphore will succeed.
Shared Memory: Shared memory allows one or more processes to communicate via memory that appears in all of their virtual address spaces. The pages of the virtual memory is referenced by page table entries in each of the sharing processes' page tables. It does not have to be at the same address in all of the processes' virtual memory. As with all System V IPC objects, access to shared memory areas is controlled via keys and access rights checking. Once the memory is being shared, there are no checks on how the processes are using it. They must rely on other mechanisms, for example System V semaphores, to synchronize access to the memory.
Quoted from tldp.org.

There are two kinds of shared memory in Linux.
If A and B are Parent process and Child process respectively, each of them uses their own pte to access the shared memory.The shared memory is shared by the fork mechanism. So every thing is good, right?(More details, please look at the kernel function copy_one_pte() and related functions.)
If A and B are not parent and Child, they use the a public key to access the shared memory.
Let's assume that A creates a shared memory though System V shmget() with a key and, correspondly, the kernel creates a file(file name is "SYSTEMV+key") for process A in a shmem/tmpfs which is an internal RAM-based filesystem. It's mounted by the kenrel(Check shmem_init()). And the shared memory region is handled by shmem/tmpfs. Basically, it's handled by the page fault mechanism when process A accesses the shared memory region.
If process B wants to access that shared memory region created by process A. Process B should use shmget() with the same key used by Process A. Then process B can find the file("SYSTEMV+key") and map the file into Process B's address space.

CreateFileMapping and MapViewOfFile with interprocess (un)synchronized multithreaded access?

I use a Shared Memory area to get som data to a second process.
The first process uses CreateFileMapping(INVALID_HANDLE_VALUE, ..., PAGE_READWRITE, ...) and MapViewOfFile( ... FILE_MAP_WRITE).
The second process uses OpenFileMapping(FILE_MAP_WRITE, ...) and MapViewOfFile( ... FILE_MAP_WRITE).
The docs state:
Multiple views of a file mapping object
are coherent if they contain identical data at a specified time.
This occurs if the file views are derived from any file mapping object
that is backed by the same file. (...)
With one important exception, file views derived from any file mapping
object that is backed by the same file are coherent or identical at a
specific time. Coherency is guaranteed for views within a process and
for views that are mapped by different processes.
The exception is related to remote files. (...)
Since I'm just using the Shared Memory as is (backed by the paging file) I would have assumed that some synchronization is needed between processes to see a coherent view of the memory another process has written. I'm unsure however what synchronization would be needed exactly.
The current pattern I have (simplified) is like this:
Process1 | Process2
... | ...
/* write to shared mem, */ | ::WaitForSingleObject(hDataReady); // real code has error handling
/* then: */
::SetEvent(hDataReady); | /* read from shared mem after wait returns */
... | ...
Is this enough synchronization, even for shared memory?
What sync is needed in general between the two processes?
Note that inside of one single process, the call to SetEvent would certainly constitute a full memory barrier, but it isn't completely clear to me whether that holds for shared memory across processes.

I have since come to believe that for memory-access synchronization purposes, it really does not matter if the concurrently accessed memory is shared between processes or just withing one process between threads.
That is, for Shared Memory (the one shared between processes) on Windows, the same restrictions and guidelines apply as with "normal" memory within a process that is just shared between the threads of the process.
The reason I believe this is that a process and a thread are somewhat orthogonal on Windows. A process is a "container" for threads, and in order for the process to be able to do anything, it needs at least one thread. So, for memory that is mapped into multiple process' address space, the synchronization requirements on the threads running within these different processes should be actually the same as for threads running within the same process.
So, the answer to my question Is this enough synchronization, even for shared memory? is that shared memory requires the same synchronization as "normal" memory. But of course, not all synchronization techniques works across process boundaries, so you are restricted in what you can use. (A Critical Section for exampled cannot be used across processes.)

If both of those code snippets are in a loop then in addition to the event you'll need a mutex so that Process1 doesn't start writing again while Process2 is still reading. To be more specific, the mutex must be acquired before reading or writing and released after reading or writing. Make sure the mutex has been released before calling WFSO in Process2.

My understanding is that although Windows may guarantee view coherency, it does not guarantee a write is fully completed before the client reads it.
For example, if you were writing "Hello world!" to the view, it could only be partially written when the client reads it, such as "Hello w".
Therefore, the view would be byte coherent, but not message coherent.
Personally, I use a mutex to guarantee thread-safe access.

Use Semaphore should be better than Event.

Kernel threads vs Timers

I'm writing a kernel module which uses a customized print-on-screen system. Basically each time a print is involved the string is inserted into a linked list.
Every X seconds I need to process the list and perform some operations on the strings before printing them.
Basically I have two choices to implement such a filter:
1) Timer (which restarts itself in the end)
2) Kernel thread which sleeps for X seconds
While the filter is performing its stuff nothing else can use the linked list and, of course, while inserting a string the filter function shall wait.
AFAIK timer runs in interrupt context so it cannot sleep, but what about kernel threads? Can they sleep? If yes is there some reason for not to use them in my project? What other solution could be used?
To summarize: my filter function has got only 3 requirements:
1) Must be able to printk
2) When using the list everything else which is trying to access the list must block until the filter function finishes execution
3) Must run every X seconds (not a realtime requirement)

kthreads are allowed to sleep. (However, not all kthreads offer sleepful execution to all clients. softirqd for example would not.)
But then again, you could also use spinlocks (and their associated cost) and do without the extra thread (that's basically what the timer does, uses spinlock_bh). It's a tradeoff really.

each time a print is involved the string is inserted into a linked list
I don't really know if you meant print or printk. But if you're talking about printk(), You would need to allocate memory and you are in trouble because printk() may be called in an atomic context. Which leaves you the option to use a circular buffer (and thus, you should be tolerent to drop some strings because you might not have enough memory to save all the strings).
Every X seconds I need to process the list and perform some operations on the strings before printing them.
In that case, I would not even do a kernel thread: I would do the processing in print() if not too costly.
Otherwise, I would create a new system call:
sys_get_strings() or something, that would dump the whole linked list into userspace (and remove entries from the list when copied).
This way the whole behavior is controlled by userspace. You could create a deamon that would call the syscall every X seconds. You could also do all the costly processing in userspace.
You could also create a new device says /dev/print-on-screen:
dev_open would allocate the memory, and print() would no longer be a no-op, but feed the data in the device pre-allocated memory (in case print() would be used in atomic context and all).
dev_release would throw everything out
dev_read would get you the strings
dev_write could do something on your print-on-screen system

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio