If I use MPI, I have a number of processes specified when I run the main program. However I would like to start with one process and dynamically decide at runtime if and when I need more, to fork more processes off. Is that or something similar possible?
Otherwise I would have to reinvent MPI which I would very much like to avoid.
It is not possible to use fork() as the child process will not be able to use MPI functions. There is a simple mechanism in MPI to create dynamically new processes. You must use the MPI_Comm_spawn function or the MPI_Comm_spawn_mutliple
OpenMPI doc: http://www.open-mpi.org/doc/v1.4/man3/MPI_Comm_spawn.3.php
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define NUM_SPAWNS 2
int main( int argc, char *argv[] )
{
int np = NUM_SPAWNS;
int errcodes[NUM_SPAWNS];
MPI_Comm parentcomm, intercomm;
MPI_Init( &argc, &argv );
MPI_Comm_get_parent( &parentcomm );
if (parentcomm == MPI_COMM_NULL) {
MPI_Comm_spawn( "spawn_example", MPI_ARGV_NULL, np, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, errcodes );
printf("I'm the parent.\n");
} else {
printf("I'm the spawned.\n");
}
fflush(stdout);
MPI_Finalize();
return 0;
}
Related
I have a program running on Linux that fork()s after a TCP connection was accept()ed. Before the fork, it connects to a message queue via msgget() and happily sends and receives messages. At some point in the program, both the parent and the child will be waiting at the same time on a msgrcv() using the same msgtype. A separate process then sends a message via msgsnd() using this same msgtype.
However, only one of the forked processes returns from msgrcv(), and it also seems to depend on the path, the parent and the child took. It is very repeatable. In one case, only the parent receives the message, in another case only the child receives the message, leaving the other one waiting infinitely.
Does anyone have a hint on what could go wrong and why not both the parent and the child always receive the message?
I wrote two little test programs, recv.c and send.c, see below.
It turns out that the parent and the child only receive every other message. It seems to be strictly "every other", not even by chance which of the two receives a message. This would very well explain what's happening to my software.
Is this how message queues are supposed to work? Can I not send a message to multiple recipients?
/* recv.c */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ipc.h>
#include <sys/msg.h>
#include <sys/wait.h>
int main(void)
{
int msgid = msgget(247, 0666 | IPC_CREAT);
pid_t cldpid = fork();
struct msgform
{
long mtype;
char mbuf[16];
} msg;
msg.mtype = 1;
if (cldpid == 0)
{
while(true)
{
printf("Child waiting\n");
msgrcv(msgid, &msg, sizeof(msg), 1, 0);
printf("Child done\n");
}
}
while(true)
{
printf("Parent waiting\n");
msgrcv(msgid, &msg, sizeof(msg), 1, 0);
printf("Parent done\n");
}
return 0;
}
and
/* send.c */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ipc.h>
#include <sys/msg.h>
#include <sys/wait.h>
int main(void)
{
int msgid = msgget(247, 0666 | IPC_CREAT);
struct msgform {
long mtype;
char mbuf[16];
} msg;
msg.mtype = 1;
msgsnd(msgid, &msg, sizeof(msg), IPC_NOWAIT);
return 0;
}
Thanks
I'm trying to disable core dumps being generated for individual signals in my application.
ulimit -c 0 wont work in my case, since it needs to be executed before application start and will completely disable core dumps for all signals.
Is it possible to make such an exception for a single signal or at least disable core dump generation for a certain amount of time (eg. during sending the SIGHUP signal)?
#include <setjmp.h>
#include <signal.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <unistd.h>
static sigjmp_buf sigjmp;
static void sighup_handler(int signo) {
siglongjmp(&sigjmp, signo);
}
int main(int argc, char **argv) {
struct sigaction sighup_action = {
.sa_handler = &sighup_handler,
.sa_flags = SA_RESETHAND,
};
sigset_t sigset;
int signo;
sigemptyset(&sighup_action.sa_mask);
sigaddset(&sighup_action.sa_mask, SIGHUP);
sigprocmask(SIG_BLOCK, &sighup_action.sa_mask, &sigset);
sigaction(SIGHUP, &sighup_action, NULL);
signo = sigsetjmp(&sigjmp, 1);
if (signo) {
struct rlimit rl = { .rlim_cur = 0, .rlim_max = 0 };
setrlimit(RLIMIT_CORE, &rl);
sigprocmask(SIG_SETMASK, &sigset, NULL);
kill(getpid(), signo);
abort(); /* just in case */
_exit(128 | signo);
}
sigprocmask(SIG_SETMASK, &sigset, NULL);
pause(); /* or whatever the rest of your program does */
}
You can install a signal handler which sets RLIMIT_CORE to 0, then proceeds with the default signal action. If you use SA_RESETHAND, the default signal handler is automatically reinstalled right before the signal handler is run. However, setrlimit is not async-signal-safe, so we should not call it from inside a signal handler, hence using siglongjmp to return to normal code and performing it there.
Just add an empty signal handler for SIGHUP, or ignore it like this:
signal(SIGHUP, SIG_IGN);
in order to achieve a parallel resilient code to solve large linear system, i have to simulate MPI failures, the idea is to kill or stop a rand process while its working , once i achieve this step, i'll start applicating other techniques for fault mitigation.
in order to kill a process, i had an idea, i choosed randomly a process and then put it in onother COMM and the reste stills in MPI_COMM_WORLD, and then i applicated MPI_Abort(COMM,0)
the idea seemed to work but when i tried it, it showed me error
here's the sample code i've did to kill process
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include<time.h>
int main(int argc, char** argv)
{
int size, rank;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm comm1;
MPI_Group group1, grp_world;
MPI_Comm_group(MPI_COMM_WORLD, &grp_world);
int *ranks = malloc((size-1) * sizeof(rank));
int rand_rank;
srand (time(NULL));
rand_rank = rand()%(size-1)+1;
printf("%d\n",rand_rank);
MPI_Group_incl(grp_world, 1, &rand_rank, &group1);
MPI_Comm_create(MPI_COMM_WORLD, group1, &comm1);
if (rank==0) {
printf("the total number of process before killing %d is %d", rand_rank,size);
MPI_Abort(comm1,911);
printf("the total number of process after killing %d is %d", rand_rank,size);
}
MPI_Finalize();
return 0;
}
and the results :
MPI_ABORT was invoked on rank -2 in communicator MPI_COMM_NULL
with errorcode 911.
So if anyone guys have an idea how to achieve this, i've tried everything , Thankyou
I am trying to use the SIGCHLD handler but for some reason it prints of the command I gave infinitely. If I remove the struct act it works fine.
Can anyone take a look at it, I am not able to understand what the problem is.
Thanks in advance!!
/* Simplest dead child cleanup in a SIGCHLD handler. Prevent zombie processes
but dont actually do anything with the information that a child died. */
#include <sys/types.h>
#include <sys/wait.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
typedef char *string;
/* SIGCHLD handler. */
static void sigchld_hdl (int sig)
{
/* Wait for all dead processes.
* We use a non-blocking call to be sure this signal handler will not
* block if a child was cleaned up in another part of the program. */
while (waitpid(-1, NULL, WNOHANG) > 0) {
}
}
int main (int argc, char *argv[])
{
struct sigaction act;
int i;
int nbytes = 100;
char my_string[nbytes];
string arg_list[5];
char *str;
memset (&act, 0, sizeof(act));
act.sa_handler = sigchld_hdl;
if (sigaction(SIGCHLD, &act, 0)) {
perror ("sigaction");
return 1;
}
while(1){
printf("myshell>> ");
gets(my_string);
str=strtok(my_string," \n");
arg_list[0]=str;
i =1;
while ( (str=strtok (NULL," \n")) != NULL){
arg_list[i]= str;
i++;
}
if (i==1)
arg_list[i]=NULL;
else
arg_list[i+1]=NULL;
pid_t child_pid;
child_pid=fork();
if (child_pid == (pid_t)-1){
printf("ERROR OCCURED");
exit(0);
}
if(child_pid!=0){
printf("this is the parent process id is %d\n", (int) getpid());
printf("the child's process ID is %d\n",(int)child_pid);
}
else{
printf("this is the child process, with id %d\n", (int) getpid());
execvp(arg_list[0],arg_list);
printf("this should not print - ERROR occured");
abort();
}
}
return 0;
}
I haven't run your code, and am merely hypothesizing:
SIGCHLD is arriving and interrupting fgets (I'll just pretend you didn't use gets). fgets returns before actually reading any data, my_string contains the tokenized list that it had on the previous loop, you fork again, enter fgets, which is interrupted before reading any data, and repeat indefinitely.
In other words, check the return value of fgets. If it is NULL and has set errno to EINTR, then call fgets again. (Or set act.sa_flags = SA_RESTART.)
I have an application that is multithreaded - one thread is responsible for collecting the dead children with wait(), anther thread spawns them with fork upon request.
I found out that on one platform with 2.4 kernel and LinuxThread wait always fails with ECHILD. I've found out that problem might be in non-POSIX compliant implementation of LinuxThreads on 2.4 kernel and the following discussion suggests that there is no way how this could be solved.
Still I'd like to be sure that nobody knows about any solution. Even patch for the kernel would be acceptable.
When I think about the application design I don't think it could be possible to do both fork() and wait() in a single thread (or only with enormous effort)
It seems to me that this (obviously bogus) behavior is features of LinuxThreads implementation.
There really seems to be only two ways out - either switch to NPTL (requires kernel 2.6) or avoid such multi-threaded fork/wait model (this was my solution to the problem and tough it made the architecture a bit more complicated and complex it still was manageable to do in a single day)
Following example is the bare bone example of the bogus situation that fails on LinuxThreads.
#include <pthread.h>
#include <sys/wait.h>
#include <unistd.h>
#include <errno.h>
void * wait_for_child(void *arg)
{
int s;
pid_t ret;
ret = wait(&s);
if (ret == -1 && errno == ECHILD) perror("Bogus LinuxThreads encountered");
return NULL;
}
int main(int argc, char ** argv)
{
pid_t pid = fork();
if (pid == -1) return 1;
// child waits and then dies
if (pid == 0)
{
sleep(3);
return 0;
}
pthread_t wt;
pthread_create(&wt, NULL, wait_for_child, NULL);
pthread_join(wt, NULL);
return 0;
}
If you're starting to think about kernel patches, then it's time to think about upgrades. 2.4 is very long in the tooth.