what happens at the lower levels after a fork system call? - fork

I know what the fork() does at the higher level. What I'd like to know is this -
As soon as there is a fork call, a trap instruction follows and control jumps to execute the fork "handler" . Now,How does this handler , which creates the child process, by duplicating the parent process by creating another address space and process control block , return 2 values, one to each process ?
At what point of execution does the fork return 2 values ?
To put it in short, can anbody please explain the step-by-step events that take place at the lower level after a fork call ?

It's not so hard right - the kernel half of the fork() syscall can tell the difference between the two processes via the Process Control Block as you mentioned, but you don't even need to do that. So the pseudocode looks like:
int fork()
{
int orig_pid = getpid();
int new_pid = kernel_do_fork(); // Now there's two processes
// Remember, orig_pid is the same in both procs
if (orig_pid == getpid()) {
return new_pid;
}
// Must be the child
return 0;
}
Edit:
The naive version does just as you describe - it creates a new process context, copies all of the associated thread contexts, copies all of the pages and file mappings, and the new process is put into the "ready to run" list.
I think the part you're getting confused on is, that when these processes resume (i.e. when the parent returns from kernel_do_fork, and the child is scheduled for the first time), it starts in the middle of the function (i.e. executing that first 'if'). It's an exact copy - both processes will execute the 2nd half of the function.

The value returned to each process is different. The parent/original thread get's the PID of the child process and the child process get's 0.
The Linux kernel achieves this on x86 by changing the value in the eax register as it copies the current thread in the parent process.

Related

LTTng/Perf: Difference between events used for exiting (sched_process_exit) and freeing (sched_process_free) a process

Currently, I'm getting into the topic of kernel tracing with LTTng and Perf. I'm especially interested to trace the different states a process is in.
I stumbled over the event sched_process_free and sched_process_exit. I'm wondering if my current understanding is correct:
If a process is exited, sched_process_exit is written to the trace. However, the process descriptor might still be in the memory which leads to a zombie. When the whole memory connected to the process is freed, sched_process_free is called. This would mean, if I really want to be sure that the process is fully "terminated" and removed from memory, I have to listen to sched_process_free instead of sched_process_exit in the trace. Is this correct?
I find some time to edit my answer to make it more clear. If there are still some problem, please tell me, we can discuss and make it more clear. Let's dive into the end of task :
there are two system calls : exit_group() and exit(), and all of them will go to do_exit(), which will do the following things.
set PF_EXTING which means the task is deleting
remove the task descriptor from timer by del_timer_sync()
call exit_mm(), exit_sem(), __exit_fs() and others to release structure of that task
call perf_event_exit_task(tsk);
decrease the ref count
set exit_code to _exit()/exit_group() or error
call exit_notify()
update relationship with parent and child
check exit_signal, send SIGCHLD
if task is not traced or return value is -1, set the exit_state to EXIT_DEAD, call release_task() to recycle other memory and decrease ref count.
if task is traced, set exit_state to EXIT_ZOMBIE
set task flag to PF_DEAD
call schedule()
We need zombie state cause the parent may need to use those file descriptors so we can not delete all the things in the first time. The parent task will need to use something like wait() to check if child is dead. After wait(), it is time for the zombie to release totally by release_task()
decrease the owners' task number
if the task is traced, delete from the ptrace_children list
call __exit_signal() delete all pending signals and release signal_struct descriptor and exit_itimers() delete all the timer
call __exit_sighand() delete signal handler
call __unhash_process()
nr_threads--
call detach_pid() to delete task descriptor from PIDTYPE_PID and PIDTYPE_TGID
call REMOVE_LINKS to delete the task from list
call sched_exit() to schedule parent's time pieces
call put_task-struct() to decrease the counter, and release memory & task descriptor
call delayed_put_task_struct()
So, we know that sched_process_exit state will be make in the do_exit(), but we can not make sure if the process is released or not (may call release_task() or not, which will trigger sched_process_free). That is why we need both of the two perf event point.

MPI: Is there a way to receive variables only when it has changed?

I'm working on a project where I need to implement some sort of termination detection via a variable which changes only in the root process of an MPI program.
I am struggling to understand the concepts of blocking and non-blocking instructions.
In short, only the root process can determine if the task has been completed or not. This is done by implementing a simple Boolean integer variable called "running". This has to be broadcasted to all processes in order for them to know when to exit their while-loops.
All processes run in their own while-loop. At the start, the root process sets the "running" variable to true if necessary.
The root process can then determine if the "running" variable should be set to zero and should broadcast it to all other processes.
Currently, I am using a broadcast to share this variable. Thus, whenever the loop reaches its end (or "running" gets set to zero) it broadcasts the value to all processes. Thus, each process has a broadcast inside of their function to receive the value.
I am either misunderstanding the concept of blocking or my program is not efficient.
Broadcast is blocking, thus, if the root keeps on broadcasting the variable that essentially stays the same (TRUE) for the majority of the running time, each process will essentially have to wait for the root to complete its work and then block before that process can continue with its future work.
The problem exists that since this variable only changes once in the root process, there are many unnecessary blocks happening while running. I only want the variable to be broadcasted once it has been changed to zero so that I can tell the other processes to terminate a part of their code and not have to wait for the root to broadcast every time.
if(myRank != 0) {
while(running) {
doThisFunction(myRank);
MPI_Broadcast(... running ...); //Wait for root to broadcast?
}
/* Start doing something else */
} else {
while(running || ... ) {
/* Do stuff */
if (...) {
running = 0; //Somewhere in an if statement
MPI_Broadcast(... running ...); //Now terminate the while
}
MPI_Broadcast(... running ...); //Unnecessary broadcast?
}
}
I was thinking that I could use MPI_IProbe to check if there's a message to be received and then removing the MPI_Broadcast in the root's while-loop. If there is, then the process will initiate an MPI_Broadcast. If not, then it will continue as normal.
TL;DR:
My program terminates some code in processes if "running" equals zero. Currently, it broadcasts this in every while iteration and I think this causes the program to have an unnecessary block. I only want to send/ receive the variable when it is changed in the root.
Thanks for the help!
edit: "running" is a global variable.

How does Linux kernel migrate the process among multiple cores?

Context:
Process-1 is executing on core-0. Core-1 is idle.
Now, process-1 uses sched_setaffinity() to change its CPU affinity as core-1.
Question:
Which kernel function(s) migrate the process-1 to execute on core-1?
Here is the call sequence starting from the sched_setaffinity system call entry point in the kernel:
sys_sched_setaffinity.
sched_setaffinity.
__set_cpus_allowed_ptr.
In the last function, there are two cases as shown in the code at line 1101:
if (task_running(rq, p) || p->state == TASK_WAKING) {
// ...
stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
// ...
} else if (task_on_rq_queued(p)) {
rq = move_queued_task(rq, &rf, p, dest_cpu);
}
If the task to be migrated is currently running or waking up, then it is migrated by calling stop_one_cpu, which calls the following functions in order:
migration_cpu_stop.
__migrate_task.
move_queued_task.
The last function, move_queued_task, is the one that actually moves the task from the current runqueue to the target runqueue. Note that this is the same function that is called from the other branch of __set_cpus_allowed_ptr. That branch handles the case where the task is in any of the other states.

Communicating between Ruby processes, loops

I have a Ruby application which must run 24/7 to process information for a web API, both of which are operating on Google Compute Engine on a Debian Instance - the API is served by Sinatra. When I run this script in loop, it uses up the 1-core vCPU. Using a message queuing system like RabbitMQ to pass messages from the API to the backend script seems to me to skip a learning opportunity for communicating between Ruby scripts natively.
How do I keep a script dormant, i.e. awaiting instruction but not consuming memory 99% CPU? I'm assuming it's not going to be in an infinite loop, but I'm stumped on this.
How would it be best to communicate this message from one script to another? I read about Kernel#Select and forking of subprocesses, but I haven't encountered any definitive or comprehensible solution.
Forking may indeed be a good solution for you, and you only need to understand three system calls to make good use of it: fork(), waitpid() and exec(). I'm not a Ruby guy, so hopefully my C-like explanation will make enough sense for you to fill in the blanks.
The way fork() works is by the operating system making a byte-for-byte copy of the calling process' virtual memory space as it was when fork() was called and carving out new memory to place the copy into. This creates a new process with its parent's exact state--except for that the child process' fork() call returns 0, while the parent's returns the PID of the new child process. This allows the child process to know that it is a child, and the parent process to know who its children are.
While fork() copies its caller's process image, the exec() system call replaces its caller's process image with a brand new one, as specified by its arguments.
The waitpid() system call is used by the parent process to wait for a return value from a specific child process (one whose process ID was returned to the parent by the fork() call), and then properly log the process' completion with the OS. Even if you don't need your child process' return value, you should call waitpid() on it anyway so you don't end up accumulating "zombie processes."
Again, I'm not a Ruby guy, so hopefully my C-like pseudocode makes sense. Consider the following server:
while(1) { # an infinite loop
# Wait for and accept connections from your web API.
pid = fork(); # fork() returns a process ID number
# If fork() returns a negative number, something went wrong.
if(pid < 0) {
exit(1);
}
# If fork() returns 0, this is the child process.
else if(pid == 0) {
# Remember that because fork() copies your program's state,
# you can use variables you assigned before the fork to
# send to the new process as arguments.
exec(./processingscript.rb, "processingscript.rb", arg1, arg2, arg3, ...);
}
# If fork() returns a number greater than 0 (the PID of the forked
# child process), this is the parent process.
else if(pid > 0) {
childreturnvalue = waitpid(pid); # parent process hangs here until
# the process with the ID number
# pid returns.
}
}
Written this way, your CPU-intenive script only runs when a connection is received from the web API. It does its processing and then terminates, waiting to be called again. You can also specify "no hang" options for waitpid() so that you can fork multiple instances of your processing script concurrently without having your server hang every time it needs to wait for an instance of that script to complete.
Hope this helps! Perhaps somebody who knows Ruby can edit this to be a bit more idiomatic to the language.

How can I get the PID of a new process before it executes?

So that I can do some injecting and interposing using the inject_and_interpose code, I need to way to get the PID of a newly-launched process (a typical closed-source user application) before it actually executes.
To be clear, I need to do better than just "notice it quickly"--I can't be polling, or receiving some asynchronous notification that means that the process has already been executing for a few milliseconds by the time I take action.
I need to have a chance to do my injecting and interposing before a single statement executes.
I'm open to writing a background process that gets synchronously notified when a process by a particular name comes into existence. I'm also open to writing a launcher application that in turn fires up the target application.
Any solution needs to support 64-bit code, at a minimum, under 10.5 (Leopard) through 10.8 (Mountain Lion).
In case this proves to be painfully simple, I'll go ahead and admit that I'm new to OS X :) Thanks!
I know how to do this on Linux, so maybe it would be the same(-ish) on OSX.
You first call fork() to duplicate your process. The return value of fork() indicates whether you are the parent or child. The parent gets the pid of the child process, and the child gets zero.
So then, the child calls exec() to actually begin executing the new executable. With the use of a pipe created before the call to fork, the child could wait on the parent to do whatever it needed before execing the new execuatable.
pid_t pid = fork();
if (pid == -1) {
perror("fork");
exit(1);
}
if (pid > 0) {
// I am the parent, and pid is the PID of the child process.
//TODO: If desired, somehow notify child to proceed with exec
}
else {
// I am the child.
//TODO: If desired, wait no notification from parent to continue
execl("path/to/executable", "executable", "arg1", NULL);
// Should never get here.
fprintf(stderr, "ERROR: execl failed!\n");
}

Resources