What is the basic difference between a pthread and fork w.r.t. linux in terms of
implementation differences and how the scheduling varies (does it vary ?)
I ran strace on two similar programs , one using pthreads and another using fork,
both in the end make clone() syscall with different arguments, so I am guessing
the two are essentially the same on a linux system but with pthreads being easier
to handle in code.
Can someone give a deep explanation?
EDIT : see also a related question
In C there are some differences however:
fork()
Purpose is to create a new process, which becomes the child process of the caller
Both processes will execute the next instruction following the fork() system call
Two identical copies of the computer's address space,code, and stack are created one for parent and child.
Thinking of the fork as it was a person; Forking causes a clone of your program (process), that is running the code it copied.
pthread_create()
Purpose is to create a new thread in the program which is given the same process of the caller
Threads within the same process can communicate using shared memory. (Be careful!)
The second thread will share data,open files, signal handlers and signal dispositions, current working directory, user and group ID's. The new thread will get its own stack, thread ID, and registers though.
Continuing the analogy; your program (process) grows a second arm when it creates a new thread, connected to the same brain.
On Linux, the system call clone clones a task, with a configurable level of sharing.
fork() calls clone(least sharing) and pthread_create() calls clone(most sharing).
forking costs a tiny bit more than pthread_createing because of copying tables and creating COW mappings for memory.
You should look at the clone manpage.
In particular, it lists all the possible clone modes and how they affect the process/thread, virtual memory space etc...
You say "threads easier to handle in code": that's very debatable. Writing bug-free, deadlock-free multi-thread code can be quite a challenge. Sometimes having two separate processes makes things much simpler.
Related
Familiarity with how Linux Kernel Tracepoints work is not necessarily required to help with this question, it is just what is motivating this problem. In essence, I am looking for a way to store per-process data for a kernel module, without modifying the Linux source (e.g. struct task_struct), and ideally without using locks. Here is my specific question:
I have a kernel module that hooks into the sys_enter (defined here for x86_64, aarch64) and sys_exit (x86_64, aarch64) tracepoints. For each system call issued, I need to pass some data between the enter probe and the exit probe.
Some things I have considered: I could ...
...use one global variable -- but that will be shared between concurrently executing system calls on different CPUs, creating a race.
...use one global map from PID (of the process issuing the system call) to my data, together with locks -- but that will unnecessarily require synchronization between all CPUs on each system call. I would like to avoid this, since the data is "local" to each issued system call, so I feel like there should be a way to keep it local and not add costly synchronization.
...use a per-CPU global variable -- but (it is my understanding that) a process may move to another CPU during the system call execution, making this approach incorrect.
...kmallocing some memory for my custom data upon each system call entry, then pass the address to that memory by clobbering one of the registers in struct pt_regs (both the entry and exit probe receive a pointer to said struct) -- but then I will have a memory leak for system calls that do not trigger the exit probe (such as sys_exit, which never returns).
I am open to any suggestions how these ideas could be refined to address the problems I listed, or any completely different ideas that I am not thinking of.
I'd use an RCU enabled hashtable, for safety.
The first option isn't actually doable, as you stated.
The third one requires you to track which process is using which CPU, which seems unnecessary.
The leaking problem of the fourth option can probably be solved somehow, but allocating memory on each system call can introduce a serious delay.
Of course that accessing the hashtable will also slow down the system, but It won't trigger a memory allocation for each system call, so I assume it'll be less harmful.
Also, I may be wrong here, but if you assume that only process creation/destruction will introduce changes to table itself (not to the data within each entry, but the location and hash value of each row) than maybe you won't even have to synchronize on each system call, but only on ones that will cause process creation/destruction.
I am creating a Golang program that creates a process and then should be able to suspend it.
To make it more memory efficient, I would need my program to be able to dump the memory of the process to disk and reload it only when needed.
I cannot find any info here on Stack Overflow and also GitHub is not helping.
Any solution?
Attempting to answer this with the limited info..
To make it more memory efficient, I would need my program to be able to dump the memory of the process to disk and reload it only when needed.
This is generally something handled by your operating system (scheduler, memory management) controlling what processes are currently running / suspended / etc. and what memory needs to be paged in / out. Trying to implement the equivalent is quite complex, error prone, and likely to be less performant. Why do you believe you need to implement this yourself?
If you are building a program and want to have explicit control about whether it should be considered runnable or not, you could create a process which forks (creating two total processes), and have the parent process suspend and resume the child process using signals:
https://man7.org/linux/man-pages/man7/signal.7.html
I'm reading Windows Internals (7th Edition), and they write about processes in Chapter 1:
Processes
[...] a Windows process comprises the following:
[...]
At least one thread of execution Although an "empty" process is possible, it is (mostly) not useful.
What does "mostly" mean in this context? What could a process with no threads do, and how would that be useful?
EDIT: Also, in a 2015 talk, Mark Russinovich says that a process has "at least one thread" (19:12). Was that a generalization?
Disclaimer: I work for Microsoft.
I think the answer has come out in the comments. There seem to be at least two scenarios where a threadless process would be useful.
Scenario 1: capturing process snapshots
This is probably the most straightforward one. As RbMm commented, PssCaptureSnapshot can be called with the PSS_CAPTURE_VA_CLONE option to create a threadless (or "empty") process (using ZwCreateProcessEx, presumably to duplicate the target process's memory in kernel mode).
The primary use here would be for debugging, if a developer wanted to inspect a process's memory at a certain point in time.
Notably, Eryk Sun points out that an empty process is not necessary for inspecting handles (even though an empty process holds both its own memory space and handles), since there is already a way to inspect a process's handles without creating a new process or duplicating memory.
Scenario 2: forking processes with specific inherited handles---safely
Raymond Chen explains another use for a threadless process: creating new "real" processes with inherited handles safely.
When a thread wants to create a new process (CreateProcess), there are several ways for it to pass handles to the new process:
Make a handle inheritable and CreateProcess with bInheritHandles = true.
Make a handle inheritable, add it to a PROC_THREAD_ATTRIBUTE_LIST, and pass that list to the CreateProcess call.
However, they offer conflicting guarantees that can cause problems when callers want to create two threads with different handles concurrently. As Raymond puts it in Why do people take a lock around CreateProcess calls?:
In order for a handle to be inherited, you not only have to put it in the PROC_THREAD_ATTRIBUTE_LIST, but you also must make the handle inheritable. This means that if another thread is not on board with the PROC_THREAD_ATTRIBUTE_LIST trick and does a straight CreateProcess with bInheritHandles = true, it will inadvertently inherit your handles.
You can use a threadless process to mitigate this. In general:
Create a threadless process.
DuplicateHandle all of the handles you want to capture into this new threadless process.
CreateProcess your new, real forked process, using the PROC_THREAD_ATTRIBUTE_LIST, but set the nominal parent process of this process to be the threadless process (with PROC_THREAD_ATTRIBUTE_PARENT_PROCESS).
You can now CreateProcess concurrently without worrying about other callers, and you can now close the duplicate handles and the empty process.
My assumptions:
MRI ruby 1.8.X doesn't have native threads but green threads.
The OS is not aware of these green threads.
issuing an IO-heavy operation should suspend the whole process until the proper IO interruption is issued back.
With these I've created a simple ruby program that does the following:
starts a thread that prints "working!" every second.
issues an IO request to copy a large (1gb) file on the "main" thread.
Now one would guess that being the green threads invisible to the OS, it would put the whole process on the "blocked" queue and the "working!" green thread would not execute. Surprisingly, it works :S
Does anyone know what's going on there? Thanks.
There is no atomic kernel file copy operation. It's a lot of fairly short reads and writes that are entering and exiting the kernel.
As a result, the process is constantly getting control back. Signals are delivered.
Green threads work by hooking the Ruby-level thread dispatcher into low-level I/O and signal reception. As long as these hooks catch control periodically the green threads will act quite a bit like more concurrent threads would.
Unix originally had a quite thread-unaware but beautifully simple abstract machine model for the user process environment.
As the years went by support for concurrency in general and threads in particular were added bit-by-bit in two different ways.
Lots of little kludges were added to check if I/O would block, to fail (with later retry) if I/O would block, to interrupt slow tty I/O for signals but then transparently return to it, etc. When the Unix API's were merged each kludge existed in more than one form. Lots of choices.1.
Direct support for threads in the form of multiple kernel-visible processes sharing an address space was also added. These threads are dangerous and untestable but widely supported and used. Mostly, programs don't crash. As time goes on, latent bugs become visible as the hardware supports more true concurrency. I'm not the least bit worried that Ruby doesn't fully support that nightmare.
1. The good thing about standards is that there are so many of them.
When MRI 1.9 initiates, it spawns two native threads. One thread is for the VM, the other is used to handle signals. Rubinis uses this strategy, as does the JVM. Pipes can be used to communicate any info from other processes.
As for the FileUtils module, the cd, pwd, mkdir, rm, ln, cp, mv, chmod, chown, and touch methods are all, to some degree, outsourced to OS native utilities using the internal API of the StreamUtils submodule while the second thread is left to wait for a signal from the an outside process. Since these methods are quite thread-safe, there is no need to lock the interpreter and thus the methods don't block eachother.
Edit:
MRI 1.8.7 is quite smart, and knows that when a Thread is waiting for some external event (such as a browser to send an HTTP request), the Thread can be put to sleep and be woken up when data is detected. - Evan Phoenix from Engine Yard in Ruby, Concurrency, and You
The implementation basic implementation for FileUtils has not changed much sense 1.8.7 from looking at the source. 1.8.7 also uses a sleepy timer thread to wait for a IO response. The main difference in 1.9 is the use of native threads rather than green threads. Also the thread source code is much more refined.
By thread-safe I mean that since there is nothing shared between the processes, there is no reason to lock the global interpreter. There is a misconception that Ruby "blocks" when doing certain tasks. Whenever a thread has to block, i.e. wait without using any cpu, Ruby simply schedules another thread. However in certain situations, like a rack-server using 20% of the CPU waiting for a response, it can be appropriate to unlock the interpreter and allow concurrent threads to handle other requests during the wait. These threads are, in a sense, working in parallel. The GIL is unlocked with the rb_thread_blocking_region API. Here is a good post on this subject.
I am running a program on a machine with a two processors, when I do a fork is the child created as a native thread or it is like a green thread/coroutine. Is the child running concurrently with the parent or it is just parallel?
The working of fork() in general is to generate a new, independent process, duplicate the page table, and mark all pages owned by the process that called fork() as copy-on-write in that process. Then, fork() returns in both processes (the return value lets the respective process know which one it is).
On a system with more than one processor (or processor cores) you can normally (assuming you do have a SMP-enabled system, cpu affinity doesn't prevent it) expect those two processes to use both processors, but you do not strictly have a guarantee.
Threads are generated in the same way on some systems (e.g. Linux) with the exception that the pages owned by the first process are not marked copy-on-write, but are instead owned by both processes afterwards (they use the same page table). On other systems, threads may be implemented differently, e.g. in user land, in which case you will not benefit from multiple cpus with threads.
As a side note, the disadvantage of using fork() and running 2 processes instead of threads is that the processes do not share a common address space, which means that the TLB must be flushed on a context switch.
This depends on the operating system, programming-language, compiler and runtime-library, so i can only give you an example: If you use _beginthread under Windows (no matter if you use MinGW or the MSCRT directly) you use both your processors. Further to explain the semantics of "concurrent" vs. "parallel": they are non-exclusive.