I am running a program on a machine with a two processors, when I do a fork is the child created as a native thread or it is like a green thread/coroutine. Is the child running concurrently with the parent or it is just parallel?

The working of fork() in general is to generate a new, independent process, duplicate the page table, and mark all pages owned by the process that called fork() as copy-on-write in that process. Then, fork() returns in both processes (the return value lets the respective process know which one it is).
On a system with more than one processor (or processor cores) you can normally (assuming you do have a SMP-enabled system, cpu affinity doesn't prevent it) expect those two processes to use both processors, but you do not strictly have a guarantee.
Threads are generated in the same way on some systems (e.g. Linux) with the exception that the pages owned by the first process are not marked copy-on-write, but are instead owned by both processes afterwards (they use the same page table). On other systems, threads may be implemented differently, e.g. in user land, in which case you will not benefit from multiple cpus with threads.
As a side note, the disadvantage of using fork() and running 2 processes instead of threads is that the processes do not share a common address space, which means that the TLB must be flushed on a context switch.

This depends on the operating system, programming-language, compiler and runtime-library, so i can only give you an example: If you use _beginthread under Windows (no matter if you use MinGW or the MSCRT directly) you use both your processors. Further to explain the semantics of "concurrent" vs. "parallel": they are non-exclusive.


What is the use of a process with no threads in Windows?

I'm reading Windows Internals (7th Edition), and they write about processes in Chapter 1:
[...] a Windows process comprises the following:
At least one thread of execution Although an "empty" process is possible, it is (mostly) not useful.
What does "mostly" mean in this context? What could a process with no threads do, and how would that be useful?
EDIT: Also, in a 2015 talk, Mark Russinovich says that a process has "at least one thread" (19:12). Was that a generalization?
Disclaimer: I work for Microsoft.
I think the answer has come out in the comments. There seem to be at least two scenarios where a threadless process would be useful.
Scenario 1: capturing process snapshots
This is probably the most straightforward one. As RbMm commented, PssCaptureSnapshot can be called with the PSS_CAPTURE_VA_CLONE option to create a threadless (or "empty") process (using ZwCreateProcessEx, presumably to duplicate the target process's memory in kernel mode).
The primary use here would be for debugging, if a developer wanted to inspect a process's memory at a certain point in time.
Notably, Eryk Sun points out that an empty process is not necessary for inspecting handles (even though an empty process holds both its own memory space and handles), since there is already a way to inspect a process's handles without creating a new process or duplicating memory.
Scenario 2: forking processes with specific inherited handles---safely
Raymond Chen explains another use for a threadless process: creating new "real" processes with inherited handles safely.
When a thread wants to create a new process (CreateProcess), there are several ways for it to pass handles to the new process:
Make a handle inheritable and CreateProcess with bInheritHandles = true.
Make a handle inheritable, add it to a PROC_THREAD_ATTRIBUTE_LIST, and pass that list to the CreateProcess call.
However, they offer conflicting guarantees that can cause problems when callers want to create two threads with different handles concurrently. As Raymond puts it in Why do people take a lock around CreateProcess calls?:
In order for a handle to be inherited, you not only have to put it in the PROC_THREAD_ATTRIBUTE_LIST, but you also must make the handle inheritable. This means that if another thread is not on board with the PROC_THREAD_ATTRIBUTE_LIST trick and does a straight Create­Process with bInheritHandles = true, it will inadvertently inherit your handles.
You can use a threadless process to mitigate this. In general:
Create a threadless process.
DuplicateHandle all of the handles you want to capture into this new threadless process.
CreateProcess your new, real forked process, using the PROC_THREAD_ATTRIBUTE_LIST, but set the nominal parent process of this process to be the threadless process (with PROC_THREAD_ATTRIBUTE_PARENT_PROCESS).
You can now CreateProcess concurrently without worrying about other callers, and you can now close the duplicate handles and the empty process.

Can many (similar) processes use a common RAM cache?

As I understand the creation of processes, every process has it's own space in RAM for it's heap, data, etc, which is allocated upon its creation. Many processes can share their data and storage space in some ways. But since terminating a process would erase its allocated memory(so also its caches), I was wondering if it is possible that many (similar) processes share a cache in memory that is not allocated to any specific process, so that it can be used even when these processes are terminated and other ones are created.
This is a theoretical question from a student perspective, so I am merely interested in the general sence of an operating system, without adding more functionality to them to achieve it.
For example I think of a webserver that uses only single-threaded processes (maybe due to lack of multi-threading support), so that most of the processes created do similar jobs, like retrieving a certain page.
There are a least four ways what you describe can occur.
First, the system address space is shared by all processes. The Operating system can save data there that survives the death of a process.
Second, processes can map logical pages to the same physical page frame. The termination of one process does not cause the page frame to be deallocated to the other processes.
Third, some operating systems have support for writable shared libraries.
Fourth, memory mapped files.
There are probably others as well.
I think so, when a process is terminated the RAM clears it. However your right as things such as webpages will be stored in the Cache for when there re-called. For example -
You open Google and then go to another tab and close the open Google page, when you next go to Google it loads faster.
However, what I think your saying is if the Entire program E.G - Google Chrome or Safari - is closed, does the webpage you just had open stay in the cache? No, when the program is closed all its relative data is also terminated in order to fully close the program.
What is guarded region and how it differs from critical region?

Threre is a "guarded region" concept in windows, that is similar to critical region. Who knows how is it differs from Critical?
Recall that a process on any modern OS is made of several components. The ones we're interested in are:
code: the body of the program being executed,
memory: code, as well as execution related data (heap and stack) are stored there for the duration of a process,
thread: an execution context, ie CPU state and memory bits (stack for instance) related to that context,
signal slots: a process structure holding signals emitted by threads to each other, each thread has one such structure (ie, it's one of the memory bits of a thread)
Within a process, there may be several instances of these objects; usually code and memory are accessible by all the existing threads within a process. At any time there may be as many active threads (ie, executing concurrently) as there are available cores on your CPUs. These threads, if they belong to a same process, might interfer when accessing the same memory or code sections. For that purpose, Windows implements the so called critical sections, which are basically protected code blocks (it is very similar to the concept of synchronized code blocks in Java).
However, threads may also be diverted from their current execution code path when a signal is triggered or posted to them. On Windows, APCs are one form of that signaling mechanism. Guarded sections are available to make sure that a thread will complete a given code block before being able to handle these signals (APC in this case).
So, while a critical section will prevent any other thread than the active one to execute the protected code section concurrently, a guarded section will ensure that the current thread will not execute any other code than the guarded one once it started it.
As a simple analogy, imagine a code section like a flat in a building (the process code). A person (thread) who enters a critically protected code will lock the flat's main door, thus preventing any other person to enter it while she's still inside. If the code is guarded, then the OS will lock the person's cellphone, preventing her from answering calls until she actually leaves the flat.
A typical scenario for critical sections is when a specific resource needs to be accessed exclusively (a socket, an file handle, an in memory data structure). For APCs, similarly, guarded sections would prevent a thread from interfering with itself by trying to access one such resource in an APC execution when it was already using it in its current execution code path.

how come ruby's single os thread doesn't block while copying a file?

My assumptions:
MRI ruby 1.8.X doesn't have native threads but green threads.
The OS is not aware of these green threads.
issuing an IO-heavy operation should suspend the whole process until the proper IO interruption is issued back.
With these I've created a simple ruby program that does the following:
starts a thread that prints "working!" every second.
issues an IO request to copy a large (1gb) file on the "main" thread.
Now one would guess that being the green threads invisible to the OS, it would put the whole process on the "blocked" queue and the "working!" green thread would not execute. Surprisingly, it works :S
Does anyone know what's going on there? Thanks.
There is no atomic kernel file copy operation. It's a lot of fairly short reads and writes that are entering and exiting the kernel.
As a result, the process is constantly getting control back. Signals are delivered.
Green threads work by hooking the Ruby-level thread dispatcher into low-level I/O and signal reception. As long as these hooks catch control periodically the green threads will act quite a bit like more concurrent threads would.
Unix originally had a quite thread-unaware but beautifully simple abstract machine model for the user process environment.
As the years went by support for concurrency in general and threads in particular were added bit-by-bit in two different ways.
Lots of little kludges were added to check if I/O would block, to fail (with later retry) if I/O would block, to interrupt slow tty I/O for signals but then transparently return to it, etc. When the Unix API's were merged each kludge existed in more than one form. Lots of choices.1.
Direct support for threads in the form of multiple kernel-visible processes sharing an address space was also added. These threads are dangerous and untestable but widely supported and used. Mostly, programs don't crash. As time goes on, latent bugs become visible as the hardware supports more true concurrency. I'm not the least bit worried that Ruby doesn't fully support that nightmare.
1. The good thing about standards is that there are so many of them.
When MRI 1.9 initiates, it spawns two native threads. One thread is for the VM, the other is used to handle signals. Rubinis uses this strategy, as does the JVM. Pipes can be used to communicate any info from other processes.
As for the FileUtils module, the cd, pwd, mkdir, rm, ln, cp, mv, chmod, chown, and touch methods are all, to some degree, outsourced to OS native utilities using the internal API of the StreamUtils submodule while the second thread is left to wait for a signal from the an outside process. Since these methods are quite thread-safe, there is no need to lock the interpreter and thus the methods don't block eachother.
MRI 1.8.7 is quite smart, and knows that when a Thread is waiting for some external event (such as a browser to send an HTTP request), the Thread can be put to sleep and be woken up when data is detected. - Evan Phoenix from Engine Yard in Ruby, Concurrency, and You
The implementation basic implementation for FileUtils has not changed much sense 1.8.7 from looking at the source. 1.8.7 also uses a sleepy timer thread to wait for a IO response. The main difference in 1.9 is the use of native threads rather than green threads. Also the thread source code is much more refined.
By thread-safe I mean that since there is nothing shared between the processes, there is no reason to lock the global interpreter. There is a misconception that Ruby "blocks" when doing certain tasks. Whenever a thread has to block, i.e. wait without using any cpu, Ruby simply schedules another thread. However in certain situations, like a rack-server using 20% of the CPU waiting for a response, it can be appropriate to unlock the interpreter and allow concurrent threads to handle other requests during the wait. These threads are, in a sense, working in parallel. The GIL is unlocked with the rb_thread_blocking_region API. Here is a good post on this subject.

Difference between pthread and fork on gnu/Linux

What is the basic difference between a pthread and fork w.r.t. linux in terms of
implementation differences and how the scheduling varies (does it vary ?)
I ran strace on two similar programs , one using pthreads and another using fork,
both in the end make clone() syscall with different arguments, so I am guessing
the two are essentially the same on a linux system but with pthreads being easier
to handle in code.
Can someone give a deep explanation?
In C there are some differences however:
Purpose is to create a new process, which becomes the child process of the caller
Both processes will execute the next instruction following the fork() system call
Two identical copies of the computer's address space,code, and stack are created one for parent and child.
Thinking of the fork as it was a person; Forking causes a clone of your program (process), that is running the code it copied.
Purpose is to create a new thread in the program which is given the same process of the caller
Threads within the same process can communicate using shared memory. (Be careful!)
The second thread will share data,open files, signal handlers and signal dispositions, current working directory, user and group ID's. The new thread will get its own stack, thread ID, and registers though.
Continuing the analogy; your program (process) grows a second arm when it creates a new thread, connected to the same brain.
On Linux, the system call clone clones a task, with a configurable level of sharing.
fork() calls clone(least sharing) and pthread_create() calls clone(most sharing).
forking costs a tiny bit more than pthread_createing because of copying tables and creating COW mappings for memory.
You should look at the clone manpage.
In particular, it lists all the possible clone modes and how they affect the process/thread, virtual memory space etc...
You say "threads easier to handle in code": that's very debatable. Writing bug-free, deadlock-free multi-thread code can be quite a challenge. Sometimes having two separate processes makes things much simpler.
