I noticed that all the _EPROCESS objects are linked to each other via the ActiveProcessList link. What is the purpose of this List. For what does the OS use this list of Active Processes?
In Windows NT, the schedulable unit is the thread. Processes serve as a container of threads, and also as an abstraction that defines what virtual memory map is active (and some other things).
All operating systems need to keep this information available. At different times, different components of the operating system could need to search for a process that matches a specific characteristic, or would need to assess all active processes.
So, how do we store this information? Why not a gigantic array in memory? Well, how big is that array going to be? Are we comfortable limiting the number of active processes to the size of this array? What happens if we can't grow the array? Are we prepared to reserve all that memory up front to keep track of the processes? In the low process use case, isn't that a lot of wasted memory?
So we can keep them on a linked list.
There are some occasions in NT where we care about process context but not thread context. One of those is I/O completion. When an I/O operation is handled asynchronously by the operating system, the eventual completion of that I/O could be in a process context that is different from the requesting process context. So, we need some records and information about the originating process so that we can "attach" to this process. "Attaching" to the process swaps us into the appropriate context with the appropriate user-mode memory available. We don't care about thread context, we care about process context, so this works.
Related
Whenever a process is moved into the waiting state, I understand that the CPU moved to another process. But whenever a process is in waiting state if it is still needing to make a request to another I/O resource does that computation not require processing? Is there i'm assuming a small part of the processor that is dedicated to help computation of the I/O request to move data back and forth?
I hope this question makes sense lol.
IO operations are actually tasks for peripheral devices to do some work. Usually you set the task by writing data to special areas of memory which belongs to devices. They monitor changes in that small area and start to execute the tasks. So CPU does not need to do anything while the operation is in progress and can switch to another program. When the IO is completed usually an interrupt is triggered. This is a special hardware mechanism which pauses currently executed program in arbitrary place and switches to a special suprogramm, which decides what to do later. There can be another designs, for example device may set special flag somewhere in it's memory region and OS must check it from time to time.
The problem is that these IO are usually quite small, such as send 1 byte over COM port, so CPU has to be interrupted too often. You can't achieve high speed with them. Here is where DMA comes handy. This is a special coprocessor (or part of peripheral device) which has direct access to RAM and can feed big blocks of memory in-to devices. So it can process megabytes of data without interrupting CPU.
I am looking for some scheduling options based on data accessed by threads. Is there any way to find the pages of cache accessed by a specific thread.
If i have two threads from two different processes, is it possible to find the common data/pages accessed by both the threads
Two threads from the same process are potentially sharing the whole process memory space. If the programme does not restrict access to certain regions of memory to the threads, it might be difficult to know exactly which thread should be assigned to which cpu.
For more threads, the problem becomes more difficult, as a thread may share different data with several different threads, and create a network of relationship. If a relation between two threads mandates their affinity to a given cpu core, by transitivity all the threads of their relationship network should be bound to that very same core as well.
Perhaps the number of relations, or some form of clustering analysis (biconnectivity) would help.
Regarding your specific question, if two threads are sharing data, but are from different processes, then these processes are necessarily sharing these pages voluntarily by using shm_open (to create a shared memory segment) and mmap (to map that segment in the process memory). It is not possible to share data pages between processes otherwise, excepted implicitely (again) with the copy on write mechanism used by the OS for forked processes, in which case each page remains shared until one process makes a write to it.
The explicit sharing of pages (by shm_open) may be used to programmatically define the same CPU affinity for both threads - perhaps by convention in both programs to associate the relevant threads with the first core, or through a small handshaking protocol established at some point through the shared memory object (for instance the first byte of the memory segment could be set to the chosen cpu number + 1 by the first thread to access it, 0 meaning no affinity yet).
Unfortunately, the posix thread API doesn't provide a way to set cpu affinity for threads. You may use the non portable extension provided on the linux platform pthread_attr_setaffinity_np, with the cpuset family of functions to configure a thread affinity.
references:
cpuset
pthread_attr_setaffinity_np
We have to make our system highly scalable and it has been developed for windows platform using VC++. Say initially, we would like to process 100 requests(from msmq) simultaneously. What would be the best approach? Single process with 100 threads or 2 processes with 50-50 threads? What is the gain apart from process memory in case of second approach. does in windows first CPU time is allocated to process and then split between threads for that process, or OS counts the number of threads for each process and allocate CPU on the basis of threads rather than process. We notice that in first case, CPU utilization is 15-25% and we want to consume more CPU. Remember that we would like to get optimal performance thus 100 requests are just for example. We have also noticed that if we increase number of threads of the process above 120, performance degrades due to context switches.
One more point; our product already supports clustering, but we want to utilize more CPU on the single node.
Any suggestions will be highly appreciated.
You cant process more requests than you have CPU cores. "fast" scalable solutions involve setting up thread pools, where the number of active (not blocked on IO) threads == the number of CPU cores. So creating 100 threads because you want to service 100 msmq requests is not good design.
Windows has a thread pooling mechanism called IO Completion Ports.
Using IO Completion ports does push the design to a single process as, in a multi process design, each process would have its own IO Completion Port thread pool that it would manage independently and hence you could get a lot more threads contending for CPU cores.
The "core" idea of an IO Completion Port is that its a kernel mode queue - you can manually post events to the queue, or get asynchronous IO completions posted to it automatically by associating file (file, socket, pipe) handles with the port.
On the other side, the IO Completion Port mechanism automatically dequeues events onto waiting worker threads - but it does NOT dequeue jobs if it detects that the current "active" threads in the thread pool >= the number of CPU cores.
Using IO Completion Ports can potentially increase the scalability of a service a lot, usually however the gain is a lot smaller than expected as other factors quickly come into play when all the CPU cores are contending for the services other resource.
If your services are developed in c++, you might find that serialized access to the heap is a big performance minus - although Windows version 6.1 seems to have implemented a low contention heap so this might be less of an issue.
To summarize - theoretically your biggest performance gains would be from a design using thread pools managed in a single process. But you are heavily dependent on the libraries you are using to not serialize access to critical resources which can quickly loose you all the theoretical performance gains.
If you do have library code serializing your nicely threadpooled service (as in the case of c++ object creation&destruction being serialized because of heap contention) then you need to change your use of the library / switch to a low contention version of the library or just scale out to multiple processes.
The only way to know is to write test cases that stress the server in various ways and measure the results.
The standard approach on windows is multiple threads. Not saying that is always your best solution but there is a price to be paid for each thread or process and on windows a process is more expensive. As for scheduler i'm not sure but you can set the priory of the process and threads. The real benefit to threads is their shared address space and the ability to communicate without IPC, however synchronization must be careful maintained.
If you system is already developed, which it appears to be, it is likely to be easier to implement a multiple process solution especially if there is a chance that latter more then one machine may be utilized. As your IPC from 2 process on one machine can scale to multiple machines in the general case. Most attempts at massive parallelization fail because the entire system is not evaluated for bottle necks. for example if you implement a 100 threads that all write to the same database you may gain little in actual performance and just wait on your database.
just my .02
In a Windows operating system with 2 physical x86/amd64 processors (P0 + P1), running 2 processes (A + B), each with two threads (T0 + T1), is it possible (or even common) to see the following:
P0:A:T0 running at the same time as P1:B:T0
then, after 1 (or is that 2?) context switch(es?)
P0:B:T1 running at the same time as P1:A:T1
In a nutshell, I'd like to know if - on a multiple processor machine - the operating system is free to schedule any thread from any process at any time, regardless of what other threads from other processes are already running.
EDIT:
To clarify the silly example, imagine that process A's thread A:T0 has affinity to processor P0 (and A:T1 to P1,) while process B's thread B:T0 has affinity to processor P1 (and B:T1 to to P0). It probably doesn't matter whether these processors are cores or sockets.
Is there a first-class concept of a process context switch? Perfmon shows context switches under the Thread object, but nothing under the Process object.
Yes, it is possible and it happens pretty often.The OS tries to not switch one thread between CPUs (you can make it try harder setting the threads preferred processor, or you can even lock it to single processor via affinity).Windows' process is not an execution unit by itself - from this viewpoint, its basically just a context for its threads.
EDIT (further clarifications)
There's nothing like a "process context switch". Basically, the OS scheduler assigns the threads via a (very adaptive) round-robin algorithm to any free processor/core (as the affinity allows), if the "previous" processor isn't immediately available, regardless of the processes (which means multi-threaded processes can steal much more CPU power).
This "jumping" may seem expensive, considering at least the L1 (and sometimes L2) caches are per-core (apart from different slot/package processors), but it's still cheaper than delays caused by waiting to the "right" processor and inability to do elaborate load-balancing (which the "jumping" scheme makes possible).This may not apply to the NUMA architecture, but there are much more considerations invoved (e.g. adapting all memory-allocations to be thread- and processor-bound and avoiding as much state/memory sharing as possible).
As for affinity: you can set affinity masks per-thread or per-process (which supersedes all process' threads' settings), but the OS enforces least one logical processor affiliated per thread (you never end up with a zero mask).
A process' default affinity mask is inherited from its parent process (which allows you to create single-core loaders for problematic legacy executables), and threads inherit the mask from the process they belong to.
You may not set a threads affinity to a processor outside the process' affinity, but you can further limit it.
Any thread by default, will jump between the available logical processors (especially if it yields, calls to kernel, etc), may jump even if it has its preferred processor set, but only if it has to,
but it will NOT jump to a processor outside its affinity mask (which may lead to considerable delays).
I'm not sure if the scheduler sees any difference between physical and hyper-threaded processors, but even if it doesn't (which I assume), the consequences are in most cases not of a concern, i.e. there should not be much difference between multiple threads sharing physical or logical processors if the thread count is just the same. Regardless, there are some reports of cache-thrashing in this scenario, mainly in high-performance heavily multithreaded applications, like SQL server or .NET and Java VMs, which may or may not benefit from HyperThreading turned off.
I generally agree with the previous answer, however things are more complex.
Although processes are not execution units, threads belonging to the same process should be treated differently. There're two reasons for this:
Same address space. Means - when switching the context between such threads no need to setup the address translation registers.
Threads of the same process are much more likely to access the same memory.
The (2) has a great impact on the cache state. If threads read the same memory location - they reuse the L2 cache, hence the whole things speeds up. There's however the drawback too: once a thread changes a memory location - that address is invalidated in both L2 cache and L2 cache of both processors, so that the other processor invalidates its cache too.
So there're pros and cons with running the threads of the same process simultaneously (on different processors). BTW this situation has a name: "Gang scheduling".
It's highly likely that there is a limitation on how many synchronization objects - semaphores, events, critical sections - can one process and all processes on a given machine use. What exactly is this limitation?
For windows, the per-process limit on kernel handles(semaphores, events,mutex) is 2^24.
From MSDN:
Kernel object handles are process
specific. That is, a process must
either create the object or open an
existing object to obtain a kernel
object handle. The per-process limit
on kernel handles is 2^24. However,
handles are stored in the paged pool,
so the actual number of handles you
can create is based on available
memory. The number of handles that you
can create on 32-bit Windows is
significantly lower than 2^24.
It depends on the quota that is available for the process. I think in XP it is set to 10000 per process, but it can grow. I am not sure what the upper limit is.
Just checked it again, the 10000 limit is for the GDI handles and not for Kernel objects.