Prallel Processing - performance

i have C# application(s) to run for different Processing like Insertion of Record, Extraction of Text, Printing etc..
inernally it having different exe to run given Modules...
i want to efficiently use this application(s) to run in machine as per Machine configuration.
Example: let's say Machine having 8 GB RAM configuration..
i can start multiple instance of single application to improve processing speed.
But concern is, how can i decide number to run parallel instance per application based on machine configuration..
Is there any functionality in C# which say exe to run in given memory limit ?
Applications running in windows uses Virtual memory, for example : a process in 32 bit system has 2 GB, but in 64 bit it will be 1 TB of virtual memory limit, but how the process fit and the memory limit in the physical RAM is handled by the OS, which is Windows here, so you don't have control on how operating system handles physical memory.
I suggest using Parallel Class for parallel processing with C#, the performance will depend on the computer's specifications.


Does Docker give RAM extra mileage?

According | to | countless | sources, Docker provides ultra-lightweight virtualization by sharing system resources across containers, instead of allocating copies of those resources per container.
I've even read articles where it is boasted that you could "run dozens, even hundreds of containers on the same VM."
But if my app requires 2GB RAM to run, and the underlying physical machine has only 8GB RAM on it, I would normally only be able to run 3 instances of my app on it (leaving ~2GB for system memory, utilities, etc.).
Does Docker do some kind of magic with RAM, allowing me to actually run dozens of containers, each one allocated 2GB RAM, but somehow sharing unused memory under the hood?
Or are those statements more media hype than anything else?
When people talk about running "dozens or hundreds of containers" they are normally thinking about microservices; small applications that do a specific task. Each of these may have memory usage measured in KBs rather than MBs, and probably not GBs, and as such there is no reason a decent machine couldn't run dozens or hundreds of them.
There is actually a competition (I think it's on-going) to get as many containers as possible running on a Raspberry Pi. The result currently stands at over a thousand, but admittedly these containers won't be running a real-life application.
Regarding memory, the answer is "it's complicated". If you're using the AUFS or Overlay driver, containers with the same base image should be able to share "memory pages"; meaning shared libraries shouldn't need to get loaded twice for two containers. This isn't something special though; normal processes running on the host will work the same way.
At the end of the day, containers are little more than isolated processes. We can easily run dozens or hundreds of processes on a host, so it's not unfeasible to run dozens or hundreds of containers.
A Docker container only consumes the resources that it needs as it needs them. So yes you could literally run hundreds of machines on one box as long as they are not all actively consuming your resources. That is what makes Docker unique; the fact that a container will use what resources it can and then release them making them available for another container on the same host. It is best practice to let the container and Docker handle allocating resources instead of doing a hard assign of them.
The alternative would be a virtual machine. Each virtual machine that you run has to run a full linux kernal, and the host OS will hold a chunk of memory aside for the virtualized environment. This means that you can really only run a couple VMs on all but the heaviest duty hardware.
A container does NOT run a kernel- it just runs a single process (plus sub processes). This means that you can run as many processes in containers as you could if you were running those same processes without containers- each thinks it is running on a separate machine, but they all just show up as processes on the host kernel.
There is no magic that will make you able to use RAM dozens of times over. But you can pack smaller processes in together a LOT tighter than you could using virtual machines for seperation.

Where is my MMF(Memory-Mapped-File) memory in Windows Task Manager?

Two applications share memory by MMF.
A create MMF (about 1GB), B open that MMF file by name.
When I see Windows Task Manager, A has 1GB memory.
But, after several closing and launching B app again,
(or after 1 days later? I'm not sure how to reproduce)
A's memory in Windows Task Manager is below 1K bytes.
My guess is,
maybe because A app doesn't do anything after create MMF,
so, Windows thinks MMF is belong to B app. (Just guess).
My OS is Windows 2003 Enterprise x64, SP2.
Is there somebody who knows the reason?
Memory mapped file is still part of your Virtual Address Space, use perfmon to get reliable counters instead of Task Manager, which changes with each release of Windows. The Perfmon counter of Process | Virtual Bytes (total VAS) is the most interesting.
My understanding is that 1GB is reserved in the virtual address space, but memory is only actually allocated for pages that are touched. Memory mapped files are implemented parallel to the Virtual Memory API, and both build upon the NT Virtual Memory Manager. See this article and diagram for an explanation.
Did you fill your entire file with data, or did you just allocate 1GB?
Which column are you viewing in Task Manager?
The default Memory (Private Working) represents physically allocated memory.
You can add the column Commit Size to see the total amount of virtual address space allocated to the process.
Here is a summary of the various memory statistics you can see in Task Manager and what they mean.
It's because of memory working set minimize.
Is there any way of throttling CPU/Memory of a process?

Problem: I have a developers machine (read: fast, lots of memory), but the user has a users machine (read: slow, not very much memory).
I can simulate a slow network using Fiddler (
I can look at how CPU is used over time for a process using Process Explorer (
Is there any way I can restrict the amount of CPU a process can have, or the amount of memory a process can have in order to simulate a users machine more effectively? (In order to isolate performance problems for instance)
I suppose I could use a VM, but I'm looking for something a bit lighter.
I'm using Windows XP, but a solution for any Windows machine would be welcome. Thanks.
The platform SDK used to come with stress tools for doing just this back in the good old days (STRESS.EXE, CPUSTRESS.EXE in the SDK), but they might still be there (check your platform SDK and/or Visual Studio installation for these two files -- unfortunately I have niether the PSDK nor VS installed on the machine I'm typing from.)
Other tools:
memory: performance & reliability (e.g. handling failed memory allocation): can use EatMem
CPU: performance & reliability (e.g. race conditions): can use CPU Burn, Prime95, etc
handles (GDI, User): reliability (e.g. handling failed GDI resource allocation): ??? may have to write your own, but running out of GDI handles (buggy GTK apps would usually eat them all away until all other apps on the system would start falling dead like flies) is a real test for any Windows app
disk: performance & reliability (e.g. handling disk full): DiskFiller, etc.
AppVerifier has a low-resource simulation feature.
You could also try setting the priority of your process to be very low.
You can run MemAlloc to chew up RAM, possibly a few copies at once.
I found a related question:
Set Windows process (or user) memory limit
The accepted answer for the question has a link to the Windows API's SetProcessWorkingSetSize, so it's not exactly a tool that can limit the amount of memory that a process can use.
In terms of changing the amount of CPU resources a process can use, if you don't mind the granularity of per-core limiting of resources, Task Manager can change the processor affinity of a process.
In Task Manager, right-click a process and select "Set Affinity...", then select the processor cores that the process can be assigned to.
If the development machine has many cores but the user machine only has one, then, rather than allowing the process to run on all the available cores, set the process' processor affinity to only one core.
It has nothing to do with SetProcessWorkingSetSize
Just use internal Win32 kernel apis to restrict CPU Usage

Does Windows Server 2003 SP2 tell the truth about Free System Page Table Entries?

We have some Win32 console applications running on Windows Server 2003 Service Pack 2 that regularly fail with this:
Error 1450 (ERROR_NO_SYSTEM_RESOURCES): "Insufficient system resources exist to complete the requested service."
All the documentation we've found suggests it is linked to the number of Free System Page Table Entries running out. We have 16GB RAM in these machines and use the /3GB Operating System switch to squeeze the Windows kernel into 1GB and allow our processes access to 3GB of address space. This drastically reduces the total number of Free System Page Table Entries, so combined with our heavy use of MapViewOfFile() it is perhaps not surprising that the kernel page table entries are running out.
However, when using Performance Monitor to view the Free System Page Table Entries counter, the value is around 36,000 on reboot and doesn't go down when our application starts. I find it hard to believe that our application, which opens many large memory-mapped files, doesn't have any effect on the kernel page table. If we can't believe the counter, it's much more difficult to test the effect of any system changes we make.
There is a promising Knowledge Base article, The Performance tool does not accurately show the available Free System Page Table entries in Windows Server 2003, but it says the problem has been fixed in Service Pack 1, and we are already on Service Pack 2.
Has anyone else struggled with or solved this issue?
Update: I have checked !sysptes in windbg (debugging the kernel) and the value matches the performance counter, around 36,000. I guess this is most likely to mean that there really are that many free page table entries and Windows is telling the truth. It does leave the question of why we're getting 1450 errors though, if the PTEs are not running out.
Further update: We never did get to the bottom of why the 1450 errors were occurring. However, instead we upgraded the OS on these servers to 64-bit Windows. This allows the existing 32-bit applications (without recompilation) to access a full 4GB of virtual address space, and lets the kernel memory area with those pesky Page Table Entries be as big as it likes too. I don't think we've had a 1450 error since.
Can you try the windbg command "!sysptes" to get System PTE Information? I'm not sure if you can do this with live kernel debug, you may have to get a memory dump.
I'm not sure why you assume that ERROR_NO_SYSTEM_RESOURCES is caused only by running out of free System Page Table Entries ? As far as I know, such generic error codes are used for more than one resource type. And in fact, the first Google hit suggests that running out of file cache memory may cause it too. (KB on an XP bug, which tripped this error mode).
In your case, I'd be checking the "Handle Count". Another possible problem is address space fragmentation. If you you want to create a 1GB file mapping view, you need 1GB of free address space, and it has to be contiguous. If you map a 1GB file, a 800 MB file, and a 1GB file, close the 800MB one and open a 900MB file, the 900MB file may not fit in the hole that's left.
MS has 2 ways to allow there 32 bit OS to "deal" with hardware that has 4 GB or more of RAM.
Option 1: is what you did with the /3GB Switch in the Boot.ini.
Option 1 Pros and Cons:
(CONS) This option sucks 1 GB from the normal 2 GB kernel area - hence making the OS struggle to meet the demands of both Paged Pool allocations and kernel stack allocations. So a person might think that using the /3GB Switch will help their, but really this option is screwing the 32 bit Window OS into a slow death.
(CONS) But, This gives my App 3GB.... WRONG (Hence this is a CON) The catch is that ONLY application that have been recompiled from the vendor to be "/3GB Switch aware" can really use the extra 1 GB. Hence the whole use of the /3GB Switch is a really BAD J.O.K.E on everyone.
Read this link for a much better write-up:
Option 2: Use the /PAE switch in the Boot.ini.
Option 2 Pros and Cons:
(PROS) This really this only option if you have a more then 4GB of RAM. It tricks a application by placing the complete application memory footprint in RAM. Normally, only a application "Working Set" memory is in RAM and the remaining application memory requirements go into Windows Pagefile. What is a application total memory requirements?? - it called "Virtual Size".
In my world, I have a big fat Java based IBM Product that I deal with. The server that is running the "application" has 16 GB of RAM. I simply add the /PAE switch and watch (thanks to sysinternals Processes Explorer) application paging requests go from 200 KB per sec to up to 4MB per sec.
Question: "Why"?
Answer: The whole application is in RAM.
Question: "Does the application know that it is completely running in RAM?
Answer: No - It is running that same old way that it was always run, "THINKING" that it's has part of itself as the "Working Set" memory living in RAM and the remaining application memory requirements go into Windows Pagefile.
Yes, it is that flipping GOOD.
Please Note: Microsoft has done a poor job telling anyone about the great Windows OS option. Duh
How to obtain good concurrent read performance from disk

I'd like to ask a question then follow it up with my own answer, but also see what answers other people have.
We have two large files which we'd like to read from two separate threads concurrently. One thread will sequentially read fileA while the other thread will sequentially read fileB. There is no locking or communication between the threads, both are sequentially reading as fast as they can, and both are immediately discarding the data they read.
Our experience with this setup on Windows is very poor. The combined throughput of the two threads is in the order of 2-3 MiB/sec. The drive seems to be spending most of its time seeking backwards and forwards between the two files, presumably reading very little after each seek.
If we disable one of the threads and temporarily look at the performance of a single thread then we get much better bandwidth (~45 MiB/sec for this machine). So clearly the bad two-thread performance is an artefact of the OS disk scheduler.
Is there anything we can do to improve the concurrent thread read performance? Perhaps by using different APIs or by tweaking the OS disk scheduler parameters in some way.
Some details:
The files are in the order of 2 GiB each on a machine with 2GiB of RAM. For the purpose of this question we consider them not to be cached and perfectly defragmented. We have used defrag tools and rebooted to ensure this is the case.
We are using no special APIs to read these files. The behaviour is repeatable across various bog-standard APIs such as Win32's CreateFile, C's fopen, C++'s std::ifstream, Java's FileInputStream, etc.
Each thread spins in a loop making calls to the read function. We have varied the number of bytes requested from the API each iteration from values between 1KiB up to 128MiB. Varying this has had no effect, so clearly the amount the OS is physically reading after each disk seek is not dictated by this number. This is exactly what should be expected.
The dramatic difference between one-thread and two-thread performance is repeatable across Windows 2000, Windows XP (32-bit and 64-bit), Windows Server 2003, and also with and without hardware RAID5.
The problem seems to be in Windows I/O scheduling policy. According to what I found here there are many ways for an O.S. to schedule disk requests. While Linux and others can choose between different policies, before Vista Windows was locked in a single policy: a FIFO queue, where all requests where splitted in 64 KB blocks. I believe that this policy is the cause for the problem you are experiencing: the scheduler will mix requests from the two threads, causing continuous seek between different areas of the disk.
Now, the good news is that according to here and here, Vista introduced a smarter disk scheduler, where you can set the priority of your requests and also allocate a minimum badwidth for your process.
The bad news is that I found no way to change disk policy or buffers size in previous versions of Windows. Also, even if raising disk I/O priority of your process will boost the performance against the other processes, you still have the problems of your threads competing against each other.
What I can suggest is to modify your software by introducing a self-made disk access policy.
For example, you could use a policy like this in your thread B (similar for Thread A):
if THREAD A is reading from disk then wait for THREAD A to stop reading or wait for X ms
Read for X ms (or Y MB)
Stop reading and check status of thread A again
You could use semaphores for status checking or you could use perfmon counters to get the status of the actual disk queue.
The values of X and/or Y could also be auto-tuned by checking the actual trasfer rates and slowly modify them, thus maximizing the throughtput when the application runs on different machines and/or O.S. You could find that cache, memory or RAID levels affect them in a way or the other, but with auto-tuning you will always get the best performance in every scenario.
I'd like to add some further notes in my response. All other non-Microsoft operating systems we have tested do not suffer from this problem. Linux, FreeBSD, and Mac OS X (this final one on different hardware) all degrade much more gracefully in terms of aggregate bandwidth when moving from one thread to two. Linux for example degraded from ~45 MiB/sec to ~42 MiB/sec. These other operating systems must be reading larger chunks of the file between each seek, and therefor not spending nearly all their time waiting on the disk to seek.
Our solution for Windows is to pass the FILE_FLAG_NO_BUFFERING flag to CreateFile and use large (~16MiB) reads in each call to ReadFile. This is suboptimal for several reasons:
Files don't get cached when read like this, so there are none of the advantages that caching normally gives.
The constraints when working with this flag are much more complicated than normal reading (alignment of read buffers to page boundaries, etc).
(As a final remark. Does this explain why swapping under Windows is so hellish? Ie, Windows is incapable of doing IO to multiple files concurrently with any efficiency, so while swapping all other IO operations are forced to be disproportionately slow.)
Edit to add some further details for Will Dean:
Of course across these different hardware configurations the raw figures did change (sometimes substantially). The problem however is the consistent degradation in performance that only Windows suffers when moving from one thread to two. Here is a summary of the machines tested:
Several Dell workstations (Intel Xeon) of various ages running Windows 2000, Windows XP (32-bit), and Windows XP (64-bit) with single drive.
A Dell 1U server (Intel Xeon) running Windows Server 2003 (64-bit) with RAID 1+0.
An HP workstation (AMD Opteron) with Windows XP (64-bit), and Windows Server 2003, and hardware RAID 5.
My home unbranded PC (AMD Athlon64) running Windows XP (32-bit), FreeBSD (64-bit), and Linux (64-bit) with single drive.
My home MacBook (Intel Core1) running Mac OS X, single SATA drive.
My home Koolu PC running Linux. Vastly underpowered compared to the other systems but I demonstrated that even this machine can outperform a Windows server with RAID5 when doing multi-threaded disk reads.
CPU usage on all of these systems was very low during the tests and anti-virus was disabled.
I forgot to mention before but we also tried the normal Win32 CreateFile API with the FILE_FLAG_SEQUENTIAL_SCAN flag set. This flag didn't fix the problem.
It does seem a little strange that you see no difference across quite a wide range of windows versions and nothing between a single drive and hardware raid-5.
It's only 'gut feel', but that does make me doubtful that this is really a simple seeking problem. Other than the OS X and the Raid5, was all this tried on the same machine - have you tried another machine? Is your CPU usage basically zero during this test?
What's the shortest app you can write which demonstrates this problem? - I would be interested to try it here.
I would create some kind of in memory thread safe lock. Each thread could wait on the lock until it was free. When the lock becomes free, take the lock and read the file for a defined length of time or a defined amount of data, then release the lock for any other waiting threads.
Do you use IOCompletionPorts under Windows? Windows via C++ has an in-depth chapter on this subject and as luck would have it, it is also available on MSDN.
Paul - saw the update. Very interesting.
It would be interesting to try it on Vista or Win2008, as people seem to be reporting some considerable I/O improvements on these in some circumstances.
My only suggestion about a different API would be to try memory mapping the files - have you tried that? Unfortunately at 2GB per file, you're not going to be able to map multiple whole files on a 32-bit machine, which means this isn't quite as trivial as it might be.
