Soft realtime on windows OS- what to consider? - windows

What consideration should we have (both software and hardware) when we build a soft-realtime application on windows : a task that occurs every XXX milliseconds and that should be completed within YYY milliseconds. (Altough consequences of missing a deadline are bad, the application can still recover from missed deadline - hence the "soft" realtime).
A few questions that already comes to my mind:
Are there registry settings that should be changed, looked at?
Is it better to use external graphic card instead of onboard video?
Example expected answer:
You should read on (and disable) Nagle Algorithm if you use TCP as it can delay packet sending.
(This could maybe be turned in community wiki)

Consider using Multimedia Class Scheduler Service
From the doc
The Multimedia Class Scheduler service (MMCSS) enables multimedia
applications to ensure that their time-sensitive processing receives
prioritized access to CPU resources. This service enables multimedia
applications to utilize as much of the CPU as possible without denying
CPU resources to lower-priority applications
Another option availale to you is to adjust your thread priorities but you need to be very careful not to get to aggressive with this.

Hardware-wise, will this be running on server-class equipment? If so, the usual steps apply. Disable hyperthreading, turbo boost, and CPU C-states. Implement some level of CPU-affinity on your critical processes.

Related

How to profile set of process in freeBSD?

I am trying to debug a service with respect to its performance. The service I am trying to debug, internally spawns instances of the same binary. To improve the through-put, I am planning to increase number of instances of the binary. After a point in number of processes of the binary, through-put is not increasing. Now I am trying to reason-out why this is happening.
I need some help on where to start, tools available for process level profiling. I am using freeBSD platform.
If using more processes doesn't improve output, then your service isn't CPU bound. It might be constrained by e.g. disk or network throughput instead.
Start with systat. Especially systat -vmstat. See man systat.
This will show you several aspects (like memory usage, interrupts, processot usage and disk activity) of how busy your system is.
If your program does a lot of network activity, using systat -tcp might give insight as well.
If your service is a HTTP server, you might want to look at varnish.

Two-way communication between kernel-mode driver and user-mode application?

I need a two-way communication between a kernel-mode WFP driver and a user-mode application. The driver initiates the communication by passing a URL to the application which then does a categorization of that URL (Entertainment, News, Adult, etc.) and passes that category back to the driver. The driver needs to know the category in the filter function because it may block certain web pages based on that information. I had a thread in the application that was making an I/O request that the driver would complete with the URL and a GUID, and then the application would write the category into the registry under that GUID where the driver would pick it up. Unfortunately, as the driver verifier pointed out, this is unstable because the Zw registry functions have to run at PASSIVE_LEVEL. I was thinking about trying the same thing with mapped memory buffers, but I’m not sure what the interrupt requirements are for that. Also, I thought about lowering the interrupt level before the registry function calls, but I don't know what the side effects of that are.
You just need to have two different kinds of I/O request.
If you're using DeviceIoControl to retrieve the URLs (I think this would be the most suitable method) this is as simple as adding a second I/O control code.
If you're using ReadFile or equivalent, things would normally get a bit messier, but as it happens in this specific case you only have two kinds of operations, one of which is a read (driver->application) and the other of which is a write (application->driver). So you could just use WriteFile to send the reply, including of course the GUID so that the driver can match up your reply to the right query.
Another approach (more similar to your original one) would be to use a shared memory buffer. See this answer for more details. The problem with that idea is that you would either need to use a spinlock (at the cost of system performance and power consumption, and of course not being able to work on a single-core system) or to poll (which is both inefficient and not really suitable for time-sensitive operations).
There is nothing unstable about PASSIVE_LEVEL. Access to registry must be at PASSIVE_LEVEL so it's not possible directly if driver is running at higher IRQL. You can do it by offloading to work item, though. Lowering the IRQL is usually not recommended as it contradicts the OS intentions.
Your protocol indeed sounds somewhat cumbersome and doing a direct app-driver communication is probably preferable. You can find useful information about this here: http://msdn.microsoft.com/en-us/library/windows/hardware/ff554436(v=vs.85).aspx
Since the callouts are at DISPATCH, your processing has to be done either in a worker thread or a DPC, which will allow you to use ZwXXX. You should into inverted callbacks for communication purposes, there's a good document on OSR.
I've just started poking around WFP but it looks like even in the samples that they provide, Microsoft reinject the packets. I haven't looked into it that closely but it seems that they drop the packet and re-inject whenever processed. That would be enough for your use mode engine to make the decision. You should also limit the packet capture to a specific port (80 in your case) so that you don't do extra processing that you don't need.

monitoring application (CPU and cache usage) on single Linux box with 80 cores

I am looking for a performance monitoring tool for my application which will collect/visualize in realtime the CPU and cache usage on single Linux box like IBM System or HP ProLiant with typical configuration 8 processors / 80 cores.
Application is the home-grown multithreaded C+ code which uses OpenMP.
This monitoring tool should not run 24 hours per day; it should not do e-mail notification.
I will run this tool just before sending commands to my apps, the apps will execute the command (it may take as a maximum few minutes only). During this time interval I need to analyze:
- usage of cores
- data movement between processors
- usage of L1, L2, L3 caches
- some other metrics (help me here) which can help to find bottleneck in application
performance and resource utilization
I guess that tools like Nagios / Zabbix are too heavy for this task.
From another side using the command-line tools like "top" and "sar" for 80 cores not very convenient and plotting (not necessary real-time) would be nice to have...
While getting the per core usage is rather easy - the other values might prove to be not practical, not at least without running that application within a profiler of some sorts.
Measuring QPI utilization is something highly non-trivial if at all possible. Intel's vTune might be able to acquire such things but only when running instrumented version of your binaries.
Also on x86 there is no way to figure out L1,L2,L3 usage of any kind - you can grab the low level CPU counters to measure cache misses though (but would probably need to use instrumented/profiled binaries and always withan something like vTune or PAPI).
You could "easily" setup something to pull all the lower level performance counters into SNMP and grab the SNMP values via standard SNMP capable monitoring tools but be aware that SNMP pulling is something that you don't want to occur more than 1-2/s max. Or pull that info into something like collectd.
I'm also having the impression that you don't understand the problem domain of monitoring tools. They are not ment to be used as low level analysis probes for finding application level/system bottlenecks - at best you could get some hints which resource (from a 10K feet view) is running under full utilization. Monitoring and alterting tools are something that operations staff needs to use to understand which part of their IT system is currently used and how, to gather historical data and predict future resource utilization and to be alerted when something breaks.
SiteScope, Hyperic or any combination of shell scripts, native OS utilities and a DB to store the results may do the job.

Diagnosing pathological behavior of a piece of cluster software

I'm using a kind of load balancer over a small cluster that is able to achieve >2000rps on zero-duration requests (t.i. ones that are immediately satisfied by the worker nodes).
But as soon as the requests stop being zero-duration and start taking even 1ms, performance immediately drops >10x. The data being transfered in both directions is identical and is about 2kb in size.
This is for sure not related to saturation of the cluster or network throughput, because 200rps of 1ms requests is a very tiny load and the network is 10Gbit. Besides, the CPU load is just some 2-5% both on the load balancer and on the worker nodes.
I wonder whether that might be related to some pathological behavior of the OS scheduler, or the OS network stack (t.i. there is some special case behavior for very short interactions).
How might I diagnose the reason? Which perfcounters to watch? What tools or methodologies to use?
(Just in case someone simply knows the answer to my particular problem, I'm talking about the MS HPC Server 2008 R2's "WCF Broker", running on Windows Server 2008 R2 over Hyper-V)
One thing you can do is use ETW tracing to try and understand what the nodes are doing while your WCF job is running. On HPC server, I sometimes clusrun xperf to collect traces on all or specific nodes. There are a number of tools that you can use for analyzing ETW traces, including xperf itself. I haven't done any serious work using HPC SOA (WCF), but I did write a simple WCF raytracer app and then used xperf to profile it on several of the nodes.
Turned out it was a completely network-unrelated issue having to do with peculiarities of the scheduling mechanism of HPC Server. I resolved the issue by tweaking a configuration option "serviceRequestPrefetchCount" to 0 in the loadBalancing section of the WCF service config file.
I'm assuming that there are some shared resources with some kind of locking system in place? Is locking a bottleneck? It's hard to guess without seeing the system.
Do you have a way to profile the workers? What are they spending most of their time on, especially in the fast vs slow scenarios?

CPU Utiliztion per process in Win32 API

I am doing a project on a centralized LAN management system. I need to know how many CPU cycles is each process of a remote PC consuming(as in a Task Manager )so that the network admin can close few processes,in case the CPU utilization of a system in network goes beyond acceptable rates..
I would like to know if there is a Win32 API for this requirement of mine n if so ,i request you to give me information about it..
thank you in advance..
Win32 API has lots of functions to find all kinds of information about currently running processes and threads, here's a link to the full list of them: http://msdn.microsoft.com/en-us/library/ms683223(VS.85).aspx
Explore the list and you should be able to find the function(s) there that meet your requirements, for example GetProcessTimes() returns structures that contain the amounts of time the process has executed in kernel mode, in user mode, etc.
You need to look at the performance monitor system. You can get the stats from there (in the Process counter).
Here's a (delphi) explanation of it, that's pretty good and simple to understand.
When you understand how it all works, you then need the Performance Counters API to read the data counters.

Resources