Measuring process peak memory usage post mortem - winapi

I'm working on a benchmark tool which among other things measures the time and memory used by an external process performing an operation. I'm mostly interested in the peak pageable memory size (a.k.a. PageFileBytesPeak performance counter / Process.PeakPagedMemorySize64 / peak private bytes). This is a .NET project so a pure .NET solution would be preferable, however this is most likely not a possibility.
The problem here is that I won't know the peak memory usage before the process has exited. I can't read the performance counters for the process when it no longer exists. So I could instead poll it while the process is running.
However this is not preferable as if I poll too often I will interfere with the time it takes for the process to complete its work, and if poll too rarely the result won't be accurate (the process will most likely hit its peak memory usage right before it exits). So I'm hoping there is some way to do it reliably that is less hacky than the solutions I have come up with so far:
Inject DLL into process, report value via IPC mechanism on DLL_PROCESS_DETACH.
Patch/Hook ExitProcess in target process, report value via IPC mechanism before executing real ExitProcess.
Pretend to be a debugger, measure value on EXIT_PROCESS_DEBUG_EVENT (the process won't be cleaned up by the kernel before ContinueDebugEvent is called).

Reading extant PerfMon counters should be a very low overhead operation, esp. for system counters like the ones you want to work with, since the counters are typically (possibly always? not sure) implemented using a block of shared memory (mapped file).
I'd implement polling with a runtime configurable interval, and only resort to more complex techniques if you find this is affecting your application materially. If you want to sanity check this first, set up PerfMon to monitor the counter(s) of interest and see if that kills your application when running at a usable refresh interval.

Turns out that GetProcessMemoryInfo works even after the process has exited as long as you have an active handle to it. The virtual memory usage isn't available this way though if you happen to need that.
Only caveat is that the size of the values depends on the bitness of the process it's called from, so the values may overflow if a 32 bit process measures the memory usage of a 64 bit process.
Example:
[DllImport("psapi.dll", SetLastError=true)]
static extern bool GetProcessMemoryInfo(IntPtr hProcess, out PROCESS_MEMORY_COUNTERS counters, int size);
[StructLayout(LayoutKind.Sequential)]
private struct PROCESS_MEMORY_COUNTERS
{
public uint cb;
public uint PageFaultCount;
public UIntPtr PeakWorkingSetSize;
public UIntPtr WorkingSetSize;
public UIntPtr QuotaPeakPagedPoolUsage;
public UIntPtr QuotaPagedPoolUsage;
public UIntPtr QuotaPeakNonPagedPoolUsage;
public UIntPtr QuotaNonPagedPoolUsage;
public UIntPtr PagefileUsage;
public UIntPtr PeakPagefileUsage;
}
public long BenchmarkProcessMemoryUsage(string fileName, string arguments)
{
ProcessStartInfo startInfo = new ProcessStartInfo(fileName, arguments);
startInfo.UseShellExecute = false;
Process process = Process.Start();
process.WaitForExit();
PROCESS_MEMORY_COUNTERS counters;
if (!GetProcessMemoryInfo(process.Handle, out counters, Marshal.SizeOf(typeof(PROCESS_MEMORY_COUNTERS))))
throw new System.ComponentModel.Win32Exception(Marshal.GetLastWin32Error());
return (long)counters.PeakPagefileUsage;
}

Related

How to check which index in a loop is executing without slow down process?

What is the best way to check which index is executing in a loop without too much slow down the process?
For example I want to find all long fancy numbers and have a loop like
for( long i = 1; i > 0; i++){
//block
}
and I want to learn which i is executing in real time.
Several ways I know to do in the block are printing i every time, or checking if(i % 10000), or adding a listener.
Which one of these ways is the fastest. Or what do you do in similar cases? Is there any way to access the value of the i manually?
Most of my recent experience is with Java, so I'd write something like this
import java.util.concurrent.atomic.AtomicLong;
public class Example {
public static void main(String[] args) {
AtomicLong atomicLong = new AtomicLong(1); // initialize to 1
LoopMonitor lm = new LoopMonitor(atomicLong);
Thread t = new Thread(lm);
t.start(); // start LoopMonitor
while(atomicLong.get() > 0) {
long l = atomicLong.getAndIncrement(); // equivalent to long l = atomicLong++ if atomicLong were a primitive
//block
}
}
private static class LoopMonitor implements Runnable {
private final AtomicLong atomicLong;
public LoopMonitor(AtomicLong atomicLong) {
this.atomicLong = atomicLong;
}
public void run() {
while(true) {
try {
System.out.println(atomicLong.longValue()); // Print l
Thread.sleep(1000); // Sleep for one second
} catch (InterruptedException ex) {}
}
}
}
}
Most AtomicLong implementations can be set in one clock cycle even on 32-bit platforms, which is why I used it here instead of a primitive long (you don't want to inadvertently print a half-set long); look into your compiler / platform details to see if you need something like this, but if you're on a 64-bit platform then you can probably use a primitive long regardless of which language you're using. The modified for loop doesn't take much of an efficiency hit - you've replaced a primitive long with a reference to a long, so all you've added is a pointer dereference.
It won't be easy, but probably the only way to probe the value without affecting the process is to access the loop variable in shared memory with another thread. Threading libraries vary from one system to another, so I can't help much there (on Linux I'd probably use pthreads). The "monitor" thread might do something like probe the value once a minute, sleep()ing in between, and so allowing the first thread to run uninterrupted.
To have a null cost reporting (on multi-cpu computers) : set your index as a "global" property (class-wide for instance), and have a separate thread to read and report the index value.
This report could be timer-based (5 times per seconds or so).
Rq : Maybe you'll need also a boolean stating 'are we in the loop ?'.
Volatile and Caches
If you're going to be doing this in, say, C / C++ and use a separate monitor thread as previously suggested then you'll have to make the global/static loop variable volatile. You don't want the compiler decide deciding to use a register for the loop variable. Some toolchains make that assumption anyway, but there's no harm being explicit about it.
And then there's the small issue of caches. A separate monitor thread nowadays will end up on a separate core, and that'll mean that the two separate cache subsystems will have to agree on what the value is. That will unavoidably have a small impact on the runtime of the loop.
Real real time constraint?
So that begs the question of just how real time is your loop anyway? I doubt that your timing constraint is such that you're depending on it running within a specific number of CPU clock cycles. Two reasons, a) no modern OS will ever come close to guaranteeing that, you'd have to be running on the bare metal, b) most CPUs these days vary their own clock rate behind your back, so you can't count on a specific number of clock cycles corresponding to a specific real time interval.
Feature rich solution
So assuming that your real time requirement is not that constrained, you may wish to do a more capable monitor thread. Have a shared structure protected by a semaphore which your loop occasionally updates, and your monitor thread periodically inspects and reports progress. For best performance the monitor thread would take the semaphore, copy the structure, release the semaphore and then inspect/print the structure, minimising the semaphore locked time.
The only advantage of this approach over that suggested in previous answers is that you could report more than just the loop variable's value. There may be more information from your loop block that you'd like to report too.
Mutex semaphores in, say, C on Linux are pretty fast these days. Unless your loop block is very lightweight the runtime overhead of a single mutex is not likely to be significant, especially if you're updating the shared structure every 1000 loop iterations. A decent OS will put your threads on separate cores, but for the sake of good form you'd make the monitor thread's priority higher than the thread running the loop. This would ensure that the monitoring does actually happen if the two threads do end up on the same core.

memory pool usage (boost::pool) for variable sized buffers?

The bottleneck of my current project is heap allocation... profiling stated about 50% of the time one critical thread spends with/in the new operator.
The application cannot use stack memory here and needs to allocate a lot of one central job structure—a custom job/buffer implementation: small and short-lived but variable in size. The object are itself heap memory std::shared_ptr/std::weak_ptr objects and carry a classic C-Array (char*) payload.
Depending on the runtime configuration and workload in different parts 300k-500k object might get created and are in use at the same time (but this should usually not happen). Since its a x64 application memory fragmentation isn't that big a deal (but it might get when also targeted at x86).
To increase speed and packet throughput and as well be save to memory fragmentation in the future I was thinking about using some memory management pool which lead me to boost::pool.
Almost all examples use fixed size object... but I'm unsure how to deal with a variable lengthed payload? A simplified object like this could be created using a boost::pool but I'm unsure what to do with the payload? Is it usable with a boost:pool at all?
class job {
public:
static std::shared_ptr<job> newObj();
private:
delegate_t call;
args_t * args;
unsigned char * payload;
size_t payload_size;
}
Usually the objects are destroyed when all references to the shared_ptr run out of scope and I wouldn't want to change the shared-ptr back to a c-ptr. A deferred destruction of the objects should also work to increase performance and from what I read should work better with a boost:pool. I haven't found if the pool supports an interaction with the smart_ptr? The alternative but quirky way would be to save a reference to the shared_ptr on creation together with the pool and release them in blocks.
Does anyone have experiences with the two? boost:pool usage with variable sized objects and smart pointer interaction?
Thank you!

getloadavg() within kernel

Is there an equivalent api like getloadavg() that can be used within the kernel i.e. for my own driver ?
I have a driver that is thrashing and I would like to throttle it, and i am looking for a kernel-api to find about the cpu usage.
Thank you.
You're probably looking for the get_avenrun() function in kernel/sched.c. An example of how to use it is in fs/proc/loadavg.c:
static int loadavg_proc_show(struct seq_file *m, void *v)
{
unsigned long avnrun[3];
get_avenrun(avnrun, FIXED_1/200, 0);
seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n",
LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
nr_running(), nr_threads,
task_active_pid_ns(current)->last_pid);
return 0;
}
Though I'm a little skeptical of how you can use the load average to modify a driver -- the load average is best treated as a heuristic for system administrators to gauge how their system changes over time, not necessarily how "healthy" it might be at any given moment -- what specifically in the driver is causing troubles? There's probably a better mechanism to make it play nicely with the rest of the system.

How to know what amount of memory I'm using in a process? win32 C++

I'm using
Win32 C++ in
CodeGear Builder 2009
Target is Windows XP Embedded.
I found the PROCESS_MEMORY_COUNTERS_EX struct
and I have created a siple function to return the
Memory consumption of my process
SIZE_T TForm1::ProcessPrivatBytes( DWORD processID )
{
SIZE_T lRetval = 0;
HANDLE hProcess;
PROCESS_MEMORY_COUNTERS_EX pmc;
hProcess = OpenProcess( PROCESS_QUERY_INFORMATION |
PROCESS_VM_READ,
FALSE, processID );
if (NULL == hProcess)
{
lRetval = 1;
}
else
{
if ( GetProcessMemoryInfo( hProcess, (PROCESS_MEMORY_COUNTERS*)&pmc, sizeof(pmc)) )
{
lRetval = pmc.WorkingSetSize;
lRetval = pmc.PrivateUsage;
}
CloseHandle( hProcess );
}
return lRetval;
}
//---------------------------------------------------------------------------
Do i have to use lRetval = pmc.WorkingSetSize; or lRetval = pmc.PrivateUsage;
the privateUsage are what I see in perfmon.
but what is that WorkingSetSize exactly.
I what to see every byte I allocate in the counter when I allocate it. Is this Posible?
regards
jvdn
This is a much tougher question than you probably realized. The reason is that Windows shares most executable code between processes (especially the ones that make up most of Windows itself) between processes. For example, there's normally ONE copy of kernel32.dll loaded into memory, but it'll normally be mapped into every process. Do you consider that part of the memory your process is "using" or not?
Private memory is what's unique to that particular process. This can be somewhat misleading too. Since the executable for your process could potentially be shared with another process (i.e. two instances of your program could be run), that's not counted as part of the private memory, even if (as is often the case) there's only one instance of it running.
The working set size is about 99.999% meaningless. What it returns is whatever has been set as the preferred working set size for the process. You can adjust that with SetProcessWorkingSetSize(). Windows has a working set trimmer that attempts to trim down working sets. If memory serves, it uses the working set size to guess at whether it's worth trying to trim the working set of this process -- i.e. if its current working set is larger than the working set size was set to, it tries to trim it down. Otherwise, it (mostly) leaves it alone.
Chances are that nothing you do will show you ever byte you allocate as you allocate it though. Calling Windows to allocate memory is fairly slow, so what's normally done is that the run-time library allocates a fairly big chunk of memory from Windows. When you allocate memory, the run-time library gives you a piece of that big chunk. Only when that chunk is gone does it go back to Windows and ask for more.

What is the win32 API function for private bytes?

What is the win32 API function for private bytes (the ones you can see in perfmon).
I'd like to avoid the .NET API
BOOL WINAPI GetProcessMemoryInfo(
__in HANDLE Process,
__out PPROCESS_MEMORY_COUNTERS ppsmemCounters,
__in DWORD cb
);
Where ppsmemCounters parameter can be a PROCESS_MEMORY_COUNTERS or PROCESS_MEMORY_COUNTERS_EX structure. Just typecast PROCESS_MEMORY_COUNTERS_EX to PROCESS_MEMORY_COUNTERS.
PROCESS_MEMORY_COUNTERS_EX.PrivateUsage is what you're looking for.
More info here and here
You can collect the same data perfmon shows using the performance counters API
You need to clarify what you are trying to do. These are internal figures whose value is not really controlled by any API.
Technically Private Bytes is the commit charge, the amount of memory allocated in the swap file to hold the contents of the applications private memory should it be swapped out.
Generally private bytes = amount of dynamically allocated memory + some extra.

Resources