We have some code that relies on extensive usage of fork. We started to hit the performance problems and one of our hypothesis is that we do have a lot of speed wasted when copy-on-write happens in the forked processes.
Is there a way to specifically detect when and how copy-and-write happens, to have a detailed insight into this process.
My platform is OSX but more general information is also appreciated.
There are a few ways to get this info on OS X. If you're satisfied with just watching information about copy-on-write behavior from the command-line, you can use the vm_stat tool with an interval. E.g., vm_stat 0.5 will print full statistics twice per second. One of the columns is the number of copy-on-write faults.
If you'd like to gather specific information in a more detailed way, but still from outside the actual running process, you can use the Instruments application that comes with OS X. This includes a set of tools for gathering information about a running process, the most useful of which for your case are likely to be the VM Tracker, Virtual Memory Trace, or Shared Memory instruments. These capture lots of useful information over the lifetime of a process. The application is not super intuitive, but it will do what you need.
If you'd like detailed information in-process, I think you'll need to use the (poorly documented) VM statistics API. You can request that the kernel fill a vm_statistics struct using the host_statistics routine. For example, running this code:
mach_msg_type_number_t count = HOST_VM_INFO_COUNT;
vm_statistics_data_t vmstats;
kern_return_t host_statistics(mach_host_self(), HOST_VM_INFO, (host_info_t) &vmstats, &count);
will fill the vmstats structure with information such as cow_faults, which gives the number of faults triggered by copy-on-write behavior. Check out the headers /usr/include/mach/vm_*, which declare the types and routines for gathering this information.
Related
I'm doing this as a personal project, I want to make a visualizer for this data. but the first step is getting the data.
My current plan is to
make my program debug the target process step through it
each step record the EIP from every thread's context within the target process
construct the memory address the instruction uses from the context and store it.
Is there an easier or built in way to do this?
Have a look at Intel PIN for dynamic binary instrumentation / running a hook for every load / store instruction. intel-pin
Instead of actually single-stepping in a debugger (extremely slow), it does binary-to-binary JIT to add calls to your hooks.
https://software.intel.com/sites/landingpage/pintool/docs/81205/Pin/html/index.html
Honestly the best way to do this is probably instrumentation like Peter suggested, depending on your goals. Have you ever ran a script that stepped through code in a debugger? Even automated it's incredibly slow. The only other alternative I see is page faults, which would also be incredibly slow but should still be faster than single step. Basically you make every page not in the currently executing section inaccessible. Any RW access outside of executing code will trigger an exception where you can log details and handle it. Of course this has a lot of flaws -- you can't detect RW in the current page, it's still going to be slow, it can get complicated such as handling page execution transfers, multiple threads, etc. The final possible solution I have would be to have a timer interrupt that checks RW access for each page. This would be incredibly fast and, although it would provide no specific addresses, it would give you an aggregate of pages written to and read from. I'm actually not entirely sure off the top of my head if Windows exposes that information already and I'm also not sure if there's a reliable way to guarantee your timers would get hit before the kernel clears those bits.
xmalloc can be used in the process environment only when I write a AIX kernel extension.
what's the memory allocation functions can be called from the interrupt environment in AIX?
thanks.
The network memory allocation routines. Look in /usr/include/net/net_malloc.h. The lowest level is net_malloc and net_free.
I don't see much documentation in IBM's pubs nor the internet. There are a few examples in various header files.
There is public no prototype that I can find for these.
If you look in net_malloc.h, you will see MALLOC and NET_MALLOC macros defined that call it. Then if you grep in all the files under /usr/include, you will see uses of these macros. From these uses, you can deduce the arguments to the macros and thus deduce the arguments to net_malloc itself. I would make one routine that is a pass through to net_malloc that you controlled the interface to.
On your target system, do "netstat -m". The last bucket size you see will be the largest size you can call net_malloc with the M_NOWAIT flag. M_WAIT can be used only at process time and waits for netm to allocate more memory if necessary. M_NOWAIT returns with a 0 if there is not enough memory pinned. At interrupt time, you must use M_NOWAIT.
There is no real checking for the "type" but it is good to pick an appropriate type for debugging purposes later on. The netm output from kdb shows the type.
In a similar fashion, you can figure out how to call net_free.
Its sad IBM has chosen not to document this. An alternative to get this information officially is to pay for an "ISV" question. If you are doing serious AIX development, you want to become an ISV / Partner. It will save you lots of heart break. I don't know the cost but it is within reach of small companies and even individuals.
This book is nice to have too.
I'd like to my Windows C++ program to be able to read the number of hard page faults it has caused. The program isn't running as administrator. Edited to add: To be clear, I'm not as interested in the aggregate page fault count of the whole system.
It looks like ETW might export counters for this, but I'm having a lot of difficulty figuring out the API, and it's not clear what's accessible by regular users as compared to administrators.
Does anyone have an example of this functionality lying around? Or is it simply not possible on Windows?
(OT, but isn't it sad how much easier this is on *nix? gerusage() and you're done.)
afai can tell the only way to do this would be to use ETW (Event Tracing for Windows) to monitor kernel Hard Page Faults. The event payload has a thread ID that you might be able to correlate with an existing process (this is going to be non-trivial btw) to produce a running per-process count. I don't see any way to get historical information per process.
I can guarantee you that this is A Hard Problem because Process Explorer supports only Page Faults (soft or hard) in its per-process display.
http://msdn.microsoft.com/en-us/magazine/ee412263.aspx
A page fault occurs when a sought-out
page table entry is invalid. If the
requested page needs to be brought in
from disk, it is called a hard page
fault (a very expensive operation),
and all other types are considered
soft page faults (a less expensive
operation). A Page Fault event payload
contains the virtual memory address
for which a page fault happened and
the instruction pointer that caused
it. A hard page fault requires disk
access to occur, which could be the
first access to contents in a file or
accesses to memory blocks that were
paged out. Enabling Page Fault events
causes a hard page fault to be logged
as a page fault with a type Hard Page
Fault. However, a hard fault typically
has a considerably larger impact on
performance, so a separate event is
available just for a hard fault that
can be enabled independently. A Hard
Fault event payload has more data,
such as file key, offset and thread
ID, compared with a Page Fault event.
I think you can use GetProcessMemoryInfo() - Please refer to http://msdn.microsoft.com/en-us/library/ms683219(v=vs.85).aspx for more information.
Yes, quite sad. Or you could just not assume Windows is so gimp that it doesn't even provide a page fault counter and look it up: Win32_PerfFormattedData_PerfOS_Memory.
There is a C/C++ sample on Microsoft's site that explain how to read performance counters: INFO: PDH Sample Code to Enumerate Performance Counters and Instances
You can copy/paste it and I think you're interested by the "Memory" / "Page Reads/sec" counters, as stated in this interesting article: The Basics of Page Faults
This is done with performance counters in windows. It's been a while since I've done anything with them. I don't recall whether or not you need to run as administrator to query them.
[Edit]
I don't have example code to provide but according to this page, you can get this information for a particular process:
Process : Page Faults/sec. This is an
indication of the number of page
faults that occurred due to requests
from this particular process.
Excessive page faults from a
particular process are an indication
usually of bad coding practices.
Either the functions and DLLs are not
organized correctly, or the data set
that the application is using is being
called in a less than efficient
manner.
I don't think you need administrative credential to enumerate the performance counters. A sample at codeproject Performance Counters Enumerator
Has anyone tried to create a log file of interprocess communications? Could someone give me a little advice on the best way to achieve this?
The question is not quite clear, and comments make it less clear, but anyway...
The two things to try first are ipcs and strace -e trace=ipc.
If you want to log all IPC(seems very intensive), you should consider instrumentation.
Their are a lot of good tools for this, check out PIN in perticular, this section of the manual;
In this example, we show how to do
more selective instrumentation by
examining the instructions. This tool
generates a trace of all memory
addresses referenced by a program.
This is also useful for debugging and
for simulating a data cache in a
processor.
If your doing some heavy weight tuning and analysis, check out TAU (Tuning and analysis utilitiy).
Communication to a kernel driver can take many forms. There is usually a special device file for communication, or there can be a special socket type, like NETLINK. If you are lucky, there's a character device to which read() and write() are the sole means of interaction - if that's the case then those calls are easy to intercept with a variety of methods. If you are unlucky, many things are done with ioctls or something even more difficult.
However, running 'strace' on the program using the kernel driver to communicate can reveal just about all it does - though 'ltrace' might be more readable if there happens to be libraries the program uses for communication. By tuning the arguments to 'strace', you can probably get a dump which contains just the information you need:
First, just eyeball the calls and try to figure out the means of kernel communication
Then, add filters to strace call to log only the kernel communication calls
Finally, make sure strace logs the full strings of all calls, so you don't have to deal with truncated data
The answers which point to IPC debugging probably are not relevant, as communicating with the kernel almost never has anything to do with IPC (atleast not the different UNIX IPC facilities).
Is there any way to set a system wide memory limit a process can use in Windows XP? I have a couple of unstable apps which do work ok for most of the time but can hit a bug which results in eating whole memory in a matter of seconds (or at least I suppose that's it). This results in a hard reset as Windows becomes totally unresponsive and I lose my work.
I would like to be able to do something like the /etc/limits on Linux - setting M90, for instance (to set 90% max memory for a single user to allocate). So the system gets the remaining 10% no matter what.
Use Windows Job Objects. Jobs are like process groups and can limit memory usage and process priority.
Use the Application Verifier (AppVerifier) tool from Microsoft.
In my case I need to simulate memory no longer being available so I did the following in the tool:
Added my application
Unchecked Basic
Checked Low Resource Simulation
Changed TimeOut to 120000 - my application will run normally for 2 minutes before anything goes into effect.
Changed HeapAlloc to 100 - 100% chance of heap allocation error
Set Stacks to true - the stack will not be able to grow any larger
Save
Start my application
After 2 minutes my program could no longer allocate new memory and I was able to see how everything was handled.
Depending on your applications, it might be easier to limit the memory the language interpreter uses. For example with Java you can set the amount of RAM the JVM will be allocated.
Otherwise it is possible to set it once for each process with the windows API
SetProcessWorkingSetSize Function
No way to do this that I know of, although I'm very curious to read if anyone has a good answer. I have been thinking about adding something like this to one of the apps my company builds, but have found no good way to do it.
The one thing I can think of (although not directly on point) is that I believe you can limit the total memory usage for a COM+ application in Windows. It would require the app to be written to run in COM+, of course, but it's the closest way I know of.
The working set stuff is good (Job Objects also control working sets), but that's not total memory usage, only real memory usage (paged in) at any one time. It may work for what you want, but afaik it doesn't limit total allocated memory.
Per process limits
From an end-user perspective, there are some helpful answers (and comments) at the superuser question “Is it possible to limit the memory usage of a particular process on Windows”, including discussions of how to set recursive quota limits on any or all of:
CPU assignment (quantity, affinity, NUMA groups),
CPU usage,
RAM usage (both ‘committed’ and ‘working set’), and
network usage,
… mostly via the built-in Windows ‘Job Objects’ system (as mentioned in #Adam Mitz’s answer and #Stephen Martin’s comment above), using:
the registry (for persistence, when desired) or
free tools, such as the open-source Process Governor.
(Note: nested Job Objects ~may~ not have been available under all earlier versions of Windows, but the un-nested version appears to date back to Windows XP)
Per-user limits
As far as overall per-user quotas:
??
It is possible that each user session is automatically assigned to a job group itself; if true, per-user limits should be able to be applied to that job group. Update: nope; Job Objects can only be nested at the time they are created or associated with a specific process, and in some cases a child Job Object is allowed to ‘break free’ from its parent and become independent, so they can’t facilitate ‘per-user’ resource limits.
(NTFS does support per-user file system ~storage~ quotas, though)
Per-system limits
Besides simple BIOS or ‘energy profile’ restrictions:
VM hypervisor or Kubernetes-style container resource limit controls may be the most straightforward (in terms of end-user understandability, at least) option.
Footnotes, regarding per-process and other resource quotas / QoS for non-Windows systems:
‘Classic’ Mac OS (including ‘classic’ applications running on 2000s-era versions of Mac OS X): per-application memory limits can be easily set within the ‘Memory’ section of the Finder ‘Get Info’ window for the target program; as a system using a cooperative multitasking concurrency model, per-process CPU limits were impossible.
BSD: ? (probably has some overlap with linux and non-proprietary macOS methods?)
macOS (aka ‘Mac OS X’): no user-facing interface; system support includes, depending on version, the ‘Multiprocessing Services API’, Grand Central Dispatch, POSIX threads / pthread, ‘operation objects’, and possibly others.
Linux: ‘Resource Manager’/limits.conf, control groups/‘cgroups’, process priority/‘niceness’/renice, others?
IBM z/OS and other mainframe-style systems: resource controls / allocation was built-in from nearly the beginning