GetIpAddrTable() leaks memory. How to resolve that? - windows

On my Windows 7 box, this simple program causes the memory use of the application to creep up continuously, with no upper bound. I've stripped out everything non-essential, and it seems clear that the culprit is the Microsoft Iphlpapi function "GetIpAddrTable()". On each call, it leaks some memory. In a loop (e.g. checking for changes to the network interface list), it is unsustainable. There seems to be no async notification API which could do this job, so now I'm faced with possibly having to isolate this logic into a separate process and recycle the process periodically -- an ugly solution.
Any ideas?
// IphlpLeak.cpp - demonstrates that GetIpAddrTable leaks memory internally: run this and watch
// the memory use of the app climb up continuously with no upper bound.
#include <stdio.h>
#include <windows.h>
#include <assert.h>
#include <Iphlpapi.h>
#pragma comment(lib,"Iphlpapi.lib")
void testLeak() {
static unsigned char buf[16384];
DWORD dwSize(sizeof(buf));
if (GetIpAddrTable((PMIB_IPADDRTABLE)buf, &dwSize, false) == ERROR_INSUFFICIENT_BUFFER)
{
assert(0); // we never hit this branch.
return;
}
}
int main(int argc, char* argv[]) {
for ( int i = 0; true; i++ ) {
testLeak();
printf("i=%d\n",i);
Sleep(1000);
}
return 0;
}

#Stabledog:
I've ran your example, unmodified, for 24 hours but did not observe that the program's Commit Size increased indefinitely. It always stayed below 1024 kilobyte. This was on Windows 7 (32-bit, and without Service Pack 1).

Just for the sake of completeness, what happens to memory usage if you comment out the entire if block and the sleep? If there's no leak there, then I would suggest you're correct as to what's causing it.
Worst case, report it to MS and see if they can fix it - you have a nice simple test case to work from which is more than what I see in most bug reports.
Another thing you may want to try is to check the error code against NO_ERROR rather than a specific error condition. If you get back a different error than ERROR_INSUFFICIENT_BUFFER, there may be a leak for that:
DWORD dwRetVal = GetIpAddrTable((PMIB_IPADDRTABLE)buf, &dwSize, false);
if (dwRetVal != NO_ERROR) {
printf ("ERROR: %d\n", dwRetVal);
}

I've been all over this issue now: it appears that there is no acknowledgment from Microsoft on the matter, but even a trivial application grows without bounds on Windows 7 (not XP, though) when calling any of the APIs which retrieve the local IP addresses.
So the way I solved it -- for now -- was to launch a separate instance of my app with a special command-line switch that tells it "retrieve the IP addresses and print them to stdout". I scrape stdout in the parent app, the child exits and the leak problem is resolved.
But it wins "dang ugly solution to an annoying problem", at best.

Related

Is there a race between starting and seeing yourself in WinApi's EnumProcesses()?

I just found this code in the wild:
def _scan_for_self(self):
win32api.Sleep(2000) # sleep to give time for process to be seen in system table.
basename = self.cmdline.split()[0]
pids = win32process.EnumProcesses()
if not pids:
UserLog.warn("WindowsProcess", "no pids", pids)
for pid in pids:
try:
handle = win32api.OpenProcess(
win32con.PROCESS_QUERY_INFORMATION | win32con.PROCESS_VM_READ,
pywintypes.FALSE, pid)
except pywintypes.error, err:
UserLog.warn("WindowsProcess", str(err))
continue
try:
modlist = win32process.EnumProcessModules(handle)
except pywintypes.error,err:
UserLog.warn("WindowsProcess",str(err))
continue
This line caught my eye:
win32api.Sleep(2000) # sleep to give time for process to be seen in system table.
It suggests that if you call EnumProcesses() too fast after starting, you won't see yourself. Is there any truth to this?
There is a race, but it's not the race the code tried to protect against.
A successful call to CreateProcess returns only after the kernel object representing the process has been created and enqueued into the kernel's process list. A subsequent call to EnumProcesses accesses the same list, and will immediately observe the newly created process object.
That is, unless the process object has since been destroyed. This isn't entirely unusual since processes in Windows are initialized in-process. The documentation even makes note of that:
Note that the function returns before the process has finished initialization. If a required DLL cannot be located or fails to initialize, the process is terminated.
What this means is that if a call to EnumProcesses immediately following a successful call to CreateProcess doesn't observe the newly created process, it does so because it was late rather than early. If you are late already then adding a delay will only make you more late.
Which swiftly leads to the actual race here: Process IDs uniquely identify processes only for a finite time interval. Once a process object is gone, its ID is up for grabs, and the system will reuse it at some point. The only reliable way to identify a process is by holding a handle to it.
Now it's anyone's guess what the author of _scan_for_self was trying to accomplish. As written, the code takes more time to do something that's probably altogether wrong1 anyway.
1 Turns out my gut feeling was correct. This is just your average POSIX developer, that, in the process of learning that POSIX is insufficient would rather call out Microsoft instead of actually using an all-around superior API.
The documentation for EnumProcesses (WIn32 API - EnumProcesses function), does not mention anything about a delay needed to see the current process in the list it returns.
The example from Microsoft how to use EnumProcess to enumerate all running processes (Enumerating All Processes), also does not contain any delay before calling EnumProcesses.
A small test application I created in C++ (see below) always reports that the current process is in the list (tested on Windows 10):
#include <Windows.h>
#include <Psapi.h>
#include <iostream>
#include <vector>
const DWORD MAX_NUM_PROCESSES = 4096;
DWORD aProcesses[MAX_NUM_PROCESSES];
int main(void)
{
// Get the list of running process Ids:
DWORD cbNeeded;
if (!EnumProcesses(aProcesses, MAX_NUM_PROCESSES * sizeof(DWORD), &cbNeeded))
{
return 1;
}
// Check if current process is in the list:
DWORD curProcId = GetCurrentProcessId();
bool bFoundCurProcId{ false };
DWORD numProcesses = cbNeeded / sizeof(DWORD);
for (DWORD i=0; i<numProcesses; ++i)
{
if (aProcesses[i] == curProcId)
{
bFoundCurProcId = true;
}
}
std::cout << "bFoundCurProcId: " << bFoundCurProcId << std::endl;
return 0;
}
Note: I am aware that the fact that the program reported the expected result does not mean that there is no race. Maybe I just couldn't catch it manifest. But trying to run code like that can give you a hint sometimes (especially if the result would have been that there is a race).
The fact that I never had a problem running this test (did it many times), together with the lack of any mention of the need for a delay in Microsoft's documentation make me believe that it is not required.
My conclusion is that either:
There is a unique issue when using it from python (doubt it).
or:
The code you found is doing something unnecessary.
There is no race.
EnumProcesses calls a NT API function that switches to kernel mode to walk the linked list of processes. Your own process has been added to the list before it starts running.

Linux OOM killer does not work

I would like to test if the kernel OOM killer work fine on my embedded Linux or not. I used an application test to fill all memory and see if OOM will kill my application if the system run in out of memory condition.
The test program I used:
#include <stdio.h>
#include <stdlib.h>
#define MEGABYTE 1024*1024
int main(int argc, char *argv[])
{
void *myblock = NULL;
int count = 0;
while(1)
{
myblock = (void *) malloc(MEGABYTE);
if (!myblock) break;
memset(myblock,1, MEGABYTE);
printf("Currently allocating %d MB\n",++count);
}
exit(0);
}
Results:
I always get :
MyApplication triggered out of memory codition (oom killer not called): gfp_mask=0x1200d2, order=0, oomkilladj=0
I try to change /etc/sysctl by adding :
vm.oom_kill_allocating_task=1
vm.panic_on_oom=0
vm.overcommit_memory=0
how can I make OOM works fine on my system
Kernel version :2.6.30 #7 SMP PREEMPT
The Linux “OOM killer” is a solution to the overcommit problem.
If you just “fill all memory”, then overcommit will not show up. The malloc call will eventually return a null pointer, the convention to indicate that the memory request cannot be fulfilled.
In order to cause an overcommit-related problem, you must allocate too much memory without writing to it, and then decide to write to all of it, so that the system finds itself forced to honor promises it made without having the capacity to fulfill them.
EDIT after source code was provided:
To be completely precise, in order to trigger a problem with overcommit and force the Linux OOM killer to take action, you should have several processes that in a first phase all reserve memory with malloc() (but do not write to it yet). Then have all of them write to the memory they have reserved at the same time. This will force Linux to honor the memory promises outside of any memory allocation, and it will have no choice but to kill a process that wasn't allocating (since none of them will be allocating at that moment).
Also, if you still want to see how or when OOM-killer works. I would suggest you to add fork() before while loop. That will create many processes, and eventually one of them OOM-killer will kill.

Read/Write memory on OS X 10.8.2 with vm_read and vm_write

This is my code that works only on Xcode (version 4.5):
#include <stdio.h>
#include <mach/mach_init.h>
#include <mach/mach_vm.h>
#include <sys/types.h>
#include <mach/mach.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <Security/Authorization.h>
int main(int argc, const char * argv[]) {
char test[14] = "Hello World! "; //0x7fff5fbff82a
char value[14] = "Hello Hacker!";
char test1[14];
pointer_t buf;
uint32_t sz;
task_t task;
task_for_pid(current_task(), getpid(), &task);
if (vm_write(current_task(), 0x7fff5fbff82a, (pointer_t)value, 14) == KERN_SUCCESS) {
printf("%s\n", test);
//getchar();
}
if (vm_read(task, 0x7fff5fbff82a, sizeof(char) * 14, &buf, &sz) == KERN_SUCCESS) {
memcpy(test1, (const void *)buf, sz);
printf("%s", test1);
}
return 0;
}
I was trying also ptrace and other things, this is why I include other libraries too.
The first problem is that this works only on Xcode, I can find with the debugger the position (memory address) of a variable (in this case of test), so I change the string with the one on value and then I copy the new value on test on test1.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid(), the 2° problem is that I need to read and write on another process, this is only a test for see if the functions works on the same process, and it works (only on Xcode).
How I can do that on other processes? I need to read a position (how I can find the address of "something"?), this is the first goal.
For your problems, there are solutions:
The first problem: OS X has address space layout randomization. If you want to make your memory images fixed and predictable, you have to compile your code with NOPIE setting. This setting (PIE = Position Independent Executable), is responsible for allowing ASLR, which "slides" the memory by some random value, which changes on every instance.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid():
The Mach APIs operate on the lower level abstractions of "task" and "Thread" which correspond roughly to that of the BSD "process" and "(u)thread" (there are some exceptions, e.g. kernel_task, which does not have a PID, but let's ignore that for now). task_for_pid obtains the task port (think of it as a "handle"), and if you get the port - you are free to do whatever you wish. Basically, the vm_* functions operate on any task port - you can use it on your own process (mach_task_self(), that is), or a port obtained from task_for_pid.
Task for PID actually doesn't necessarily require root (i.e. "sudo"). It requires getting past taskgated on OSX, which traditionally verified membership in procmod or procview groups. You can configure taskgated ( /System/Library/LaunchDaemons/com.apple.taskgated.plist) for debugging purposes. Ultimately, btw, getting the task port will require an entitlement (the same as it now does on iOS). That said, the easiest way, rather than mucking around with system authorizations, etc, is to simply become root.
Did you try to run your app with "sudo"?
You can't read/write other app's memory without sudo.

ReadFile doesn't work asynchronously on Win7 and Win2k8

According to MSDN, ReadFile can read data 2 different ways: synchronously and asynchronously.
I need the second one. The folowing code demonstrates usage with OVERLAPPED struct:
#include <windows.h>
#include <stdio.h>
#include <time.h>
void Read()
{
HANDLE hFile = CreateFileA("c:\\1.avi", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_FLAG_OVERLAPPED, NULL);
if ( hFile == INVALID_HANDLE_VALUE )
{
printf("Failed to open the file\n");
return;
}
int dataSize = 256 * 1024 * 1024;
char* data = (char*)malloc(dataSize);
memset(data, 0xFF, dataSize);
OVERLAPPED overlapped;
memset(&overlapped, 0, sizeof(overlapped));
printf("reading: %d\n", time(NULL));
BOOL result = ReadFile(hFile, data, dataSize, NULL, &overlapped);
printf("sent: %d\n", time(NULL));
DWORD bytesRead;
result = GetOverlappedResult(hFile, &overlapped, &bytesRead, TRUE); // wait until completion - returns immediately
printf("done: %d\n", time(NULL));
CloseHandle(hFile);
}
int main()
{
Read();
}
On Windows XP output is:
reading: 1296651896
sent: 1296651896
done: 1296651899
It means that ReadFile didn't block and returned imediatly at the same second, whereas reading process continued for 3 seconds. It is normal async reading.
But on windows 7 and windows 2008 I get following results:
reading: 1296661205
sent: 1296661209
done: 1296661209.
It is a behavior of sync reading.
MSDN says that async ReadFile sometimes can behave as sync (when the file is compressed or encrypted for example). But the return value in this situation should be TRUE and GetLastError() == NO_ERROR.
On Windows 7 I get FALSE and GetLastError() == ERROR_IO_PENDING. So WinApi tells me that it is an async call, but when I look at the test I see that it is not!
I'm not the only one who found this "bug": read the comment on ReadFile MSDN page.
So what's the solution? Does anybody know? It is been 14 months after Denis found this strange behavior.
I don't know the size of the "c:\1.avi" file but the size of the buffer you give to Windows (256M!) is probably big enough to hold the file. So windows decides to read the whole file and put it in the buffer the way it likes. You don't say to windows "I want async", you say "I know how to handle async".
Just change the buffer size say 1024, and your program will behave exactly the same, but read only 1024 bytes (and return ERROR_IO_PENDING as well).
In general, you do asynchronous because you want to do something else during the operation. Look at the sample here: Testing for the End of a File, as it demonstrate an async ReadFile. If you change the sample's buffer and set it to a big value, it should behave exactly like yours.
PS: I suggest you don't rely on time samples to check things, use return codes and events
According to this, I would suspect that it should return TRUE in your case. But it may also be that the completion modes default settings are different on Win7/Win2k8.
Try setting a different mode with SetFileCompletionNotificationModes().
Have you tried to use an event as #Simon Mourier suggested ?. I know that the documentation says that the event is not required, but if you see the example in links provided by #Simon Mourier, it is using an event for asynchronous read.
Windows7/Server2008 have different behavior to resolve a race condition that can occurn in GetOverlappedResultEx. When you compile for these OS's Windows detects this and uses different behavior. I find this wicked confusing.
Here is a link:
http://msdn.microsoft.com/en-us/library/dd371711(VS.85).aspx
I'm sure you've read this many times in the past, but some of the text has changed since Win7 - esp the hEvent field in the OVERLAPPED struct,
http://msdn.microsoft.com/en-us/library/ms684342(v=VS.85).aspx
Functions such as
GetOverlappedResult and the
synchronization wait functions reset
auto-reset events to the nonsignaled
state. Therefore, you should use a
manual reset event; if you use an
auto-reset event, your application can
stop responding if you wait for the
operation to complete and then call
GetOverlappedResult with the bWait
parameter set to TRUE.
could you do an experiment - please allocate a manual reset event in your OVERLAPPED struct instead of a auto reset event? (I dont see the allocation in your snippit - dont forget to create the event and to set 'hEvent' after zeroing the struct)
This probably has something to do with caching. Try to open the file non-cached (FILE_FLAG_NO_BUFFERING)
EDIT
This is actually documented in the MSDN documentation for ReadFile:
Note If a file or device is opened
for asynchronous I/O, subsequent calls
to functions such as ReadFile using
that handle generally return
immediately, but can also behave
synchronously with respect to blocked
execution. For more information see
http://support.microsoft.com/kb/156932.

Critical Sections leaking memory on Vista/Win2008?

It seems that using Critical Sections quite a bit in Vista/Windows Server 2008 leads to the OS not fully regaining the memory.
We found this problem with a Delphi application and it is clearly because of using the CS API. (see this SO question)
Has anyone else seen it with applications developed with other languages (C++, ...)?
The sample code was just initialzing 10000000 CS, then deleting them. This works fine in XP/Win2003 but does not release all the peak memory in Vista/Win2008 until the application has ended.
The more you use CS, the more your application retains memory for nothing.
Microsoft have indeed changed the way InitializeCriticalSection works on Vista, Windows Server 2008, and probably also Windows 7.
They added a "feature" to retain some memory used for Debug information when you allocate a bunch of CS. The more you allocate, the more memory is retained. It might be asymptotic and eventually flatten out (not fully bought to this one).
To avoid this "feature", you have to use the new API InitalizeCriticalSectionEx and pass the flag CRITICAL_SECTION_NO_DEBUG_INFO.
The advantage of this is that it might be faster as, very often, only the spincount will be used without having to actually wait.
The disadvantages are that your old applications can be incompatible, you need to change your code and it is now platform dependent (you have to check for the version to determine which one to use). And also you lose the ability to debug if you need.
Test kit to freeze a Windows Server 2008:
- build this C++ example as CSTest.exe
#include "stdafx.h"
#include "windows.h"
#include <iostream>
using namespace std;
void TestCriticalSections()
{
const unsigned int CS_MAX = 5000000;
CRITICAL_SECTION* csArray = new CRITICAL_SECTION[CS_MAX];
for (unsigned int i = 0; i < CS_MAX; ++i)
InitializeCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
EnterCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
LeaveCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
DeleteCriticalSection(&csArray[i]);
delete [] csArray;
}
int _tmain(int argc, _TCHAR* argv[])
{
TestCriticalSections();
cout << "just hanging around...";
cin.get();
return 0;
}
-...Run this batch file (needs the sleep.exe from server SDK)
#rem you may adapt the sleep delay depending on speed and # of CPUs
#rem sleep 2 on a duo-core 4GB. sleep 1 on a 4CPU 8GB.
#for /L %%i in (1,1,300) do #echo %%i & #start /min CSTest.exe & #sleep 1
#echo still alive?
#pause
#taskkill /im cstest.* /f
-...and see a Win2008 server with 8GB and quad CPU core freezing before reaching the 300 instances launched.
-...repeat on a Windows 2003 server and see it handle it like a charm.
Your test is most probably not representative of the problem. Critical sections are considered "lightweight mutexes" because a real kernel mutex is not created when you initialize the critical section. This means your 10M critical sections are just structs with a few simple members. However, when two threads access a CS at the same time, in order to synchronize them a mutex is indeed created - and that's a different story.
I assume in your real app threads do collide, as opposed to your test app. Now, if you're really treating critical sections as lightweight mutexes and create a lot of them, your app might be allocating a large number of real kernel mutexes, which are way heavier than the light critical section object. And since mutexes are kernel object, creating a excessive number of them can really hurt the OS.
If this is indeed the case, you should reduce the usage of critical sections where you expect a lot of collisions. This has nothing to do with the Windows version, so my guess might be wrong, but it's still something to consider. Try monitoring the OS handles count, and see how your app is doing.
You're seeing something else.
I just built & ran this test code. Every memory usage stat is constant - private bytes, working set, commit, and so on.
int _tmain(int argc, _TCHAR* argv[])
{
while (true)
{
CRITICAL_SECTION* cs = new CRITICAL_SECTION[1000000];
for (int i = 0; i < 1000000; i++) InitializeCriticalSection(&cs[i]);
for (int i = 0; i < 1000000; i++) DeleteCriticalSection(&cs[i]);
delete [] cs;
}
return 0;
}

Resources