Critical Sections leaking memory on Vista/Win2008? - windows-vista

It seems that using Critical Sections quite a bit in Vista/Windows Server 2008 leads to the OS not fully regaining the memory.
We found this problem with a Delphi application and it is clearly because of using the CS API. (see this SO question)
Has anyone else seen it with applications developed with other languages (C++, ...)?
The sample code was just initialzing 10000000 CS, then deleting them. This works fine in XP/Win2003 but does not release all the peak memory in Vista/Win2008 until the application has ended.
The more you use CS, the more your application retains memory for nothing.

Microsoft have indeed changed the way InitializeCriticalSection works on Vista, Windows Server 2008, and probably also Windows 7.
They added a "feature" to retain some memory used for Debug information when you allocate a bunch of CS. The more you allocate, the more memory is retained. It might be asymptotic and eventually flatten out (not fully bought to this one).
To avoid this "feature", you have to use the new API InitalizeCriticalSectionEx and pass the flag CRITICAL_SECTION_NO_DEBUG_INFO.
The advantage of this is that it might be faster as, very often, only the spincount will be used without having to actually wait.
The disadvantages are that your old applications can be incompatible, you need to change your code and it is now platform dependent (you have to check for the version to determine which one to use). And also you lose the ability to debug if you need.
Test kit to freeze a Windows Server 2008:
- build this C++ example as CSTest.exe
#include "stdafx.h"
#include "windows.h"
#include <iostream>
using namespace std;
void TestCriticalSections()
{
const unsigned int CS_MAX = 5000000;
CRITICAL_SECTION* csArray = new CRITICAL_SECTION[CS_MAX];
for (unsigned int i = 0; i < CS_MAX; ++i)
InitializeCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
EnterCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
LeaveCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
DeleteCriticalSection(&csArray[i]);
delete [] csArray;
}
int _tmain(int argc, _TCHAR* argv[])
{
TestCriticalSections();
cout << "just hanging around...";
cin.get();
return 0;
}
-...Run this batch file (needs the sleep.exe from server SDK)
#rem you may adapt the sleep delay depending on speed and # of CPUs
#rem sleep 2 on a duo-core 4GB. sleep 1 on a 4CPU 8GB.
#for /L %%i in (1,1,300) do #echo %%i & #start /min CSTest.exe & #sleep 1
#echo still alive?
#pause
#taskkill /im cstest.* /f
-...and see a Win2008 server with 8GB and quad CPU core freezing before reaching the 300 instances launched.
-...repeat on a Windows 2003 server and see it handle it like a charm.

Your test is most probably not representative of the problem. Critical sections are considered "lightweight mutexes" because a real kernel mutex is not created when you initialize the critical section. This means your 10M critical sections are just structs with a few simple members. However, when two threads access a CS at the same time, in order to synchronize them a mutex is indeed created - and that's a different story.
I assume in your real app threads do collide, as opposed to your test app. Now, if you're really treating critical sections as lightweight mutexes and create a lot of them, your app might be allocating a large number of real kernel mutexes, which are way heavier than the light critical section object. And since mutexes are kernel object, creating a excessive number of them can really hurt the OS.
If this is indeed the case, you should reduce the usage of critical sections where you expect a lot of collisions. This has nothing to do with the Windows version, so my guess might be wrong, but it's still something to consider. Try monitoring the OS handles count, and see how your app is doing.

You're seeing something else.
I just built & ran this test code. Every memory usage stat is constant - private bytes, working set, commit, and so on.
int _tmain(int argc, _TCHAR* argv[])
{
while (true)
{
CRITICAL_SECTION* cs = new CRITICAL_SECTION[1000000];
for (int i = 0; i < 1000000; i++) InitializeCriticalSection(&cs[i]);
for (int i = 0; i < 1000000; i++) DeleteCriticalSection(&cs[i]);
delete [] cs;
}
return 0;
}

Related

Slow thread creation on Windows

I have upgraded a number crunching application to a multi-threaded program, using the C++11 facilities. It works well on Mac OS X but does not benefit from multithreading on Windows (Visual Studio 2013). Using the following toy program
#include <iostream>
#include <thread>
void t1(int& k) {
k += 1;
};
void t2(int& k) {
k += 1;
};
int main(int argc, const char *argv[])
{
int a{ 0 };
int b{ 0 };
auto start_time = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 10000; ++i) {
std::thread thread1{ t1, std::ref(a) };
std::thread thread2{ t2, std::ref(b) };
thread1.join();
thread2.join();
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time_stack = std::chrono::duration_cast<std::chrono::microseconds>(
end_time - start_time).count();
std::cout << "Time: " << time_stack / 10000.0 << " micro seconds" <<
std::endl;
std::cout << a << " " << b << std::endl;
return 0;
}
I have discovered that it takes 34 microseconds to start a thread on Mac OS X and 340 microseconds to do the same on Windows. Am I doing something wrong on the Windows side ? Is it a compiler issue ?
Not a compiler problem (nor an operating system problem, strictly speaking).
It is a well-known fact that creating threads is an expensive operation. This is especially true under Windows (used to be true under Linux prior to clone as well).
Also, creating and joining a thread is necessarily slow and does not tell a lot about creating a thread as such. Joining presumes that the thread has exited, which can only happen after it has been scheduled to run. Thus, your measurements include delays introduced by scheduling. Insofar, the times you measure are actually pretty good (they could easily be 20 times longer!).
However, it does not matter a lot whether spawning threads is slow anyway.
Creating 20,000 threads like in your benchmark in a real program is a serious error. While it is not strictly illegal or disallowed to create thousands (even millions) of threads, the "correct" way of using threads is to create no more threads than there are approximately CPU cores. One does not create very short-lived threads all the time either.
You might have a few short-lived ones, and you might create a few extra threads (which e.g. block on I/O), but you will not want to create hundreds or thousands of these. Every additional thread (beyond the number of CPU cores) means more context switches, more scheduler work, more cache pressure, and 1MB of address space and 64kB of physical memory gone per thread (due to stack reserve and commit granularity).
Now, assume you create for example 10 threads at program start, it does not matter at all whether this takes 3 milliseconds alltogether. It takes several hundred milliseconds (at least) for the program to start up anyway, nobody will notice a difference.
Visual C++ uses Concurrency Runtime (MS specific) to implement std.thread features. When you directly call any Concurrency Runtime feature/function, it creates a default runtime object (not going into details). Or, when you call std.thread function, it does the same as of ConcRT function was invoked.
The creation of default runtime (or say, scheduler) takes sometime, and hence it appear to be taking sometime. Try creating a std::thread object, let it run; and then execute the benching marking code (whole of above code, for example).
EDIT:
Skim over it - http://www.codeproject.com/Articles/80825/Concurrency-Runtime-in-Visual-C
Do Step-Into debugging, to see when CR library is invoked, and what it is doing.

Thread-Ids in Windows greater than 0xFFFF

we have a big and old software project. This software runs in older days on an old OS, so it has an OS-Wrapper. Today it runs on windows.
In the OS-Wrapper we have structs to manage threads. One Member of this struct is the thread-Id, but it is defined with an uint16_t. The thread-Ids will be generated with the Win-API createThreadEx.
Since some month at one of our customers thread-Ids appears which are greater than
numeric_limits<uint16_t>::max()
We run in big troubles, if we try to change this member to an uint32_t. And even if we fix it, we had to test the fix.
So my question is: How is it possible in windows to get thread-Ids which are greater than 0xffff? How must be the circumstances to reach this?
Windows thread IDs are 32 bit unsigned integers, of type DWORD. There's no requirement for them to be less than 0xffff. Whatever thought process led you to that belief was flawed.
If you want to stress test your system to create a scenario where you have thread IDs that go above 0xffff then you simply need to create a large number of threads. To make this tenable, without running out of virtual address space, create threads with very small stacks. You can create the threads suspended too because you don't need the threads to do anything.
Of course, it might still be a little tricky to force the system to allocate that many threads. I found that my simple test application would not readily generate thread IDs above 0xffff when run as a 32 bit process, but would do so as a 64 bit process. You could certainly create a 64 bit process that would consume the low-numbered thread IDs and then allow your 32 bit process to go to work and so deal with lower numbered thread IDs.
Here's the program that I experimented with:
#include <Windows.h>
#include <iostream>
DWORD WINAPI ThreadProc(LPVOID lpParameter)
{
return 0;
}
int main()
{
for (int i = 0; i < 10000; i++)
{
DWORD threadID;
if (CreateThread(NULL, 64, ThreadProc, NULL, CREATE_SUSPENDED, &threadID) == NULL)
return 1;
std::cout << std::hex << threadID << std::endl;
}
return 0;
}
Re
” We run in big troubles, if we try to change this member to an uint32_t. And even if we fix it, we had to test the fix.
Your current software’s use of a 16-bit object to store a value that requires 32 bits, is a bug. So you have to fix it, and test the fix. There are at least two practical fixes:
Changing the declaration of the id, and all uses of it.
It can really help with finding all copying of the id, to introduce a dedicated type that is not implicitly convertible to integer, e.g. a C++11 based enumeration type.
Adding a layer of indirection.
Might be possible without changing the data, only changing the threading library implementation.
A deeper fix might be to replace the current threading with C++11 standard library threading.
Anyway you're up for a bit of work, and/or some cost.

Read/Write memory on OS X 10.8.2 with vm_read and vm_write

This is my code that works only on Xcode (version 4.5):
#include <stdio.h>
#include <mach/mach_init.h>
#include <mach/mach_vm.h>
#include <sys/types.h>
#include <mach/mach.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <Security/Authorization.h>
int main(int argc, const char * argv[]) {
char test[14] = "Hello World! "; //0x7fff5fbff82a
char value[14] = "Hello Hacker!";
char test1[14];
pointer_t buf;
uint32_t sz;
task_t task;
task_for_pid(current_task(), getpid(), &task);
if (vm_write(current_task(), 0x7fff5fbff82a, (pointer_t)value, 14) == KERN_SUCCESS) {
printf("%s\n", test);
//getchar();
}
if (vm_read(task, 0x7fff5fbff82a, sizeof(char) * 14, &buf, &sz) == KERN_SUCCESS) {
memcpy(test1, (const void *)buf, sz);
printf("%s", test1);
}
return 0;
}
I was trying also ptrace and other things, this is why I include other libraries too.
The first problem is that this works only on Xcode, I can find with the debugger the position (memory address) of a variable (in this case of test), so I change the string with the one on value and then I copy the new value on test on test1.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid(), the 2° problem is that I need to read and write on another process, this is only a test for see if the functions works on the same process, and it works (only on Xcode).
How I can do that on other processes? I need to read a position (how I can find the address of "something"?), this is the first goal.
For your problems, there are solutions:
The first problem: OS X has address space layout randomization. If you want to make your memory images fixed and predictable, you have to compile your code with NOPIE setting. This setting (PIE = Position Independent Executable), is responsible for allowing ASLR, which "slides" the memory by some random value, which changes on every instance.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid():
The Mach APIs operate on the lower level abstractions of "task" and "Thread" which correspond roughly to that of the BSD "process" and "(u)thread" (there are some exceptions, e.g. kernel_task, which does not have a PID, but let's ignore that for now). task_for_pid obtains the task port (think of it as a "handle"), and if you get the port - you are free to do whatever you wish. Basically, the vm_* functions operate on any task port - you can use it on your own process (mach_task_self(), that is), or a port obtained from task_for_pid.
Task for PID actually doesn't necessarily require root (i.e. "sudo"). It requires getting past taskgated on OSX, which traditionally verified membership in procmod or procview groups. You can configure taskgated ( /System/Library/LaunchDaemons/com.apple.taskgated.plist) for debugging purposes. Ultimately, btw, getting the task port will require an entitlement (the same as it now does on iOS). That said, the easiest way, rather than mucking around with system authorizations, etc, is to simply become root.
Did you try to run your app with "sudo"?
You can't read/write other app's memory without sudo.

GetIpAddrTable() leaks memory. How to resolve that?

On my Windows 7 box, this simple program causes the memory use of the application to creep up continuously, with no upper bound. I've stripped out everything non-essential, and it seems clear that the culprit is the Microsoft Iphlpapi function "GetIpAddrTable()". On each call, it leaks some memory. In a loop (e.g. checking for changes to the network interface list), it is unsustainable. There seems to be no async notification API which could do this job, so now I'm faced with possibly having to isolate this logic into a separate process and recycle the process periodically -- an ugly solution.
Any ideas?
// IphlpLeak.cpp - demonstrates that GetIpAddrTable leaks memory internally: run this and watch
// the memory use of the app climb up continuously with no upper bound.
#include <stdio.h>
#include <windows.h>
#include <assert.h>
#include <Iphlpapi.h>
#pragma comment(lib,"Iphlpapi.lib")
void testLeak() {
static unsigned char buf[16384];
DWORD dwSize(sizeof(buf));
if (GetIpAddrTable((PMIB_IPADDRTABLE)buf, &dwSize, false) == ERROR_INSUFFICIENT_BUFFER)
{
assert(0); // we never hit this branch.
return;
}
}
int main(int argc, char* argv[]) {
for ( int i = 0; true; i++ ) {
testLeak();
printf("i=%d\n",i);
Sleep(1000);
}
return 0;
}
#Stabledog:
I've ran your example, unmodified, for 24 hours but did not observe that the program's Commit Size increased indefinitely. It always stayed below 1024 kilobyte. This was on Windows 7 (32-bit, and without Service Pack 1).
Just for the sake of completeness, what happens to memory usage if you comment out the entire if block and the sleep? If there's no leak there, then I would suggest you're correct as to what's causing it.
Worst case, report it to MS and see if they can fix it - you have a nice simple test case to work from which is more than what I see in most bug reports.
Another thing you may want to try is to check the error code against NO_ERROR rather than a specific error condition. If you get back a different error than ERROR_INSUFFICIENT_BUFFER, there may be a leak for that:
DWORD dwRetVal = GetIpAddrTable((PMIB_IPADDRTABLE)buf, &dwSize, false);
if (dwRetVal != NO_ERROR) {
printf ("ERROR: %d\n", dwRetVal);
}
I've been all over this issue now: it appears that there is no acknowledgment from Microsoft on the matter, but even a trivial application grows without bounds on Windows 7 (not XP, though) when calling any of the APIs which retrieve the local IP addresses.
So the way I solved it -- for now -- was to launch a separate instance of my app with a special command-line switch that tells it "retrieve the IP addresses and print them to stdout". I scrape stdout in the parent app, the child exits and the leak problem is resolved.
But it wins "dang ugly solution to an annoying problem", at best.

How can I find out how much of address space the application is consuming and report this to user?

I'm writing the memory manager for an application, as part of a team of twenty-odd coders. We're running out of memory quota and we need to be able to see what's going on, since we only appear to be using about 700Mb. I need to be able to report where it's all going - fragmentation etc. Any ideas?
You can use existing memory debugging tools for this, I found Memory Validator 1 quite useful, it is able to track both API level (heap, new...) and OS level (Virtual Memory) allocations and show virtual memory maps.
The other option which I also found very usefull is to be able to dump a map of the whole virtual space based on VirtualQuery function. My code for this looks like this:
void PrintVMMap()
{
size_t start = 0;
// TODO: make portable - not compatible with /3GB, 64b OS or 64b app
size_t end = 1U<<31; // map 32b user space only - kernel space not accessible
SYSTEM_INFO si;
GetSystemInfo(&si);
size_t pageSize = si.dwPageSize;
size_t longestFreeApp = 0;
int index=0;
for (size_t addr = start; addr<end; )
{
MEMORY_BASIC_INFORMATION buffer;
SIZE_T retSize = VirtualQuery((void *)addr,&buffer,sizeof(buffer));
if (retSize==sizeof(buffer) && buffer.RegionSize>0)
{
// dump information about this region
printf(.... some buffer information here ....);
// track longest feee region - usefull fragmentation indicator
if (buffer.State&MEM_FREE)
{
if (buffer.RegionSize>longestFreeApp) longestFreeApp = buffer.RegionSize;
}
addr += buffer.RegionSize;
index+= buffer.RegionSize/pageSize;
}
else
{
// always proceed
addr += pageSize;
index++;
}
}
printf("Longest free VM region: %d",longestFreeApp);
}
You can also find out information about the heaps in a process with Heap32ListFirst/Heap32ListNext, and about loaded modules with Module32First/Module32Next, from the Tool Help API.
'Tool Help' originated on Windows 9x. The original process information API on Windows NT was PSAPI, which offers functions which partially (but not completely) overlap with Tool Help.
Our (huge) application (a Win32 game) started throwing "Not enough quota" exceptions recently, and I was charged with finding out where all the memory was going. It is not a trivial job - this question and this one were my first attempts at finding out. Heap behaviour is unexpected, and accurately tracking how much quota you've used and how much is available has so far proved impossible. In fact, it's not particularly useful information anyway - "quota" and "somewhere to put things" are subtly and annoyingly different concepts. The accepted answer is as good as it gets, although enumerating heaps and modules is also handy. I used DebugDiag from MS to view the true horror of the situation, and understand how hard it is to actually thoroughly track everything.

Resources