How to detect high contention critical sections? - winapi

My application uses many critical sections, and I want to know which of them might cause high contention. I want to avoid bottlenecks, to ensure scalability, especially on multi-core, multi-processor systems.
I already found one accidentally when I noticed many threads hanging while waiting to enter critical section when application was under heavy load. That was rather easy to fix, but how to detect such high contention critical sections before they become a real problem?
I know there is a way to create a full dump and get that info from it (somehow?). But this is rather intrusive way. Are there methods application can do on the fly to diagnose itself for such issues?
I could use data from structure _RTL_CRITICAL_SECTION_DEBUG, but there are notes that this could be unsafe across different Windows versions: http://blogs.msdn.com/b/oldnewthing/archive/2005/07/01/434648.aspx
Can someone suggest a reliable and not too complex method to get such info?

What you're talking about makes perfect sense during testing, but isn't really feasible in production code.
I mean.. you CAN do things in production code, such as determine the LockCount and RecursionCount values (this is documented), subtract RecursionCount from LockCount and presto, you have the # of threads waiting to get their hands on the CRITICAL_SECTION object.
You may even want to go deeper. The RTL_CRITICAL_SECTION_DEBUG structure IS documented in the SDK. The only thing that ever changed regarding this structure was that some reserved fields were given names and were put to use. I mean.. it's in the SDK headers (winnt.h), documented fields do NOT change. You misunderstood Raymond's story. (He's partially at fault, he likes a sensation as much as the next guy.)
My general point is, if there's heavy lock contention in your application, you should, by all means, ferret it out. But don't ever make the code inside a critical section bigger if you can avoid it. And reading the debug structure (or even lockcount/recursioncount) should only ever happen when you're holding the object. It's fine in a debug/testing version, but it should not go into production.

There are other ways to handle concurrency besides critical sections (i.e semaphores). One of the best ways is non-blocking synchronization. That means structuring your code to not require blocking even with shared resources. You shoudl read up on concurrency. Also, you can post a code snippet here and someone can give you advice on how ways to improve your concurrecy code.

Take a look at Intel Thread Profiler. It should be able to help to spot such problems.
Also you may want to instrument your code by wrapping critical sections in a proxy that dumps data on the disk for analysis. It really depends on the app itself, but it could be at least the information how long thread been waiting for the CS.

Related

Clean up after killing a thread

After reading this article https://developer.ibm.com/tutorials/l-memory-leaks/ I'm wondering is there a way to cancel thread execution and avoid memory leaks. Since my understanding is that the join functionality is releasing the allocated space. That should be possible to do also by other commands. The thing that interest me how does join releases the memory space and other functions cant? Is there a function that gives to witch thread a memory space is assigned? Can this be given out (the mapping)? I know one should not do crazy things with that since it represents an potential safety issue. But still are there ways to achieve that?
For example if I have a third party lib then I can identify its threads but I have the problem that I cannot identify allocated memory spaces in the lib, or I do not know how to do that (the lib is a binary).
If the library doesn't support that, you can't. Your understanding of the issue is slightly off. It doesn't matter who allocated the memory, it matters whether the memory still needs to be allocated or not. If the library provides some way to get to the point where the memory no longer needs to be allocated, that provided way would also provide a way to free the memory. If the library doesn't provide any way to get to the point where the memory no longer needs to be allocated, some way to free it would not be helpful.
Coding such stuff is a rabbit hole and should be done on the OS level.
Can't be done. The OS has no way to know when the code that allocated some chunk of memory still needs it and when it doesn't. Only the code that allocated the memory can possibly know that.
Posix allows canceling but not identifying the individual threads, and not all Posix functionality works on linux. Posix is just a layer over the stl stuff in the OS.
Right, so POSIX is not the place where this goes. It requires understanding of the application and so must be done at the application layer. If you need this functionality, code it. If you need it in other people's code and they don't supply it, talk to them. Presumably, if their code is decent and appropriate, it has some way to d what you need. If not, your complaint is with the code that doesn't do what you need.
My thoughts on that were that somewhere in Linux the system tracks what allocation on heap were made by the threads if some option is enabled since I know by default there is nothing.
That doesn't help. Which thread allocated memory tells you absolutely nothing about when it is no longer needed. Only the same code that decided it was needed can tell when it is no longer needed. So if this is needed in some code that allocates memory, that code must implement this. If the person who implemented that code did not provide this kind of facility, then that means they decided it wasn't needed. You may wish to ask them why they made that decision. Their answer may well surprise you.
But I see there is no answer to a serious question.
The answer is to code what you need. If it's someone else's code and they didn't code it, then they didn't think you would need it. They're most likely right. But if they're wrong, then don't use their code.

How to measure the point in time at which a slice of data in memory was accessed?

Suppose I'm reading large chunks of data into memory and processing them sequentially. Is there a way to pinpoint when a given segment/chunk of the memory was accessed, by using some kind of system tool that will log memory address accesses?
An approach I'm considering - which doesn't rely on measurement utilities - is logging what data is being processed at any point of time, and inferring the usage based on looking at the data itself. But that is not a generic solution.
These are some of the ideas that have been brewing in my head to do what you want. Never had the time to explore these in more detail though.
Simplest method is to add a watchpoint for the address inside gdb, if you need a quick fix kind of solution.
Another way to do this is to mark the pages READONLY for chunks of data you want to check access for. On Linux this can be done using mprotect call. This assumes you are debugging this code, as the access to the page will cause a segfault. You could possibly install a signal handler.
Another way to do the same maybe to us ptrace system call, which maybe more trouble than it's worth.
If you just want to count accesses to a memory address you can use perf_event_open system call on newer linux kernels. See documentation for PERF_COUNT_HW_CACHE_OP_READ and PERF_COUNT_HW_CACHE_OP_WRITE. You are on your own with that one though. It maybe even less worthwhile to use this method. However, since the question is marked with the performance tag, this maybe what you are looking for.
If you just want a system tool, you might want to look at perf tool and dig into the manuals to see if it can do the same thing that I described with perf_event_open. This tool is a wrapper around that system call, so I am guessing that it should have support for the functionality I mentioned in the previous point.

Find memory leak in very complex Ruby app

everyone!
It's nice to work with Ruby and write some code. But in past of this week, i notice that we have some problem in our application. Memory usage is growing like O(x*3) function.
Our application very complex, it is based on EventMachine and other external libs. Even more, it is running under amd64 bit version of FreeBSD using Ruby 1.8.7-p382
I'v tried to research by myself the way how find memory leak in our app.
I've found many tools and libs, but they doesn't work under FreeBSD'64bit and I have no idea how step up to find leaks in huge ruby application. It's OK, if you have few files with 200-300 lines of code, but here you have around 30 files with average 200-300 line's of code.
I just realize, i need too much of time to find those leaks, doing stupid actions: believe/research/assume that some of part of this code is may be actually leaking and wrap some tracking code, like using ruby-prof gem technice. But it's so painfully slow way, because as i said we have too much of code.
So, my question is how to find memory leak in very complex Ruby app and not put all my life into this work?
Thx in advance
One thing to try, even though it can massively degrade performance, is to manually trigger the garbage collector by calling GC.start every so often. How often is kind of subjective, as the more you run it the slower the app, and the less you run it the higher the memory footprint.
For whatever reason, the garbage collector may go on vacation from time to time, presumably not wanting to interfere if there is some heavy processing going on. As such you may have to manually call to have your trash taken away.
One way to avoid creating trash is to use memory more efficiently. Don't create hashes when arrays will do the job, don't create arrays when a single string will suffice, and so on. It will be important to profile your application to see what kind of objects are cluttering up your heap before you just start hacking away randomly.
If you can, try and use 1.9.2 which has made significant gains in terms of memory management. Ruby Enterprise Edition is also an option if you need 1.8.7 compatibility, as it's essentially a better garbage collector for that version.
How hard would it be to run your app on a linux box? If you don't have the same memory problems there, it is probably something specific with your ruby runtime. If you do have the same problems, you can use all the tools and libs that are linux only.
Another alternative - can you wrap your unit tests with some memory tracking code? Most unit test frameworks make it easy to add some code before/after each test. Or you could just run each test 1000000000 times and see if the memory goes out of control? if it does, you know something that happens in that test is causing the leak, and you can continue to isolate the problem.
Have you tried counting the number of objects you have, using ObjectSpace.each_object? Although you're intending to use small batches, maybe you only have more objects that you think.
count = ObjectSpace.each_object() {}
# => 7216

Are there any consequences to never deleting critical sections?

I am refining a large body of native code which uses a few static critical sections and never calls DeleteCriticalSection, leaving them to process exit to clean up.
There are no leaks and no concerns about the total number of CS getting too high, I'm just wondering if there are any long-term Windows consequences to not cleaning them up. We have regression test suites that will launch a program thousands of times a day, although end users are not likely to do anything like that.
Because of the range of deployed machines we have to consider Windows XP as well and this native code is run from a managed application.
A critical section is just a block of memory unless contention is detected, at which time an event object is created for synchronization. Process exit would clean up any lingering events. If you were creating these at runtime dynamically and not freeing them, it would be bad. If the ones not getting cleaned up are a fixed amount for each process, I wouldn't worry about it.
In principle, every process resource is cleaned up when the process exits. Kernel resources like event objects definitely follow this principle.
The short answer is probably not. The long answer is, this is a lazy programming practice and should be fixed.
To use DeleteCriticalSection correctly, one needs to shutdown in an orderly manner so that no other thread owns or attempts to own the section before/after it is deleted. And programmers get lazy to define and implement how shutdown will work for their program.
There are many things you can do with no immediate measurable consequences - but that does not make it right. Also similar attitude towards other handles/objects in the same code base will have cumulative effect and could add up to "consequences".

What to avoid for performance reasons in multithreaded code?

I'm currently reviewing/refactoring a multithreaded application which is supposed to be multithreaded in order to be able to use all the available cores and theoretically deliver a better / superior performance (superior is the commercial term for better :P)
What are the things I should be aware when programming multithreaded applications?
I mean things that will greatly impact performance, maybe even to the point where you don't gain anything with multithreading at all but lose a lot by design complexity. What are the big red flags for multithreading applications?
Should I start questioning the locks and looking to a lock-free strategy or are there other points more important that should light a warning light?
Edit: The kind of answers I'd like are similar to the answer by Janusz, I want red warnings to look up in code, I know the application doesn't perform as well as it should, I need to know where to start looking, what should worry me and where should I put my efforts. I know it's kind of a general question but I can't post the entire program and if I could choose one section of code then I wouldn't be needing to ask in the first place.
I'm using Delphi 7, although the application will be ported / remake in .NET (c#) for the next year so I'd rather hear comments that are applicable as a general practice, and if they must be specific to either one of those languages
One thing to definitely avoid is lots of write access to the same cache lines from threads.
For example: If you use a counter variable to count the number of items processed by all threads, this will really hurt performance because the CPU cache lines have to synchronize whenever the other CPU writes to the variable.
One thing that decreases performance is having two threads with much hard drive access. The hard drive would jump from providing data for one thread to the other and both threads would wait for the disk all the time.
Something to keep in mind when locking: lock for as short a time as possible. For example, instead of this:
lock(syncObject)
{
bool value = askSomeSharedResourceForSomeValue();
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
}
Do this (if possible):
bool value = false;
lock(syncObject)
{
value = askSomeSharedResourceForSomeValue();
}
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
Of course, this example only works if DoSomethingIfTrue() and DoSomethingIfFalse() don't require synchronization, but it illustrates this point: locking for as short a time as possible, while maybe not always improving your performance, will improve the safety of your code in that it reduces surface area for synchronization problems.
And in certain cases, it will improve performance. Staying locked for long lengths of time means that other threads waiting for access to some resource are going to be waiting longer.
More threads then there are cores, typically means that the program is not performing optimally.
So a program which spawns loads of threads usually is not designed in the best fashion. A good example of this practice are the classic Socket examples where every incoming connection got it's own thread to handle of the connection. It is a very non scalable way to do things. The more threads there are, the more time the OS will have to use for context switching between threads.
You should first be familiar with Amdahl's law.
If you are using Java, I recommend the book Java Concurrency in Practice; however, most of its help is specific to the Java language (Java 5 or later).
In general, reducing the amount of shared memory increases the amount of parallelism possible, and for performance that should be a major consideration.
Threading with GUI's is another thing to be aware of, but it looks like it is not relevant for this particular problem.
What kills performance is when two or more threads share the same resources. This could be an object that both use, or a file that both use, a network both use or a processor that both use. You cannot avoid these dependencies on shared resources but if possible, try to avoid sharing resources.
Run-time profilers may not work well with a multi-threaded application. Still, anything that makes a single-threaded application slow will also make a multi-threaded application slow. It may be an idea to run your application as a single-threaded application, and use a profiler, to find out where its performance hotspots (bottlenecks) are.
When it's running as a multi-threaded aplication, you can use the system's performance-monitoring tool to see whether locks are a problem. Assuming that your threads would lock instead of busy-wait, then having 100% CPU for several threads is a sign that locking isn't a problem. Conversely, something that looks like 50% total CPU utilitization on a dual-processor machine is a sign that only one thread is running, and so maybe your locking is a problem that's preventing more than one concurrent thread (when counting the number of CPUs in your machine, beware multi-core and hyperthreading).
Locks aren't only in your code but also in the APIs you use: e.g. the heap manager (whenever you allocate and delete memory), maybe in your logger implementation, maybe in some of the O/S APIs, etc.
Should I start questioning the locks and looking to a lock-free strategy
I always question the locks, but have never used a lock-free strategy; instead my ambition is to use locks where necessary, so that it's always threadsafe but will never deadlock, and to ensure that locks are acquired for a tiny amount of time (e.g. for no more than the amount of time it takes to push or pop a pointer on a thread-safe queue), so that the maximum amount of time that a thread may be blocked is insignificant compared to the time it spends doing useful work.
You don't mention the language you're using, so I'll make a general statement on locking. Locking is fairly expensive, especially the naive locking that is native to many languages. In many cases you are reading a shared variable (as opposed to writing). Reading is threadsafe as long as it is not taking place simultaneously with a write. However, you still have to lock it down. The most naive form of this locking is to treat the read and the write as the same type of operation, restricting access to the shared variable from other reads as well as writes. A read/writer lock can dramatically improve performance. One writer, infinite readers. On an app I've worked on, I saw a 35% performance improvement when switching to this construct. If you are working in .NET, the correct lock is the ReaderWriterLockSlim.
I recommend looking into running multiple processes rather than multiple threads within the same process, if it is a server application.
The benefit of dividing the work between several processes on one machine is that it is easy to increase the number of servers when more performance is needed than a single server can deliver.
You also reduce the risks involved with complex multithreaded applications where deadlocks, bottlenecks etc reduce the total performance.
There are commercial frameworks that simplifies server software development when it comes to load balancing and distributed queue processing, but developing your own load sharing infrastructure is not that complicated compared with what you will encounter in general in a multi-threaded application.
I'm using Delphi 7
You might be using COM objects, then, explicitly or implicitly; if you are, COM objects have their own complications and restrictions on threading: Processes, Threads, and Apartments.
You should first get a tool to monitor threads specific to your language, framework and IDE. Your own logger might do fine too (Resume Time, Sleep Time + Duration). From there you can check for bad performing threads that don't execute much or are waiting too long for something to happen, you might want to make the event they are waiting for to occur as early as possible.
As you want to use both cores you should check the usage of the cores with a tool that can graph the processor usage on both cores for your application only, or just make sure your computer is as idle as possible.
Besides that you should profile your application just to make sure that the things performed within the threads are efficient, but watch out for premature optimization. No sense to optimize your multiprocessing if the threads themselves are performing bad.
Looking for a lock-free strategy can help a lot, but it is not always possible to get your application to perform in a lock-free way.
Threads don't equal performance, always.
Things are a lot better in certain operating systems as opposed to others, but if you can have something sleep or relinquish its time until it's signaled...or not start a new process for virtually everything, you're saving yourself from bogging the application down in context switching.

Resources