I recently found myself trying to diagnose why a particular Ruby program was running slow. In the end it turned out be caused by a scaling issue causing a lot of contention on a particular mutex.
I was wondering if there are any tools that I could have used to make this issue easier to diagnose? I know I could have used ruby-prof to get detailed output of what all 100+ threads of this program were spending their time on, but I'm curious whether there is any tool that is specifically focused on just measuring mutex contention in Ruby?
So If figured out how to do this with DTrace.
Given a Ruby program like this:
# mutex.rb
mutex = Mutex.new
threads = []
threads << Thread.new do
loop do
mutex.synchronize do
sleep 2
end
end
end
threads << Thread.new do
loop do
mutex.synchronize do
sleep 4
end
end
end
threads.each(&:join)
We can use a DTrace script like this:
/* mutex.d */
ruby$target:::cmethod-entry
/copyinstr(arg0) == "Mutex" && copyinstr(arg1) == "synchronize"/
{
self->file = copyinstr(arg2);
self->line = arg3;
}
pid$target:ruby:rb_mutex_lock:entry
/self->file != NULL && self->line != NULL/
{
self->mutex_wait_start = timestamp;
}
pid$target:ruby:rb_mutex_lock:return
/self->file != NULL && self->line != NULL/
{
mutex_wait_ms = (timestamp - self->mutex_wait_start) / 1000;
printf("Thread %d acquires mutex %d after %d ms - %s:%d\n", tid, arg1, mutex_wait_ms, self->file, self->line);
self->file = NULL;
self->line = NULL;
}
When we run this script against the Ruby program, we then get information like this:
$ sudo dtrace -q -s mutex.d -c 'ruby mutex.rb'
Thread 286592 acquires mutex 4313316240 after 2 ms - mutex.rb:14
Thread 286591 acquires mutex 4313316240 after 4004183 ms - mutex.rb:6
Thread 286592 acquires mutex 4313316240 after 2004170 ms - mutex.rb:14
Thread 286592 acquires mutex 4313316240 after 6 ms - mutex.rb:14
Thread 286592 acquires mutex 4313316240 after 4 ms - mutex.rb:14
Thread 286592 acquires mutex 4313316240 after 4 ms - mutex.rb:14
Thread 286591 acquires mutex 4313316240 after 16012158 ms - mutex.rb:6
Thread 286592 acquires mutex 4313316240 after 2002593 ms - mutex.rb:14
Thread 286591 acquires mutex 4313316240 after 4001983 ms - mutex.rb:6
Thread 286592 acquires mutex 4313316240 after 2004418 ms - mutex.rb:14
Thread 286591 acquires mutex 4313316240 after 4000407 ms - mutex.rb:6
Thread 286592 acquires mutex 4313316240 after 2004163 ms - mutex.rb:14
Thread 286591 acquires mutex 4313316240 after 4003191 ms - mutex.rb:6
Thread 286591 acquires mutex 4313316240 after 2 ms - mutex.rb:6
Thread 286592 acquires mutex 4313316240 after 4005587 ms - mutex.rb:14
...
We can collect this output and use it to derive information about which mutexes are causing the most contention.
Related
NOTE: I have added the C++ tag to this because a) the code is C++ and b) people using C++ may well have used IO completion ports. So please don't shout.
I am playing with IO completion ports, and have eventually fully understood (and tested, to prove) - both with help from RbMm - the meaning of the NumberOfConcurrentThreads parameter within CreateIoCompletionPort().
I have the following small program which creates 10 threads all waiting on the completion port. I tell my completion port to only allow 4 threads to be runnable at once (I have four CPUs). I then enqueue 8 packets to the port. My thread function outputs a message if it dequeues a packet with an ID > 4; in order for this message to be output, I have to stop at least one of the four currently running threads, which happens when I enter '1' at the console.
Now this is all fairly simple code. I have one big concern however, and that is that if all of the threads that are processing a completion packet get bogged down, it will mean no more packets can be dequeued and processed. That is what I am simulating with my infinite loop - the fact that no more packets are dequeued until I enter '1' at the console highlights this potential problem!
Would a better solution not be to have my four threads dequeuing packets (or as many threads as CPUs), then when one is dequeued, farm the processing of that packet off to a worker thread from a separate pool, thereby removing the risk of all threads in the IOCP being bogged down thus no more packets being dequeued?
I ask this as all the examples of IO completion port code I have seen use a method similar to what I show below, not using a separate thread pool which I propose. This is what makes me think that I am missing something because I am outnumbered!
Note: this is a somewhat contrived example, because Windows will allow an additional packet to be dequeued if one of the runnable threads enters a wait state; I show this in my code with a commented out cout call:
The system also allows a thread waiting in GetQueuedCompletionStatus
to process a completion packet if another running thread associated
with the same I/O completion port enters a wait state for other
reasons, for example the SuspendThread function. When the thread in
the wait state begins running again, there may be a brief period when
the number of active threads exceeds the concurrency value. However,
the system quickly reduces this number by not allowing any new active
threads until the number of active threads falls below the concurrency
value.
But I won't be calling SuspendThread in my thread functions, and I don't know which functions other than cout will cause the thread to enter a wait state, thus I can't predict if one or more of my threads will ever get bogged down! Hence my idea of a thread pool; at least context switching would mean that other packets get a chance to be dequeued!
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include <thread>
#include <vector>
#include <algorithm>
#include <atomic>
#include <ctime>
#include <iostream>
using namespace std;
int main()
{
HANDLE hCompletionPort1;
if ((hCompletionPort1 = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 4)) == NULL)
{
return -1;
}
vector<thread> vecAllThreads;
atomic_bool bStop(false);
// Fill our vector with 10 threads, each of which waits on our IOCP.
generate_n(back_inserter(vecAllThreads), 10, [hCompletionPort1, &bStop] {
thread t([hCompletionPort1, &bStop]()
{
// Thread body
while (true)
{
DWORD dwBytes = 0;
LPOVERLAPPED pOverlapped = 0;
ULONG_PTR uKey;
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
//cout << uKey; // EVEN THIS WILL CAUSE A "wait" which causes MORE THAN 4 THREADS TO ENTER!
if (uKey >4)
cout << "Started processing packet ID > 4!" << endl;
while (!bStop)
; // INFINITE LOOP
}
}
});
return move(t);
}
);
// Queue 8 completion packets to our IOCP...only four will be processed until we set our bool
for (int i = 1; i <= 8; ++i)
{
PostQueuedCompletionStatus(hCompletionPort1, 0, i, new OVERLAPPED);
}
while (!bStop)
{
int nVal;
cout << "Enter 1 to cause current processing threads to end: ";
cin >> nVal;
bStop = (nVal == 1);
}
for (int i = 0; i < 10; ++i) // Tell all 10 threads to stop processing on the IOCP
{
PostQueuedCompletionStatus(hCompletionPort1, 0, 0, 0); // Special packet marking end of IOCP usage
}
for_each(begin(vecAllThreads), end(vecAllThreads), mem_fn(&thread::join));
return 0;
}
EDIT #1
What I mean by "separate thread pool" is something like the following:
class myThread {
public:
void SetTask(LPOVERLAPPED pO) { /* start processing pO*/ }
private:
thread m_thread; // Actual thread object
};
// The threads in this thread pool are not associated with the IOCP in any way whatsoever; they exist
// purely to be handed a completion packet which they then process!
class ThreadPool
{
public:
void Initialise() { /* create 100 worker threads and add them to some internal storage*/}
myThread& GetNextFreeThread() { /* return one of the 100 worker thread we created*/}
} g_threadPool;
The code that each of my four threads associated with the IOCP then change to
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
// Pick a new thread from a pool of pre-created threads and assign it the packet to process
myThread& thr = g_threadPool.GetNextFreeThread();
thr.SetTask(pOverlapped);
// Now, this thread can immediately return to the IOCP; it doesn't matter if the
// packet we dequeued would take forever to process; that is happening in the
// separate thread thr *that will not intefere with packets being dequeued from IOCP!*
}
This way, there is no possible way that I can end up in the situation where no more packets are being dequeued!
It seems there is conflicting opinion on whether a separate thread pool should be used. Clearly, as the sample code I have posted shows, there is potential for packets to stop being dequeued from the IOCP if the processing of the packets does not enter a wait state; given, the infinite loop is perhaps unrealistic but it does demonstrate the point.
The linux scheduler calls for an scheduler algorithm that finds the next task in the task list.
What does the scheduling algorithm return if there are no tasks left?
Below is a piece of code
struct rt_info* sched_something(struct list_head *head, int flags)
{
/* some logic */
return some_task; // What task's value if there are no tasks left.
}
There is special "idle" task with PID=0, sometimes it is called swapper (Why do we need a swapper task in linux?). This task is scheduled to CPU core when there is no other task ready to run. There are several idle tasks - one for each CPU core
Source kernel/sched/core.c http://lxr.free-electrons.com/source/kernel/sched/core.c?v=3.16#L3160
3160 /**
3161 * idle_task - return the idle task for a given cpu.
3162 * #cpu: the processor in question.
3163 *
3164 * Return: The idle task for the cpu #cpu.
3165 */
3166 struct task_struct *idle_task(int cpu)
3167 {
3168 return cpu_rq(cpu)->idle;
3169 }
3170
So, pointer to this task is stored in runqueue (struct rq) kernel/sched/sched.h:
502 * This is the main, per-CPU runqueue data structure. ... */
508 struct rq {
557 struct task_struct *curr, *idle, *stop;
There is some init code in sched/core.c:
4517 /**
4518 * init_idle - set up an idle thread for a given CPU
4519 * #idle: task in question
4520 * #cpu: cpu the idle task belongs to
4524 */
4525 void init_idle(struct task_struct *idle, int cpu)
I think, idle task will run some kind of loop with special asm commands to inform CPU core that there is no useful job...
This post http://duartes.org/gustavo/blog/post/what-does-an-idle-cpu-do/ says idle task executes cpu_idle_loop kernel/sched/idle.c (there may be custom version of loop for arch and CPU - called with cpu_idle_poll(void) -> cpu_relax();):
40 #define cpu_relax() asm volatile("rep; nop")
45 static inline int cpu_idle_poll(void)
..
50 while (!tif_need_resched())
51 cpu_relax();
221 if (cpu_idle_force_poll || tick_check_broadcast_expired())
222 cpu_idle_poll();
223 else
224 cpuidle_idle_call();
I have this weird problem where thread I created does not terminate even after it exits from the thread function. I create the thread so:
typedef void(*Task)(void*);
AsyncWorker(Task proc, void* arg): thd_(NULL) {
thd_ = new std::thread(proc, arg);
}
~AsyncWorker() {
if (thd_) {
if(thd_->joinable())
thd_->join(); // does not return from here
delete thd_;
}
}
This is the task that the thread executes:
static void RunLoop(void* arg)
{
if (!arg)
return;
SomeObject* thiz = static_cast<SomeObject*>(arg);
while( !(thiz->done_) ) {
thiz->DoInLoop();
}
return;
}
I set the member SomeObject::done_ to true from the main thread and delete AsyncWorker. When I step through the debugger I can see that the thread has exited from the RunLoop function but call to join in the dtor hangs. The call stack for both the thread and the main thread shows
[External Code]
[No symbols loaded for ntdll.dll]
What could be the problem? The SomeObject::DoInLoop method does wait on a mutex but I signal the mutex before deleting AsyncWorker object so that the thread can go past that and in any case if the thread has exited from the thread proc it is clearly not holding on to any mutexes, right? What is frustrating is that the call stack does not tell me where it is stuck.
Initially, I thought it was a problem how I was using std::thread (I am using them for the first time) but the I tried the same with Windows threads and got the same problem. So I must be doing something wrong.
Edit: I initially tagged the problem as vs2012 but I am actually using vs2013 sp1.
I have the following scenario:
I have multiple worker threads running that all go through a certain section of code, and they're allowed to do so simultaneously. No critical section surrounds this piece of code right now as it's not required for these threads.
I have a main thread that also -occassionally- wants to enter that section of code, but when it does, none of the other worker threads should use that section of code.
Naive solution: surround the section of code with a critical section. But that would kill a lot of parallelism between the worker threads, which is important in my case.
Is there a better solution?
Use RW locks. RW locks allow multiple readers and only a single writer. Your workers would call read-lock at the start of the critical section and the main thread would write-lock.
By definition, when calling read-lock, the calling process will wait for any writing threads to finish. When calling write-lock, the calling process will wait for any reading or writing threads to finish.
Example using POSIX threads:
pthread_rwlock_t lock;
/* worker threads */
void *do_work(void *args) {
for (int i = 0; i < 100; ++i) {
pthread_rwlock_rdlock(&lock);
// do some work...
pthread_rwlock_unlock(&lock);
sleep(1);
}
pthread_exit(0);
}
/* main thread */
int main(void) {
pthread_t workers[4];
pthread_rwlock_init(&lock);
int i;
// spawn workers...
for (i = 0; i < 4; ++i) {
pthread_create(workers[i]; NULL, do_worker, NULL);
}
for (i = 0; i < 100, ++i) {
pthread_rwlock_wrlock(&lock);
// do some work...
pthread_rwlock_unlock(&lock);
sleep(1);
}
return 0;
}
As far as I understand it, your worker threads are started asynchronously. So when the main thread wants to run this code section, you have to ensure that no worker thread is executing it. Therefore you have to stop all worker threads before the main thread can enter that code section, and allow them to enter it again afterwards.
This could be done - using Grand Central Dispatch - if your worker threads would be assigned to a dispatch group, see https://developer.apple.com/library/mac/#documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.html.
The main thread could then send the message dispatch_group_wait to this dispatch group, wait for all worker thread to leave this code section, execute it, and then requeue the worker threads.
I'm using pthreads in a Windows application. I noticed my program was deadlocking--a quick inspection showed that the following had occurred:
Thread 1 spawned Thread 2. Thread 2 spawned Thread 3. Thread 2 waited on a mutex from Thread 3, which wasn't unlocking.
So, I went to debug in gdb and got the following when backtracing the third thread:
Thread 3 (thread 3456.0x880):
#0 0x7c8106e9 in KERNEL32!CreateThread ()
from /cygdrive/c/WINDOWS/system32/kernel32.dll
Cannot access memory at address 0x131
It was stuck, deadlocked, somehow, in the Windows CreateThread function! Obviously it couldn't unlock the mutex when it wasn't even able to start executing code. Yet, despite the fact that it was apparently stuck here, pthread_create returned zero (success).
What makes this particularly odd is that the same application on Linux has no such issues. What in the world would cause a thread to hang during the creation process (!?) but return successfully as if it had been created properly?
Edit: in response to the request for code, here's some code (simplified):
The creation of the thread:
if ( pthread_create( &h->lookahead->thread_handle, NULL, (void *)lookahead_thread, (void *)h->thread[h->param.i_threads] ) )
{
log( LOG_ERROR, "failed to create lookahead thread\n");
return ERROR;
}
while ( !h_lookahead->b_thread_active )
usleep(100);
return SUCCESS;
Note that it waits until b_thread_active is set, so somehow b_thread_active is being set, so the thread being called has to have done something...
... here's the lookahead_thread function:
void lookahead_thread( mainstruct *h )
{
h->lookahead->b_thread_active = 1;
while( !h->lookahead->b_exit_thread && h->lookahead->b_thread_active )
{
if ( synch_frame_list_get_size( &h->lookahead->next ) > delay )
_lookahead_slicetype_decide (h);
else
usleep(100); // Arbitrary number to keep thread from spinning
}
while ( synch_frame_list_get_size( &h->lookahead->next ) )
_lookahead_slicetype_decide (h);
h->lookahead->b_thread_active = 0;
}
lookahead_slicetype_decide (h); is the thing that the thread does.
The mutex, synch_frame_list_get_size:
int synch_frame_list_get_size( synch_frame_list_t *slist )
{
int fno = 0;
pthread_mutex_lock( &slist->mutex );
while (slist->list[fno]) fno++;
pthread_mutex_unlock( &slist->mutex );
return fno;
}
The backtrace of thread 2:
Thread 2 (thread 332.0xf18):
#0 0x00478853 in pthread_mutex_lock ()
#1 0x004362e8 in synch_frame_list_get_size (slist=0x3ef3a8)
at common/frame.c:1078
#2 0x004399e0 in lookahead_thread (h=0xd33150)
at encoder/lookahead.c:288
#3 0x0047c5ed in ptw32_threadStart#4 ()
#4 0x77c3a3b0 in msvcrt!_endthreadex ()
from /cygdrive/c/WINDOWS/system32/msvcrt.dll
#5 0x7c80b713 in KERNEL32!GetModuleFileNameA ()
from /cygdrive/c/WINDOWS/system32/kernel32.dll
#6 0x00000000 in ??
I would try double checking your mutexes in thread 2 and thread 3. Pthreads are implemented for windows using the standard windows api; So there will be slight differences between the windows and linux versions. This is a bizarre problem, but then again, that happens a lot in threading.
Could you try posting a snippet of the code where the locking is done in thread 2, and in the function that thread 3 should start in?
Edit in response to code
Did you ever unlock the mutex in thread 2? Your trace shows it locking a mutex, then creating a thread to do all that work which tries to also lock on the mutex. I'm guessing after thread 2 returns SUCESS it does? Also, why are you using flags and sleeping, perhaps barriers or conditional variables for process synchronization may be more robust.
Another note, is b_thread_active flag marked as volatile? Perhaps the compiler is caching something to not allow it to break out?