Pthread RWLock on MAC Deadlocking but not on Linux? - macos

I've been experimenting with rwlock's on Mac and am experiencing something that seems to me shouldn't be happening. There's some weird combination of using read/write locks with recursive read locks that is deadlocking, but shouldn't be.
I posted the code on pastebin because it's more than just a snippet. The way this code is written shouldn't deadlock, and indeed doesn't when running on linux. Why does this deadlock on a mac?
http://pastebin.com/Ui9iS1ke
Any ideas?

See the bug I reported with apple.
https://bugreport.apple.com/cgi-bin/WebObjects/RadarWeb.woa/7/wo/0blX77DJS8lBTTxVnTsNDM/5.83.28.0.13

Here's the open radar bug.
http://openradar.appspot.com/8588290

Aaron: I just ran into this. I found that one can workaround this by using thread local storage. Create a wrapper around the rwlock that increments a thread local key:
#interface ReadWriteLock : NSObject {
pthread_key_t readKey;
pthread_key_t writeKey;
pthread_rwlock_t rwLock;
}
-(void)lockRead;
-(void)unlockRead;
-(void)lockWrite;
-(void)unlockWrite;
#end
Then increment the readKey using pthread_setspecific when you call lockRead, decrement it when you call unlockRead, only rd_lock when the key goes from 0 to 1 and only rw_unlock when the key goes from 1 to 0. Copy this for the writeLock logic.
Since the pthread_setspecific and pthread_getspecific are thread-local, you don't need to lock around access to these. Make sure to call the appropriate pthread creation / initialization functions in init, and make sure to dispose of all of the pthread_* members in dealloc.
Unfortunately I can't give you the full source to my solution, but the above method works ( I've tested it heavily ).

Related

Boost interprocess process crash when Memory allocation will lead to dead lock [duplicate]

I have a need for interprocess synchronization around a piece of hardware. Because this code will need to work on Windows and Linux, I'm wrapping with Boost Interprocess mutexes. Everything works well accept my method for checking abandonment of the mutex. There is the potential that this can happen and so I must prepare for it.
I've abandoned the mutex in my testing and, sure enough, when I use scoped_lock to lock the mutex, the process blocks indefinitely. I figured the way around this is by using the timeout mechanism on scoped_lock (since much time spent Googling for methods to account for this don't really show much, boost doesn't do much around this because of portability reasons).
Without further ado, here's what I have:
#include <boost/interprocess/sync/named_recursive_mutex.hpp>
#include <boost/interprocess/sync/scoped_lock.hpp>
typedef boost::interprocess::named_recursive_mutex MyMutex;
typedef boost::interprocess::scoped_lock<MyMutex> ScopedLock;
MyMutex* pGate = new MyMutex(boost::interprocess::open_or_create, "MutexName");
{
// ScopedLock lock(*pGate); // this blocks indefinitely
boost::posix_time::ptime timeout(boost::posix_time::microsec_clock::local_time() + boost::posix_time::seconds(10));
ScopedLock lock(*pGate, timeout); // a 10 second timeout that returns immediately if the mutex is abandoned ?????
if(!lock.owns()) {
delete pGate;
boost::interprocess::named_recursive_mutex::remove("MutexName");
pGate = new MyMutex(boost::interprocess::open_or_create, "MutexName");
}
}
That, at least, is the idea. Three interesting points:
When I don't use the timeout object, and the mutex is abandoned, the ScopedLock ctor blocks indefinitely. That's expected.
When I do use the timeout, and the mutex is abandoned, the ScopedLock ctor returns immediately and tells me that it doesn't own the mutex. Ok, perhaps that's normal, but why isn't it waiting for the 10 seconds I'm telling it too?
When the mutex isn't abandoned, and I use the timeout, the ScopedLock ctor still returns immediately, telling me that it couldn't lock, or take ownership, of the mutex and I go through the motions of removing the mutex and remaking it. This is not at all what I want.
So, what am I missing on using these objects? Perhaps it's staring me in the face, but I can't see it and so I'm asking for help.
I should also mention that, because of how this hardware works, if the process cannot gain ownership of the mutex within 10 seconds, the mutex is abandoned. In fact, I could probably wait as little as 50 or 60 milliseconds, but 10 seconds is a nice "round" number of generosity.
I'm compiling on Windows 7 using Visual Studio 2010.
Thanks,
Andy
When I don't use the timeout object, and the mutex is abandoned, the ScopedLock ctor blocks indefinitely. That's expected
The best solution for your problem would be if boost had support for robust mutexes. However Boost currently does not support robust mutexes. There is only a plan to emulate robust mutexes, because only linux has native support on that. The emulation is still just planned by Ion Gaztanaga, the library author.
Check this link about a possible hacking of rubust mutexes into the boost libs:
http://boost.2283326.n4.nabble.com/boost-interprocess-gt-1-45-robust-mutexes-td3416151.html
Meanwhile you might try to use atomic variables in a shared segment.
Also take a look at this stackoverflow entry:
How do I take ownership of an abandoned boost::interprocess::interprocess_mutex?
When I do use the timeout, and the mutex is abandoned, the ScopedLock ctor returns immediately and tells me that it doesn't own the mutex. Ok, perhaps that's normal, but why isn't it waiting for the 10 seconds I'm telling it too?
This is very strange, you should not get this behavior. However:
The timed lock is possibly implemented in terms of the try lock. Check this documentation:
http://www.boost.org/doc/libs/1_53_0/doc/html/boost/interprocess/scoped_lock.html#idp57421760-bb
This means, the implementation of the timed lock might throw an exception internally and then returns false.
inline bool windows_mutex::timed_lock(const boost::posix_time::ptime &abs_time)
{
sync_handles &handles =
windows_intermodule_singleton<sync_handles>::get();
//This can throw
winapi_mutex_functions mut(handles.obtain_mutex(this->id_));
return mut.timed_lock(abs_time);
}
Possibly, the handle cannot be obtained, because the mutex is abandoned.
When the mutex isn't abandoned, and I use the timeout, the ScopedLock ctor still returns immediately, telling me that it couldn't lock, or take ownership, of the mutex and I go through the motions of removing the mutex and remaking it. This is not at all what I want.
I am not sure about this one, but I think the named mutex is implemented by using a shared memory. If you are using Linux, check for the file /dev/shm/MutexName. In Linux, a file descriptor remains valid until that is not closed, no matter if you have removed the file itself by e.g. boost::interprocess::named_recursive_mutex::remove.
Check out the BOOST_INTERPROCESS_ENABLE_TIMEOUT_WHEN_LOCKING and BOOST_INTERPROCESS_TIMEOUT_WHEN_LOCKING_DURATION_MS compile flags. Define the first symbol in your code to force the interprocess mutexes to time out and the second symbol to define the timeout duration.
I helped to get them added to the library to solve the abandoned mutex issue. It was necessary to add it due to many interprocess constructs (like message_queue) that rely on the simple mutex rather than the timed mutex. There may be a more robust solution in the future, but this solution has worked just fine for my interprocess needs.
I'm sorry I can't help you with your code at the moment; something is not working correctly there.
BOOST_INTERPROCESS_ENABLE_TIMEOUT_WHEN_LOCKING is not so good. It throws an exception and does not help much. To workaround exceptional behaviour I wrote this macro. It works just alright for common purposed. In this sample named_mutex is used. The macro creates a scoped lock with a timeout, and if the lock cannot be acquired for EXCEPTIONAL reasons, it will unlock it afterwards. This way the program can lock it again later and does not freeze or crash immediately.
#define TIMEOUT 1000
#define SAFELOCK(pMutex) \
boost::posix_time::ptime wait_time \
= boost::posix_time::microsec_clock::universal_time() \
+ boost::posix_time::milliseconds(TIMEOUT); \
boost::interprocess::scoped_lock<boost::interprocess::named_mutex> lock(*pMutex, wait_time); \
if(!lock.owns()) { \
pMutex->unlock(); }
But even this is not optimal, because the code to be locked now runs unlocked once. This may cause problems. You can easily extend the macro however. E.g. run code only if lock.owns() is true.
boost::interprocess::named_mutex has 3 defination:
on windows, you can use macro to use windows mutex instead of boost mutex, you can try catch the abandoned exception, and you should unlock it!
on linux, the boost has pthread_mutex, but it not robust attribute in 1_65_1version
so I implemented interprocess_mutex myself use system API(windows Mutex and linux pthread_mutex process shared mode), but windows Mutex is in the kernel instead of file.
Craig Graham answered this in a reply already but I thought I'd elaborate because I found this, didn't read his message, and beat my head against it to figure it out.
On a POSIX system, timed lock calls:
timespec ts = ptime_to_timespec(abs_time);
pthread_mutex_timedlock(&m_mut, &ts)
Where abs_time is the ptime that the user passes into interprocess timed_lock.
The problem is, that abs_time must be in UTC, not system time.
Assume that you want to wait for 10 seconds; if you're ahead of UTC your timed_lock() will return immediately,
and if you're behind UTC, your timed_lock() will return in hours_behind - 10 seconds.
The following ptime times out an interprocess mutex in 10 seconds:
boost::posix_time::ptime now = boost::posix_time::second_clock::universal_time() +
boost::posix_time::seconds(10);
If I use ::local_time() instead of ::universal_time(), since I'm ahead of UTC, it returns immediately.
The documentation fails to mention this.
I haven't tried it, but digging into the code a bit, it looks like the same problem would occur on a non-POSIX system.
If BOOST_INTERPROCESS_POSIX_TIMEOUTS is not defined, the function ipcdetail::try_based_timed_lock(*this, abs_time) is called.
It uses universal time as well, waiting on while(microsec_clock::universal_time() < abs_time).
This is only speculation, as I don't have quick access to a Windows system to test this on.
For full details, see https://www.boost.org/doc/libs/1_76_0/boost/interprocess/sync/detail/common_algorithms.hpp

How to synchronize a kernel workqueue thread?

I'm pretty new to Linux device drivers and kernel. I basically want to synchronize a workque thread (Lets call it A()) with another function (Lets call it B()). My purpose here is to fail B when A is running.
Currently, what I have done is as follows.
A(){
active = true; // a variable shared b/w both A and B
...
...
...
active = false;
}
B(){
if(active){
return -EBUSY
}
}
Is this the right way to synchronize these 2 functions? Is there any other strategy I should follow?
for Linux-kernel it's bad code. Try reading about mutex and semaphore.
http://www.linuxdevcenter.com/pub/a/linux/2007/05/24/semaphores-in-linux.html?page=5
Why do you want to do so?
A: mmc_rescan() is defined as INIT_DELAYED_WORK(&host->detect, mmc_rescan);
B: First line of code in mmc_suspend_host() is cancel_delayed_work(&host->detect);
So your A is canceled in B. It was done for a reason, brought by this commit.
So what's the reason to replace this cancellation with another synchronization? If you don't have this commit in your kernel, just pull it (cherry-pick), it probably fixes your problem.
UPDATE
Please see next commits, they change MMC suspend behavior:
mmc: core: Push common suspend|resume code into each bus_ops
mmc: core: Initiate suspend|resume from mmc bus instead of mmc host
mmc: core: Remove deprecated mmc_suspend|resume_host APIs
Maybe you just need to back-port those patches to fix your problem. Or at least they will make it a bit simpler to fix it.
You should define a mutex and use it whenever you have shared variable in your code, here is an implementation of mutex and conditions functions in Linux kernel Linux Mutexes and Conditions

What's the meaning of mutex_.AssertHeld() in source of leveldb

Recently I read the source of leveldb, the source url is https://leveldb.googlecode.com/files/leveldb-1.13.0.tar.gz
And when I read db/db_impl.cc,there comes the following code:
mutex_.AssertHeld()
I follow it into file port/port_posix.h,and I find the following :
void AssertHeld() { }
Then I grep in the souce dir,but can't find anyother implementation of the AssertHeld() anymore.
So here is my question,what does the mutex_.AssertHeld() do in db/db_impl.cc? THX
As you have observed it does nothing in the default implementation. The function seems to be a placeholder for checking whether a particular thread holds a mutex and optionally abort if it doesn't. This would be equivalent to the normal asserts we use for variables but applied on mutexes.
I think the reason it is not implemented yet is we don't have an equivalent light weight function to assert whether a thread holds a lock in pthread_mutex_t used in the default implementation. Some platforms which has that capability could fill this implementation as part of porting process. Searching online I did find some implementation for this function in the windows port of leveldb. I can see one way to implement it using a wrapper class over pthread_mutex_t and setting some sort of a thread id variable to indicate which thread(s) currently holds the mutex, but it will have to be carefully implemented given the race conditions that can arise.

how does procexp close a mutex held by another process?

I am trying to close a mutex that is being held by a process on Windows using Win32 functions. This can be done using procexp but I need to do it programmatically without using the procexp GUI.
Method1:
I tried injecting a dll into the processs using EasyHook and then tried the following from the injected thread:
- OpenMutex
- ReleaseMutex
It gave me the ERROR_NOT_OWNER error probably because the release was called on a different thread than the one that called AcquireMutex.
Method2:
After injecting the dll, I tried to hook for CreateMutex using mHook. The hooked CreateMutex just called back the original CreateMutex. But this would just crash the application.
I can use procexp to close the mutex but I need to do it programmatically. How does procexp do it? How can it be done programmatically without any kernel mode code?
Use NtQuerySystemInformation() to retrieve an array of open handles, loop through the array until you find the desired mutex handle in the target process, then close it using DuplicateHandle() by specifying the DUPLICATE_CLOSE_SOURCE flag.
The following article explains it in more detail:
HOWTO: Enumerate handles
Just adding a complete answer. I had to add the following the following code to handles.cpp after the mutex is recognized:
HANDLE realHandle;
ret = DuplicateHandle(processHandle, (HANDLE)handle.Handle, GetCurrentProcess(), &realHandle, 0, TRUE, DUPLICATE_CLOSE_SOURCE);
if(!ret)
printf("DuplicateHandle Problem!");
if (!CloseHandle(realHandle))
{
printf("Problem closing the copied handle");
}
printf("", realHandle);
}

Problem with .release behavior in file_operations

I'm dealing with a problem in a kernel module that get data from userspace using a /proc entry.
I set open/write/release entries for my own defined /proc entry, and manage well to use it to get data from userspace.
I handle errors in open/write functions well, and they are visible to user as open/fopen or write/fwrite/fprintf errors.
But some of the errors can only be checked at close (because it's the time all the data is available). In these cases I return something different than 0, which I supposed to be in some way the value 'close' or 'fclose' will return to user.
But whatever the value I return my close behave like if all is fine.
To be sure I replaced all the release() code by a simple 'return(-1);' and wrote a program that open/write/close the /proc entry, and prints the close return value (and the errno). It always return '0' whatever the value I give.
Behavior is the same with 'fclose', or by using shell mechanism (echo "..." >/proc/my/entry).
Any clue about this strange behavior that is not the one claimed in many tutorials I found?
BTW I'm using RHEL5 kernel (2.6.18, redhat modified), on a 64bit system.
Thanks.
Regards,
Yannick
The release() isn't allowed to cause the close() to fail.
You could require your userspace programs to call fsync() on the file descriptor before close(), if they want to find out about all possible errors; then implement your final error checking in the fsync() handler.

Resources