How can I tell whether it's safe/necessary to cudaFree() or not? - memory-management

I've allocated some GPU global memory with cudaMalloc(), say, in the constructor of some class. Now it's time to destruct the instance I've constructed, and I have my instance's data pointer. The thing is, I'm worried maybe some mischievous code elsewhere has called cudaDeviceReset(), after which my cudaFree() will probably fail (I'll get an invalid device pointer error). So, how can can I tell whether my pointer is elligible for cudaFree()ing?

I don't believe you can do much about that.
About the best you can do is try and engineer the lifespan of objects which will call the CUDA APIs in their destructors to do so before context destruction. In practice, that means having them fall of of scope in a well defined fashion before the context is automatically or manually torn down.
For a call like cudaFree(), which is somewhat "fire and forget" anyway, the best thing to do might be to write your own wrapper for the call and explicitly catch and tastefully ignore any obvious error conditions which would arise if the call was made after context destruction-

Given what talonmies says, one might consider doing the converse:
wrap your cudaDeviceReset() calls to also regard a 'generation counter'.
Counter increases will be protected by a lock.
While you lock, you reset and increment the generation counter.
Wrap cudaMalloc() to also keep the generation index (you might need a class/struct for that) - obtained during allocation (which also locks).
Wrap cudaFree() to lock and only really cudaFree() if the reset generation has not changed.
... now, you might say "Is all that locking worth it? At worst, you'll get an error, it's not such a big deal." And, to be honest - I'm not sure it's worth it. You could make this somewhat less painful by using a Reader-Writer lock instead of a simple lock, where the allocate and free are just readers that can all access concurrently.

Related

How does COM avoid deadlocks when 2 objects call into each other?

Let's say there are two apartment-threaded COM objects, located in different apartments. Or maybe they're in different processes altogether. If one object calls a method on another, which in turn calls a method back on the first object, how does COM prevent the whole thing from deadlocking?
What you describe is called reentrancy.
The truth is that COM doesn’t do anything explicit to prevent reentrancy issues. It’s up to the implementer of each object to take precautions where needed, as applicable.
Funny enough, reentrancy in COM is far less common in real life than you would think. Object graphs in COM tend to be mostly trees, which do not exhibit reentrancy. When you have cycles it’s almost always because of objects exposing event-type functionality of some sort, typically Connection Points.
Event callbacks are very limited in scope and they trigger under the explicit control of each object’s code, so the programmer is able to easily time them so they occur at safe places (for example at/near the end of a method’s body after all the real work is done). This prevents serious reentrancy issues from developing.
But nothing stops you from coding something dangerous. For example, if an object triggers an event while its internal object state is inconsistent, all bets are off.
You mention deadlocks. Deadlocks require a locking mechanism of some sort (for example a Critical Section) and should be extremely rare to impossible in COM apartments for the reasons listed above. Any object that triggers an event while holding a lock is asking for serious trouble, and a deadlock is not the biggest of its worries: by virtue of being an STA object the reentrant call will run on the same thread, and it will be able to acquire the locks again and proceed right through, which means it’s very likely that the object will corrupt its internal state, cause a crash, or worse. Note that locks in an STA thread only make sense if the resources controlled by the lock are accessible to threads outside the object’s STA.
And finally, nothing in COM stops you from causing an infinite recursion loop and subsequent stack overflow either. For example, take two COM objects Obj1 and Obj2, with Obj2 implementing an event. We can have Obj1 call a pObj2->SomeMethod(…) which causes Obj2 to fire the event; then have obj1 listen (“sink”) to that event, and have that event handler call SomeMethod() again.
UPDATE:
Profound thanks to Remy Lebeau for pointing to in his comment something I had forgotten to discuss, via a link to CodeGuru article Understanding COM Apartments, Part I. And in the process I also learned something new myself I should have known about.
There is one aspect of reentrancy and locking to consider and that is what happens during inter-apartment calls (either STA<->STA, STA<->MTA, or even STA<->OutofProc). During an inter-apartment call the STA (caller's) thread needs to stall and wait for an answer to the call request; the response cannot (by definition) execute on the same thread. But it can't just fully block (e.g. WaitForSingleObject) waiting for the response because the thread needs to be able to respond and process not only potential callbacks to the original object, but also to callbacks to any other object inside of the same apartment. If it were to fully block, the COM infrastructure itself would be introducing the potential for a deadlock and you wouldn't even need a dependency cycle between objects. So the COM marshalling infrastructure uses a more complex form of Wait that can unblock for a few other situations (Hans Passat points to CoWaitForMultipleHandles which looks right to me but I don't know the infrastructure to that level). If an applicable callback occurs, the marshalling infrastructure will unblock and allow that call to enter the apartment and proceed.
This is a form of locking induced by the COM infrastructure itself, rather than one coded explicitly as part of the object's implementation, which is why I hadn't thought of bringing it up. So COM does in fact "do something to prevent deadlocks", but to prevent deadlock potentials induced by its own infrastructure.
The part that I hadn't consciously realized was that this mechanism is very selective. It only lets through COM calls that form part of the same causality chain, that is, a callback, a direct consequence of the call that the thread was waiting on. Other COM calls into the apartment have to queue up and wait for that call chain to conclude, and for the STA thread to return to the thread's message loop.1
1 It makes complete sense that it needs to be that way, but I don't think I ever realized it.

What is the use-case for TryEnterCriticalSection?

I've been using Windows CRITICAL_SECTION since the 1990s and I've been aware of the TryEnterCriticalSection function since it first appeared. I understand that it's supposed to help me avoid a context switch and all that.
But it just occurred to me that I have never used it. Not once.
Nor have I ever felt I needed to use it. In fact, I can't think of a situation in which I would.
Generally when I need to get an exclusive lock on something, I need that lock and I need it now. I can't put it off until later. I certainly can't just say, "oh well, I won't update that data after all". So I need EnterCriticalSection, not TryEnterCriticalSection
So what exactly is the use case for TryEnterCriticalSection?
I've Googled this, of course. I've found plenty of quick descriptions on how to use it but almost no real-world examples of why. I did find this example from Intel that, frankly doesn't help much:
CRITICAL_SECTION cs;
void threadfoo()
{
while(TryEnterCriticalSection(&cs) == FALSE)
{
// some useful work
}
// Critical Section of Code
LeaveCriticalSection (&cs);
}
// other work
}
What exactly is a scenario in which I can do "some useful work" while I'm waiting for my lock? I'd love to avoid thread-contention but in my code, by the time I need the critical section, I've already been forced to do all that "useful work" in order to get the values that I'm updating in shared data (for which I need the critical section in the first place).
Does anyone have a real-world example?
As an example you might have multiple threads that each produce a high volume of messages (events of some sort) that all need to go on a shared queue.
Since there's going to be frequent contention on the lock on the shared queue, each thread can have a local queue and then, whenever the TryEnterCriticalSection call succeeds for the current thread, it copies everything it has in its local queue to the shared one and releases the CS again.
In C++11 therestd::lock which employs deadlock-avoidance algorithm.
In C++17 this has been elaborated to std::scoped_lock class.
This algorithm tries to lock on mutexes in one order, and then in another, until succeeds. It takes try_lock to implement this approach.
Having try_lock method in C++ is called Lockable named requirement, whereas mutexes with only lock and unlock are BasicLockable.
So if you build C++ mutex on top of CTRITICAL_SECTION, and you want to implement Lockable, or you'll want to implement lock avoidance directly on CRITICAL_SECTION, you'll need TryEnterCriticalSection
Additionally you can implement timed mutex on TryEnterCriticalSection. You can do few iterations of TryEnterCriticalSection, then call Sleep with increasing delay time, until TryEnterCriticalSection succeeds or deadline has expired. It is not a very good idea though. Really timed mutexes based on user-space WIndows synchronization objects are implemented on SleepConditionVariableSRW, SleepConditionVariableCS or WaitOnAddress.
Because windows CS are recursive TryEnterCriticalSection allows a thread to check whether it already owns a CS without risk of stalling.
Another case would be if you have a thread that occasionally needs to perform some locked work but usually does something else, you could use TryEnterCriticalSection and only perform the locked work if you actually got the lock.

Go destructors?

I know there are no destructors in Go since technically there are no classes. As such, I use initClass to perform the same functions as a constructor. However, is there any way to create something to mimic a destructor in the event of a termination, for the use of, say, closing files? Right now I just call defer deinitClass, but this is rather hackish and I think a poor design. What would be the proper way?
In the Go ecosystem, there exists a ubiquitous idiom for dealing with objects which wrap precious (and/or external) resources: a special method designated for freeing that resource, called explicitly — typically via the defer mechanism.
This special method is typically named Close(), and the user of the object has to call it explicitly when they're done with the resource the object represents. The io standard package does even have a special interface, io.Closer, declaring that single method. Objects implementing I/O on various resources such as TCP sockets, UDP endpoints and files all satisfy io.Closer, and are expected to be explicitly Closed after use.
Calling such a cleanup method is typically done via the defer mechanism which guarantees the method will run no matter if some code which executes after resource acquisition will panic() or not.
You might also notice that not having implicit "destructors" quite balances not having implicit "constructors" in Go. This actually has nothing to do with not having "classes" in Go: the language designers just avoid magic as much as practically possible.
Note that Go's approach to this problem might appear to be somewhat low-tech but in fact it's the only workable solution for the runtime featuring garbage-collection. In a language with objects but without GC, say C++, destructing an object is a well-defined operation because an object is destroyed either when it goes out of scope or when delete is called on its memory block. In a runtime with GC, the object will be destroyed at some mostly indeterminate point in the future by the GC scan, and may not be destroyed at all. So if the object wraps some precious resource, that resource might get reclaimed way past the moment in time the last live reference to the enclosing object was lost, and it might even not get reclaimed at all—as has been well explained by #twotwotwo in their respective answer.
Another interesting aspect to consider is that the Go's GC is fully concurrent (with the regular program execution). This means a GC thread which is about to collect a dead object might (and usually will) be not the thread(s) which executed that object's code when it was alive. In turn, this means that if the Go types could have destructors then the programmer would need to make sure whatever code the destructor executes is properly synchronized with the rest of the program—if the object's state affects some data structures external to it. This actually might force the programmer to add such synchronization even if the object does not need it for its normal operation (and most objects fall into such category). And think about what happens of those exernal data strucrures happened to be destroyed before the object's destructor was called (the GC collects dead objects in a non-deterministic way). In other words, it's much easier to control — and to reason about — object destruction when it is explicitly coded into the program's flow: both for specifying when the object has to be destroyed, and for guaranteeing proper ordering of its destruction with regard to destroying of the data structures external to it.
If you're familiar with .NET, it deals with resource cleanup in a way which resembles that of Go quite closely: your objects which wrap some precious resource have to implement the IDisposable interface, and a method, Dispose(), exported by that interface, must be called explicitly when you're done with such an object. C# provides some syntactic sugar for this use case via the using statement which makes the compiler arrange for calling Dispose() on the object when it goes out of the scope declared by the said statement. In Go, you'll typically defer calls to cleanup methods.
One more note of caution. Go wants you to treat errors very seriously (unlike most mainstream programming language with their "just throw an exception and don't give a fsck about what happens due to it elsewhere and what state the program will be in" attitude) and so you might consider checking error returns of at least some calls to cleanup methods.
A good example is instances of the os.File type representing files on a filesystem. The fun stuff is that calling Close() on an open file might fail due to legitimate reasons, and if you were writing to that file this might indicate that not all the data you wrote to that file had actually landed in it on the file system. For an explanation, please read the "Notes" section in the close(2) manual.
In other words, just doing something like
fd, err := os.Open("foo.txt")
defer fd.Close()
is okay for read-only files in the 99.9% of cases, but for files opening for writing, you might want to implement more involved error checking and some strategy for dealing with them (mere reporting, wait-then-retry, ask-then-maybe-retry or whatever).
runtime.SetFinalizer(ptr, finalizerFunc) sets a finalizer--not a destructor but another mechanism to maybe eventually free up resources. Read the documentation there for details, including downsides. They might not run until long after the object is actually unreachable, and they might not run at all if the program exits first. They also postpone freeing memory for another GC cycle.
If you're acquiring some limited resource that doesn't already have a finalizer, and the program would eventually be unable to continue if it kept leaking, you should consider setting a finalizer. It can mitigate leaks. Unreachable files and network connections are already cleaned up by finalizers in the stdlib, so it's only other sorts of resources where custom ones can be useful. The most obvious class is system resources you acquire through syscall or cgo, but I can imagine others.
Finalizers can help get a resource freed eventually even if the code using it omits a Close() or similar cleanup, but they're too unpredictable to be the main way to free resources. They don't run until GC does. Because the program could exit before next GC, you can't rely on them for things that must be done, like flushing buffered output to the filesystem. If GC does happen, it might not happen soon enough: if a finalizer is responsible for closing network connections, maybe a remote host hits its limit on open connections to you before GC, or your process hits its file-descriptor limit, or you run out of ephemeral ports, or something else. So it's much better to defer and do cleanup right when it's necessary than to use a finalizer and hope it's done soon enough.
You don't see many SetFinalizer calls in everyday Go programming, partly because the most important ones are in the standard library and mostly because of their limited range of applicability in general.
In short, finalizers can help by freeing forgotten resources in long-running programs, but because not much about their behavior is guaranteed, they aren't fit to be your main resource-management mechanism.
There are Finalizers in Go. I wrote a little blog post about it. They are even used for closing files in the standard library as you can see here.
However, I think using defer is more preferable because it's more readable and less magical.

What is the design rationale behind HandleScope?

V8 requires a HandleScope to be declared in order to clean up any Local handles that were created within scope. I understand that HandleScope will dereference these handles for garbage collection, but I'm interested in why each Local class doesn't do the dereferencing themselves like most internal ref_ptr type helpers.
My thought is that HandleScope can do it more efficiently by dumping a large number of handles all at once rather than one by one as they would in a ref_ptr type scoped class.
Here is how I understand the documentation and the handles-inl.h source code. I, too, might be completely wrong since I'm not a V8 developer and documentation is scarce.
The garbage collector will, at times, move stuff from one memory location to another and, during one such sweep, also check which objects are still reachable and which are not. In contrast to reference-counting types like std::shared_ptr, this is able to detect and collect cyclic data structures. For all of this to work, V8 has to have a good idea about what objects are reachable.
On the other hand, objects are created and deleted quite a lot during the internals of some computation. You don't want too much overhead for each such operation. The way to achieve this is by creating a stack of handles. Each object listed in that stack is available from some handle in some C++ computation. In addition to this, there are persistent handles, which presumably take more work to set up and which can survive beyond C++ computations.
Having a stack of references requires that you use this in a stack-like way. There is no “invalid” mark in that stack. All the objects from bottom to top of the stack are valid object references. The way to ensure this is the LocalScope. It keeps things hierarchical. With reference counted pointers you can do something like this:
shared_ptr<Object>* f() {
shared_ptr<Object> a(new Object(1));
shared_ptr<Object>* b = new shared_ptr<Object>(new Object(2));
return b;
}
void g() {
shared_ptr<Object> c = *f();
}
Here the object 1 is created first, then the object 2 is created, then the function returns and object 1 is destroyed, then object 2 is destroyed. The key point here is that there is a point in time when object 1 is invalid but object 2 is still valid. That's what LocalScope aims to avoid.
Some other GC implementations examine the C stack and look for pointers they find there. This has a good chance of false positives, since stuff which is in fact data could be misinterpreted as a pointer. For reachability this might seem rather harmless, but when rewriting pointers since you're moving objects, this can be fatal. It has a number of other drawbacks, and relies a lot on how the low level implementation of the language actually works. V8 avoids that by keeping the handle stack separate from the function call stack, while at the same time ensuring that they are sufficiently aligned to guarantee the mentioned hierarchy requirements.
To offer yet another comparison: an object references by just one shared_ptr becomes collectible (and actually will be collected) once its C++ block scope ends. An object referenced by a v8::Handle will become collectible when leaving the nearest enclosing scope which did contain a HandleScope object. So programmers have more control over the granularity of stack operations. In a tight loop where performance is important, it might be useful to maintain just a single HandleScope for the whole computation, so that you won't have to access the handle stack data structure so often. On the other hand, doing so will keep all the objects around for the whole duration of the computation, which would be very bad indeed if this were a loop iterating over many values, since all of them would be kept around till the end. But the programmer has full control, and can arrange things in the most appropriate way.
Personally, I'd make sure to construct a HandleScope
At the beginning of every function which might be called from outside your code. This ensures that your code will clean up after itself.
In the body of every loop which might see more than three or so iterations, so that you only keep variables from the current iteration.
Around every block of code which is followed by some callback invocation, since this ensures that your stuff can get cleaned if the callback requires more memory.
Whenever I feel that something might produce considerable amounts of intermediate data which should get cleaned (or at least become collectible) as soon as possible.
In general I'd not create a HandleScope for every internal function if I can be sure that every other function calling this will already have set up a HandleScope. But that's probably a matter of taste.
Disclaimer: This may not be an official answer, more of a conjuncture on my part; but the v8 documentation is hardly
useful on this topic. So I may be proven wrong.
From my understanding, in developing various v8 based backed application. Its a means of handling the difference between the C++ and javaScript environment.
Imagine the following sequence, which a self dereferencing pointer can break the system.
JavaScript calls up a C++ wrapped v8 function : lets say helloWorld()
C++ function creates a v8::handle of value "hello world =x"
C++ returns the value to the v8 virtual machine
C++ function does its usual cleaning up of resources, including dereferencing of handles
Another C++ function / process, overwrites the freed memory space
V8 reads the handle : and the data is no longer the same "hell!#(#..."
And that's just the surface of the complicated inconsistency between the two; Hence to tackle the various issues of connecting the JavaScript VM (Virtual Machine) to the C++ interfacing code, i believe the development team, decided to simplify the issue via the following...
All variable handles, are to be stored in "buckets" aka HandleScopes, to be built / compiled / run / destroyed by their
respective C++ code, when needed.
Additionally all function handles, are to only refer to C++ static functions (i know this is irritating), which ensures the "existence"
of the function call regardless of constructors / destructor.
Think of it from a development point of view, in which it marks a very strong distinction between the JavaScript VM development team, and the C++ integration team (Chrome dev team?). Allowing both sides to work without interfering one another.
Lastly it could also be the sake of simplicity, to emulate multiple VM : as v8 was originally meant for google chrome. Hence a simple HandleScope creation and destruction whenever we open / close a tab, makes for much easier GC managment, especially in cases where you have many VM running (each tab in chrome).

Object must be locked to be used?

I was pondering language features and I was wondering if the following feature had been implemented in any languages.
A way of declaring that an object may only be accessed within a Mutex. SO for example in java you would only be able to access an object if it was in a synchrnoised block and in C# a Lock.
A compiler error would ensue if the object was used outside of a Mutex block.
Any thoughts?
UPDATE
I think some people have misunderstood the question, I'm not asking if you can lock objects, I'm asking if there is a mechanism to state at declaration of an object that it may only be accessed from within a lock/synchronised statement.
There are two ways to do that.
Your program either refuses to run a method unless the protecting mutex is locked by the calling thread (that's a runtime check); or it refuses to compile (that's a compile time check).
First way is what C# lock does.
Second method requires a compiler able to evaluate every execution path possible. It's hardly feasible.
In Java you can add the synchronized keyword to a method, but that is only syntactic sugar to wrapping the entire method body in a synchronized(this)-block (for non-static methods).
So for Java there is no language construct that enforces that behavior. You can try to .wait() on this with a zero timeout to ensure that the calling code has acquired the monitor, but that's just checking after-the-fact
In Objective-C, you can use the #property and #synthesize directives to let the compiler generate the code for accessors. By default they are protected by mutex.
Demanding locks on everything as you describe would create the potential for deadlocks, as one might be forced to take a lock sooner than one would otherwise.
That said, there are approaches similar to what you describe - Software Transactional Memory, in particular, avoids the deadlock issue by allowing rollbacks and retries.

Resources