How do you efficiently debug reference count problems in shared memory? - algorithm

Assume you have a reference counted object in shared memory. The reference count represents the number of processes using the object, and processes are responsible for incrementing and decrementing the count via atomic instructions, so the reference count itself is in shared memory as well (it could be a field of the object, or the object could contain a pointer to the count, I'm open to suggestions if they assist with solving this problem). Occasionally, a process will have a bug that prevents it from decrementing the count. How do you make it as easy as possible to figure out which process is not decrementing the count?
One solution I've thought of is giving each process a UID (maybe their PID). Then when processes decrement, they push their UID onto a linked list stored alongside the reference count (I chose a linked list because you can atomically append to head with CAS). When you want to debug, you have a special process that looks at the linked lists of the objects still alive in shared memory, and whichever apps' UIDs are not in the list are the ones that have yet to decrement the count.
The disadvantage to this solution is that it has O(N) memory usage where N is the number of processes. If the number of processes using the shared memory area is large, and you have a large number of objects, this quickly becomes very expensive. I suspect there might be a halfway solution where with partial fixed size information you could assist debugging by somehow being able to narrow down the list of possible processes even if you couldn't pinpoint a single one. Or if you could just detect which process hasn't decremented when only a single process hasn't (i.e. unable to handle detection of 2 or more processes failing to decrement the count) that would probably still be a big help.
(There are more 'human' solutions to this problem, like making sure all applications use the same library to access the shared memory region, but if the shared area is treated as a binary interface and not all processes are going to be applications written by you that's out of your control. Also, even if all apps use the same library, one app might have a bug outside the library corrupting memory in such a way that it's prevented from decrementing the count. Yes I'm using an unsafe language like C/C++ ;)
Edit: In single process situations, you will have control, so you can use RAII (in C++).

You could do this using only a single extra integer per object.
Initialise the integer to zero. When a process increments the reference count for the object, it XORs its PID into the integer:
object.tracker ^= self.pid;
When a process decrements the reference count, it does the same.
If the reference count is ever left at 1, then the tracker integer will be equal to the PID of the process that incremented it but didn't decrement it.
This works because XOR is commutative ( (A ^ B) ^ C == A ^ (B ^ C) ), so if a process XORs the tracker with its own PID an even number of times, it's the same as XORing it with PID ^ PID - that's zero, which leaves the tracker value unaffected.
You could alternatively use an unsigned value (which is defined to wrap rather than overflow) - adding the PID when incrementing the usage count and subtracting it when decrementing it.

Fundementally, shared memory shared state is not a robust solution and I don't know of a way of making it robust.
Ultimately, if a process exits all its non-shared resources are cleaned up by the operating system. This is incidentally the big win from using processes (fork()) instead of threads.
However, shared resources are not. File handles that others have open are obviously not closed, and ... shared memory. Shared resources are only closed after the last process sharing them exits.
Imagine you have a list of PIDs in the shared memory. A process could scan this list looking for zombies, but then PIDs can get reused, or the app might have hung rather than crashed, or...
My recommendation is that you use pipes or other message passing primitives between each process (sometimes there is a natural master-slave relationship, other times all need to talk to all). Then you take advantage of the operating system closing these connections when a process dies, and so your peers get signalled in that event. Additionally you can use ping/pong timeout messages to determine if a peer has hung.
If, after profiling, it is too inefficient to send the actual data in these messages, you could use shared memory for the payload as long as you keep the control channel over some kind of stream that the operating system clears up.

The most efficient tracing systems for resource ownership don't even use reference counts, let alone lists of reference-holders. They just have static information about the layouts of every data type that might exist in memory, also the shape of the stack frame for every function, and every object has a type indicator. So a debugging tool can scan the stack of every thread, and follow references to objects recursively until it has a map of all the objects in memory and how they refer to each other. But of course systems that have this capability also have automatic garbage collection anyway. They need help from the compiler to gain all that information about the layout of objects and stack frames, and such information cannot actually be reliably obtained from C/C++ in all cases (because object references can be stored in unions, etc.) On the plus side, they perform way better than reference counting at runtime.
Per your question, in the "degenerate" case, all (or almost all) of your process's state would be held in shared memory - apart from local variables on the stack. And at that point you would have the exact equivalent of a multi-threaded program in a single process. Or to put it another way, processes that share enough memory start to become indistinguishable from threads.
This implies that you needn't specify the "multiple processes, shared memory" part of your question. You face the same problem anyone faces when they try to use reference counting. Those who use threads (or make unrestrained use of shared memory; same thing) face another set of problems. Put the two together and you have a world of pain.
In general terms, it's good advice not to share mutable objects between threads, where possible. An object with a reference count is mutable, because the count can be modified. In other words, you are sharing mutable objects between (effective) threads.
I'd say that if your use of shared memory is complex enough to need something akin to GC, then you've almost got the worst of both worlds: the expensive of process creation without the advantages of process isolation. You've written (in effect) a multi-threaded application in which you are sharing mutable objects between threads.
Local sockets are a very cross-platform and very fast API for interprocess communication; the only one that works basically identically on all Unices and Windows. So consider using that as a minimal communication channel.
By the way, are you using consistently using smart pointers in the processes that hold references? That's your only hope of getting reference counting even half right.

Use following
int pids[MAX_PROCS]
int counter;
Increment
do
find i such pid[i]=0 // optimistic
while(cas[pids[i],0,mypid)==false)
my_pos = i;
atomic_inc(counter)
Decrement
pids[my_pos]=0
atomic_dec(counter);
So you know all processes using this object
You MAX_PROCS big enough and search free
place randomly so if number of processes significanly lower then MAX_PROCS the search
would be very fast.

Next to doing things yourself: you can also use some tool like AQTime which has a reference counted memchecker.

Related

Is a dynamically allocated 2D array automatically deleted after the program exits?

I am learning about destructors right now because I am making this assignment about matrices (we're supposed to make a Matrix class and overload operators to do Matrix operations and me and the person I am going to mention in the next bit were planning to also make it perform Gauss-Jordan elimination, if this is relevant) which are represented in this assignment through dynamic 2D arrays.
I heard someone talk about using a destructor for the deletion process of the arrays. I started reading about destructors and one of the events that calls a destructors that seemed like the only time a destructor would be used in an application like this was the termination of the program, so I am left kind of confused as to why he'd need a destructor? What's the significance of a destructor in an application like this?
The answer to the question in the title is:
Yes. And No.
No:
If the process creates an object with new and terminates without calling delete on that same object, the object is never destructed. Any action that would be done by the destructor is simply not done.
This action can be stuff that is required for consistency of external data. Like pushing something to a database. Or like flushing a cache to disk. What action is missed depends entirely on the destructor.
Yes:
The memory that was occupied by the process is not lost to the system. Your process requested some chunks of memory from the system's kernel, so that it was able to construct its objects within that memory. The kernel keeps track of which memory pages it has allocated to which process, and it does not care a bit what that process does with it. The kernel is entirely oblivious to which objects were constructed within the memory.
When a process exits, the kernel will simply reclaim any memory that's still allocated to the process. As such, you don't permanently loose memory by forgetting to delete objects at shutdown.
However, this reclaiming affects memory use, only. The contents of any cache that wasn't flushed remains unflushed. And the external files that were in an inconsistent state when the process terminated will remain in that inconsistent state forever.
So, bottom line: Memory will be reclaimed by the kernel anyway. But it's generally not a good idea to forget cleanup. It's better not to get into the habit of being lazy, because that habit will bite you severely down the road.

MPI: Ensure an exclusive access to a shared memory (RMA)

I would like to know which is the best way to ensure an exclusive access to a shared resource (such as memory window) among n processes in MPI. I've tried MPI_Win_lock & MPI_Win_fence but they don't seem to work as expected, i.e: I can see that multiple processes enter a critical region (code between MPI_Win_lock & MPI_Win_unlock that contains MPI_Get and/or MPI_Put) at the same time.
I would appreciate your suggestions. Thanks.
In MPI 2 you cannot truly do atomic operations. This is introduced in MPI 3 using MPI_Fetch_and_op. This is why your critical data is modified.
Furthermore, take care with `MPI_Win_lock'. As described here:
The name of this routine is misleading. In particular, this routine need not block, except when the target process is the calling process.
The actual blocking process is MPI_Win_unlock, meaning that only after returning from this procedure you can be sure that the values from put and get are correct. Perhaps this is better described here:
MPI passive target operations are organized into access epochs that are bracketed by MPI Win lock and MPI Win unlock calls. Clever MPI implementations [10] will combine all the data movement operations (puts, gets, and accumulates) into one network transaction that occurs at the unlock.
This same document can also provide a solution to your problem, which is that critical data is not written atomically. It does this through the use of a mutex, which is a mechanism that ensures only one process can access data at the time.
I recommend you read this document: The solution they propose is not difficult to implement.

MPI shared memory access

In the parallel MPI program on for example 100 processors:
In case of having a global counting number which should be known by all MPI processes and each one of them can add to this number and the others should see the change instantly and add to the changed value.
Synchronization is not possible and would have lots of latency issue.
Would it be OK to open a shared memory among all the processes and use this memory for accessing this number also changing that?
Would it be OK to use MPI_WIN_ALLOCATE_SHARED or something like that or is this not a good solution?
Your question suggests to me that you want to have your cake and eat it too. This will end in tears.
I write you want to have your cake and eat it too because you state that you want to synchronise the activities of 100 processes without synchronisation. You want to have 100 processes incrementing a shared counter, (presumably) to have all the updates applied correctly and consistently, and to have increments propagated to all processes instantly. No matter how you tackle this problem it is one of synchronisation; either you write synchronised code or you offload the task to a library or run-time which does it for you.
Is it reasonable to expect MPI RMA to provide automatic synchronisation for you ? No, not really. Note first that mpi_win_allocate_shared is only valid if all the processes in the communicator which make the call are in shared memory. Given that you have the hardware to support 100 processes in the same, shared, memory, you still have to write code to ensure synchronisation, MPI won't do it for you. If you do have 100 processes, any or all of which may increment the shared counter, there is nothing in the MPI standard, or any implementations that I am familiar with, which will prevent a data race on that counter.
Even shared-memory parallel programs (as opposed to MPI providing shared-memory-like parallel programs) have to take measures to avoid data races and other similar issues.
You could certainly write an MPI program to synchronise accesses to the shared counter but a better approach would be to rethink your program's structure to avoid too-tight synchronisation between processes.

programatically determine amount of time remaining before preemption

i am trying to implement some custom lock-free structures. its operates similar to a stack so it has a take() and a free() method and operates on pointer and underlying array. typically it uses optimistic conncurrency. free() writes a dummy value to pointer+1 increments the pointer and writes the real value to the new address. take() reads the value at pointer in a spin/sleep style until it doesnt read the dummy value and then decrements the pointer. in both operations changes to the pointer are done with compare and swap and if it fails, the whole operation starts again. the purpose of the dummy value is to insure consistency since the write operation can be preempted after the pointer is incremented.
this situation leads me to wonder weather it is possible to prevent preemtion in that critical place by somhow determining how much time is left before the thread will be preempted by the scheduler for another thread. im not worried about hardware interrupts. im trying to eliminate the possible sleep from my reading function so that i can rely on a pure spin.
is this at all possible?
are there other means to handle this situation?
EDIT: to clarify how this may be helpful, if the critical operation is interrupted, it will effectively be like taking out an exclusive lock, and all other threads will have to sleep before they could continue with their operations
EDIT: i am not hellbent on having it solved like this, i am merely trying to see if its possible. the probability of that operation being interrupted in that location for a very long time is extremely unlikely and if it does happen it will be OK if all the other operations need to sleep so that it can complete.
some regard this as premature optimization, but this is just my pet project. regardless - that does not exclude research and sience from attempting to improve techniques. even though computer sience has reasonably matured and every new technology we use today is just an implementation of what was already known 40 years ago, we should not stop to be creative to address even the smallest of concerns, like trying to make a reasonable set of operations atomic woithout too much performance implications.
Such information surely exists somewhere, but it is of no use for you.
Under "normal conditions", you can expect upwards of a dozen DPCs and upwards of 1,000 interrupts per second. These do not respect your time slices, they occur when they occur. Which means, on the average, you can expect 15-16 interrupts within a time slice.
Also, scheduling does not strictly go quantum by quantum. The scheduler under present Windows versions will normally let a thread run for 2 quantums, but may change its opinion in the middle if some external condition changes (for example, if an event object is signalled).
Insofar, even if you know that you still have so and so many nanoseconds left, whatever you think you know might not be true at all.
Cnnot be done without time-travel. You're stuffed.

How to show percentage of 'memory used' in a win32 process?

I know that memory usage is a very complex issue on Windows.
I am trying to write a UI control for a large application that shows a 'percentage of memory used' number, in order to give the user an indication that it may be time to clear up some memory, or more likely restart the application.
One implementation used ullAvailVirtual from MEMORYSTATUSEX as a base, then used HeapWalk() to walk the process heap looking for additional free memory. The HeapWalk() step was needed because we noticed that after a while of running the memory allocated and freed by the heap was never returned and reported by the ullAvailVirtual number. After hours of intensive working, the ullAvailVirtual number no longer would accurately report the amount of memory available.
However, this method proved not ideal, due to occasional odd errors that HeapWalk() would return, even when the process heap was not corrupted. Further, since this is a UI control, the heap walking code was executing every 5-10 seconds. I tried contacting Microsoft about why HeapWalk() was failing, escalated a case via MSDN, but never got an answer other than "you probably shouldn't do that".
So, as a second implementation, I used PagefileUsage from PROCESS_MEMORY_COUNTERS as a base. Then I used VirtualQueryEx to walk the virtual address space adding up all regions that weren't MEM_FREE and returned a value for GetMappedFileNameA(). My thinking was that the PageFileUsage was essentially 'private bytes' so if I added to that value the total size of the DLLs my process was using, it would be a good approximation of the amount of memory my process was using.
This second method seems to (sorta) work, at least it doesn't cause crashes like the heap walker method. However, when both methods are enabled, the values are not the same. So one of the methods is wrong.
So, StackOverflow world...how would you implement this?
which method is more promising, or do you have a third, better method?
should I go back to the original method, and further debug the odd errors?
should I stay away from walking the heap every 5-10 seconds?
Keep in mind the whole point is to indicate to the user that it is getting 'dangerous', and they should either free up memory or restart the application. Perhaps a 'percentage used' isn't the best solution to this problem? What is? Another idea I had was a color based system (red, yellow, green, which I could base on more factors than just a single number)
Yes, the Windows memory manager was optimized to fulfill requests for memory as quickly and efficiently possible, it was not optimized to easily measure how much space is used. The first downfall is that heap blocks that are released are rarely unmapped. They are simply marked as "free", to be used by the next allocation. That's why VirtualQueryEx() cannot work.
The problem with HeapWalk is that you have to lock the heap (HeapLock) so that it can walk it without the heap allocation changing. That lock can have very detrimental side-effects. Quoting:
Walking a heap may degrade
performance, especially on symmetric
multiprocessing (SMP) computers. The
side effects may last until the
process ends.
Even then, the number you get back is pretty meaningless. A program never runs out of free space, it runs out of a large enough contiguous chunk of memory to fulfill the request. No happy answers I'm afraid. Except one: a 64-bit operating system cost less than two hundred bucks.
The place to start is probably GetProcessMemoryInfo(). This fills in a structure for you that has, among other things, the current working set in bytes.
Have a look at the following article .NET and running processes
It uses WMI to check for the memory usage of processes, specifically using the
System.Diagnostics.Process
and another link on how to use WMI in C#: WMI Made Easy for C#
Hope this helps.

Resources