Castalia running error of type cRuntimeError - omnet++

my omnetpp.ini has sim-time-limit = 1000s and runs normally until I get the error:
Running Castalia: Configuration 1/1 Run 1/1 Complete 0%terminate called after throwing an instance of 'cRuntimeError'
what(): Object Data is currently in (cMessageHeap)simulation.scheduled-events, it cannot be deleted. If this error occurs inside cMessageHeap, it needs to be changed to call drop() before it can delete that object. If this error occurs inside cMessageHeap's destructor and Data is a class member, cMessageHeap needs to call drop() in the destructor
When checking the trace, it recorded all the information correctly until 673.xxxx, in another execution, it stopped at 164.xxxx and in another at 978.xxxx.
Have you seen this error before? Could you point me to a possible solution.

Related

How to debug MPI program before bad termination?

I am currently developing a program written in C++ with the MPI+pthread paradigm.
I add some functionality to my program, however I have a bad termination message from one MPI process, like this:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 37805 RUNNING AT node165
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0#node162] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:0#node162] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2#node166] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:2#node166] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2#node166] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: node162: task 0: Exited with exit code 7
[proxy:0:0#node162] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: node166: task 2: Exited with exit code 7
[mpiexec#node162] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec#node162] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec#node162] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec#node162] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
My problem is such that I have no idea about why I have this kind of message, and thus how to correct it.
I use only some basic functions from MPI, and ensure that there is no threads which uses MPI calls (only my "master process" is allowed to call such functions).
I also checked that one process does not send message to itself, and that the process destination exist before sending a message.
My question is quite simple: how to know where the problem comes from to then debug my application ?
Thank you a lot.
one of your processes has had a segmentation fault. This means reading from or writing to an area of memory that it is not permitted to.
That's the cause and MPI functions often are difficult to get right the first time - for example it could be MPI send and receive functions with incorrect sizes or locations.
The best solution is to fire up a parallel debugger so that you can watch all the processes. It looks like you are using a proper HPC system so there is a chance that there is one installed on the system -- ddt or totalview are the most popular.
Take a look at How to debug an MPI program
My experience with this problem when writing in C++ and using MPI is that this frequently occurred when I did not set MPI_Finalze(); before every return statement.

Safe await on function in another process

TL;DR
How to safely await on function execution (takes str and int as arguments and doesn't require any other context) in a separate process?
Long story
I have aiohtto.web web API that uses Boost.Python wrapper for C++ extension, run under gunicorn (and I plan to deploy it on Heroku), tested by locust.
About extension: it have just one function that does non-blocking operation - takes one string (and one integer for timeout management), does some calculations with it and returns a new string. And for every input string, it is only one possible output (except timeout, but in that case, C++ exception must be raised and translated by Boost.Python to a Python-compatible one).
In short, a handler for specific URL executes the code below:
res = await loop.run_in_executor(executor, func, *args)
where executor is the ProcessPoolExecutor instance, and func -function from C++ extension module. (in the real project, this code is in the coroutine method of the class, and func - it's classmethod that only executes C++ function and returns the result)
Error catching
When a new request arrives, I extract it's POST data by request.post() and then storing it's data to the instance of the custom class named Call (because I have no idea how to name it in another way). So that call object contains all input data (string), request receiving time and unique id that comes with the request.
Then it proceeds to class named Handler (not the aiohttp request handler), that passes it's input to another class' method with loop.run_in_executor inside. But Handler has a logging system that works like a middleware - reads id and receiving time of every incoming call object and logging it with a message that tells you either it just starting to execute, successfully executed or get in trouble. Also, Handler have try/except and stores all errors inside the call object, so that logging middleware knows what error occurred, or what output extension had returned
Testing
I have the unit test that just creates 256 coroutines with this code inside and executor that have 256 workers and it works well.
But when testing with Locust here comes a problem. I use 4 Gunicorn workers and 4 executor workers for this kind of testing. At some time application just starts to return wrong output.
My Locust's TaskSet is configured to log every fault response with all available information: output string, error string, input string (that was returned by the application too), id. All simulated requests are the same, but id is unique for every.
The situation is better when setting Gunicorn's max_requests option to 100 requests, but failures still come.
Interesting thing is, that sometimes I can trigger "wrong output" period by simply stopping and starting Locust's test.
I need a 100% guarantee that my web API works as I expect.
UPDATE & solution
Just asked my teammate to review the C++ code - the problem was in global variables. In some way, it wasn't a problem for 256 parallel coroutines, but for Gunicorn was.

Error while working with the inet radio modules

I have two problems while working with the the inet radio modules. I think they are somehow interrelated.
Problem 1
When I am using subscribe function in my MAC layer
radioModule->subscribe(IRadio::radioModeChangedSignal, this);
radioModule->subscribe(IRadio::transmissionStateChangedSignal, this);
I get this error
Error in module (inet::physicallayer::Radio) MyNetwork.sta[0].nic[0].radio(id=19) during network initialization: inet::MyMac: Unsupported signal data type long for signal radioModeChanged (id=34).
Problem 2
My receiver module has a problem with these functions in inet.physicallayer.common.RadioMedium.cc
const IListening *listening = receiverRadio->getReceiver()->createListening(receiverRadio, arrival->getStartTime(), arrival->getEndTime(), arrival->getStartPosition(), arrival->getEndPosition());
[...]
communicationCache->setCachedListening(receiverRadio, transmission, listening);
I get this error
<!> Error in module (inet::physicallayer::Radio) MyNetwork.sta[0].nic[0].radio (id=19) at event #33, t=2: ASSERT: condition shareCount == 0 false in function parsimUnpack, cpacket.cc line 146.
According Problem 1: you have to override the method
virtual void receiveSignal(cComponent *source, simsignal_t signalID, long l, cObject *details)
in your MyMac class. Without this method a simple module doesn't know what to do with received signal, therefore it throws an error.
Problem 2 is connected with a packet handling, maybe decapsulation. The presented code is not the source of it. Set debug-on-errors=true in your omnetpp.ini then run simulation in debug mode. You should see the place in your code which causes this error.

Retry on Runtime Errors

I have come across this problem a few times and never been able to resolve it but now I need to solve it once and for all.
I have a procedure which has been throwing runtime errors. This is not a problem as I have an error handler defined at the top of the function and the handler at the bottom something like this:
retryConcat:
On Local Error GoTo concatErr
'Some Code here
Exit Sub
concatErr:
If MsgBox("Could not append the receipt for this transaction to the Receipt viewer logs.", vbExclamation + vbRetryCancel, "Warning - Missing Receipt") = vbRetry Then
err.Clear
GoTo retryConcat
End If
The error handler contains a message box allowing the user to retry if required. Now here is the part which confuses me. The first time an error is thrown it shows the message box and allows the user to retry as expected. The program then jumps to the appropriate line and tries again. However the second time through when the error is thrown it does not jump to the error handler, it jumps out of the procedure and the error handler in the parent catches it instead!
So my question is why does it jump to the parent error handler on subsequent throws. This happens in many places all over my code. In many cases where I can manually check for errors I can stick the code in a while loop to solve it but with runtime errors I am forced to use the error trapping which acts in this rather annoying way.
Any help or advice would be appreciated.
You need ot use Resume retryConcat.
When an error occurs, it jumps into the error handle to concatErr:. You then show the message box, and if the user chooses to retry, the code then jumps to retryConcat. As this you used Goto, it DOES NOT exit the error handler, and so next time the error occurs, it's already in the error handler and has no choice but to raise the error up the chain to the calling procedure.
Using Resume concatRetry allows it to exit the error handler and resume at the required point, meaning next time the error occurs, it can handle is again.
It probably makes it easier to understand, if you imagine the error handler is a state, not a section of code.

UIDeviceRGBColor isEqualToString:]: unrecognized selector

Can someone tell me what exactly is that about?
I have table and inside the tableCell I have a pickerview and some textfields in other cells.
When i'm scrolling the table up and down 8-10 times app crashes and gives me this error:
* Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[UIDeviceRGBColor isEqualToString:]: unrecognized selector sent to instance 0x5834850'
Short answer: it is trying to call -isEqualToString: on an instance of UIDeviceRGBColor, which is doesn't respond to it.
Long answer: you're either asking for the wrong object at some point, or quite possibly trying to access an object that has been released, but who's pointer has not been set to nil. Sometimes when this happens you'll get a straight crash as the memory in the new location isn't a proper object. Sometimes a new object takes its place. The best way to find out is to turn on Zombies.
This is a good overview of how to use Zombies: http://iosdevelopertips.com/debugging/tracking-down-exc_bad_access-errors-with-nszombieenabled.html
You may start seeing messages saying "-[NSCFString isEqualToString:] message sent to deallocated instance". If so then this is a memory management problem and you need to double check your retains & releases. If you don't get this message then you are likely calling the wrong method and so getting the wrong object back.

Resources