Why does OracleAQ Leave Dead Subscribers in the Queue? - oracle

This is Oracle 11.2.0.3.
We've got a problem where we use Oracle's JMS over OracleAQ. This works fine except we started noticing the queue getting filled with 1000s, then millions of messages over time. Some of these are in the PROCESSED state, but most are READY. We traced down this behavior to "zombie" or dead subscribers to the topic. When a Java process is terminated and doesn't get the chance to unregister itself, it leaves the subscriber record in the queue and ORacle doesn't seem to detect that it is dead. So much so that MONTHS later, a new message sent into our multi-subscriber queue will then get multiplied by the # of subscribers, which it thinks is much higher than it actually is. (We first noticed this when we reached the maximum subscriber limit.)
We've got the qmon processes running - I even tried increasing the minimum # of processes to no effect. The queue clean-up happens really nicely as long as there are no dead subscribers in the queue.
Anyone see this before, and hopefully found a solution?

Ok, So I could not have a better solution than this:
1) Create your subscriber with a name and keep track of the subscriber's name.
2) Make sure that you have a shutdown hook to application to execute below procedure, which will unsubscribe and de-register the subscriber.
3) In case of unexpected shutdown/crash, when un-subscription could not be done, there must be a cleanup task to execute below code:
DECLARE
aqAgent SYS.AQ$_AGENT;
BEGIN
for idx in (select consumer_name from
DBA_QUEUE_SUBSCRIBERS a where a.queue_name = '<Your Oracle AQ Name>') loop
aqAgent := SYS.AQ$_AGENT(idx.consumer_name, NULL, NULL);
DBMS_AQADM.REMOVE_SUBSCRIBER('<Your Oracle AQ Name>', aqAgent);
end loop;
END;
This will make sure that your system remains full-proof.

Related

zeromq: ZMQ_CONFLATE==1 does not stop queues from saving old messages

With ZeroMQ and CPPZMQ 4.3.2, I want to drop old messages for all my sockets including
PAIR
Pub/Sub
REQ/REP
So I use m_socks[channel].setsockopt(ZMQ_CONFLATE, 1) on all my sockets before binding/connecting.
Test
However, when I made the following test, it seems that the old messages are still flushed out on each reconnection. In this test,
I use a thread to keep sending generated sinewave to a receiver thread
Every 10 seconds I double the sinewave's frequency
Then after 10 seconds I stop the process
Below is the pseudocode of the sender
// on sender end
auto thenSec = high_resolution_clock::now();
while(m_isRunning) {
// generate sinewave, double the frequency every 10s or so
auto nowSec = high_resolution_clock::now();
if (duration_cast<seconds>(nowSec - thenSec).count() > 10) {
m_sine.SetFreq(m_sine.GetFreq()*2);
thenSec = nowSec;
}
m_sine.Generate(audio);
// send to rendering thread
m_messenger.send("inproc://sound-ear.pair",
(const void*)(audio),
audio_size,
zmq::send_flags::dontwait
);
}
Note that I already use DONTWAIT to mitigate blocking.
On the receiver side I have a zmq::poller_event handler that simply receives the last message on event polling.
In the stop sequence I reset the sinewave frequency to its lowest value, say, 440Hz.
Expected
The expected behaviour would be:
If I stop both the sender and the receiver after 10s when the frequency is doubled,
and I restart both,
then I should see the sinewave reset to 440Hz.
Observed
But the observed behaviour is that the received sinewave is still of the doubled frequency after restarting the communication, i.e., 880Hz.
Question
Am I doing it wrong or should I use some kind of killswitch to force drop all messages in this case?
OK, I think I solved it myself. Kind of.
Actual solution
I finally realized that the behaviour I want is to flush all messages when I stop the rendering. According to the official doc(How can I flush all messages that are in the ZeroMQ socket queue?), this can only be achieved by
set the sockets of both sender's and receiver's ZMQ_LINGER option to 0, meaning to keep nothing on closing those sockets;
closing the sockets on both sender and receiver ends, which also involves bootstrapping pollers and all references to the sockets.
This seems a lot of unnecessary work if I'm to restart rendering my data again, right after the stop sequence. But I found no other way to solve this cleanly.
Initial effort
It seems to me that ZMQ_CONFLATE does not make a difference on PAIR. I really have to tweak high water marks on sender and receiver ends using ZMQ_SNDHWM and ZMQ_RCVHWM.
However, I said "kind of solved" because tweaking HWM in the end is not the optimal solution for a realtime application,
having ZMQ_SNDHWM / ZMQ_RCVHWM set to the minimum "1", we still have a sizable latency in terms of realtime.
Also, the consumer thread could fall into underrun situatioin, i.e., perceivable jitters with the lowest HWM.
If I'm not doing anything wrong, I guess the optimal solution would still be shared memory for my targeted scenario. This is sad because I really enjoyed the simplicity of ZMQ's multicast messaging patternsand hate to deal with thread locking littered everywhere.

Should dbms_aq.dequeue loop forever?

Here my first steps with Oracle Advanced Queueing...
Szenario: I have a running application where many, many multiple independ processes report back to a central controller to handle the next steps. Simplified the processes are started via cron or via callback of a just finished process.The callbacks are from remote hosts via http -> php -> DB, basicly one http-call after the process has finished on the remote host.
The complete controller logic was written in pl/sql with a singleton concept in mind, so only one process should execute the controller logic at the same time. In fact in 99% of all calls this is not necessary, but that's not the kind of thing I could change at the moment (nor the architecture in general).
To ensure this there is actually a bad mutex implementation, pseudo-code
$mutex = false;
while( not $mutex )
{
$mutex = getMutex();
if( $mutex )
executeController();
else
sleep(5);
}
Wherein the mutex is a one field table having the values 0 (=> "free") or 1 ( => "busy" )
The result of this "beautiful" contstruction is log-file full of "Hey! Got no mutex! Waiting...". And the more processes wait, the longer they wait with no control of who's next. Sometimes the load gets so heavy that the apache first forks and finally dies...
Solution
So my first "operation" would be to replace the mutex with Oracle Advanced Queueing with the controller as single-consumer. Benefits: No more "busy waiting" within the apache layer, strict first come first serve.
( Because all the DB-Actions take place in the same oracle-schema, this could be achieved with standard-objects, pl/sql-methods as well. But why reinvent the wheel, if there are dbms-packages?)
As far as I read using the listen-feature (polling the queued items) in this context is far better than the registration-feaure (scheduling an action when a message arrives).
Basicly everything works fine, i managed to:
create the message type
create the queue-table
create the queue
start the queue
add USER as subscriber
create a procedure for enqueueing
create a procedure for processing & dequeueing
create a procedure for listening to the queue and calling the "process & dequeue"-function when a message arrives.
Of course the listener shall be active 24/7, so i specified no "wait" time. In general depending on the time of the day he will get "something to do" at least every few minutes, more likely every few seconds, sometimes more.
Now here is my problem (if it actually is a problem), i just wrote it according to the examples i found so far:
CREATE OR REPLACE PROCEDURE demo_aq_listener IS
qlist dbms_aq.aq$_agent_list_t;
agent_w_msg sys.aq$_agent;
BEGIN
qlist(0) := sys.aq$_agent(USER, 'demo_aq_queue', NULL);
LOOP
dbms_aq.listen(agent_list => qlist, agent => agent_w_msg);
DEMO_AQ_DEQUEUE();--process & dequeue
END LOOP;
END;
/
Calling the procedure basically does what i expect: It stays "up" and prosseces the queued messages.
But is this the way to do this? What does it do if there are no queued messages? "Sleeping" within the dbms_aq.listen-routine or "Looping as fast as it can", so that I just have implemented another way of "busy waiting"? Might there be a timeout (maybe on oss-level or elsewhere) i just didn't reach?
Here is the complete code with queue-definition etc.: demo_dbms_aq_with_listener.sql
UPDATE
Through further testing i just realized that it seems, that i got a far greater lack of understanding then i hoped :(
On "execution level" don't using the listener at all and just looping the dequeue function has the same effect: It waits for the first/next message
CREATE OR REPLACE PROCEDURE demo_aq_listener IS
BEGIN
LOOP
DEMO_AQ_DEQUEUE();
END LOOP;
END;
/
At least this is easier to test, calling only
BEGIN
DEMO_AQ_DEQUEUE();
END;
/
Also just waits for the first message. Which leaves me totally confused wether I need the listener at all and if what i'am doing does make any sense at all :(
Conclusion
I don't need the listener at all, because i have a single consumer who can treat all messages in the same way.
But the key/core Question stays the same: Is it ok to keep DBMS_AQ.DEQUEUE on "maybe active waiting" in a loop knowing it'll get messages all day long in short intervalls?
(you'll find DEMO_AQ_DEQUEUE() in linked sql-file above)
Better late than never, everything's fine, it is idle waiting:
1) Whilst the DEQUEUE is in sleep mode (WAIT FOREVER), I can see the session is waiting on the event - "Streams AQ: waiting for messages in the queue", that is an IDLE wait class and not actually consuming ANY resources, correct ?
Correct. It's similar to waiting on a row lock on a table. You just "sit there"
https://asktom.oracle.com/pls/apex/asktom.search?tag=writing-a-stand-alone-application-to-continuously-monitor-a-database-queue-aq

Can I mark IBM MQ messages as dirty?

I do have the following (multi-threaded) process in place:
Browse MQ queue (with lock) and get the next available message
Do something with it which might or might not fail
a. If successful, remove message from queue and start over or b. if not successful, leave message on queue
My problem arises from the fact that my application could die unexpectedly between step 2 and 3 and the application would then produce a duplicated message upon restart.
Is there a way to mark a message as 'dirty' or 'processing' on the queue (while or after reading it) with the mark persisting even if the application restarts?
I have tried to use the marks provided by MQ, but they do not survive a restart. Another possibility would be to move the message to a 'processing' queue, remove it on success or move it back to the source queue on failure, but this requires a second queue and is not trivial code anymore.
Rough code example:
MQGetMessageOptions gmo = new MQGetMessageOptions();
gmo.options = MQConstants.MQGMO_BROWSE_FIRST | MQConstants.MQGMO_LOCK;
MQMessage message = new MQMessage();
message.correlationId = MQC.MQCI_NONE;
message.messageId = MQC.MQMI_NONE;
queue.get(message, gmo);
boolean success = processMessage(message);
// Application gets killed here after successful message processing.
// Produces duplicate after restart.
if (success) {
MQGetMessageOptions gmo2 = new MQGetMessageOptions();
gmo2.options = MQConstants.MQGMO_MSG_UNDER_CURSOR;
queue.get(new MQMessage(), gmo2);
}
Basically, I'd like to achieve this:
get message non-destructively from queue (only if not marked as "processing")
mark message as "processing" on queue
process message (including sending to some destination)
if successful delete from queue, or remove "processing" state on queue otherwise
If the application dies right after a successful third step 'process message', the message would be marked as "processing" and would not be processed again (as it might have been already).
Note: I do not want this process to have any knowledge about the message processing (other than success).
Have you tried SYNCPOINT?Commit or Backout kind of operation might help in this scenario.
Your solution is a horrible design. If you are updating a database then why are you not using 2 phase commits (i.e. XA transactions)?
Just have your MQAdmin setup up the queue manager to use the resource manager of the particular database you are using then it is as simple as:
Start transaction (2 phase commit)
Get message (destructive get NOT browse) from the queue
Update database
Commit transaction
Hence, everything in the transaction, MQGET and database update, will either be committed together or backed out together.
If your application were to crash, then the resource manager will automatically back out everything in the transaction.
Lets say you don't want to use 2 phase commit or you are not updating a database (updating a file) then you can use single phase UOW (Unit of Work).
Use MQGMO option of MQGMO_SYNCPOINT
Get message (destructive get NOT browse) from the queue
Update whatever you are updating
Issue MQCMIT
Things to know about MQ:
If an application issues an MQDISC or ends normally, with current uncommitted operations, an implied MQCMIT is executed by IBM MQ, i.e. all operations done under SYNCPOINT are committed.
If an application ends abnormally, with current uncommitted operations, an implied MQBACK is executed by IBM MQ, i.e. all operations done under SYNCPOINT are rolled back.

How to use multiple sessions per connection in a multi-threaded application?

Suppose I have one connection c and many session objects s1, s2 .. sn, each working in different threads t1, t2 ... tn.
c
|
-------------------------------------------------
| | | |
(t1,s1) (t2,s2) (t3,s3) ...... (tn,sn)
Now suppose one of the thread t3 wants to send a message to a particular queue q3 and then listen to the reply asynchronously. So it does the following:
1: c.stop();
2: auto producer = s3.createProducer(s3.createQueue(q3));
3: auto text = s3.createTextMessage(message);
4: auto replyQueue = s3.createTemporaryQueue();
5: text.setJMSReplyTo(replyQueue);
6: producer.send(text);
7: auto consumer = s3.createConsumer(replyQueue);
8: consumer.setMessageListener(myListener);
9: c.start();
The reason why I called c.stop() in the beginning and then c.start() in the end, because I'm not sure if any of the other threads has called start on the connection (making all the sessions asynchronous — is that right?) and as per the documentation:
"If synchronous calls, such as creation of a consumer or producer, must be made on an asynchronous session, the Connection.Stop must be called. A session can be resumed by calling the Connection.Start method to start delivery of messages."
So calling stop in the beginning of the steps and then start in the end seems reasonable and thus the code seems correct (at least to me). However, when I thought about it more, I think the code is buggy, as it doesn't make sure no other threads call start before t3 finishes all the steps.
So my questions are:
Do I need to use mutex to ensure it? Or the XMS handles it automatically (which means my reasoning is wrong)?
How to design my application so that I dont have to call stop and start everytime I want to send a messages and listen reply asynchronously?
As per the quoted text above, I cannot call createProducer() and createConsumer() if the connection is in asynchronous mode. What are other methods I cannot call? The documentation doesn't categorise the methods in this way:
Also, the documentation doesn't say clearly what makes a session asynchronous. It says this:
"A session is not made asynchronous by assigning a message listener to a consumer. A session becomes asynchronous only when the Connection.Start method is called."
I see two problems here:
Calling c.start() makes all sessions asynchronous, not just one.
If I call c.start() but doesn't assign any message listener to a consumer, are the session(s) still asynchronous?
It seems I've lots of questions, so it'd be great if anyone could provide me with links to the parts or sections of the documentation which explains XMS objects with such minute details.
This says,
"According to the specification, calling stop(), close() on a Connection, setMessageListener() on a Session etc. must wait till all message processing finishes, that is till all onMessage() calls which have already been entered exit. So if anyone attempts to do that operation inside onMessage() there will be a deadlock by design."
But I'm not sure if that information is authentic, as I didn't find this info on IBM documentation.
I prefer the KIS rule. Why don't you use 1 connection per thread? Hence, the code would not have to worry about conflicts between threads.

PostMessage occasionally loses a message

I wrote a multi-threaded windows application where thread:
A – is a windows form that handles user interaction and process the data from B.
B – occasionally generates data and passes it two A.
A thread safe queue is used to pass the data from thread B to A. The enqueue and dequeue functions are guarded using a windows critical section objects.
If the queue is empty when the enqueue function is called, the function will use PostMessage to tell A that there is data in the queue. The function checks to make sure the call to PostMessage is executed successfully and repeatedly calls PostMessage if it is not successful (PostMessage has yet to fail).
This worked well for quite some time until one specific computer started to lose the occasional message. By lose I mean that, PostMessage returns successfully in B but A never receives the message. This causes the software to appear frozen.
I have already come up with a couple acceptable workarounds. I am interesting in knowing why windows is loosing these messages and why this is only happening on the one computer.
Here is the relevant portions of the code.
// Only called by B
procedure TSharedQueue.Enqueue(AItem: TSQItem);
var
B: boolean;
begin
EnterCriticalSection(FQueueLock);
if FCount > 0 then
begin
FLast.FNext := AItem;
FLast := AItem;
end
else
begin
FFirst := AItem;
FLast := AItem;
end;
if (FCount = 0) or (FCount mod 10 = 0) then // just in case a message is lost
repeat
B := PostMessage(FConsumer, SQ_HAS_DATA, 0, 0);
if not B then
Sleep(1000); // this line of code has never been reached
until B;
Inc(FCount);
LeaveCriticalSection(FQueueLock);
end;
// Only called by A
function TSharedQueue.Dequeue: TSQItem;
begin
EnterCriticalSection(FQueueLock);
if FCount > 0 then
begin
Result := FFirst;
FFirst := FFirst.FNext;
Result.FNext := nil;
Dec(FCount);
end
else
Result := nil;
LeaveCriticalSection(FQueueLock);
end;
// procedure called when SQ_HAS_DATA is received
procedure TfrmMonitor.SQHasData(var AMessage: TMessage);
var
Item: TSQItem;
begin
while FMessageQueue.Count > 0 do
begin
Item := FMessageQueue.Dequeue;
// use the Item somehow
end;
end;
Is FCount also protected by FQueueLock? If not, then your problem lies with FCount being incremented after the posted message is already processed.
Here's what might be happening:
B enters critical section
B calls PostMessage
A receives the message but doesn't do anything since FCount is 0
B increments FCount
B leaves critical section
A sits there like a duck
A quick remedy would be to increment FCount before calling PostMessage.
Keep in mind that things can happen quicker than one would expect (i.e. the message posted with PostMessage being caught and processed by another thread before you have a chance to increment FCount a few lines later), especially when you're in a true multi-threaded environment (multiple CPUs). That's why I asked earlier if the "problem machine" had multiple CPUs/cores.
An easy way to troubleshoot problems like these is to scaffold the code with additonal logging to log every time you enter a method, enter/leave a critical section etc. Then you can analyze the log to see the true order of events.
On a separate note, a nice little optimization that can be done in a producer/consumer scenario like this is to use two queues instead of one. When the consumer wakes up to process the full queue, you swap the full queue with an empty one and just lock/process the full queue while the new empty queue can be populated without the two threads trying to lock each other's queues. You'd still need some locking in the swapping of the two queues though.
If the queue is empty when the enqueue
function is called, the function will
use PostMessage to tell A that there
is data in the queue.
Are you locking the message queue before checking the queue size and issuing the PostMessage? You may be experiencing a race condition where you check the queue and find it non-empty when in fact A is processing the very last message and is about to go idle.
To see if you're in fact experiencing a race condition and not a problem with PostMessage, you could switch to using an event. The worker thread (A) would wait on the event instead of waiting for a message. B would simply set that event instead of posting a message.
This worked well for quite some time
until one specific computer started to
lose the occasional message.
By any chance, does the number of CPUs or cores that this specific computer have different than the others where you see no problem? Sometimes when you switch from a single-CPU machine to a machine with more than one physical CPU/core, new race conditions or deadlocks may arise.
Could there be a second instance unknowingly running and eating the messages, marking them as handled?

Resources