IIB Collector Node and transactions - ibm-integration-bus

I am using a Collector Node in my message flow. It is configured to collect 50 message or wait for 30 seconds. Under load testing, Websphere MQ sometimes says that a long-running transaction has been detected, and the pid corresponds with the pid of the application's execution group. The question is: is it possible that the Collector Node does not commit its internal transaction while waiting for the messages or for the timeout expiry?

The MQInput node is where the transactionality is specified. This is described in the IIB v10 KC page Developing integration solutions > Developing message flows > Message flow behavior > Changing message flow behavior > Configuring transactionality for message flows > Configuring MQ nodes for transactions
If you set the property to Yes (the default option): if a transaction is not already inflight, the node starts a transaction.
The Collector Node does not commit until it times out or reaches the count. See the IIB v10 KC page Reference > Message flow development > Built-in nodes > Collector node
All input messages that are received under sync point from a transaction or thread by the Collector node are stored in internal queues. Storing the input messages under sync point ensures that the messages remain in a consistent state for the outgoing thread to process; such messages are available only at the end of the transaction or thread that propagates the input messages.
A new transaction is created when a message collection is complete, and is propagated to the next node.

Whenever you configure any node(those are eligible as per IBM documentation) to work under transaction, they don't commit until the unit-of-work gets completed. In your case since 50 messages(if arrived in 30 secs) are requested in one unit-of-work, the message flow that has collector node and all other nodes in that flow commit once all 50 messages are successfully processed. During this time period, Queue manager has to maintain this in-flight state in its logs which I had stated previously which had to be increased. So any large unit-of-work causes this issue irrespective of node used
Since your issue deals with MQ long running transaction, ensure you have enough MQ log space for transaction handling by the queue manager.
To increase the MQ log space go to the below path and increase the primary and secondary number
==> IBM\WebSphere MQ\qmgrs\QMNAME\qm.ini
Below are the content that you have to increase. By default it is 3 and 2. Ensure you have space on your disc to whatever number you are increasing it to. Restart your queue manager once the qm.ini file has been updated.
Log:
LogPrimaryFiles=3
LogSecondaryFiles=2
Link to MQ config on :
https://www.ibm.com/support/knowledgecenter/en/SSFKSJ_9.0.0/com.ibm.mq.con.doc/q018710_.htm
Hope this helps.

Related

Is there anyway to check current bulk queue size Opensearch?

My Opensearch sometimes reaches the error "429 Too Many Requests" when writing data. I know there is a queue, when the queue is full it will show that error. So is there any Api to check that bulk queue status, current size...? Example: queue 150/200 (nearly full)
Yes, you can use the following API call
GET _cat/thread_pool?v
You will get something like this, where you can see the node name, the thread pool name (look for write), the number of active requests currently being carried out, the number of requests waiting in the queue and finally the number of rejected requests.
node_name name active queue rejected
node01 search 0 0 0
node01 write 8 2 0
The write queue can handle as many requests as 1 + number of CPUs, i.e. as many can be active at the same time. If active is full and new requests come in, they go directly in the queue (default size 10000). If active and queue are full, requests start to be rejected.
Your mileage may vary, but when optimizing this, you're looking at:
keeping rejected at 0
minimizing the number of requests in the queue
making sure that active requests get carried out as fast as possible.
Instead of increasing the queue, it's usually preferable to increase the number of CPU. If you have heavy ingest pipelines kicking in, it's often a good idea to add ingest nodes whose goal will be to execute that pipeline instead of on the data node.

Spring Batch - restart behavior upon worker crash

I've been exploring how Spring Batch works in certain failure cases when remote partitioning is used.
Let's say I have 3 worker nodes and 1 manager node. The manager node creates 30 partitions that the workers can pick up. The messaging layer is Kafka.
The workers are up, waiting for work to arrive on the specific topic. The manager node creates the partitions, puts them into the DB and sends the messages on the Kafka topic which has 3 partitions.
All nodes have started the processing but suddenly one node has crashed. The node that has crashed will have the step execution states set to STARTED/STARTING for the partitions it initially has picked up.
Another node will come to the rescue since the Kafka partitions will get revoked and reassigned, so one of the nodes between the 2 will read the partition the crashed node did.
In this case, nothing will happen of course because the original Kafka offset was committed by the crashed node even though the processing hasn't finished. Let's say when partitions get reassigned, I set the consumer back to the topic's beginning - for the partitions it manages.
Awesome, this way the consumer will start consuming messages from the partition of the crashed node.
And here's the catch. Even though some of the step executions that the crashed node processed with COMPLETED state, the new node that took over will reprocess that particular step execution once more even though it was finished before by the crashed node.
This seems strange to me.
Maybe I'm trying to solve this the wrong way, not sure but I appreciate any suggestions how to make the workers fault-tolerant for crashes.
Thanks!
If a StepExecution is marked as COMPLETED in the job repository, it will not be reprocessed. No data will be run again. A new StepExecution may be created (I don't have the code in front of me right now) but when Spring Batch evaluates what to do based on the previous run, it won't process it again. That's a key feature of how Spring Batch's partitioning works. You can send the workers 100 messages to process each partition, but it will only actually get processed once due to the synchronization in the job repository. If you are seeing other behavior, we would need more information (details from your job repository and configuration specifics).

NiFi - data stuck in queues when load balancing is used

In Apache NiFi, dockerized version 1.15, a cluster of 3 NiFi nodes is created. When load balancing is used via default port 6342, flow files get stuck in some of the queues, in the queue in which load balancing is enabled. But, when "List queue" is tried, the message "The queue has no FlowFiles." is issued:
The part of the NiFi processor group where the issue happens:
Configuration of NiFi queue in which flow files seem to be stuck:
Another problem, maybe not related, is that after this happens, some of the flow files reach the subsequent NiFi processors, but get stuck before the MergeContent processors. This time, the queues can be listed:
The part of code when the second issue occurs:
The part of code when the second issue occurs
The configuration of the queue:
The listing of the FlowFiles in the queue:
The MergeContent processor configuration. The parameter "max_num_for_merge_smxs" is set to 100:
Load balancing is used because data are gathered from the SFTP server, and that processor runs only on the Primary node.
If you need more information, please let me know.
Thank you in advance!
Edited:
I put the load-balancing queues between the ConsumeMQTT (working on the Primary node only) and UpdataAttribute processors, but Flow files are seemingly staying in the load-balancing queue, but when the listing is done, the message is "The queue has no FlowFiles.". Please check:
Changed position of the load-balancing queue:
The message that there are no flow files in the queues:
Take notice that the processors before and after the queue are stopped while doing "List queue".
Edit 2:
I changed the configuration in the nifi.properties to the following:
nifi.cluster.load.balance.connections.per.node=20
nifi.cluster.load.balance.max.thread.count=60
nifi.cluster.load.balance.comms.timeout=30 sec
I also restarted the NiFi containers, so I will monitor the behaviour. For now, there are no stuck Flow files in the load-balancing queues, they go to the processor that follows the queue.
"The queue has no FlowFiles" is normal behaviour of a queue that is feeding into a Merge - the flowfiles are pending to be merged.
The most likely cause of them being "stuck" before a Merge is that you have Round Robin distributed the FlowFiles across many nodes, and then you are setting a Minimum count on the Merge. This minimum is per node and there are not enough FlowFiles on each node to hit the Minimum, so they are stuck waiting for more FlowFiles to trigger the Merge.
-- Edit
"The queue has no FlowFiles" is also expected on a queue that is active - in your flow, the load balancing queue is drained immediately into the output queue of your merge PGs Input port - so there are no FFs sitting around in the load balancing queue. If you were to STOP the Input ports inside the merge PG, you should be able to list them on the LB queue.
It sounds like you are doing GetSFTP (Primary) and then distributing the files. The better approach would be to use ListSFTP (Primary) -> Load Balance -> FetchSFTP - this would avoid shuffling large files, and would instead load balance the file names between all nodes, with each node then fetching a subset of the files.
Secondly, I would review your Merge config - you have a parameter #{max_num_for_merge_xmsx} defined, but this set in the Minimum Number of Entries for the Merge - so you are telling Merge to only ever merge when at least #{max_num_for_merge_xmsx} amount of FlowFiles is reached.

Spring Boot Kafka: Commit cannot be completed since the group has already rebalanced

Today, in my Spring Boot and single instance Kafka application I faced the following issue:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
be completed since the group has already rebalanced and assigned the
partitions to another member. This means that the time between
subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is
spending too much time message processing. You can address this either
by increasing the session timeout or by reducing the maximum size of
batches returned in poll() with max.poll.records.
What may be the reason for this and how to fix it? As far as I understand - my consumer was blocked for a long time and didn't respond for the heartbeat. And I should adjust Kafka properties in order to address it. Could you please tell me what exact properties should I adjust and where, for example on the Kafka side or on my application Spring Kafka side?
By default Kafka will return a batch of records of fetch.min.bytes (default 1) up to either max.poll.records (default 500), or fetch.max.bytes (default 52428800), otherwise it will wait fetch.wait.max.ms (default 100) before returning a batch of data. Your consumer is expected to do some work on that data and then call poll() again. Your consumer's work is expected to be completed within max.poll.interval.ms (default 300000 — 5 mins in pre v2.0 and 30000 - 30 seconds post v2.0). If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.
So to fix your issue, reduce the number of messages returned, or increase max.poll.interval.ms property to avoid timing out and rebalancing.

What does this mean in a storm topology "storm Insufficient Capacity on queue to emit"? And how can I increase the queue size or fix this?

I have a storm topology running and I get this message in debug logs for one of the bolts "storm Insufficient Capacity on queue to emit". This bolt sends message to another bolt on a particular stream. The next bolt is trying to write data to a db and hence is slower.
Does this mean that next bolt's internal queue is full and hence no more messages will be emitted? How can I increase that queue size? Also will these messages be re tried once the next bolt has processed its messages?

Resources