kafka commit during rebalancing - go

The scenario:
Kafka version 2.4.1.
Kafka partitions are processing messages actively.
CPU usage is less, memory usage is mediocre and no throttling is observed.
Golang Applications deployed on k8s using confluent's go client version 1.7.0.
k8s deletes some of the pods, kafka consumer group goes into rebalancing.
The message which was getting processed during this rebalancing gets stuck in the middle and takes around 17 mins to get processed, usual processing time is 3-4 seconds max.
No DB throttling, load is actually not even 10% of our peak.
k8s pods have 1 core and 1gb of memory.
Messages are consumed and processed in the same thread.
Earlier we found that one of the brokers in the 6 cluster node was unhealthy and we replaced it, post which we started facing the issue.
Question - Why did the message get stuck? Is it because rebalancing made the processing thread hang? OR something else?
Thanks in advance for your answers!

Messages are stuck due to rebalancing which is happening for your consumer group (CG). The rebalancing process for Kafka is normal procedure and is always triggered when new member joins the CG or leaves the CG. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. But if the CG stuck in PreparingRebalance you will not process any data.
You can identify the CG state by running some Kafka commands as example:
kafka-consumer-groups.sh --bootstrap-server $BROKERS:$PORT --group $CG --describe --state
and it should show you the status of the CG as example:
GROUP COORDINATOR (ID) ASSIGNMENT-STRATEGY STATE #MEMBERS
name-of-consumer-group brokerX.com:9092 (1) Empty 0
in above example you have STATE : EMPTY
The ConsumerGroup State may have 5 states:
Stable - is when the CG is stable and has all members connected successfully
Empty - is when there is no members in the group (usually mean the module is down or crashed)
PreparingRebalance - is when the members are connecting to the CG (it may indicate issue with client when members keep crashing but also is the State of CG before gets stable state)
CompletingRebalance - is the state when the PreparingRebalance is completing the process of rebalancing
Dead - consumer group does not have any members and metadata has been removed.
To indicate if the issue is on Cluster or client per PreparingRebalance just stop the client and execute the command to verify CG state... if the CG will be still showing members .. then you have to restart the broker which is pointed in the output command as Coordinator of that CG example brokerX.com:9092 .. if the CG become empty once you stop all clients connected to the CG would mean that something is off with the client code/data which causes members to leave/rejoin CG and as effect of this you sees that the CG is always in the status of PreparingRebalance that you will need to investigate why is this happening.
since from what I recall there was bug in Kafka version 2.4.1. and been fixed in 2.4.1.1 you can read about it here:
https://issues.apache.org/jira/browse/KAFKA-9752
https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-msk-now-offers-version-2-4-1-1-fixing-a-perpetual-rebalance-bug-in-apache-kafka-2-4-1/
my troubleshooting steps should show you how can you verify If this is the case that you facing the bug issue or is just bad code.

Related

Good way to notify distributor that the worker can't process any more messages

I currently have a distributor service and a worker service that communicate via SQS. The distributor searches for expired values in a postgresql database and sends them to SQS.
The worker picks up jobs from SQS and calls an API to get more data. This API has a quota limit.
When the worker reaches this quota what is the best way to notify the distributor that no more work should be sent?
My initial ideas:
Go back into postgres and reinsert the data with a much later expiration date such that the next time it is processed I will surely have enough quota units. (I have to do this anyway as there are 3 requests to get all the data and failing 1/3 means I would have to reinsert the job).
Send a stop signal through the queue that the distributor will receive. The only issue to get around here is that I would not want the message to stay in the queue longer than needed, and having more than one distributor running at the same time complicates message consumption.
Check the size of the queue before inserting something into it. This won't work well because SQS cannot give a good estimate until about 1 minute after the producers stop sending to the queue.
Some sort of distributed lock in the database

What will happen if my kafka consumer group is changed after each restart

Let’s say for instance, my kafka consumer (in Consumer Group 1) is reading messages from Kafka Topic A.
Now if that consumer consumes 12 messages before failing.
When the consumer starts up again, and now it has different consumer group (i.e. consumer group 2),
Question 1 -? On restart, will it continue from where it left off in the offset (or position) because that offset is stored by Kafka and/or ZooKeeper or will it start consuming messages from 1st message.
Question 2-> Is there a way to ensure that on restart (When consumer has different consumer group), it still start consuming from where it left off before restarting?
Just to give you the context, i am trying to update in-memory caches in each node/server on receiving a message on kafka topic. In order to do that, i am using a different consumer group for each node/server so that each message is consumed by all the nodes/servers to update in-memory cache. Please let me know if there are better ways to do this. Thanks!
Consumer offsets are maintained per consumer group and hence if you have a different consumer group on each restart you can make use of the auto.offset.reset property
The auto.offset.reset property specifies
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted): earliest: automatically reset the offset to the earliest offsetlatest: automatically reset the offset to the latest offsetnone: throw exception to the consumer if no previous offset is found for the consumer's groupanything else: throw exception to the consumer.
Having informed about the current approach - I believe you should relook at the design and it would be better to have a different consumer group per node but ensure to keep the same consumer group name per node even after a restart. This is a suggestion based on the info provided but there could be better solutions as well after going into the detail of the design/implementation.

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Lost connection to task manager

I'm working on a project that uses flink (Version - 1.4.2) for bulk data ingestion to my graph database (Janusgraph). Data ingestion has two phases, one is vertex data ingestion and the other is edge data ingestion to graph db. Vertex data ingestion runs without any issue but during edge ingestion i'm getting an error saying Lost connection to task manager taskmanagerName. The detailed error traceback from flink-taskmanager-b6f46f6c8-fgtlw is attached below:
2019-08-01 18:13:26,025 ERROR org.apache.flink.runtime.operators.BatchTask
- Error in task code: CHAIN Join(Remap EDGES id: TO) -> Map (Key Extractor) -> Combine (Deduplicate edges including bi-directional edges) (62/80)
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Lost connection to task manager 'flink-taskmanager-b6f46f6c8-gcxnm/10.xx.xx.xx:6121'.
This indicates that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:146)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
... 6 more
For ease of understanding lets say:
flink-taskmanager-b6f46f6c8-gcxnm as TM1 and
flink-taskmanager-b6f46f6c8-fgtlw as TM2
On debugging I was able to find that the TM1 requested for ResultPartition (RPP) from TM2 and the TM2 started to send ResultPartition to TM1. But on checking the logs from TM1 we found that it waited for long time to get RP from TM2 but after some time it started to deregister the accepted task. We believe deregistring task after netty remote transport exception caused TM2 to send Lost Taskmanager error for the specific job.
Both taskmanagers are running in separate ec2 instance (m4.2xlarge). I have verified cpu and memory utilization of both instances and was able to see all metric within limit.
Can you please tell me why taskmanager is acting weird like this and a way to fix this issue.
Thanks in advance
Flushing Buffers to Netty
In the picture above, the credit-based flow control mechanics actually sit inside the “Netty Server” (and “Netty Client”) components and the buffer the RecordWriter is writing to is always added to the result subpartition in an empty state and then gradually filled with (serialised) records. But when does Netty actually get the buffer? Obviously, it cannot take bytes whenever they become available since that would not only add substantial costs due to cross-thread communication and synchronisation, but also make the whole buffering obsolete.
In Flink, there are three situations that make a buffer available for consumption by the Netty server:
a buffer becomes full when writing a record to it, or
the buffer timeout hits, or
a special event such as a checkpoint barrier is sent.
Flush after Buffer Full
The RecordWriter works with a local serialisation buffer for the current record and will gradually write these bytes to one or more network buffers sitting at the appropriate result subpartition queue. Although a RecordWriter can work on multiple subpartitions, each subpartition has only one RecordWriter writing data to it. The Netty server, on the other hand, is reading from multiple result subpartitions and multiplexing the appropriate ones into a single channel as described above. This is a classical producer-consumer pattern with the network buffers in the middle and as shown by the next picture. After (1) serialising and (2) writing data to the buffer, the RecordWriter updates the buffer’s writer index accordingly. Once the buffer is completely filled, the record writer will (3) acquire a new buffer from its local buffer pool for any remaining bytes of the current record - or for the next one - and add the new one to the subpartition queue. This will (4) notify the Netty server of data being available if it is not aware yet4. Whenever Netty has capacity to handle this notification, it will (5) take the buffer and send it along the appropriate TCP channel.
Image 1
We can assume it already got the notification if there are more finished buffers in the queue.
Flush after Buffer Timeout
In order to support low-latency use cases, we cannot only rely on buffers being full in order to send data downstream. There may be cases where a certain communication channel does not have too many records flowing through and unnecessarily increase the latency of the few records you actually have. Therefore, a periodic process will flush whatever data is available down the stack: the output flusher. The periodic interval can be configured via StreamExecutionEnvironment#setBufferTimeout and acts as an upper bound on the latency5 (for low-throughput channels). The following picture shows how it interacts with the other components: the RecordWriter serialises and writes into network buffers as before but concurrently, the output flusher may (3,4) notify the Netty server of data being available if Netty is not already aware (similar to the “buffer full” scenario above). When Netty handles this notification (5) it will consume the available data from the buffer and update the buffer’s reader index. The buffer stays in the queue - any further operation on this buffer from the Netty server side will continue reading from the reader index next time.
Image 2
Reference:
Below link may help you out.
flink-network-stack-details
Can you check the GC logs of TM1 and TM2 to see if there were full GCs which may cause heatbeat timeout.

Kafka Streams: InvalidStateStoreException

If the stateful stream application is started with 6 threads on a single node, would the above exception occur?
Is there any process that needs to be followed, if a stateful stream application started on node 1 consuming a particular topic, is made to run on different node?
If the stateful stream application is started on 2 nodes and if the above exception occurs, would the stream application terminate immediately?
If yes, where can this exception be caught in a try-catch block?
If the exception can be caught, and if we add sleep for 10 mins, would the store automatically gets to valid state?
If not, is there a method that can be used to check the store state and wait until it becomes valid?
Follow-up:
If the stateful stream application is started with 6 threads on a single node, would the above exception occur?
It can
Essentially I was wondering if we keep the entire topic consumption on a single node, would it avoid re-building the store from an internal topic if a re-balancing occurs, due to one of the thread going down/terminates?
store is not ready yet: you can wait until the store is ready -- best to register a restore callback (check the docs for details) to get informed when restore is finished and you can retry to query the store.
Sorry, just to be clear on the above, is it StateRestoreCallback OR StateRestoreListener? I assume it is the later one. Also, is it required to override StateRestoreCallback and include logic to restore the store?
InvalidStateStoreException can have different causes, thus, it's hard to answer your question without more context.
If the stateful stream application is started with 6 threads on a single node, would the above exception occur?
It can.
Is there any process that needs to be followed, if a stateful stream application started on node 1 consuming a particular topic, is made to run on different node?
No.
If the stateful stream application is started on 2 nodes and if the above exception occurs, would the stream application terminate immediately?
Depends where the exception it thrown:
Either, the corresponding StreamThread would die, but the application would not terminate automatically. You should register an uncaught exception handler on the KafkaStreams instance and react to an dying thread with custom code (like, terminating the application).
If it is thrown from KafkaStreams using interactive queries, StreamThread would not be affected.
Where can this exception be caught in a try-catch block?
Usually yes, especially if you refer to interactive queries feature.
if we add sleep for 10 mins, would the store automatically gets to valid state?
If you refer to interactive queries feature, sleeping is not a good strategy. There are multiple causes for the exception and you need to react accordingly:
store is not local but on different node: you can figure this out by check the store metadata.
store is not ready yet: you can wait until the store is ready -- best to register a restore listener (check the docs for details) to get informed when restore is finished and you can retry to query the store.
Update
Essentially I was wondering if we keep the entire topic consumption on a single node, would it avoid re-building the store from an internal topic if a re-balancing occurs, due to one of the thread going down/terminates?
Yes (for non-EOS case). Other threads would detect the local store and reuse it.
StateRestoreCallback OR StateRestoreListener
Yes, it's StateRestoreListener. You would implement StateRestoreCallback only if you write a custom state store.

One slow ActiveMQ consumer causing other consumers to be slow

I'm looking for help regarding a strange issue where a slow consumer on a queue causes all the other consumers on the same queue to start consuming messages at 30 second intervals. That is all consumers but the slow one don't consumer messages as fast as they can, instead they wait for some magical 30s barrier before consuming.
The basic flow of my application goes like this:
a number of producers place messages onto a single queue. Messages can have different JMSXGroupIDs
a number of consumers listen to messages on that single queue
as standard practice the JMSXGroupIDs get distributed across the consumers
at some point one of the consumers becomes slow and can't process messages very quickly
the slow consumer ends up filling its prefetch buffer on the broker and AMQ recognises that it is slow (default behaviour)
at that point - or some 'random' but close time later - all consumers except the slow one start to only consume messages at the same 30s intervals
if the slow consumer becomes fast again then things very quickly return to normal operation and the 30s barrier goes away
I'm at a loss for what could be causing this issue, or how to fix it, please help.
More background and findings
I've managed to reliably reproduce this issue on AMQ 5.8.0, 5.9.0 (where the issue was originally noticed) and 5.9.1, on fresh installs and existing ops-managed installs and on different machines some vm and some not. All linux installs, different OSs and java versions.
It doesn't appear to be affected by anything prefetch related, that is: changing the prefetch value from 1 to 10 to 1000 didn't stop the issue from happening
[red herring?] Enabling debug logs on the amq instance shows logs relating to the periodic check for messages that can be expired. The queue doesn't have an expiry policy so I can only think that the scheduled expireMessagesPeriod time is just waking amq up in such a way that it then sends messages to the non-slow consumers.
If the 30s mode is entered then left then entered again the seconds-past-the-minute time is always the same, for example 14s and 44s past the minute. This is true across all consumers and all machines hosting those consumers. Those barrier points do change after restarts of amq.
While not strictly a solution to the problem, further investigation has uncovered the root cause of this issue.
TL;DR - It's known behaviour and won't be fixed before Apollo
More Details
Ultimately this is caused by the maxPageSize property and the fact that AMQ will only apply selection criteria to messages in memory. Generally these are message selectors (property = value), but in my case they are JMSXGroupID=>Consumer assignments.
As messages are received by the queue they get paged into memory and placed into a collection (named pagedInPendingDispatch in the source). To dispatch messages AMQ will scan through this list of messages and try to find a consumer that will accept it. That includes checking the group id, message selector and prefetch buffer space. For our use case we aren't using message selectors but we are using groups. If no consumer can take the message then it is left in the collection and will be checked again at the next tick.
In order to stop the pagedInPendingDispatch collection from eating up all the resources available there is a suggested limit to the size of this queue configured via the maxPageSize property. This property isn't actually a maximum, it's more a hint as to whether, under normal conditions, new message arrivals should be paged in memory or paged to disk.
With these two pieces of information and a slow consumer it turns out that eventually all the messages in the pagedInPendingDispatch collection end up only being consumable by the slow consumer, and hence the collection effectively gets blocked and no other messages get dispatched. This explains why the slow consumer wasn't affected by the 30s interval, it had maxPageSize messages waiting delivery already.
This doesn't explain why I was seeing the non-slow consumers receive messages every 30s though. As it turns out, paging messages into memory has two modes, normal and forced. Normal follows the process outlined above where the size of the collection is compared to the maxPageSize property, when forced, however, messages are always paged into memory. This mode exists to allow you to browse through messages that aren't in memory. As it happens this forced mode is also used by the expiry mechanism to allow AMQ to expire messages that aren't in memory.
So what we have now is a collection of messages in memory that are all targeted for dispatch to the same consumer, a consumer that won't accept them because it is slow or blocked. We also have a backlog of messages awaiting delivery to all consumers. Every expireMessagesPeriod milliseconds a task runs that force pages messages into memory to check if they should be expired or not. This adds those messages onto the pages in collection which now contains maxPageSize messages for the slow consumer and N more messages destined for any consumer. Those messages get delivered.
QED.
References
Ticket referring to this issue but for message selectors instead
Docs relating to the configuration properties
Somebody else with this issue but for selectors

Resources