Kafka Streams instances going in to DEAD state - apache-kafka-streams

We are using Kafka 0.10.2.0, kafka streams 1.1.0.
We have Kafka Cluster of 16 machines, and topic that is being consumed by Kafka Streams has 256 partitions. We spawned 400 instances of Kakfa Streams application.
We see that all of the StreamThreads go in to DEAD state.
[2018-05-25 05:59:29,282] INFO stream-thread [ksapp-19f923d7-5f9e-4137-b79f-ee20945a7dd7-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD (org.apache.kafka.streams.processor.internals.StreamThread)
[2018-05-25 05:59:29,282] INFO stream-client [ksapp-19f923d7-5f9e-4137-b79f-ee20945a7dd7] State transition from REBALANCING to ERROR (org.apache.kafka.streams.KafkaStreams)
[2018-05-25 05:59:29,282] WARN stream-client [ksapp-19f923d7-5f9e-4137-b79f-ee20945a7dd7] All stream threads have died. The instance will be in error state and should be closed. (org.apache.kafka.streams.KafkaStreams)
[2018-05-25 05:59:29,282] INFO stream-thread [ksapp-19f923d7-5f9e-4137-b79f-ee20945a7dd7-StreamThread-1] Shutdown complete (org.apache.kafka.streams.processor.internals.StreamThread)
Please note that when we have only 100 kafka instances, things are working as expected. We see that instances are consuming messages from topic.

Related

Kafka streams keep logging 'Discovered transaction coordinator' after a node crash (with config StreamsConfig.EXACTLY_ONCE_V2)

I have a kafka(kafka_2.13-2.8.0) cluster with 3 partitions and 3 replications distributed in 3 nodes.
A producer cluster is sending messages to the topic.
I also have a consumer cluster using Kafka streams to consume messages from the topic.
To test fault tolerance, I killed a node. Then all consumers get stuck and keep poping below info:
[read-1-producer] o.a.k.c.p.internals.TransactionManager : [Producer clientId=streams-app-3-0451a24c-7e5c-498c-98d4-d30a6f5ecfdb-StreamThread-1-producer, transactionalId=streams-app-3-0451a24c-7e5c-498c-98d4-d30a6f5ecfdb-1] Discovered transaction coordinator myhost:9092 (id: 3 rack: null)
what I found out by now is there are sth relevant to the configuration of StreamsConfig.EXACTLY_ONCE_V2, because if I change it to StreamsConfig.AT_LEAST_ONCE the consumer works as expected.
To keep the EOS consuming, did I miss any configuration for producer/cluster/consumer?

How to stop Preparing to rebalance group with old generation in Kafka?

I used Kafka for my web application and I found the below messages in kafka.log :
[2021-07-06 08:49:03,658] INFO [GroupCoordinator 0]: Preparing to rebalance group qpcengine-group in state PreparingRebalance with old generation 105 (__consumer_offsets-28) (reason: removing member consumer-1-7eafeb56-e6fe-4161-9c88-e69c06a0ab37 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
[2021-07-06 08:49:03,658] INFO [GroupCoordinator 0]: Group qpcengine-group with generation 106 is now empty (__consumer_offsets-28) (kafka.coordinator.group.GroupCoordinator)
But, kafka like as looping forever for one consumer.
How can I stop it?
Here the picture of the kafka log :
enter image description here
If you only have one partition,you dont'need to use consumer_group
just try to use Assign(not subscribe)

kafka-streams instance on startup continuously logs "Found no committed offset for partition traces-1"

I have a kafka-streams app with 2 instances. This is a brand new kafka-cluster with all topics created and have no messages written to them yet.
I start the first instance and see that it has transitioned from REBALANCING to RUNNING state
Now I start the next instance and notice that it continuously logs the following:
2020-01-14 18:03:57.896 [streaming-app-f2457059-c9ec-4c21-a177-be54f8d59cb2-StreamThread-2] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=streaming-app-f2457059-c9ec-4c21-a177-be54f8d59cb2-StreamThread-2-consumer, groupId=streaming-app] Found no committed offset for partition traces-1

KafkaStreams shuts down with no exceptions

I have four instances of a Kafka stream application running with the same application id. All the input topics are of a single partition. To achieve scalability I have passed it through an intermediate dummy topic with multiple partitions. I have set request.timeout.ms as 4 minutes.
The Kafka instances go into the ERROR state without any exception being thrown. It is difficult to figure out what is the exact issue. Any ideas?
[INFO ] 2018-01-09 12:30:11.579 [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] StreamThread:939 - stream-thread [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] Shutting down
[INFO ] 2018-01-09 12:30:11.579 [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] StreamThread:888 - stream-thread [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] State transition from RUNNING to PENDING_SHUTDOWN.
[INFO ] 2018-01-09 12:30:11.595 [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] KafkaProducer:972 - Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.
[INFO ] 2018-01-09 12:30:11.605 [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] StreamThread:972 - stream-thread [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] Stream thread shutdown complete
[INFO ] 2018-01-09 12:30:11.605 [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] StreamThread:888 - stream-thread [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD.
[WARN ] 2018-01-09 12:30:11.605 [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] KafkaStreams:343 - stream-client [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4] All stream threads have died. The Kafka Streams instance will be in an error state and should be closed.
[INFO ] 2018-01-09 12:30:11.605 [new-03-cb952917-bd06-4932-8c7e-62986126a5b4-StreamThread-1] KafkaStreams:268 - stream-client [app-new-03-cb952917-bd06-4932-8c7e-62986126a5b4] State transition from RUNNING to ERROR.
Thea asker shared his solution in the comments:
Once I changed the consumer group id it worked.
It is also worth noting that the related issue (which may or may not have the same root cause) has been introduced in recent versions, and now looks to be fixed as well in Kafka versions 2.5.1 and 2.6.0 above.
As such people who are experiencing this today may want to check whether they are on a high (or low) enough version to avoid this issue.
You may also need to set the default.production.exception.handler Kafka Streams property to a class that implements ProductionExceptionHandler and, unlike the default class DefaultProductionExceptionHandler, logs the error before triggering a permanent failure state.

Kafka Stream Rebalancing : State transition from REBALANCING to ERROR

I have 4 topics with single partition and three instances of the application. I tried to achieve scalability by writing a custom PartitionGrouper which would create 3 tasks as below:
1st instance-topic1,partition0,topic4,partition0
2nd instance-topic2,partition0
3rd instance-topic3,partition0
I configured NUM_STANDBY_REPLICAS_CONFIG to 1 since it would maintain states locally(also to eliminate invalidstatestore exception).
The above setup worked fine for two instances. When I increased it to three instances I started facing issues w.r.t rebalancing.
StickyTaskAssignor:58 - Unable to assign 1 of 1 standby tasks for task [1009710637_0]. There is not enough available capacity. You should increase the number of threads and/or application instances to maintain the requested number of standby replicas.
[INFO ] 2017-12-25 20:05:42.221 [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] StreamThread:888 - stream-thread [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED.
[INFO ] 2017-12-25 20:05:42.221 [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] KafkaStreams:268 - stream-client [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449] State transition from REBALANCING to REBALANCING.
[INFO ] 2017-12-25 20:05:42.276 [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] StreamThread:195 - stream-thread [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] partition assignment took 55 ms.
current active tasks: [1009710637_0]
current standby tasks: [1240464215_0, 1833680710_0]
previous active tasks: []
[INFO ] 2017-12-25 20:05:42.631 [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] StreamThread:939 - stream-thread [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] Shutting down
[INFO ] 2017-12-25 20:05:42.631 [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] StreamThread:888 - stream-thread [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] State transition from PARTITIONS_ASSIGNED to PENDING_SHUTDOWN.
[INFO ] 2017-12-25 20:05:42.633 [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] KafkaProducer:972 - Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.
[INFO ] 2017-12-25 20:05:42.638 [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] StreamThread:972 - stream-thread [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] Stream thread shutdown complete
[INFO ] 2017-12-25 20:05:42.638 [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] StreamThread:888 - stream-thread [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD.
[WARN ] 2017-12-25 20:05:42.638 [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] KafkaStreams:343 - stream-client [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449] All stream threads have died. The Kafka Streams instance will be in an error state and should be closed.
[INFO ] 2017-12-25 20:05:42.638 [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449-StreamThread-1] KafkaStreams:268 - stream-client [app-03-cfaf7841-dc19-4ee4-9d05-ae4928c21449] State transition from REBALANCING to ERROR.
I assume that your PartitionGrouper breaks something. It's it quite hard to write a correct custom partition grouper as you need to know a lot of internals about Kafka Streams. Thus, it is not recommended in the first place.
The error itself means, that a StandbyTask cannot be assigned to a thread successfully, as there are not enough threads. In general, the idea is that a StandbyTask cannot be assigned to a thread the runs the corresponding "active" task or a another copy of the same StandbyTasks: it does not increase fault-tolerance but only wastes memory as if a thread dies, all the task dies.
Why you get this error in particular is unclear (happy debugging :)).
However, for your use case, you should just start different application instances subscribing to individual topics and using different application.id to scale out your application.
I have come across such an issue, it might happen that the kafka stream is being closed (before it can go to RUNNING STATE) which hints streams API to invoke the close method, and the state changes from REBALANCING to PENDING_SHUTDOWN and then to NOT RUNNING.
This can occur if you are building your KafkaStreams in try-with-resources which automatically closes the stream after the try block is executed.

Resources