Kafka streams keep logging 'Discovered transaction coordinator' after a node crash (with config StreamsConfig.EXACTLY_ONCE_V2) - apache-kafka-streams

I have a kafka(kafka_2.13-2.8.0) cluster with 3 partitions and 3 replications distributed in 3 nodes.
A producer cluster is sending messages to the topic.
I also have a consumer cluster using Kafka streams to consume messages from the topic.
To test fault tolerance, I killed a node. Then all consumers get stuck and keep poping below info:
[read-1-producer] o.a.k.c.p.internals.TransactionManager : [Producer clientId=streams-app-3-0451a24c-7e5c-498c-98d4-d30a6f5ecfdb-StreamThread-1-producer, transactionalId=streams-app-3-0451a24c-7e5c-498c-98d4-d30a6f5ecfdb-1] Discovered transaction coordinator myhost:9092 (id: 3 rack: null)
what I found out by now is there are sth relevant to the configuration of StreamsConfig.EXACTLY_ONCE_V2, because if I change it to StreamsConfig.AT_LEAST_ONCE the consumer works as expected.
To keep the EOS consuming, did I miss any configuration for producer/cluster/consumer?

Related

Spring kafka consumer stops receiving message

I have a spring microservice using kafka.
Here are consumer 5 config properties :
BOOTSTRAP_SERVERS_CONFIG -> <ip>:9092
KEY_DESERIALIZER_CLASS_CONFIG -> StringDeserializer.class
VALUE_DESERIALIZER_CLASS_CONFIG -> StringDeserializer.class
GROUP_ID_CONFIG -> "Group1"
MAX_POLL_INTERVAL_MS_CONFIG -> Integer.INT_MAX
It has been observed that when microservice is restarted , then kafka consumer stops receiving messages. Please help me in this.
I believe your max.poll.interval.ms is the issue. It is set to 24days!! This represents the time the consumer is given to process the message. The the broker will hang for that long when the processing thread dies! Try setting it to a smaller value than Integer.INT_MAX, for example 30 seconds 30000ms.

Kafka Streams instances going in to DEAD state

We are using Kafka 0.10.2.0, kafka streams 1.1.0.
We have Kafka Cluster of 16 machines, and topic that is being consumed by Kafka Streams has 256 partitions. We spawned 400 instances of Kakfa Streams application.
We see that all of the StreamThreads go in to DEAD state.
[2018-05-25 05:59:29,282] INFO stream-thread [ksapp-19f923d7-5f9e-4137-b79f-ee20945a7dd7-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD (org.apache.kafka.streams.processor.internals.StreamThread)
[2018-05-25 05:59:29,282] INFO stream-client [ksapp-19f923d7-5f9e-4137-b79f-ee20945a7dd7] State transition from REBALANCING to ERROR (org.apache.kafka.streams.KafkaStreams)
[2018-05-25 05:59:29,282] WARN stream-client [ksapp-19f923d7-5f9e-4137-b79f-ee20945a7dd7] All stream threads have died. The instance will be in error state and should be closed. (org.apache.kafka.streams.KafkaStreams)
[2018-05-25 05:59:29,282] INFO stream-thread [ksapp-19f923d7-5f9e-4137-b79f-ee20945a7dd7-StreamThread-1] Shutdown complete (org.apache.kafka.streams.processor.internals.StreamThread)
Please note that when we have only 100 kafka instances, things are working as expected. We see that instances are consuming messages from topic.

UNKNOWN_PRODUCER_ID When using apache kafka streams (scala)

I am running 3 instances of a service that I wrote using:
Scala 2.11.12
kafkaStreams 1.1.0
kafkaStreamsScala 0.2.1 (by lightbend)
The service uses Kafka streams with the following topology (high level):
InputTopic
Parse to known Type
Clear messages that the parsing failed on
split every single message to 6 new messages
on each message run: map.groupByKey.reduce(with local store).toStream.to
Everything works as expected but i can't get rid of a WARN message that keeps showing:
15:46:00.065 [kafka-producer-network-thread | my_service_name-1ca232ff-5a9c-407c-a3a0-9f198c6d1fa4-StreamThread-1-0_0-producer] [WARN ] [o.a.k.c.p.i.Sender] - [Producer clientId=my_service_name-1ca232ff-5a9c-407c-a3a0-9f198c6d1fa4-StreamThread-1-0_0-producer, transactionalId=my_service_name-0_0] Got error produce response with correlation id 28 on topic-partition my_service_name-state_store_1-repartition-1, retrying (2 attempts left). Error: UNKNOWN_PRODUCER_ID
As you can see, I get those errors from the INTERNAL topics that Kafka stream manage. Seems like some kind of retention period on the producer metadata in the internal topics / some kind of a producer id reset.
Couldn't find anything regarding this issue, only a description of the error itself from here:
ERROR CODE RETRIABLE DESCRIPTION
UNKNOWN_PRODUCER_ID 59 False This exception is raised by the broker if it could not locate the producer metadata associated with the producerId in question. This could happen if, for instance, the producer's records were deleted because their retention time had elapsed. Once the last records of the producer id are removed, the producer's metadata is removed from the broker, and future appends by the producer will return this exception.
Hope you can help,
Thanks
Edit:
It seems that the WARN message does not pop up on version 1.0.1 of kafka streams.

Strange 'emitted' numbers behavior / zero stat numbers in Topology stats (Storm 1.0.3)

This is what my storm UI stat looks like.
The problem is that I have no idea where those numbers (of emitted tuples are coming from).
My topology is pretty simple: kafka spout -> bolt (persisting data into hbase)
topology works - when I put data into kafka topic, I get them processed by bolt and persisted in hbase, which I then verify with scan operator in hbase shell (so new records are being inserted)
however each time I submit new message into kafka and when it’s persisted by bolt - my topology doesn’t increase number of emitted by ‘1’.
periodically I get all numbers increased by 20 - without sending any new messages into kafka. I.e. my kafka topic gets no messages for hours, but the number of tuples emitted always get increased in chunks of 20 over time. I still get the same number of records in hbase.
I get no exceptions/errors anywhere in apache storm logs.
I’m not doing ack() or fail() any of my tuples in my bolt implementation (which is BasicBolt type doing ack automatically)
my capacity or latency in bolt metrics is always staying zero even when I load a lot of messages in Kafka
my kafka offset log ($KAFKA/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker) shows all the messages are processed and Kafka Lag for given topic/group is 0.
So my question:
what are those ‘stealth’ tuples that increase ‘emitted’ in both Spout and Bolt over time by 20s?
is it possible to enable ‘debugging’ in storm UI to see what those tuples are?
why capacity/latency in bolt metrics is always zero while bolt is confirmed to persist data?
Environment details
I’m using Java 8 + Apache Storm 1.0.3
[devops#storm-wk1-prod]~/storm/supervisor/stormdist% storm version
Running: /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/opt/apache-storm-1.0.3 -Dstorm.log.dir=/opt/apache-storm-1.0.3/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /opt/apache-storm-1.0.3/lib/storm-core-1.0.3.jar:/opt/apache-storm-1.0.3/lib/kryo-3.0.3.jar:/opt/apache-storm-1.0.3/lib/reflectasm-1.10.1.jar:/opt/apache-storm-1.0.3/lib/asm-5.0.3.jar:/opt/apache-storm-1.0.3/lib/minlog-1.3.0.jar:/opt/apache-storm-1.0.3/lib/objenesis-2.1.jar:/opt/apache-storm-1.0.3/lib/clojure-1.7.0.jar:/opt/apache-storm-1.0.3/lib/disruptor-3.3.2.jar:/opt/apache-storm-1.0.3/lib/log4j-api-2.1.jar:/opt/apache-storm-1.0.3/lib/log4j-core-2.1.jar:/opt/apache-storm-1.0.3/lib/log4j-slf4j-impl-2.1.jar:/opt/apache-storm-1.0.3/lib/slf4j-api-1.7.7.jar:/opt/apache-storm-1.0.3/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-1.0.3/lib/servlet-api-2.5.jar:/opt/apache-storm-1.0.3/lib/storm-rename-hack-1.0.3.jar:/opt/storm/conf org.apache.storm.utils.VersionInfo
Storm 1.0.3
URL https://git-wip-us.apache.org/repos/asf/storm.git -r eac433b0beb3798c4723deb39b3c4fad446378f4
Branch (no branch)
Compiled by ptgoetz on 2017-02-07T20:22Z
From source with checksum c78e52de4b8a22d99551d45dfe9c1a4b
My storm.yaml:
I'm running 2 instances with storm supervisor, each having the following config:
storm.zookeeper.servers:
- "10.138.0.8"
- "10.138.0.9"
- "10.138.0.16"
storm.zookeeper.port: 2181
nimbus.seeds: ["10.138.0.10"]
storm.local.dir: "/var/log/storm"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
worker.childopts: "-Xmx768m"
nimbus.childopts: "-Xmx512m"
supervisor.childopts: "-Xmx256m"
toplogy.yaml
nimbus.host: "10.138.0.10"
# In Storm 0.7.x, this is necessary in order to give workers time to
# initialize. In Storm 0.8.0 and later, it may not be necessary because Storm
# has added a separate, longer timeout for the initial launch of a worker.
supervisor.worker.timeout.secs: 60
topology.workers: 1
topology
import tbolts
import tspouts
def create(builder):
"""Create toplogy through Petrel library
"""
# spout getting data from kafka instance
# we run 2 tasks of kafka spout
builder.setSpout("kafka",
tspouts.KafkaSpout(), 2)
# persistence bolt
# we run 4 tasks of persistence bolt
builder.setBolt("persistence",
tbolts.PersistenceBolt(), 4).shuffleGrouping("kafka")
The reason your emit count jumps up by 20 is due to the fact that Storm only samples every 20th tuple buy default to update its metrics. This sampling rate is controlled by the topology.stats.sample.rate config variable and can be changed per topology. So you could set this to be 1.0 (it is 0.05 by default) and you would get an accurate emit count, however this would introduce a significant processing overhead and may cause your Acker and/or metrics consumer instances to become overloaded. Use with caution.

Apache Storm Worker Process dies

I have installed storm-0.9.2 in a 5-node cluster. I have a simple topology with 1 spout and varying number of bolts (4, 9, 22, 31). For each configuration I have configured (#bolts + 1) workers. Thus for 4 bolts, I have 5 workers, 22 bolts with 23 workers, etc.
I have observed failed worker processes in the worker log files with corresponding EndOfStream exception in the zookeeper.out log file. When I do get a clean test run the number of tuples processed by each bolt is evenly distributed on each worker. On a non-clean test run, the workers that failed attempt to reconnect, however since the number of tuples are finite there are no more tuples to process.
What are the possible causes for a worker process to die?
Excerpt from zookeeper.out log file:
*2014-10-27 17:40:33,198 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x1495431347c001e, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:744)
2014-10-27 17:40:33,201 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1007] - Closed socket connection for client /192.168.0.1:45693 which had sessionid 0x1495431347c001e*
Cluster Environment:
Storm 0.9.2
Zookeeper 3.4.6
Ubuntu 13.10
To me, it looks like a problem with your Zookeeper. There are a couple of ideas:
Your Zookeeper timeout configuration is too small.
Your Zookeeper instance doesn't has enough children (slaves) to handle your workload.
For diagnosing, start by increasing the default time out for your Zookeeper instance. If it is not working, try to expand your Zookeeper cluster.
You can consolidate Zookeeper documentation. Please, let us know if that solves your problem.

Resources