Using dead letter queue with Kafka MirrorMaker2 - apache-kafka-connect

Kafka Connect converters provide the feature of dead letter queue (DLQ) that can be configured (errors.deadletterqueue.topic.name) to store failing records. I tried configuring it on a MirrorMaker2 setup but it doesn't seem to be working as expected. My expectation is that messages that failed to replicate to target cluster are stored in the dead letter queue topic.
To test this, I simulated failures by bringing down the target cluster and expected MirrorMaker2 to create a DLQ on source cluster with failed message but didn't see the dead letter queue topic created. The Kafka documentation is not very clear on whether this configuration option works for MirrorMaker2.
Below is the configuration I used:
clusters = sourceKafkaCluster,targetKafkaCluster
sourceKafkaCluster.bootstrap.servers = xxx
targetKafkaCluster.bootstrap.servers = yyy
sourceKafkaCluster->targetKafkaCluster.enabled = true
targetKafkaCluster->sourceKafkaCluster.enabled = false
#Not sure which one of the below ones are correct.
sourceKafkaCluster->targetKafkaCluster.errors.deadletterqueue.topic.name=dlq_topic_1
sourceKafkaCluster->targetKafkaCluster.errors.deadletterqueue.topic.replication.factor=1
errors.deadletterqueue.topic.name=dlq_topic_1
errors.deadletterqueue.topic.replication.factor=1
Does the deadletterqueue configuration option work with MirrorMaker2?

Related

Kafka streams keep logging 'Discovered transaction coordinator' after a node crash (with config StreamsConfig.EXACTLY_ONCE_V2)

I have a kafka(kafka_2.13-2.8.0) cluster with 3 partitions and 3 replications distributed in 3 nodes.
A producer cluster is sending messages to the topic.
I also have a consumer cluster using Kafka streams to consume messages from the topic.
To test fault tolerance, I killed a node. Then all consumers get stuck and keep poping below info:
[read-1-producer] o.a.k.c.p.internals.TransactionManager : [Producer clientId=streams-app-3-0451a24c-7e5c-498c-98d4-d30a6f5ecfdb-StreamThread-1-producer, transactionalId=streams-app-3-0451a24c-7e5c-498c-98d4-d30a6f5ecfdb-1] Discovered transaction coordinator myhost:9092 (id: 3 rack: null)
what I found out by now is there are sth relevant to the configuration of StreamsConfig.EXACTLY_ONCE_V2, because if I change it to StreamsConfig.AT_LEAST_ONCE the consumer works as expected.
To keep the EOS consuming, did I miss any configuration for producer/cluster/consumer?

Reset Spring Boot Kafka Stream Application on modifying topics

I'm using a spring-kafka to run Kafka Stream in a Spring Boot application using StreamsBuilderFactoryBean. I changed the number of partitions in some of the topics from 100 to 20 by deleting and recreating them, but now on running the application, I get the following error:
Existing internal topic MyAppId-KSTREAM-AGGREGATE-STATE-STORE-0000000092-changelog has invalid partitions: expected: 20; actual: 100. Use 'kafka.tools.StreamsResetter' tool to clean up invalid topics before processing.
I couldn't access the class kafka.tools.StreamsResetter and tried calling StreamsBuilderFactoryBean.getKafkaStreams.cleanup() but it gave NullPointerException. How do I do the said cleanup?
The relevant documentation is at here.
Step 1: Local Cleanup
For Spring Boot with StreamsBuilderFactoryBean, the first step can be done by simply adding CleanerConfig to the constructor:
// Before
new StreamsBuilderFactoryBean(new KafkaStreamsConfiguration(config));
// After
new StreamsBuilderFactoryBean(new KafkaStreamsConfiguration(config), new CleanupConfig(true, true));
This enables calling the KafkaStreams.cleanUp() method on both before start() & after stop().
Step 2: Global Cleanup
For step two, with all instances of the application stopped, simply use the tool as explained in the documentation:
# In kafka directory
bin/kafka-streams-application-reset.sh --application-id "MyAppId" --bootstrap-servers 1.2.3.4:9092 --input-topics x --intermediate-topics first_x,second_x,third_x --zookeeper 1.2.3.4:2181
What this does:
For any specified input topics: Reset the application’s committed consumer offsets to "beginning of the topic" for all partitions (for consumer group application.id).
For any specified intermediate topics: Skip to the end of the topic, i.e. set the application’s committed consumer offsets for all partitions to each partition’s logSize (for consumer group application.id).
For any internal topics: Delete the internal topic (this will also delete committed the corresponding committed offsets).

Using flume to read IBM MQ data

I want to read data from IBM MQ and put it into HDFs.
Looked into JMS source of flume, seems it can connect to IBM MQ, but I’m not understanding what does “destinationType” and “destinationName” mean in the list of required properties. Can someone please explain?
Also, how I should be configuring my flume agents
flumeAgent1(runs on the machine same as MQ) reads MQ data ---- flumeAgent2(Runs on Hadoop cluster) writes into Hdfs
OR only one agent is enough on Hadoop cluster
Can someone help me in understanding how MQs can be integrated with flume
Reference
https://flume.apache.org/FlumeUserGuide.html
Thanks,
Chhaya
Regarding the Flume agent architecture, it is composed in its minimalist form by a source in charge of receiving or polling for events, and converting the events into Flume events that are put in a channel. Then, a sink takes those events in order to persist the data somewhere, or send the data to another agent. All these components (source, channel, sink, i.e. an agent) run in the same machine. Different agents may be distributed, instead.
Being said that, your scenario seems to require a single agent based on a JMS source, a channel, typically Memory Channel, and a HDFS sink.
The JMS source, as stated in the documentation, has only been tested for ActiveMQ, but shoukd work for any other queue systemm. The documentation also provides an example:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE
a1 is the name of the single agent. c1 is the name for the channel and its configuration must be still completed; and a sink configuration is totally missing. It can be easily completed by adding:
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = ...
a1.sinks.k1...
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1...
r1 is the JMS source, and as can be seen, destinationName simply ask for a string name. destinationType can only take two values: queue or topic. I think the important parameters are providerURL and initialContextFactory and connectionFactory, which must be adapted for IBM MQ.

In Storm Spout, Naming the Consumer Group

I am currently using:
https://github.com/wurstmeister/storm-kafka-0.8-plus/commits/master
which has been moved to:
https://github.com/apache/storm/tree/master/external/storm-kafka
I want to specify the Kafka Consumer Group Name. By looking at the storm-kafka code, I followed the setting, id, to find that is is never used when dealing with a consumer configuration, but is used in creating the zookeeper path at which offset information is stored. Here in this link is an example of why I would want to do this: https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
Am I correct in saying that the Consumer Group Name cannot be set using the https://github.com/apache/storm/tree/master/external/storm-kafka code?
So far, storm-kafka integration is implemented using SimpleConsumer API of kafka and the format it stores consumer offset in zookeeper is implemented in their own way(JSON format).
If you write spout config like below,
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts,
"topic name",
"/kafka/consumers(just an example, path to store consumer offset)",
"yourTopic");
It will write consumer offset in subdirectories of /kafka/consumers/yourTopic.
Note that by default storm-kafka uses same zookeeper that your Storm uses.

Kafka Storm spout changing topology and consuming from the old offset

I am using the kafka spout for consuming messages. But in case if I have to change topology and upload then will it resume from the old message or start from the new message? Kafka spout gives us to specity the timestamp from where to consume but how will I know the timestamp?
spoutConfig.forceStartOffsetTime(-1);
It will choose the latest offset written around that timestamp to start consuming. You can
force the spout to always start from the latest offset by passing in -1, and you can force
it to start from the earliest offset by passing in -2.
references
If you are using KafkaSpout ensure the following:
In your SpoutConfig “id” and “ zkroot" do NOT change after
redeploying the new version of the topology. Storm uses the“
zkroot”, “id” to store the topic offset into zookeeper
KafkaConfig.forceFromStart is set to false.
KafkaSpout stores the offsets into zookeeper. Be very careful during the re-deployment if you set forceFromStart to true ( which can be the case when you first deploy the topology) in KafkaConfig of the KafkaSpout it will ignore stored zookeeper offsets. Make sure you set it to false.
Consider writing your topology so that the KafkaConfig.forceFromStart value is read from a properties file when your Topology’s main() method executes. This will allow your administrators to control whether the Kafka messages are replayed or not.
Basically the sequence of events will be:
First time start the topology by reading from beginning with below properties:
forceFromStart = true
startOffsetTime = -2
The above props will force it to start from the beginning of the topic. Remember to have both properties because forceFromStart tells storm to read the startOffsetTime property and use the value that is set to determine from where to start reading, and ignore zookeeper offset.
From now on your topology will run and zookeeper will maintain the offset. If your worker dies, it will start be started by supervisor and start reading from the offset in zookeeper.
Now if you want to restart your topology and you want to read from where it was left off before shutdown, use below property and restart the topology:
forceFromStart = false
By the above property, you are telling storm not the read the startOffsetTime value instead use the zookeeper offset which has been maintained before you shutdown your topology.
From now on every time you restart the topology, it will read from where it was left.
If you want to restart your topology and you want to read from the head/top of the topic, use below property and restart topology:
forceFromStart = true
startOffsetTime = -1
By above property you are telling storm to ignore the zookeeper offset and start from the latest offset that is the tip of the topic.

Resources