Monitoring Kafka Spout with KafkaOffsetMonitoring tool - performance

I am using the kafkaSpout that came with storm-0.9.2 distribution for my project. I want to monitor the throughput of this spout. I tried using the KafkaOffsetMonitoring, but it does not show any consumers reading from my topic.
I suspect this is because I have specified the root path in Zookeeper for the spout to store the consumer offsets. How will the kafkaOffsetMonitor know that where to look for data about my kafkaSpout instance?
Can someone explain exactly where does zookeeper store data about kafka topics and consumers? The zookeeper is a filesystem. So, how does it arrange data of different topics and their partitions? What is consumer groupid and how is it interpreted by zookeeper while storing consumer offset?
If anyone has ever used kafkaOffsetMonitor to monitor throughput of a kafkaSpout, please tell me how I can get the tool to find my spout?
Thanks a lot,
Palak Shah

Kafka-Spout maintains its offset in its own znode rather than under the znode where kafka stores the offsets for regular consumers. We had a similar need where we had to monitor the offsets of both the kafka-spout consumers and also regular kafka consumers, so we ended writing our own tool. You can get the tool from here:
https://github.com/Symantec/kafka-monitoring-tool

I have never used KafkaOffsetMonitor, but I can answer the other part.
zookeeper.connect is the property where you can specify the znode for Kafka; By default it keeps all data at '/'.
You can access the zookeeper filesystem using zkCli.sh, the zookeeper command line.
You should look at /consumers and /brokers; following would give you the offset
get /consumers/my_test_group/offsets/my_topic/0
You can poll this offset continuously to know the rate of consumption at spout.

Related

Storm bolt following a kafka bolt

I have a Storm topology where I have to send output to kafka as well as update a value in redis. For this I have a Kafkabolt as well as a RedisBolt.
Below is what my topology looks like -
tp.setSpout("kafkaSpout", kafkaSpout, 3);
tp.setBolt("EvaluatorBolt", evaluatorBolt, 6).shuffleGrouping("kafkaStream");
tp.setBolt("ResultToRedisBolt",ResultsToRedisBolt,3).shuffleGrouping("EvaluatorBolt","ResultStream");
tp.setBolt("ResultToKafkaBolt", ResultsToKafkaBolt, 3).shuffleGrouping("EvaluatorBolt","ResultStream");
The problem is that both of the end bolts (Redis and Kafka) are listening to the same stream from the preceding bolt (ResultStream), hence both can fail independently. What I really need is that if the result is successfully published in Kafka, then only I update the value in Redis. Is there a way to have an output stream from a kafkaBolt where I can get the messages published successfully to Kafka? I can then probably listen to that stream in my RedisBolt and act accordingly.
It is not currently possible, unless you modify the bolt code. You would likely be better off changing your design slightly, since doing extra processing after the tuple is written to Kafka has some drawbacks. If you write the tuple to Kafka and you fail to write to Redis, you will get duplicates in Kafka, since the processing will start over at the spout.
It might be better, depending on your use case, to write the result to Kafka, and then have another topology read the result from Kafka and write to Redis.
If you still need to be able to emit new tuples from the bolt, it should be pretty easy to implement. The bolt recently got the ability to add a custom Producer callback, so we could extend that mechanism.
See the discussion at https://github.com/apache/storm/pull/2790#issuecomment-411709331 for context.

in Kafka, how to make consumers consume from local partition?

Just to make the scenario simple.
number of consumers == number of partitions == Kafka broker numbers
If deploy the consumers on the same machines where the brokers are, how to make each consumer only consume the messages locally? The purpose is to cut all the network overhead.
I think we can make it if each consumer can know the partition_id on their machines, but I don't know how? or is there other directions to solve this problem?
Thanks.
bin/kafka-topics.sh --zookeeper [zk address] --describe --topic [topic_name] tells you which broker hosts the leader for each partition. Then you can use manual partition assignment for each consumer to make sure it consumes from a local partition.
Probably not worth the effort because partition leadership can change and then you would have to rebalance all your consumers to be local again. You can save the same amount of network bandwidth with less effort by just reducing the replication factor from 3 to 2.
Maybe you could use the Admin Client API.
First you can use the describeTopics() methods for getting information about topics in the cluster. From the DescribeTopicResult you can access to TopicPartitionInfo with information about partitions for each topic. From there you can access to the Node through the leader(). Node contains the host() and you can check if it's the same as the host your consumer is running or id() and the consumer should have the information about the broker-id running on the same machine (in general it's an information you can define upfront). More infor on Admin Client API at the following JavaDoc :
https://kafka.apache.org/0110/javadoc/index.html?org/apache/kafka/clients/admin/AdminClient.html

Swapping a existing topology with a new one

I have added new bolts in my storm topology and want to swap the existing topology with the new one.How can i achieve it in such a way that when the second topology starts it does not read the same messages again.
If you're reading from Kafka using the Storm provided Kafka Spout, it stores its offset in Zookeeper. If you keep the id defined in SpoutConfig the same, every time the Kafka Spout restarts it should check Zookeeper and restart from the last committed offset. Achieving your goal of not reading the same messages again.

Dynamic topic in kafka channel using flume

Is it possible to have a kafka channel with a dynamic topic - something like the kafka sink where you can specify the topic header, or the HDFS sink where you can use a value from a header?
I know I can multiplex to use multiple channels (with a bunch of channel configurations), but that is undesirable because I'd like to have a single dynamic HDFS sink, rather than an HDFS sink for each kafka channel.
My understanding is that the Flume Kafka channel can only be mapped to a single topic because it is both producing and consuming logs on that particular topic.
Looking at the code in KafkaChannel.java from Flume 1.6.0, I can see that only one topic is ever subscribed to (with one consumer per thread).

Kafka Spout Reading same message multiple times

If I increase the parallelism of a Kafka spout in my storm topology, how can I stop it from reading the same message in a topic multiple times?
Storm's Kafka spout persists consumer offsets to Zookeeper, so as long as you don't clear your Zookeeper store then it shouldn't read the same message more than once. If you are seeing a message being read multiple times, perhaps check that offsets are being persisted to your zookeeper instance?
I think that by default when running locally, the Kafka spout starts its own local Zookeeper instance (separate from Kafka's Zookeeper), which may have its state reset each time you restart the topology.
you should check if the message is getting acknowledged properly. If not then the spout will treat it as failed and will reply the message.
If it is inflow from kafka into storm, then please share more information.
If data flow is out from storm to kafka:
then just check your TopologyBuilder in your code.
It should not be allGrouping, if yes then change it to shuffleGrouping
Example:
builder.setBolt("OUTPUTBOLT", new OutBoundBolt(boltConfig), 4)
.allGrouping("previous_bolt"); // this is wrong change it to
// shuffleGrouping
All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
You need to specify consumer group. Once specified Kafka will give only the next message to any of your spouts. All spouts should belong to same consumer group.
While creating a consumer please specify following property
props.put("group.id", a_groupId);
If your kafka spout is Opeque then you need to topology.max.spout.pending<10
because "pending means the tuple has not been acked or failed yet" so, if there is no more tuple for each batch and less then the pending count, spout trying to reach max spout pending size.
You can handle this problem by using Transactional Spout if your needs meet.

Resources