Dynamic topic in kafka channel using flume - hadoop

Is it possible to have a kafka channel with a dynamic topic - something like the kafka sink where you can specify the topic header, or the HDFS sink where you can use a value from a header?
I know I can multiplex to use multiple channels (with a bunch of channel configurations), but that is undesirable because I'd like to have a single dynamic HDFS sink, rather than an HDFS sink for each kafka channel.

My understanding is that the Flume Kafka channel can only be mapped to a single topic because it is both producing and consuming logs on that particular topic.
Looking at the code in KafkaChannel.java from Flume 1.6.0, I can see that only one topic is ever subscribed to (with one consumer per thread).

Related

How to read multiple queues messages from rabbitmq to logstash?

I have 3 different queues in rabbitmq where I have to read messages and send them to elasticsearch in the same index. I am confuse it is possible to read multiple queues in single config file. I am already reading one queue at a time. But I am getting realtime messages from different queues and need to process at the same time all these three queues?
you can, you need one input section for each queue

Storm bolt following a kafka bolt

I have a Storm topology where I have to send output to kafka as well as update a value in redis. For this I have a Kafkabolt as well as a RedisBolt.
Below is what my topology looks like -
tp.setSpout("kafkaSpout", kafkaSpout, 3);
tp.setBolt("EvaluatorBolt", evaluatorBolt, 6).shuffleGrouping("kafkaStream");
tp.setBolt("ResultToRedisBolt",ResultsToRedisBolt,3).shuffleGrouping("EvaluatorBolt","ResultStream");
tp.setBolt("ResultToKafkaBolt", ResultsToKafkaBolt, 3).shuffleGrouping("EvaluatorBolt","ResultStream");
The problem is that both of the end bolts (Redis and Kafka) are listening to the same stream from the preceding bolt (ResultStream), hence both can fail independently. What I really need is that if the result is successfully published in Kafka, then only I update the value in Redis. Is there a way to have an output stream from a kafkaBolt where I can get the messages published successfully to Kafka? I can then probably listen to that stream in my RedisBolt and act accordingly.
It is not currently possible, unless you modify the bolt code. You would likely be better off changing your design slightly, since doing extra processing after the tuple is written to Kafka has some drawbacks. If you write the tuple to Kafka and you fail to write to Redis, you will get duplicates in Kafka, since the processing will start over at the spout.
It might be better, depending on your use case, to write the result to Kafka, and then have another topology read the result from Kafka and write to Redis.
If you still need to be able to emit new tuples from the bolt, it should be pretty easy to implement. The bolt recently got the ability to add a custom Producer callback, so we could extend that mechanism.
See the discussion at https://github.com/apache/storm/pull/2790#issuecomment-411709331 for context.

Flume - Would a source accept events even when the sink is non-operational?

New to flume.
Let's say I have an agent, which has a single avero-source, a single hdfs-sink and a single file-channel.
let's say at some point the sink fails to write to hdfs. Will the source continue to accept events, until the channel fills up?
Or would the source stop accepting events even-though the file-channel is not full?
I have tested this pretty extensively. You will have a hard time with this situation. When the sink fails, Flume will start throwing exceptions. Depending on the velocity of your stream, the channel will fill up as well causing more exceptions. The best thing to do to control for failure is to use a failover sink processor and configure a sink group. This way if one sink fails, you'll have a backup sink set up with very minimal data loss. In my experience, I have set up an Avro sink that goes to a second Flume agent hop in my topology and if that Flume agent goes down, then my failover sinks are 2 different Hadoop clusters and I write the Flume events to one of the Hadoop clusters via the HDFS sink. You then have to backfill these events. I have found the netcat source to be effective for this.

Kafka Spout Reading same message multiple times

If I increase the parallelism of a Kafka spout in my storm topology, how can I stop it from reading the same message in a topic multiple times?
Storm's Kafka spout persists consumer offsets to Zookeeper, so as long as you don't clear your Zookeeper store then it shouldn't read the same message more than once. If you are seeing a message being read multiple times, perhaps check that offsets are being persisted to your zookeeper instance?
I think that by default when running locally, the Kafka spout starts its own local Zookeeper instance (separate from Kafka's Zookeeper), which may have its state reset each time you restart the topology.
you should check if the message is getting acknowledged properly. If not then the spout will treat it as failed and will reply the message.
If it is inflow from kafka into storm, then please share more information.
If data flow is out from storm to kafka:
then just check your TopologyBuilder in your code.
It should not be allGrouping, if yes then change it to shuffleGrouping
Example:
builder.setBolt("OUTPUTBOLT", new OutBoundBolt(boltConfig), 4)
.allGrouping("previous_bolt"); // this is wrong change it to
// shuffleGrouping
All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
You need to specify consumer group. Once specified Kafka will give only the next message to any of your spouts. All spouts should belong to same consumer group.
While creating a consumer please specify following property
props.put("group.id", a_groupId);
If your kafka spout is Opeque then you need to topology.max.spout.pending<10
because "pending means the tuple has not been acked or failed yet" so, if there is no more tuple for each batch and less then the pending count, spout trying to reach max spout pending size.
You can handle this problem by using Transactional Spout if your needs meet.

Monitoring Kafka Spout with KafkaOffsetMonitoring tool

I am using the kafkaSpout that came with storm-0.9.2 distribution for my project. I want to monitor the throughput of this spout. I tried using the KafkaOffsetMonitoring, but it does not show any consumers reading from my topic.
I suspect this is because I have specified the root path in Zookeeper for the spout to store the consumer offsets. How will the kafkaOffsetMonitor know that where to look for data about my kafkaSpout instance?
Can someone explain exactly where does zookeeper store data about kafka topics and consumers? The zookeeper is a filesystem. So, how does it arrange data of different topics and their partitions? What is consumer groupid and how is it interpreted by zookeeper while storing consumer offset?
If anyone has ever used kafkaOffsetMonitor to monitor throughput of a kafkaSpout, please tell me how I can get the tool to find my spout?
Thanks a lot,
Palak Shah
Kafka-Spout maintains its offset in its own znode rather than under the znode where kafka stores the offsets for regular consumers. We had a similar need where we had to monitor the offsets of both the kafka-spout consumers and also regular kafka consumers, so we ended writing our own tool. You can get the tool from here:
https://github.com/Symantec/kafka-monitoring-tool
I have never used KafkaOffsetMonitor, but I can answer the other part.
zookeeper.connect is the property where you can specify the znode for Kafka; By default it keeps all data at '/'.
You can access the zookeeper filesystem using zkCli.sh, the zookeeper command line.
You should look at /consumers and /brokers; following would give you the offset
get /consumers/my_test_group/offsets/my_topic/0
You can poll this offset continuously to know the rate of consumption at spout.

Resources