If I increase the parallelism of a Kafka spout in my storm topology, how can I stop it from reading the same message in a topic multiple times?
Storm's Kafka spout persists consumer offsets to Zookeeper, so as long as you don't clear your Zookeeper store then it shouldn't read the same message more than once. If you are seeing a message being read multiple times, perhaps check that offsets are being persisted to your zookeeper instance?
I think that by default when running locally, the Kafka spout starts its own local Zookeeper instance (separate from Kafka's Zookeeper), which may have its state reset each time you restart the topology.
you should check if the message is getting acknowledged properly. If not then the spout will treat it as failed and will reply the message.
If it is inflow from kafka into storm, then please share more information.
If data flow is out from storm to kafka:
then just check your TopologyBuilder in your code.
It should not be allGrouping, if yes then change it to shuffleGrouping
Example:
builder.setBolt("OUTPUTBOLT", new OutBoundBolt(boltConfig), 4)
.allGrouping("previous_bolt"); // this is wrong change it to
// shuffleGrouping
All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
You need to specify consumer group. Once specified Kafka will give only the next message to any of your spouts. All spouts should belong to same consumer group.
While creating a consumer please specify following property
props.put("group.id", a_groupId);
If your kafka spout is Opeque then you need to topology.max.spout.pending<10
because "pending means the tuple has not been acked or failed yet" so, if there is no more tuple for each batch and less then the pending count, spout trying to reach max spout pending size.
You can handle this problem by using Transactional Spout if your needs meet.
Related
I have a Storm topology where I have to send output to kafka as well as update a value in redis. For this I have a Kafkabolt as well as a RedisBolt.
Below is what my topology looks like -
tp.setSpout("kafkaSpout", kafkaSpout, 3);
tp.setBolt("EvaluatorBolt", evaluatorBolt, 6).shuffleGrouping("kafkaStream");
tp.setBolt("ResultToRedisBolt",ResultsToRedisBolt,3).shuffleGrouping("EvaluatorBolt","ResultStream");
tp.setBolt("ResultToKafkaBolt", ResultsToKafkaBolt, 3).shuffleGrouping("EvaluatorBolt","ResultStream");
The problem is that both of the end bolts (Redis and Kafka) are listening to the same stream from the preceding bolt (ResultStream), hence both can fail independently. What I really need is that if the result is successfully published in Kafka, then only I update the value in Redis. Is there a way to have an output stream from a kafkaBolt where I can get the messages published successfully to Kafka? I can then probably listen to that stream in my RedisBolt and act accordingly.
It is not currently possible, unless you modify the bolt code. You would likely be better off changing your design slightly, since doing extra processing after the tuple is written to Kafka has some drawbacks. If you write the tuple to Kafka and you fail to write to Redis, you will get duplicates in Kafka, since the processing will start over at the spout.
It might be better, depending on your use case, to write the result to Kafka, and then have another topology read the result from Kafka and write to Redis.
If you still need to be able to emit new tuples from the bolt, it should be pretty easy to implement. The bolt recently got the ability to add a custom Producer callback, so we could extend that mechanism.
See the discussion at https://github.com/apache/storm/pull/2790#issuecomment-411709331 for context.
In my topology I have a spout with a socket opened on port 5555 to receive messages.
If I have 10 supervisors in my Storm cluster, will each one of them be listening to their 5555 ports?
In the end, to which supervisor should I send messages?
Multiple comments here:
Storm uses a pull based model for data ingestion via Spouts. If you open a socket you will block the Spout until data is available (and this is bad; see this SO question for more details: Why should I not loop or block in Spout.nextTuple())
About Spout deployment (Supervisors):
first, it depends on the parallelism of your spout (ie,parallelims_hint, default value is one)
second, supervisors do no execute Spout code: Supervisors start up worker JVM that execute Spouts/Bolts (see config parameter number_of_workers for a topology)
third, Storm uses a load-balanced round-robin scheduler; thus, it might happen that two Spout executor are scheduled to the same worker JVM (or different workers on the same host); for this case, you will get a port conflict (only one execute will be able to open the port)
Dated distribution should not matter in this case: if you really go with push, you can choose any host to send the data; Storm does not care. Of course, if you need some kind of key-based partitioning, you might want to send data from a single partition the a single Spout instance; as an alternative, just forward the data within the Spout and use fieldsGrouping to get your partitions for the consuming Bolt. However, if you use pull based data ingestion by the Spout, you can ensure that each Spout pulls data from certain partitions and the problem resolves naturally.
To sum up: using push based data ingestion might be a bad idea.
I have added new bolts in my storm topology and want to swap the existing topology with the new one.How can i achieve it in such a way that when the second topology starts it does not read the same messages again.
If you're reading from Kafka using the Storm provided Kafka Spout, it stores its offset in Zookeeper. If you keep the id defined in SpoutConfig the same, every time the Kafka Spout restarts it should check Zookeeper and restart from the last committed offset. Achieving your goal of not reading the same messages again.
As I understand things, ZooKeeper will persist tuples emitted by bolts so if a bolt crashes (or a computer with the bolt crashes, or the entire cluster crashes), the tuple emitted by the bolt will not be lost. Once everything is restarted, the tuples will be fetched from ZooKeeper, and everything will continue on as if nothing bad ever happened.
What I don't yet understand is if the same thing is true for spouts. If a spout emits a tuple (i.e., the emit() function within a spout is executed), and the computer the spout is running on crashes shortly thereafter, will that tuple be resurrected by ZooKeeper? Or do we need Kafka in order to guarantee this?
P.S. I understand that the tuple emitted by the spout must be assigned a unique ID in the call to emit().
P.P.S. I see sample code in books that uses something like ConcurrentHashMap<UUID, Values> to track which spouted tuples have not yet been acked. Is this somehow automatically persisted with ZooKeeper? If not, then I shouldn't really be doing that, should I? What should I being doing instead? Using Kafka?
Florian Hussonnois answered my question thoroughly and clearly in this storm-user thread. This was his answer:
Actually, the tuples aren't persisted into "zookeeper". If your
"spout" emits a tuple with a unique id, it will be automatically
follow internally by storm (i.e ackers) . Thus, in case the emitted
tuple comes to fail because of a bolt failure, Storm invokes the
method 'fail' on the origin spout task with the unique id as argument.
It's then up to you to re-emit the failed tuple.
In sample codes, spouts use a Map to track which tuples are fully
processed by your entire topology in order to be able to re-emit in
case of a bolt failure.
However, if the failure doesn't come from a bolt but from your spout,
the in memory Map will be lost and your topology will not be able to
remit failed tuples.
For a such scenario you can rely on Kafka. In fact, the Kafka Spout
store its read offset into zookeeper. In that way, if a spout task
goes down it will be able to read its offset from zookeeper after
restarting.
I am using the kafkaSpout that came with storm-0.9.2 distribution for my project. I want to monitor the throughput of this spout. I tried using the KafkaOffsetMonitoring, but it does not show any consumers reading from my topic.
I suspect this is because I have specified the root path in Zookeeper for the spout to store the consumer offsets. How will the kafkaOffsetMonitor know that where to look for data about my kafkaSpout instance?
Can someone explain exactly where does zookeeper store data about kafka topics and consumers? The zookeeper is a filesystem. So, how does it arrange data of different topics and their partitions? What is consumer groupid and how is it interpreted by zookeeper while storing consumer offset?
If anyone has ever used kafkaOffsetMonitor to monitor throughput of a kafkaSpout, please tell me how I can get the tool to find my spout?
Thanks a lot,
Palak Shah
Kafka-Spout maintains its offset in its own znode rather than under the znode where kafka stores the offsets for regular consumers. We had a similar need where we had to monitor the offsets of both the kafka-spout consumers and also regular kafka consumers, so we ended writing our own tool. You can get the tool from here:
https://github.com/Symantec/kafka-monitoring-tool
I have never used KafkaOffsetMonitor, but I can answer the other part.
zookeeper.connect is the property where you can specify the znode for Kafka; By default it keeps all data at '/'.
You can access the zookeeper filesystem using zkCli.sh, the zookeeper command line.
You should look at /consumers and /brokers; following would give you the offset
get /consumers/my_test_group/offsets/my_topic/0
You can poll this offset continuously to know the rate of consumption at spout.