I want to know if the stateful bolt stops processing tuples when it is in the prepare and commit states, which is the synchronization mechanism.I checked the source code and felt that the bolt will enter the corresponding processing according to the received checkpoint tuple or the normal tuple. Therefore, if the checkpoint ACTION is being executed, the normal calculation is not performed, that is, the synchronization mechanism. Is it correct? ?
I just wanted to confirm something which i think is in between the line of the documentation. Would it be correct to say that Commit in kafka streams is independent of if the offset/message has been processed by the entire set of processing nodes of application topology, but solely depend on the commit interval ? In other words, where in typical kafka consumer application, one would commit when a message is fully processed as opposed to only fetch, in Kafka stream, simply being fetched is enough for the commit interval to kick in and commit that message/offset ? That is, even if that offset/message has not yet been processed by the entire set of processing node of the application topology ?
Or are message eligible to be committed, based on the fact that the entire set of processing node of the topology processed them, and they are ready to go out in a topic or external system.
In a sense the question could be sum up as, when are offset/messages, eligible to be committed in Kafka streams ? is it conditional ? if so what is the condition ?
You have do understand that a Kafka Streams program, i.e., its Topology my contain multiple sub-topologies (https://docs.confluent.io/current/streams/architecture.html#stream-partitions-and-tasks). Sub-topolgies are connected via topics to each other.
A record can be committed, if it's fully processed by a sub-topology. For this case, the record's intermediate output is written into the topic that connects two sub-topologies before committing happens. The downstream sub-topology would read from the "connecting topic" and commit offsets for this topic.
Committing indeed happens based on commit.interval.ms only. If a fetch returns lets say 100 records (offsets 0 to 99), and 30 records are processed by the sub-topology when commit.interval.ms hits, Kafka Streams would first make sure that the output of those 30 messages is flushed to Kafka (ie, Producer.flush()) and would afterward commit offset 30 -- the other 70 messages are just in an internal buffer of Kafka Streams and would be processed after the commit. If the buffer is empty, a new fetch would be send. Each thread, tracks commit.interval.ms independently, and would commit all its tasks if commit interval passed.
Because committing happens on a sub-topology basis, it can be than an input topic record is committed, while the output topic does not have the result data yet, because the intermediate results are not processed yet by a downstream sub-topology.
You can inspect the structure of your program via Topology#describe() to see what sub-topologies your program has.
Whether using streams or just a simple consumer, the key thing is that auto-commit happens in the polling thread, not a separate thread - the offset of a batch of messages is only committed on the subsequent poll, and commit.interval.ms just defines the minimum time between commits, ie a large value means that commit won't happen on every poll.
The implication is that as long as you are not spawning additional threads, you will only ever be committing offsets for messages that have been completely processed, whatever that processing involves.
Running a streaming beam pipeline where i stream files/records from gcs using avroIO and then create minutely/hourly buckets to aggregate events and add it to BQ. In case the pipeline fails how can i recover correctly and process the unprocessed events only ? I do not want to double count events .
One approach i was thinking was writing to spanner or bigtable but it may be the case the write to BQ succeeds but the DB fails and vice versa ?
How can i maintain a state in reliable consistent way in streaming pipeline to process only unprocessed events ?
I want to make sure the final aggregated data in BQ is the exact count for different events and not under or over counting ?
How does spark streaming pipeline solve this (I know they have some checkpointing directory for managing state of query and dataframes ) ?
Are there any recommended techniques to solve accurately these kind of problem in streaming pipelines ?
Based on clarification from the comments, this question boils down to 'can we achieve exactly-once semantics across two successive runs of a streaming job, assuming both runs are start from scratch?'. Short answer is no. Even if the user is willing store some state in external storage, it needs to be committed atomically/consistently with streaming engine internal state. Streaming engines like Dataflow, Flink store required state internally, which is needed for to 'resume' a job. With Flink you could resume from latest savepoint, and with Dataflow you can 'update' a running pipeline (note that Dataflow does not actually kill your job even when there are errors, you need to cancel a job explicitly). Dataflow does provide exactly-once processing guarantee with update.
Some what relaxed guarantees would be feasible with careful use of external storage. The details really depend on specific goals (often it is is no worth the extra complexity).
So, Apache Storm + Trident provide the exactly-once semantics. Imagine I have the following topology:
TridentSpout -> SumMoneyBolt -> SaveMoneyBolt -> Persistent Storage.
CalculateMoneyBolt sums monetary values in memory, then passes the result to SaveMoneyBolt which should save the final value to a remote storage/database.
Now it is very important that we calculate these values and store only once to the database. We do not want to accidentally double count the money.
So how does Storm with Trident handle network partitioning and/or failure scenarios when the write request to the database has been successfully sent, the database has successfully received the request, logged the transaction, and while responding to the client, the SaveMoneyBolt has either died or partitioned from the network before having received the database response?
I assume that if SaveMoneyBolt had died, Trident would retry the batch, but we cannot afford double counting.
How are such scenarios handled?
Trident gives a unique transaction id for each batch. If a batch is retried it will have the same txid. Also the batch updates are ordered, i.e. the state update for a batch will not happen until the update for the previous batch is complete. So by storing the txid along with the values in the state trident can de-duplicate the updates and provide exactly once semantics.
Trident comes with a few built-in Map state implementations which handles all this automatically.
For more information take a look at the docs :
I have started to use Storm recently but could not find any resources on the net about global grouping option's fault tolerance.
According to my understanding from the documents; while running a topology with a bolt(Bolt A) which is uses global grouping will receive tuples from tasks of Bolt B into the task of Bolt A. As it is using global grouping option, there is only one task of Bolt A in the topology.
The question is as follows: What will happen if we store some historical data of the stream within Bolt A and the worker process dies which contains the task of Bolt A? Meaning that will the data stored in this bolt get lost?
Thanks in advance
Once all the downstream tasks have acked the tuple, it means that they have successfully processed the message and it need not be replayed if there is a shut down. If you are keeping any state in memory, then you should store it in a persistent store. Message should be acked when the state change due to the message has been persisted.
As I understand things, ZooKeeper will persist tuples emitted by bolts so if a bolt crashes (or a computer with the bolt crashes, or the entire cluster crashes), the tuple emitted by the bolt will not be lost. Once everything is restarted, the tuples will be fetched from ZooKeeper, and everything will continue on as if nothing bad ever happened.
What I don't yet understand is if the same thing is true for spouts. If a spout emits a tuple (i.e., the emit() function within a spout is executed), and the computer the spout is running on crashes shortly thereafter, will that tuple be resurrected by ZooKeeper? Or do we need Kafka in order to guarantee this?
P.S. I understand that the tuple emitted by the spout must be assigned a unique ID in the call to emit().
P.P.S. I see sample code in books that uses something like ConcurrentHashMap<UUID, Values> to track which spouted tuples have not yet been acked. Is this somehow automatically persisted with ZooKeeper? If not, then I shouldn't really be doing that, should I? What should I being doing instead? Using Kafka?
Florian Hussonnois answered my question thoroughly and clearly in this storm-user thread. This was his answer:
Actually, the tuples aren't persisted into "zookeeper". If your
"spout" emits a tuple with a unique id, it will be automatically
follow internally by storm (i.e ackers) . Thus, in case the emitted
tuple comes to fail because of a bolt failure, Storm invokes the
method 'fail' on the origin spout task with the unique id as argument.
It's then up to you to re-emit the failed tuple.
In sample codes, spouts use a Map to track which tuples are fully
processed by your entire topology in order to be able to re-emit in
case of a bolt failure.
However, if the failure doesn't come from a bolt but from your spout,
the in memory Map will be lost and your topology will not be able to
remit failed tuples.
For a such scenario you can rely on Kafka. In fact, the Kafka Spout
store its read offset into zookeeper. In that way, if a spout task
goes down it will be able to read its offset from zookeeper after