I have a need where I want to group messages received from a system based on certain criterion. For performance reasons, I want to avoid persisting these individual messages before I can group them. I've seen that JMS implementations provide transaction batching over a set of messages as given in
Document 1
Document 2
But I also want the acknowledgement of batch to be controlled by my code; as in case there is some issue in grouping, I should be able to rollback the batch I am reading, to be able to process the message in next try.
From above links, as the transaction is managed by container over a set of onMessage calls, I would not control the transaction commit and rollback.
Can someone let me know if I misreading it and what would be the way to achieve this?
Related
As per what I read on internet, method annotated with Spring #KafkaListener will commit the offset in 5 sec by default.
Suppose after 5 seconds, the offset is committed but the processing is still going on and in between consumer crashes because of some issue, in that case after rebalancing, the partition will be assigned to other consumer and it will start processing from next message because previous message offset was committed.
This will result in loss of the message.
So, do I need to commit the offset manually after processing completes? What would be the recommended approach?
Again, if processing is done, and just before commit, the consumer crashed, then how to avoid the message
duplication in this case.
Please suggest the way which will avoid message loss and duplication. I am using Spring KafkaListener
with default configuration.
As usual this depends on your use case and how you would like to deal with issues during your processing. The usage of auto-commit will change the delivery semantics of your application.
Enabling the auto commits is more an "at-most-once" semantics as you would read the data and commit it before you have actually processed the data. In case your processing fails the message was already committed and you will not read it again, it is therefore "lost" for your application (for your particular consumerGroup to be more precise).
Disabling the auto commit is more a "at-least-once" semantics as you are committing the data only after the processing of the data. Imagine you fetch 100 messages from the topic. 50 of them were processed sucessfullay and your application fails during the processing of the 51st message. Now, as you disabled auto commit and only commit all or none messages at the end of the processing, you have not committed any of the 100 messages, the next time your application reads the same 100 messages again. However, you have now created 50 duplicate messages as they were already processed successfully previously.
To conclude, you need to figure out if your use case can rather handle data loss or deal with duplicates. Dealing with duplicates can be ensured if your application is idempotent.
You are asking about "how to prevent data loss and duplicates" which means you are referring to "exactly-once-scemantics". This is a big topic in distributed streaming systems and you could check the spring-kafka docs if this is supported under which configuration and dependent on the output operation of your application.
Please also check the comment of GaryRussell on this post:
"the Spring team does not recommend using auto commit; the listener container Ackmode (BATCH or RECORD) will commit the offsets in a deterministic manner; recent versions of the framework disable auto commit (unless specifically enabled)"
If the consumer takes 5+ seconds to process the message then you have a problem in the code that needs to be fixed.
Auto-commit is risky in Production as can lead to problem scenarios (message loss etc.)
Better to go with manual commit to have better control.
Make the consumer idempotent so that duplicate message and WIP state of consumer is not a problem. May be, maintain processing status in consumer's DB so that if processing is half done then on consumer restart it can clear the WIP state and process afresh. Similarly, if processing status is Complete state then on restart it will see the Complete status and simply commit the duplicate message to Kafka.
We use Kafka topics as both events and a repository. Using the kafka-streams API we define a simple K-Table that represents all the events in the topic.
In our use case we publish events to the topic and subsequently reference the K-Table as the backing repository. The main issue is that the published events are not immediately visible on the K-Table.
We tried transactions and exactly once semantics as described here (https://kafka.apache.org/26/documentation/streams/core-concepts#streams_processing_guarantee) but there is always a delay we cannot control.
Publish Event
Undetermined amount of time
Published Event is visible in the K-Table
Is there a way to eliminate the delay or otherwise know that a specific event has been consumed by the K-Table.
NOTE: We tried both partition and global tables with similar results.
Thanks
Because Kafka is an asynchronous system the observed delay is expected and you cannot do anything to avoid it.
However, if you publish a message to a topic, the KafkaProducer allows you to pass in a Callback to the send() method and the callback will be executed after the message was written to the topic providing the record's metadata like topic, partition, and offset.
After Kafka Streams processed messages, it will eventually commit the offsets (you can configure the commit interval, too). Thus, you can know if the message is in the KTable after the offset was committed. By default, committing happens every 30 seconds only and it's not recommended to use a very short commit interval because it implies large overhead. Thus, I am not sure if this would help for your case, as it seem you want a more timely "response".
As an alternative, you can also disable caching on the KTable and use a toStream().process() step -- after each update to the KTable, the changelog stream provided by toStream() will contain the record and you can access the record metadata (including its offset) in the Processor via the given ProcessorContext object. Thus should also allow you to figure out, when the record is available in the KTable.
My team is considering if we can use mass transit as a primary solution for sagas in RabbitMq (vs NServiceBus). I admit that our experience which solution like masstransit and nserviceBus are minimal and we have started to introduce messaging into our system. So I sorry if my question will be simple or even stupid.
However, when I reviewed the mass transit documentation I noticed that I am not sure if that is possible to solve one of our cases.
The case looks like:
One of our components will produce up to 100 messages which will be "sent" to queue. These messages are a result of a single operation in a system. All of the messages will have the same Correlated Id and our internal publication id (same too).
1) is it possible to define a single instance saga (by correlated id) which will wait until it receives all messages from a queue and then process them as a single batch?
2) otherwise, is there any solution to ensure all of the sent messages was processed? (Consistency batch?) I assume that correlated Id will serve as a way to found an existing saga instance (singleton). In the ideal case, I would like to complete an instance of a saga When the system will process every message which belongs to a single group (to one publication)
I look at CompositeEvent too but I do not sure if I could use it to "ensure" that every message was processed and then I would let to complete saga for specific correlated Id.
Can you explain how could it be achieved? And into what mechanism I should look at in order to correlated id a lot of messages with the same id to the single saga and then complete if all of msg will be consumed?
Thank you in advance for any response
What you describe is how correlation by id works. It is like that out of the box.
So, in short - when you configure correlation for your messages correctly, all messages with the same correlation id will be handled by the same saga instance.
Concerning the second question - unless you publish a separate event that would inform the saga about how messages it should expect, how would it know that? You can definitely schedule a long timeout, attempting and assuming that within the timeout all the messages will be received by the saga, but it's not reliable.
Composite events won't help here since they are for messages with different types to be handled as one when all of them arrive and it doesn't count for the number of messages of each type. It just waits for one message of each type.
The ability to receive a series of messages and then operate on them in a batch is a common case, so much so that there is a sample showing how to do just that:
Batch Sample
Each saga instance has a unique correlation identifier, and as long as those messages can be correlated to that single instance, MassTransit will manage the concurrency (either optimistic or pessimistic, and depending upon the saga storage engine).
I'd suggest reviewing the state machine in the sample, and seeing how that compares to your scenario.
One option for a Messaging Provider is a Message Queue, which provides FIFO ordering, i.e. Queue. Why would the ordering of messages be important? I wonder if is it because of the priority of the messages or anything similar to that. i would appreciate if anyone could explain with example.
Your answer is right - logically some operations are interdependent and you must maintain the order of calls.
But I think that there is an even more important purely technical aspect to this that I want to point out: You need to know the order to be able to achieve ACID transactions.
Take the following scenario:
You have a process service that orchestrates 5 other entity/utility services. The process gets triggered and starts executing but 3rd call fails. More often than not it is too expensive to have a common transactional context between services (in order to have 2-phase commit), so the solution is to use Compensation i.e. to call the opposite operations of all services that already did a write operation before the failure. If you cannot guarantee the order of the messages, you cannot possibly know what you should rollback and what not (if you don't explicitly look in the underlying systems and track the change yourself - but this is not a sane approach).
Hope this helps!
Here's what I wrote for my answer:
By implementing a Queue data structure, Consumers will receive messages in order by which they were sent. For example, An Order System in Enterprise systems sends some messages to Sales System. Let these be "GetPayment" and "Make a Shipment". If these messages are not queued, the Sales System could malfunction by notifying to "Make a Shipment" before "Getting a Payment".
The idea is to maintain the enterprise level workflow.
PS: Plamen has more in-depth answer.
Whatsoever gets into the message buffer first should be served first. Message queues are used to retain the order of the messages received. Queues are First in and first out.
Application server creates a new transaction before calling MDB's onMessage method. Also I am processing database update in onMessage method. Transactions create additional overhead and processing several message in one transaction could increase performance.
Is it possible to make App server to use one transaction for several messages. Or maybe there are other approaches to this problem?
And, by the way, I can't use multiple instances, cause I need to preserve the sequence order.
I guess you can store the messages in a list and depending upon how many messages you want to process in one transaction you can check the size of the list and process the messages.