How to implement exactly once read semantics using Kafka Transaction API? - spring-boot

I am new to Kafka.
I want to design a system using Springboot+Kafka which will consume messages in a batch with exactly once read guarantee. The same message should not be processed again. I was supposed to use zookeeper's metadata which maintains the partition offset but then I came to know that Kafka has introduced Transaction API which takes care of this.
From documentation I read about this API but I did not find any practical sample example which shows that messages are processed only once.
Is there any link where I can learn Transaction API theoretically as well as implement it practically? Appreciate your help.

Related

Read data directly from REST Api with Kafka

Is such a situation even possible ? :
There is an application "XYZ" (in which there is no Kafka) that exposes a REST api. It is a SpringBoot application with which Angular application communicates.
A new application (SpringBoot) is created which wants to use Kafka and needs to fetch data from "XYZ" application. And it wants to do this using Kafka.
The "XYZ" application has an example endpoint [GET] api/message/all which displays all messages.
Is there a way to "connect" Kafka directly to this endpoint and read data from it ? In short, the idea is for Kafka to consume data directly from the EP. Communication between two microservices, where one microservice does not have a kafka.
What suggestions do you have for solving this situation. Because I guess this option is not possible. Is it necessary to add a publisher in application XYZ which will send data to the queue and only then will they be available for consumption by a new application ??
Getting them via the REST-Interface might not be a very good idea.
Simply put, in the messaging world, message delivery guarantees are a big topic and the standard ways to solve that with Kafka are usually
Producing messages from your service using the Producer-API to a Kafka topic.
Using Kafka-Connect to read from an outbox-table.
Since you most likely have a database already attached to your API-Service, there might arise the problem of dual writes if you choose to produce the messages directly to a topic. What this means, is that writes to a database might fail while it might be successfully written to Kafka/vice-versa. So you can end up with inconsistent states. Depending on your use case this might be a problem or not.
Nevertheless, to overcome that, the outbox pattern can come in handy.
Via the outbox pattern, you'd basically write your messages to a table, a so-called outbox-table, and then you'd use Kafka-Connect to poll this table of the database. Kafka Connect is basically a cluster of workers that consume this database table and forward the entries of the table to a Kafka topic. You might want to look at confluent cloud, they offer a fully managed Kafka-Connect service. Like this you don't have to manage the cluster of workers yourself. Once you have the messages in a Kafka topic, you can consume them with the standard Kafka Consumer-API/ Stream-API.
What you're looking for is a Source-Connector.
A source connector for a specific database. E.g. MongoDB
E.g. https://www.confluent.io/hub/mongodb/kafka-connect-mongodb
For now, most source-connectors produce in an at-least-once fashion. This means that the topic you configure the connector to write to might contain a message twice. So make sure that if you need them to be consumed exactly once, you think about deduplicating these messages.

spring integration reading many files

We have a requirement to parse lots of incoming files (into a directory) and processing them and putting the outcome onto AWS kinesis for each file.
The frequency of files can be 60,000 per day and files can arrive every 15 seconds. Each file may contain about 1000 entries.
Can spring-integration handle this load?
Would there be any issues processing this kind of volumes?
As the files are coming in on to an inbound-channel-adapter can we execute a service-activator for each file?
I believe we need to use task-executors on channels with poller? Any examples?
Would task-executors call the service-activators in a multi-threaded manner?
Any pointers would be helpful. Links to any code examples would be nice.
This is not the kind of question one asks here on SO - too broad and too much questions in a single thread. I assume even if I answer to all of them, you are going to ask more and SO is not good for Q&A chat. Anyway:
Yes, Spring Integration can handle this. You can use simple FileReadingMessageSource to poll the directory periodically.
Each file (an outbound message payload) can be fed to the FileSplitter to parse it line by line.
After splitter you indeed can use an ExecutorChannel to process those lines in parallel.
The Service Activator can be called in multi-threaded environment as long as it is a thread-safe.
In the end you can use KinesisMessageHandler to send record to the AWS Kinesis. And yes, this one can be used from different threads as well.
All the information you can find in the Spring Integration Reference Manual. Some Samples may help you as well. And also Spring Integration AWS Extension is there for you.

Using Akka.net / Actor System for an ETL process

I'm new in the world of actor modeling and I am in love with the idea. But does some pattern exists for processing a batch of messages simply for bulk storage in a safe manner?
I'm afraid if I read 400 messages of expected 500 and put them in a list, if the system closes, I don't want to lose those 400 messages from the (persisted)
mailbox. In a service bus world you could ask for a batch of messages and only when processed, commit all of them. Thank you.
You may want to combine your actor system with some service bus/reliable queues, like RabbitMQ or Azure Service Bus, at use actor system only for message processing.
From within Akka.NET itself, you have persistence extension, which can be used for storing actor state in persistent backend of your choice. It also contains a dedicated kind of an actor, AtLeastOnceDeliveryActor that may be used to resend messages until they will be confirmed.
you can extend split and aggregate in your ESB to do it, I made something similar with mule ESB from long time.

Spring Integration JMS Threadsafe

I'm pretty new to Spring Integration and still trying to get my head around it. Right now I'm just trying to understand if the example I've found here is actually safe across multiple threads:
https://github.com/spring-projects/spring-integration-samples/blob/master/basic/jms/src/test/java/org/springframework/integration/samples/jms/ChannelAdapterDemoTest.java
My use case is as follows:
Send request to queue with JMS Reply-to as a temporary queue
Wait for response to be received on the temporary queue
Need this to happen synchronously within a method -- I don't want to split it up and make it asynchronous across several methods
Will the above example work for this? If not, am I barking up the wrong tree?
Thanks in advance.
That sample is pretty simple; it just sends the message to stdout so, yes, it's perfectly thread safe.
For the request/reply scenario you are talking about, you need to use a <gateway/> - see the other example in that sample project. In that case, you can see that the message is handled by 'demoBean' which, again, is perfectly thread safe.
For a real application, the thread-safetyness depends on the code in the services invoked by the flow receiving the message.
If you wish, you can use Spring Integration on the client side too (with an outbound gateway).

When to use persistence with Java Messaging and Queuing Systems

I'm performing a trade study on (Java) Messaging & Queuing systems for an upcoming re-design of a back-end framework for a major web application (on Amazon's EC2 Cloud, x-large instances). I'm currently evaluating ActiveMQ and RabbitMQ.
The plan is to have 5 different queues, with one being a dead-letter queue. The number of messages sent per day will be anywhere between 40K and 400K. As I plan for the message content to be a pointer to an XML file location on a data store, I expect the messages to be about 64 bytes. However, for evaluation purposes, I would also like to consider sending raw XML in the messages, with an average file size of 3KB.
My main questions: When/how many messages should be persisted on a daily basis? Is it reasonable to persist all messages, considering the amounts I specified above? I know that persisting will decrease performance, perhaps by a lot. But, by not persisting, a lot of RAM is being used. What would some of you recommend?
Also, I know that there is a lot of information online regarding ActiveMQ (JMS) vs RabbitMQ (AMQP). I have done a ton of research and testing. It seems like either implementation would fit my needs. Considering the information that I provided above (file sizes and # of messages), can anyone point out a reason(s) to use a particular vendor that I may have missed?
Thanks!
When/how many messages should be persisted on a daily basis? Is it
reasonable to persist all messages, considering the amounts I
specified above?
JMS persistence doesn't replace a database, it should be considered a short-lived buffer between producers and consumers of data. that said, the volume/size of messages you mention won't tax the persistence adapters on any modern JMS system (configured properly anyways) and can be used to buffer messages for extended durations as necessary (just use a reliable message store architecture)
I know that persisting will decrease performance, perhaps by a lot.
But, by not persisting, a lot of RAM is being used. What would some of
you recommend?
in my experience, enabling message persistence isn't a significant performance hit and is almost always done to guarantee messages. for most applications, the processes upstream (producers) or downstream (consumers) end up being the bottlenecks (especially database I/O)...not JMS persistence stores
Also, I know that there is a lot of information online regarding
ActiveMQ (JMS) vs RabbitMQ (AMQP). I have done a ton of research and
testing. It seems like either implementation would fit my needs.
Considering the information that I provided above (file sizes and # of
messages), can anyone point out a reason(s) to use a particular vendor
that I may have missed?
I have successfully used ActiveMQ on many projects for both low and high volume messaging. I'd recommend using it along with a routing engine like Apache Camel to streamline integration and complex routing patterns
A messaging system must be used as a temporary storage. Applications should be designed to pull the messages as soon as possible. The more number of messages lesser the performance. If you are pulling of messages then there will be a better performance as well as lesser memory usage. Whether persistent or not memory will still be used as the messages are kept in memory for better performance and will backed up on disk if a message type is persistent only.
The decision on message persistence depends on how critical a message is and does it require to survive a messaging provider restart.
You may want to have a look at IBM WebSphere MQ. It can meet your requirements. It has JMS as well as proprietary APIs for developing applications.
ActiveMQ is a good choice for open source JMS, more expensive ones I can recommend are TIBCO EMS or maybe Solace.
But JMS is actually built for once-only delivery and longer persistence is left out of the specification. You could of course go database, but that's heavy weight and possibly expensive.
What I would recommend (Note: I work for CodeStreet) is our 'ReplayService for JMS'. It let's you store any type of JMS messages (or native WebSphere MQ ones) in a high-performance file-based disk storage. Each message is automatically assigned a nanosecond timestamp and a globalMsgID that you can overwrite on publication. So the XML messages could be recorded by the ReplayServer and your actual message could just contain the globalMsgID as reference. And maybe some properties ?
Once a receiver receives the globalMsgID, it could then replay that message from the ReplayServer, if needed.
But on the other hand, 400K*3KB XML message should be easily doable for ActiveMQ or others. Also, you should compress your XML messages before sending.

Resources