Spring Kafka and Transactions - spring

I'd like to use Spring Kafka with Transactions but I don't really understand how it is supposed to be configured and how it works.
Here is my configuration
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");
props.put(ProducerConfig.RETRIES_CONFIG, String.valueOf(Integer.MAX_VALUE));
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 1);
props.put(ProducerConfig.ACKS_CONFIG, "all");
This config is used in a DefaultKafkaProducerFactory with transaction id prefix:
defaultKafkaProducerFactory.setTransactionIdPrefix("my_app.");
Problem 1 :
How am I supposed to chose this transaction id prefix ?
If I understand correctly, this prefix is used by spring to generate a transactional id for each producer created.
Why couldn't we just use "UUID.randomUUID() ?
Problem 2 :
If the producer is destroyed, it will generate a new transactional id.
Therefor, if the application crash, on restart it will reuse old transactional id.
Is that normal ???
Problem 3 :
I'm using an application deployed on cloud which can be automatically scaled up/down.
This means my prefix can not be fixed since all my producers on each instances will have transactional id in conflict.
Should I add a random part in it ?
Do I need to recover the same prefix when an instance is scaled down/up or crash and restart ?
Problem 4 :
Last but not least, we are using credentials for our Kafka.
This does not seems to work :
Current ACLs for resource `TransactionalId:my_app*`:
User:CN... has Allow permission for operations: All from hosts: *
How should I set my ACLs knowing that my transactional ids are generated ?
Edit 1
After further reading, if I understand correctly.
If you have a C0(consumer) reading from P0(partition).If the broker start a consumer rebalance.
P0 could be assigned to another consumer C1.
This consumer C1 should use the same transaction id than then previous C0 to prevent duplicate (Zombies fencing) ?
How do you achieve this in spring-kafka ? The transaction id seems to have nothing to do with the consumer and thus the partition read.
Thanks

You can't use a random TID because of the zombie fencing - if the server crashes you could have a partial transaction in the topic which will never be completed and nothing more will be consumed from any partitions with a write for that transaction.
That is by design - for the above reason.
Again, you can't randomize; for the above reason.
Cloud Foundry, for example, has an environment variable that indicates the instance index. If you are using a cloud platform that doesn't include something like that, you would have to simulate it somehow. Then, use it in the transaction id:
spring.kafka.producer.transaction-id-prefix=foo-${instance.index}-
ACLs - I can't answer that; I am not familiar with kafka permissions; might be better to ask a separate question for that.
I think we need to add some logic to Spring to ensure the same transaction id is always used for a particular topic/partition.
https://github.com/spring-projects/spring-kafka/issues/800#issuecomment-419501929
EDIT
Things have changed since this answer (KIP-447); if your brokers are 2.5.0 or later - see. https://docs.spring.io/spring-kafka/docs/2.5.5.RELEASE/reference/html/#exactly-once and https://docs.spring.io/spring-kafka/docs/2.6.0-SNAPSHOT/reference/html/#exactly-once

Related

Kafka: Sarama, idempotence and transactional.id

Does Shopify/sarama provide an option similar to transactional.id in JVM API?
The library supports idempotence (Config.Producer.Idemponent, similar to enable.idempotence), but I don't understand how to use it without transactional.id.
Please, correct me if I'm wrong, there is a bit lack of documentation about these options in Sarama. But according to JVM docs, idempotence without the identifier will be limited by a single producer session. In other words, we will loss the guarantee when producer fails and restart.
I found relevant properties in the source code and some tests (for example), but don't understand how to use them externally.
Shopify/sarama Provides Kafka Exactly Once (Idempotency) with idempotent enabled producer. But For that below configuration setup need to be there.
From Shopify/sarama/config.go
if c.Producer.Idempotent {
if !c.Version.IsAtLeast(V0_11_0_0) {
return ConfigurationError("Idempotent producer requires Version >= V0_11_0_0")
}
if c.Producer.Retry.Max == 0 {
return ConfigurationError("Idempotent producer requires Producer.Retry.Max >= 1")
}
if c.Producer.RequiredAcks != WaitForAll {
return ConfigurationError("Idempotent producer requires Producer.RequiredAcks to be WaitForAll")
}
if c.Net.MaxOpenRequests > 1 {
return ConfigurationError("Idempotent producer requires Net.MaxOpenRequests to be 1")
}
}
In Shopify/sarama How they do this is, There is a producerEpoch ID in AsyncProducer's transactionManager. You can refer the file in Shopify/sarama/async_producer.go. This Id initialise with the producer initialisation and increment when successfully producing each message. read bumpEpoch() function to see that in async_producer.go file.
This is the sequence id for that producer session with the broker and it is sending with each message. Increment when message published successfully.
Read this example. It describes how idempotence works.
You are correct on producer session fact. That exactly once promised for single producer session. When restating producer just after the sequence failure, there can be a duplicate.
When producer restarts, new PID gets assigned. So the idempotency is promised only for a single producer session. Even though producer retries requests on failures, each message is persisted in the log exactly once. There can still be duplicates depending on the source where the producer is getting data. Kafka won’t take care of the duplicate data received by the producer. So, in some cases, you may require an additional de-duplication system.

Should I use SeekToCurrentErrorHandler with API restTemplate retry mechanism?

I am trying to write a kafka consumer application in spring-kafka. As consumer, I have to make sure I am not missing any record and all records should get processed.
My application design is like this :
Topics --> Read records from topic --> dump it into a table A in oracle database --> pick records from a table A --> call rest api to update records in system table B --> update response of API in table a --> commit records
Retry Mechanism on API level :
Now, if any of the records gets failed, means the response code is not as desired (400,500 etc..). We would retry those records 2 times.
Retry Mechanism on Topic level :
But, what if I got an error while committing offsets ? means, I need to have some kind of retry mechanism on the topic level as well. I went over documents and found an option :SeekToCurrentErrorHandler
#Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory();
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setAckOnError(false);
factory.getContainerProperties().setAckMode(AckMode.RECORD);
factory.setErrorHandler(new SeekToCurrentErrorHandler(new FixedBackOff(1000L, 2L)));
return factory;
}
Now, what I understand, suppose If I am not able to commit any offsets, then after adding above code, this will retry a delivery up to 2 times (3 delivery attempts) with a back off of 1 second. So, does this means, my whole flow will be replayed twice ? if this is true, then do I need to add retry mechanism on the API level separately ?
I am just trying to understand, how can I make my consumer application more resilient so I don't miss any record from processing and should have error mechanism to handle any error/missed records. Please suggest.
It's best to avoid situations where the offsets can't be committed (make sure the max.poll.interval.ms is sufficient).
But, yes, if committing the offsets fails (and commitSync is true) then the record will be redelivered to the application. If commitSync is false, the failure will simply be logged (or sent to your listener) and the "next" offset for that partition will have its offset committed later (presumably).
Adding retry at the application level (e.g. using a RetryTemplate in the listener adapter - via the container factory) will still suffer from the same problem; it also can cause a rebalance if the retries take too long.
If you really want to avoid reprocessing in this situation, you need to make your listener code idempotent - e.g. store the topic/partition/offset someplace to indicate you have already processed that record.

Spring Integration Aggregation Error

I have an spring integration implemetation where I have two Client subscribers listening to the same JMS Topic. I am using JDBC message store( Different REGIONS) in both the implementation to save the incoming messages. While processing the data I get the exception:
org.springframework.dao.EmptyResultDataAccessException: Incorrect result size: expected 1, actual 0
I know this is Jira issue : https://jira.spring.io/browse/INT-2912
As now I cant upgrade the Spring version. I am unable to understand the workaround "The work-around is either to always use a different groupKey or to use separate tables for each Message Store. We will need to add a REGION column to the INT_GROUP_TO_MESSAGE as well."
How can i create a different groupKey?
My implementation is as below:
<bean
id="jdbcMessageStore"
class="org.springframework.integration.jdbc.JdbcMessageStore"
p:dataSource-ref="datasource"
p:region="REPORTS"/>
<si:aggregator
send-partial-result-on-expiry="false"
message-store="jdbcMessageStore"
discard-channel="discardedLogger"/>
The "groupKey" mentioned there is the correlation strategy result; by default it just uses the correlationId header.
You can use correlation-strategy-expression="'foo' + headers['correlationId']" and correlation-strategy-expression="'bar' + headers['correlationId']" to use a different group key for each app.

spring integration message released twice from aggregator

I have a spring integration flow that starts with a channel inboundadapter and picks up files and passes them through the system as messages.
After a few components, the messages are aggregated at an "Aggregator" from where they are released based on release strategies or by group timeout of 30 sec.
The downstream processing has another bunch of components till the final one.
The problem I am facing is this,
When I send 33 files which create 33 "groups/buckets" based on correlation IDs, aggregated at the "Aggregator", some of the files or messages seems to be "released" twice. The reason I conclude that is because I have a channel interceptor which shows a few messages passing through the "released" channel (appearing right after the aggregator) a second time, after completing the downstream processing successfully, the first time. Additionally, this behavior causes my application to not find a file and throw an exception which I see. This leads me to conclude that the message bucket/group/corrID is somehow being "Released" twice.
I have tried to debug this many ways , but essentially, I want to know how a corrID/bucket after being released and having successfully gone through all downstream components in a single thread, can be "released" again.
My question is, how can I debug this? I want to know what is making this message/bucket re-appear in the aggregator.
My aggregator is as follows,
<int:aggregator id="bufferedFiles" input-channel="inQueueForStage"
output-channel="released" expire-groups-upon-completion="true"
send-partial-result-on-expiry="true" release-strategy="releaseHandler"
release-strategy-method="canRelease"
group-timeout-expression="size() > 0 ? T(com.att.datalake.ifr.loader.utils.MessageUtils).getAggregatorTimeout(one, #sourceSnapshot) : -1">
<int:poller fixed-delay="${files.pickup.delay:3000}"
max-messages-per-poll="${num.files.pickup.per.poll:10}"
task-executor="executor" />
</int:aggregator>
Explanation of aggregator: The size()>0 applies to EACH correlation bucket. each of the 33 files I am sending will spawn/generate/create a new bucket because of the file name, so the aggregator will have 33 buckets/groups/corrIds, each bucket will contain only one file.
So the aggregator SPEL expression simply says that if there no release strategies, then release the bucket/group after 30 secs if the group indeed has at least some files.
My Channel inbound adapter is as follows:
<int-file:inbound-channel-adapter id="files"
channel="dispatchFiles" directory="${source.dir}" scanner="directoryScanner">
<int:poller fixed-delay="${files.pickup.delay:3000}"
max-messages-per-poll="${num.files.pickup.per.poll:10}" />
</int-file:inbound-channel-adapter>
Logs
here is the log of message completing the flow the first time. The completion time invoked suggests reaching the last component a "completionHandler" SA.
Explanation of Log: "cor" is the bucket/corrId that is being released twice. The reason I get the final exception is because during the first time, the file is removed from that original location and processed. So the second time around when this erroneous release happens, there is nothing to process there.
From the pictures it can be seen that the first batch/corrId/bucket is processed and finished around 11:09, and the second one is started around 11:10
an important point I noticed that this behavior only happens when I have a global channel interceptor in which I am doing somewhat long processing. When this interceptor is commented out, the errors go away.
Question:
is it possible for aggregator to double release a batch/corrId under any circumstance? How can I make aggregator emit any logs?
Thanks
Edit 10:15pm
My channel following the aggregator has an interceptor as follows,
public Message<?> preSend(Message<?> message, MessageChannel channel) {
LOGGER.info("******** Releasing from aggregator(interceptor) , corrID:{} at time:{} ********",MessageUtils.getCorrelationId(message), new Date() );
finalReporter.callback(channel.toString(), message);
return message;
}
From Aggregator down to final compeltionHandler SA, I have single threaded processing
Aggregator -> releasedChannel -> some SA1 -> some channel -> ..... -> completionChannel->completeSA
When I run for 33 partitions, let's follow corrId = "alh" The first time it is released, it looks like following,
What it shows is that thread-5 released it and it should process all the downstream components. But it leaves it mid-way and starts doing other things and is picked up again by a diffferent thread a little later as follows,
That seems/seemed to be the problem,
Solution Update:
I did following 3 things to sort of work around, at the moment,
for some reason, my interceptors were doing return super.preSend(message, channel) instead of simply return message. I changed it to latter
I had a global channel interceptors, I removed global and kept individual ones
If the channel interceptors had any issues before returning, would that cause a new release?
Although I still see the above scenario depicted in pictures, I am not getting double processing attempts and as such it avoids the errors. I am still trying to make sense out of this.
I understand it's too specific and difficult to explain; still thanks for the time and comments...
However, yes. I think #GaryRussell is right: since you use expire-groups-upon-completion="true" some partial groups may be released by group-timeout-expression and the new messages with the same correlationId will form a new group, which is released by the next group-timeout. Your size() > 0 isn't good too. It means that it is going to release partial group after that group-timeout. Maybe size() > 1? The group can't be size() == 0 though. Because it is created on the first message, so, if gruop exists, it contains at least one message. Yes, group can be empty, but in that case the aggregator should be marked with expire-groups-upon-completion="false". In that case it is marked as completed and doesn't allow new messages.
After struggling with debugging and various blind scenarios, I believe that at least I have a workaround and a possible root cause. I will try to outline all the things that I modified,
Root Cause:
My interceptors were calling a Common class with a common callback method. This method, based on the channel name from which the request was coming from, would decide the appropriate action to take. The actions were essentially collecting data, incrementing counters and persisting to database some information.
It seems that some of them were having errors and consequently, the thread was dying and message re-released. I am not entirely sure about it and please correct me if that's not the case.
But after I fixed those errors, the re-release issue seems to have subsided or vanished altogether.
The reason it was hard to diagnose was because I could not see those errors thrown during callback method invocations; may be I was catching them or may be they were lost.
I also found that the issue was only on any channel interceptors AFTER the aggregator. Interceptors before the aggregator did not present any issues; may be because they were simpler...
To debug,
I removed the interceptors and made the callback directly from various components (SAs), removed global interceptors and tried to add individual interceptors for specific channels.
Thanks for all the help.

Does Spring's PlatformTransactionManager require transactions to be committed in a specific order?

I am looking to retrofit our existing transaction API to use Spring’s PlatformTransactionManager, such that Spring will manage our transactions. I chained my DataSources as follows:
DataSourceTransactionManager - > LazyConnectionDataSourceProxy - > dbcp.PoolingDataSource - > OracleDataSource
In experimenting with the DataSourceTransactionManager , I have found that where PROPAGATION_REQUIRES_NEW is used, it seems that Spring’s transaction management requires that the transactions be committed/rolled back in LIFO fashion, i.e. you must commit/rollback the most recently created transactions first.
Example:
#Test
public void testSpringTxns() {
// start a new txn
TransactionStatus txnAStatus = dataSourceTxnManager.getTransaction(propagationRequiresNewDefinition); // specifies PROPAGATION_REQUIRES_NEW
Connection connectionA = DataSourceUtils.getConnection(dataSourceTxnManager.getDataSource());
// start another new txn
TransactionStatus txnBStatus = dataSourceTxnManager.getTransaction(propagationRequiresNewDefinition);
Connection connectionB = DataSourceUtils.getConnection(dataSourceTxnManager.getDataSource());
assertNotSame(connectionA, connectionB);
try {
//... do stuff using connectionA
//... do other stuff using connectionB
} finally {
dataSourceTxnManager.commit(txnAStatus);
dataSourceTxnManager.commit(txnBStatus); // results in java.lang.IllegalStateException: Cannot deactivate transaction synchronization - not active
}
}
Sadly, this doesn’t fit at all well with our current transaction API which allows you to create transactions, represented by Java objects, and commit them in any order.
My question:
Am I right in thinking that this LIFO behaviour is fundamental to Spring’s transaction management (even for completely separate transactions)? Or is there a way to tweak its behaviour such that the above test will pass?
I know the proper way would be to use annotations, AOP, etc. but at present our code is not Spring-managed, so it is not really an option for us.
Thanks!
yes,I have met the same problems below when using spring:
java.lang.IllegalStateException: Cannot deactivate transaction synchronization - not active.
According above,Spring’s transaction management requires that the transactions be committed/rolled back in LIFO fashion(stack behavior).The problem disappear.
thanks.
Yes, I found this same behavior in my own application. Only one transaction is "active" at a time, and when you commit/rollback the current transaction, the next active transaction is the next most recently started transaction (LIFO/stack behavior). I wasn't able to find any way to control this, it seems to be built into the Spring Framework.

Resources