Losing event when reactor is switching scheduler - spring

Context: Loading and streaming Excel file as Flux. Processing Flux records and inserting them into database using R2DBC.
implementation("org.apache.poi:poi:4.1.2") - apache lib which has excel domain of workbooks/sheets/rows/cells
implementation("org.apache.poi:poi-ooxml:4.1.2")
implementation("com.monitorjbl:xlsx-streamer:2.1.0") - streamer wrapper which avoids loading entire excel file into the memory and parses chunks of a file
Converting file to Flux (extract header as first row and then glue it to every subsequent event/row coming from Flux):
override fun extract(inputStream: InputStream): Flux<Map<String, String>> {
val workbook: Workbook = StreamingReader.builder()
.rowCacheSize(10) // number of rows to keep in memory (defaults to 10)
.bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024)
.open(inputStream)
val xsltRows = Flux.fromIterable(workbook).flatMap { Flux.fromIterable(it) }
return xsltRows.next()
.map { rowToHeader(it) }
.flatMapMany { header -> xsltRows.map { combineToMap(header, it) } }
}
Subsequently I process this Flux into domain models for Spring R2DBC repositories and insert the entries into MySQL database.
The problem: I am missing a single Excel row (out of roughly 2 k). It is always the same row, but nothing special about data in this row.
Recall combineToMap method that associates name from the header with every cell value, it also prints the row logical sequence number as in the file:
private fun combineToMap(header: Map<Int, String>, row: Row): Map<String, String> {
val mapRow: MutableMap<String, String> = mutableMapOf()
val logicalRowNum = row.rowNum+1
logger.info("Processing row: $logicalRowNum")
for (cell in row) {
if (cell.columnIndex >= header.keys.size) {
continue
}
val headerName = header[cell.columnIndex].takeUnless { it.isNullOrBlank() }
?: throw IllegalStateException("No header name for ${cell.columnIndex} column index for header " +
"$header and cell ${cell.stringCellValue} row index ${row.rowNum}")
mapRow[headerName] = cell.stringCellValue
mapRow["row"] = logicalRowNum.toString()
}
return mapRow
}
When I added the log line I noticed the following:
2020-11-22 15:49:56.684 INFO 20034 --- [ Test worker] c.b.XSLXFileRecordsExtractor : Processing row: 255
2020-11-22 15:49:56.687 INFO 20034 --- [ Test worker] c.b.XSLXFileRecordsExtractor : Processing row: 256
2020-11-22 15:49:56.689 INFO 20034 --- [ Test worker] c.b.XSLXFileRecordsExtractor : Processing row: 257
2020-11-22 15:50:02.458 INFO 20034 --- [tor-tcp-epoll-1] c.b.XSLXFileRecordsExtractor : Processing row: 259
2020-11-22 15:50:02.534 INFO 20034 --- [tor-tcp-epoll-1] c.b.XSLXFileRecordsExtractor : Processing row: 260
2020-11-22 15:50:02.608 INFO 20034 --- [tor-tcp-epoll-1] c.b.XSLXFileRecordsExtractor : Processing row: 261
Note that the scheduler is switched after 257 row, during the switch I lost 258 row. The pool:
tor-tcp-epoll-1
is understand is Spring R2DBC internal pool.
In my downstream, if instead of doing repository.save I return static Mono.just(entity) I get my 258 row back, notice the scheduler wasn't switched as well.
2020-11-22 16:01:14.000 INFO 21959 --- [ Test worker] c.b.XSLXFileRecordsExtractor : Processing row: 257
2020-11-22 16:01:14.006 INFO 21959 --- [ Test worker] c.b.XSLXFileRecordsExtractor : Processing row: 258
2020-11-22 16:01:14.009 INFO 21959 --- [ Test worker] c.b.XSLXFileRecordsExtractor : Processing row: 259
Is this a problem with the Excel libraries or my implementation? Why am I losing the record when TP is switched?
P.S. I am not specifying any schedulers or parallel blocks or anything to mess with threads anywhere in my flow apart from calling Spring R2DBC repositories.
I will attempt to rewrite using implementation("org.apache.commons:commons-csv:1.8") and observe if same happens, but if anyone can spot anything obvious or experienced similar elsewhere, I would be infinitely grateful.

In the end I switched to commons-csv which does not have the same problem:
2020-11-22 18:34:03.719 INFO 15733 --- [ Test worker] c.b.CSVFileRecordsExtractor : Processing row: 256
2020-11-22 18:34:09.062 INFO 15733 --- [tor-tcp-epoll-1] c.b.CSVFileRecordsExtractor : Processing row: 257
2020-11-22 18:34:09.088 INFO 15733 --- [tor-tcp-epoll-1] c.b.CSVFileRecordsExtractor : Processing row: 258
For original approach, I tried to publish all on one scheduler for xlsx-streamer and poi , even forced Spring R2DBC to publish on the same single thread scheduler and it still skipped the record.
What I could observe is that when database callbacks start to come regardless of which thread pool, this is exact moment when the record gets lost , seems that iterator context gets broken.
I mean the xslx library never claimed to be reactive so no expectations there.

Related

TransientDataAccessResourceException - R2DBC pgdb connection remains in idle in transaction

I have a spring-boot application where using webflux and r2dbc-postgres. I have discovered a strange issue when trying to do some db operations in a flatMap().
Code example:
#Transactional
public Mono<Void> insertDummyFooBars() {
return Flux.fromIterable(IntStream.rangeClosed(1, 260).boxed().collect(Collectors.toList()))
.log()
.flatMap(i -> this.repository.save(FooBar.builder().foo("test-" + i).build()))
.log()
.concatMap(i -> this.repository.findAll())
.then();
}
It seems like flatMap can process max 256 elements in batches. (Queues.SMALL_BUFFER_SIZE default value is 256). So when I tried to run the code above (with 260 elements) I've got an exception - TransientDataAccessResourceException and the following message:
Cannot exchange messages because the request queue limit is exceeded; nested exception is io.r2dbc.postgresql.client.ReactorNettyClient$RequestQueueException
There is no Releasing R2DBC Connection after this exception. The pgdb connection/session remains in idle in transaction state and the app is not able to run properly when pool max size is reached and all of the connections are in idle in transaction state. I think the connection should be released even if an exception happened or not.
If I use concatMap instead of flatMap it works as expected - no exception, connection released! It's also ok with flatMap when the elements are less than or equal to 256.
Is it possible to force pgdb connection closure? What should I do If I have lot of db operations in flatMap like this? Should I replace all of them with concatMap? Is there a global solution for this?
Versions:
Postgres: 12.6, Spring-boot: 2.7.6
Demo project
LOG:
2022-12-08 16:32:13.092 INFO 17932 --- [actor-tcp-nio-1] reactor.Flux.Iterable.1 : | onNext(256)
2022-12-08 16:32:13.092 DEBUG 17932 --- [actor-tcp-nio-1] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [INSERT INTO foo_bar (foo) VALUES ($1)]
2022-12-08 16:32:13.114 INFO 17932 --- [actor-tcp-nio-1] reactor.Flux.FlatMap.2 : onNext(FooBar(id=258, foo=test-1))
2022-12-08 16:32:13.143 DEBUG 17932 --- [actor-tcp-nio-1] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [SELECT foo_bar.* FROM foo_bar]
2022-12-08 16:32:13.143 INFO 17932 --- [actor-tcp-nio-1] reactor.Flux.Iterable.1 : | request(1)
2022-12-08 16:32:13.143 INFO 17932 --- [actor-tcp-nio-1] reactor.Flux.Iterable.1 : | onNext(257)
2022-12-08 16:32:13.144 DEBUG 17932 --- [actor-tcp-nio-1] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [INSERT INTO foo_bar (foo) VALUES ($1)]
2022-12-08 16:32:13.149 INFO 17932 --- [actor-tcp-nio-1] reactor.Flux.Iterable.1 : | onComplete()
2022-12-08 16:32:13.149 INFO 17932 --- [actor-tcp-nio-1] reactor.Flux.Iterable.1 : | cancel()
2022-12-08 16:32:13.160 ERROR 17932 --- [actor-tcp-nio-1] reactor.Flux.FlatMap.2 : onError(org.springframework.dao.TransientDataAccessResourceException: executeMany; SQL [INSERT INTO foo_bar (foo) VALUES ($1)]; Cannot exchange messages because the request queue limit is exceeded; nested exception is io.r2dbc.postgresql.client.ReactorNettyClient$RequestQueueException: [08006] Cannot exchange messages because the request queue limit is exceeded)
2022-12-08 16:32:13.167 ERROR 17932 --- [actor-tcp-nio-1] reactor.Flux.FlatMap.2 :
org.springframework.dao.TransientDataAccessResourceException: executeMany; SQL [INSERT INTO foo_bar (foo) VALUES ($1)]; Cannot exchange messages because the request queue limit is exceeded; nested exception is io.r2dbc.postgresql.client.ReactorNettyClient$RequestQueueException: [08006] Cannot exchange messages because the request queue limit is exceeded
at org.springframework.r2dbc.connection.ConnectionFactoryUtils.convertR2dbcException(ConnectionFactoryUtils.java:215) ~[spring-r2dbc-5.3.24.jar:5.3.24]
at org.springframework.r2dbc.core.DefaultDatabaseClient.lambda$inConnectionMany$8(DefaultDatabaseClient.java:147) ~[spring-r2dbc-5.3.24.jar:5.3.24]
at reactor.core.publisher.Flux.lambda$onErrorMap$29(Flux.java:7105) ~[reactor-core-3.4.25.jar:3.4.25]
at reactor.core.publisher.Flux.lambda$onErrorResume$30(Flux.java:7158) ~[reactor-core-3.4.25.jar:3.4.25]
at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:94) ~[reactor-core-3.4.25.jar:3.4.25]
I have tried to change the Queues.SMALL_BUFFER_SIZE, and also tried to add a concurrency value to the flatmap. It works when I reduced the value to 255 but I think it is not a good solution.

Why does my service activator poll multiple messages?

Given the setup https://gist.github.com/gel-hidden/0a8627cf93f5396d6b73c2a6e71aad3e, I would expect when I send a message that the ServiceActivator would be called with a delay of 10 000 between messages.
The first channel takes in a list, then split the messages and then call another QueueChannel. But for some reason each pull polls all the split messages. I know I am missing something stupid, or I'm just too stupid to understand whats happening.
Related test case: https://gist.github.com/gel-hidden/de7975fffd0853ec8ce49f9d6fa6531d
Output:
2022-10-26 15:22:02.708 INFO 78647 --- [ scheduling-1] com.example.demo.DemoApplicationTests : Received message Hello
2022-10-26 15:22:02.708 INFO 78647 --- [ scheduling-1] com.example.demo.UpdateLocationFlow : Doing some work for model with id 2
2022-10-26 15:22:03.009 INFO 78647 --- [ scheduling-1] com.example.demo.UpdateLocationFlow : Completed some work for model with id 2
2022-10-26 15:22:03.017 INFO 78647 --- [ scheduling-1] com.example.demo.DemoApplicationTests : Received message World
2022-10-26 15:22:03.018 INFO 78647 --- [ scheduling-1] com.example.demo.UpdateLocationFlow : Doing some work for model with id 3
2022-10-26 15:22:03.319 INFO 78647 --- [ scheduling-1] com.example.demo.UpdateLocationFlow : Completed some work for model with id 3
2022-10-26 15:22:04.322 INFO 78647 --- [ scheduling-1] o.s.i.a.AggregatingMessageHandler : Expiring MessageGroup with correlationKey[1]
My thoughts is that the messages should be something like:
00:01 Doing some work for model with id 2
00:02 Completed some work for model with id 2
00:12 Doing some work for model with id 3
00:13 Completed some work for model it id 3
So, it is a bug in the Spring Integration around lifecycle management for the IntegrationFlowAdapter management. It just starts twice.
As a workaround I suggest to pull your #ServiceActivator handle() into an individual component with its own #Poller configuration and an inputChannel and outputChannel. In other words int must go outside of your UpdateLocationFlow. This way the IntegrationFlowAdapter won't have a control for its lifecycle and won't start it twice.
Meanwhile I'm looking how to fix it.
Thank you for reporting this!

Kafka consumer does not fetch new records when using topic pattern and large messages

I hope someone of you can help me.
I'm using spring boot 2.3.4 with spring kafka 2.5.6. I recently had to reset an offset and saw some strange behavior. We consumed the messages, but after every X (variating) messages we had a timeout of 10 seconds before the consumption continued.
This is my configuration:
spring:
kafka:
bootstrap-servers: localhost:9092
consumer:
enable-auto-commit: false
auto-offset-reset: earliest
heartbeat-interval: 1000
max-poll-records: 50
group-id: kafka-fetch-demo
fetch-max-wait: 10000
listener:
type: single
concurrency: 1
poll-timeout: 1000
no-poll-threshold: 2
monitor-interval: 10
ack-mode: manual
producer:
acks: all
batch-size: 0
retries: 0
This is an examle listener code:
#KafkaListener(id = LISTENER_ID, idIsGroup = false, topicPattern = "#{demoProperties.getTopicPattern()}")
public void onEvent(Acknowledgment acknowledgment, ConsumerRecord<byte[], String> record) {
log.info("Received record on topic {}, partition {} and offset {}",
record.topic(),
record.partition(),
record.offset());
acknowledgment.acknowledge();
}
Analysis
I figured out that the 10 second timeout came from the fetch.max.wait.ms property. However I'm not able to figure out why this property applies.
As far as I understand the fetch-max-wait property only determines the maximum time the broker waits before providing the consumer with new records even if the fetch.min.bytes is not exceeded. (Which in my case is set to the default 1 and should always be fullfilled)
Furthermore I analyzed that this problem only applies when using topic patterns and "larger" messages.
Reproduction
I uploaded an demo application on Github to reproduce the issue: https://github.com/kraennix/kafka-fetch-demo.
How I did reproduce it:
I put a thousand messages with 17,1 KB per message on a kafka topic.
I start my consuming application that listens per topic pattern to this topic. Then you can see this stopping behaviour.
Note: If I do the same with "small" messages (89 Bytes) it works as expected.
Logs
In the logs you can see the successful commit, but then the it says Skipping fetch
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] essageListenerContainer$ListenerConsumer : Commit list: {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}}
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] essageListenerContainer$ListenerConsumer : Committing: {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}}
2021-01-16 15:04:40.773 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Sending OffsetCommit request with {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}} to coordinator localhost:9092 (id: 2147483647 rack: null)
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Using older server API v7 to send OFFSET_COMMIT {group_id=kafka-fetch-demo,generation_id=4,member_id=consumer-kafka-fetch-demo-1-cf8e747f-531d-457a-aca8-18960c518ef9,group_instance_id=null,topics=[{name=publish.LargeTopic.2.test,partitions=[{partition_index=0,committed_offset=488,committed_leader_epoch=-1,committed_metadata=}]}]} with correlation id 62 to node 2147483647
2021-01-16 15:04:40.778 TRACE 19244 --- [_LISTENER-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Completed receive from node 2147483647 for OFFSET_COMMIT with correlation id 62, received {throttle_time_ms=0,topics=[{name=publish.LargeTopic.2.test,partitions=[{partition_index=0,error_code=0}]}]}
2021-01-16 15:04:40.779 DEBUG 19244 --- [_LISTENER-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Committed offset 488 for partition publish.LargeTopic.2.test-0
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.1.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.2.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.1.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.2.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
When there is a change in the Size of the message, you might need to change below 2 Props
heartbeat-interval: 1000
max-poll-records: 50
Your heart beat interval is 1sec and Max poll wait is 10secs. If the size of the message is high and you are processing the consumed messages in the same thread, then Heartbeat check will fail by the time the next Pull triggered. Make sure to process messages by an Executor using Callable.
Increase the Heart Beat Interval to 5 to 10 secs and Reduce Max Poll records to 15 when the messages size is high. Hope, this can help

intellij idea debug mode can not terminated

when i use intellij idea debug mode, i set a breakpoints ,it paused where i set successfully.when i do not want go on and push the stop button ,the process should be terminated . but the process carried on .for example.
logger.info("fixGsDataAccount start");
logger.info("before delete all cnt:{}",cnt);
logger.info("query duplicate data include self cnt:{}",cnt);
logger.info("delete duplicate data end!");
logger.info("after delete all cnt:{}",afterDeleteCnt);
logger.info("fixGsDataAccount end");
i set a breakpoints at the 3rd row .
logger.info("query duplicate data include self cnt:{}",cnt);
and i push stop button , the log sholud shotp here like below ,
05-07 17:57:47.536 [main] [INFO ] fixGsDataAccount start -
c.wzt.web.datafix.FixGsDataAccount:33
05-07 17:57:47.540 [main] [INFO ] before delete all cnt:675 -
c.wzt.web.datafix.FixGsDataAccount:35
05-07 17:57:47.545 [main] [INFO ] query duplicate data include self cnt:1 -
c.wzt.web.datafix.FixGsDataAccount:37
but it show like below
05-07 17:57:47.536 [main] [INFO ] fixGsDataAccount start -
c.wzt.web.datafix.FixGsDataAccount:33
05-07 17:57:47.540 [main] [INFO ] before delete all cnt:675 -
c.wzt.web.datafix.FixGsDataAccount:35
05-07 17:57:47.545 [main] [INFO ] query duplicate data include self cnt:1 -
c.wzt.web.datafix.FixGsDataAccount:37
05-07 17:57:47.546 [main] [INFO ] fixGsDataAccount end -
c.wzt.web.datafix.FixGsDataAccount:53
the last logger still print out
Enable the Kill the debug process immediately option:

Tuning #JmsListener

Does #JmsListener use a poller under the hood, or is it message-driven? When testing with concurrency=1, it seems to read one message per second:
2016-06-23 09:09:46.117 INFO 13044 --- [enerContainer-1] c.s.s.core.service.PolicyChangedHandler : Received: 1: This is a test
2016-06-23 09:09:46.922 INFO 13044 --- [enerContainer-1] c.s.s.core.service.PolicyChangedHandler : Received: 2: This is a test
2016-06-23 09:09:47.730 INFO 13044 --- [enerContainer-1] c.s.s.core.service.PolicyChangedHandler : Received: 3: This is a test
2016-06-23 09:09:48.535 INFO 13044 --- [enerContainer-1] c.s.s.core.service.PolicyChangedHandler : Received: 4: This is a test
2016-06-23 09:09:49.338 INFO 13044 --- [enerContainer-1] c.s.s.core.service.PolicyChangedHandler : Received: 5: This is a test
2016-06-23 09:09:50.155 INFO 13044 --- [enerContainer-1] c.s.s.core.service.PolicyChangedHandler : Received: 6: This is a test
If it is polling, how do I adjust the polling rate or increase the number of messages read per poll?
If it is message-driven, I don't why it is so slow???
Yes Spring JMSListener uses polling under the hood by default.
See DefaultMessageListenerContainer
See also the default receiveTimeout which is 1s.
The receive timeout for each attempt can be configured through the "receiveTimeout" property. setReceiveTimeout
Set the timeout to use for receive calls, in milliseconds. The default is 1000 ms, that is, 1 second.
NOTE: This value needs to be smaller than the transaction timeout used by the transaction manager (in the appropriate unit, of course). 0 indicates no timeout at all; however, this is only feasible if not running within a transaction manager and generally discouraged since such a listener container cannot cleanly shut down. A negative value such as -1 indicates a no-wait receive operation.

Resources