SpringBoot/MongoDB returns error code 251 - spring

We use Spring Boot 3.0.2 (which in turn uses the mongo driver 4.8.2). We use MongoDB 6.0.4, replica set size 1.
We use the reactive stack and the transactional operator to demarcate the transactions like this:
Mono.just(initialState).flatMap(queryDatabase).flatMap(createMongoDocument1InCollectionA).flatMap(updateMongoDocument2InCollectionB).as(transactionalOperator::transactional).retryWhen(retrySpec)
We use read and write concern majority (even if it is not relevant with a replica set size 1). All other settings, e.g. session synchronization, are default.
If we run this code in parallel in multiple threads it often (about 20%) fails with the following message:
Command failed with error 251 (NoSuchTransaction): 'Given transaction number 8 does not match any in-progress transactions. The active transaction number is 7' on server localhost:49633. The full response is {"errorLabels": ["TransientTransactionError"], "ok": 0.0, "errmsg": "Given transaction number 8 does not match any in-progress transactions. The active transaction number is 7", "code": 251, "codeName": "NoSuchTransaction", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1675160699, "i": 11}}, "signature": {"hash": {"$binary": {"base64": "75m2LsbMsyLuFwtSHZPInaFs4Lo=", "subType": "00"}}, "keyId": 7194760344735055879}}, "operationTime": {"$timestamp": {"t": 1675160699, "i": 11}}}
The retry normally helps here, but not always (in case the max attempts count is exceeded). We did not observe this error if we have just one thread executing a single request. We also noticed, that if multiple requests from the parallel threads do not try to modify the same Mongo document, this error occurs much less often (maybe about 5%).
We observed the same error occuring even more often (about 60%) with Spring Boot 2.7.6 (mongo driver 4.6.1) and MongoDB 6.0.4 (or MongoDB 4.2.0). The error message was:
Query failed with error code 251 and error message 'Given transaction number 1 does not match any in-progress transactions. The active transaction number is -1' on server localhost:49372;
The issue will certainly cause a significant operation disruption in the high concurrency production environment.
Any help on the explanation and the fix of the issue would be very appreciated.

Related

Elasticsearch 7.x circuit breaker - data too large - troubleshoot

The problem:
Since the upgrading from ES-5.4 to ES-7.2 I started getting "data too large" errors, when trying to write concurrent bulk request (or/and search requests) from my multi-threaded Java application (using elasticsearch-rest-high-level-client-7.2.0.jar java client) to an ES cluster of 2-4 nodes.
My ES configuration:
Elasticsearch version: 7.2
custom configuration in elasticsearch.yml:
thread_pool.search.queue_size = 20000
thread_pool.write.queue_size = 500
I use only the default 7.x circuit-breaker values, such as:
indices.breaker.total.limit = 95%
indices.breaker.total.use_real_memory = true
network.breaker.inflight_requests.limit = 100%
network.breaker.inflight_requests.overhead = 2
The error from elasticsearch.log:
{
"error": {
"root_cause": [
{
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
}
],
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
},
"status": 429
}
Thoughts:
I'm having hard time to pin point the source of the issue.
When using ES cluster nodes with <=8gb heap size (on a <=16gb vm), the problem become very visible, so, one obvious solution is to increase the memory of the nodes.
But I feel that increasing the memory only hides the issue.
Questions:
I would like to understand what scenarios could have led to this error?
and what action can I take in order to handle it properly?
(change circuit-breaker values, change es.yml configuration, change/limit my ES requests)
The reason is that the heap of the node is pretty full and being caught by the circuit breaker is nice because it prevents the nodes from running into OOMs, going stale and crash...
Elasticsearch 6.2.0 introduced the circuit breaker and improved it in 7.0.0. With the version upgrade from ES-5.4 to ES-7.2, you are running straight into this improvement.
I see 3 solutions so far:
Increase heap size if possible
Reduce the size of your bulk requests if feasible
Scale-out your cluster as the shards are consuming a lot of heap, leaving nothing to process the large request. More nodes will help the cluster to distribute the shards and requests among more nodes, what leads to a lower AVG heap usage on all nodes.
As an UGLY workaround (not solving the issue) one could increase the limit after reading and understanding the implications:
So I've spent some time researching how exactly ES implemented the new circuit breaker mechanism, and tried to understand why we are suddenly getting those errors?
the circuit breaker mechanism exists since the very first versions.
we started experience issues around it when moving from version 5.4 to 7.2
in version 7.2 ES introduced a new way for calculating circuit-break: Circuit-break based on real memory usage (why and how: https://www.elastic.co/blog/improving-node-resiliency-with-the-real-memory-circuit-breaker, code: https://github.com/elastic/elasticsearch/pull/31767)
In our internal upgrade of ES to version 7.2, we changed the jdk from 8 to 11.
also as part of our internal upgrade we changed the jvm.options default configuration, switching the official recommended CMS GC with the G1GC GC which have a fairly new support by elasticsearch.
considering all the above, I found this bug that was fixed in version 7.4 regarding the use of circuit-breaker together with the G1GC GC: https://github.com/elastic/elasticsearch/pull/46169
How to fix:
change the configuration back to CMS GC.
or, take the fix. the fix for the bug is just a configuration change that can be easily changed and tested in your deployment.

In Substrate what does code: 1012 "Transaction is temporarily banned" mean?

The full text of the message is :
{code: 1012, message: "Transaction is temporarily banned"}
This would indicate that the transaction is held somewhere in Substrate Runtime mempool or something of that nature, but it is not entirely clear what possible causes can trigger this, and what the eventual outcome might be.
For example,
1) is it that too many transactions have been sent from a given account, IP address or other? Has some threshold been reached?
2) is the transaction actually invalid, or not?
3) The use of the word "temporary" suggests a delay in processing, not an outright rejection of the transaction. Therefore does this suggest that the transaction is valid, but delayed? If so, for how long?
The comments in the substrate runtime core/rpc/src/author/errors.rs and core/transaction-pool/graph/src/errors.rs is no clearer about what is the outcome.
In front of the mempool, exists a transaction blacklist, which can trigger this error. Specifically, this error means that a transaction with the same hash was either:
Part of recently mined block
Detected as invalid during block production and removed from the pool.
Additionally, this error can occur when:
The transaction reaches it's longevity, i.e. is not mined for TransactionValidation::longevity blocks after being imported to the pool.
By default longevity is set to u64::max so this normally should not be the problem.
In any case -ltxpool=log should reveal more details around this error.
A transaction is only temporarily banned because it will be removed from the blacklist when either:
30 minutes pass
There are more than 4,000 transactions on the blacklist
Check out core/transaction-pool/graph/src/rotator.rs.

Can MAX_UTILIZATION for PROCESSES reached cause "Unable to get managed connection" Exception?

A JBoss 5.2 application server log was filled with thousands of the following exception:
Caused by: javax.resource.ResourceException: Unable to get managed connection for jdbc_TestDB
at org.jboss.resource.connectionmanager.BaseConnectionManager2.getManagedConnection(BaseConnectionManager2.java:441)
at org.jboss.resource.connectionmanager.TxConnectionManager.getManagedConnection(TxConnectionManager.java:424)
at org.jboss.resource.connectionmanager.BaseConnectionManager2.allocateConnection(BaseConnectionManager2.java:496)
at org.jboss.resource.connectionmanager.BaseConnectionManager2$ConnectionManagerProxy.allocateConnection(BaseConnectionManager2.java:941)
at org.jboss.resource.adapter.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:96)
... 9 more
Caused by: javax.resource.ResourceException: No ManagedConnections available within configured blocking timeout ( 30000 [ms] )
at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.getConnection(InternalManagedConnectionPool.java:311)
at org.jboss.resource.connectionmanager.JBossManagedConnectionPool$BasePool.getConnection(JBossManagedConnectionPool.java:689)
at org.jboss.resource.connectionmanager.BaseConnectionManager2.getManagedConnection(BaseConnectionManager2.java:404)
... 13 more
I've stripped off the first part of the exception, which is basically our internal JDBC wrapper code which tries to get a DB connection from the pool.
Looking at the Oracle DB side I ran the query:
select resource_name, current_utilization, max_utilization, limit_value
from v$resource_limit
where resource_name in ('sessions', 'processes');
This produced the output:
RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION LIMIT_VALUE
processes 1387 1500 1500
sessions 1434 1586 2272
Given the fact that that PROCESSES limit of 1500 was reached, would this cause the JBoss exceptions we experienced? I've also been investigating the possibility of connection leaks, but haven't found any evidence of that so far.
What is the recommended course of action here? Is simply increasing the limit a valid solution?
Usually when max_utilization gets the processes value listener will refuse new connections to database. you can see the errors relates to it in alert log. to solve this in database side you should increase the processes parameter.
hmm strange. is it possible, that exception wrapping in JBOSS hides the original error? You should get some sql exception whose text starts with ORA-. Maybe your JDBC wrapper does not handle errors properly.
The recommended actions is to:
check configured size of connection pool against processes sessions Oracle startup paramters.
check Oracles view v$session, especially columns STATUS, LAST_CALL_ET, SQL_ID, PREV_SQL_ID.
translate sql_id(prev_sql_id) into sql_text via v$sql.
if you application has a connection leak, sql_id and pred_sql_id might point you onto a place in your source code, where a connection was used last (i.e. where it was leaked).

Spring Batch Restart - Not picking up correct row

I have a spring batch job that has the following attributes:
commit-interval: 25.
skip-limit of 3.
In my integration tests, I have injected in a fake writer that will throw the skippable exception, and this in turn is injected with a list of ids, which will cause the exception to be thrown.
In the before of my test I create 135 rows. I configure that rows
"9", "11", "44", "51", "70"
will all be the rows that cause the ItemWriter to throw the exception.
All works well on the first run, and as expected, the job fails after the 3 commits of 50, on row 51, or rather when "something" in the writer has detected a skippable exception that has now exceeded the limit of 3. Also, I have asserted that 9, 11 and 44 are registered in the skippable listener which I would expect.
I realise that the batch job has not individually wrapped the items in transactions before it fails, like id did for 9, 11 and 44 because it already knows that the skip limit is reached.
However, when I restart the job, the starting row is 74 - Not 51 as I would expect.
Therefore from 51 to 73 are skipped?
I cannot figure this one out. Or why it would skip the chunk that has failed completely.
Any help would be appreciated.
David.
The fix to this bug will be included in the next release of spring batch.
https://jira.springsource.org/browse/BATCH-2122

How JTA/JTS handle transaction time out issue?

Below is my understand that JTA/ JTS handle transaction time out issue. But I cannot find my document or material to back my understand. Is my understand right? Do u know any material is refer to this issue?
Application Server iterates through all the transactions to check timeout. If a transaction timeout occurs, application server marks roll back for the transaction, and log down the detail. But Application Server neither throws exception nor interrupts the transaction this moment. When the transaction thread continue to attempt to access another transactional resource (like JDBC/ JMS), the transactional resource which implements JTA interface will check roll back flag first before go further. Then at this moment, RollbackException is thrown.
==========
Test Case 1:
Set transaction timeout to 10 secs
I. Transaction begin
II. Sleep 20 secs
III. System out "Sleep end"
Result: Timeout occur at 10th secs, and system out log down the timeout detail, but not throw exception. "Sleep end" will be printed.
==========
Test Case 2:
Set transaction timeout to 10 secs
I. Transaction begin
II. Sleep 20 secs
III. Access db 1st time
IV. Access db 2nd time
V. System out "Sleep end"
Result: Timeout occur at 10th secs, and system out logs down the timeout detail, but not throw exception. Exception throws while access db 1st time. "Sleep end" will not be printed.
==========
Test Case 3:
Set transaction timeout to 10 secs
I. Transaction begin
II. Access db and db deadlock
Result: Timeout occur at 10th secs, and system out logs down the timeout detail. No exception throws, the transaction thread is stuck. So transaction timeout control cannot handle db timeout issue. I am so confused about this..
In my understanding, above behavior should be the same while using spring transaction management(JTA) and EJB. Am I right?
Thanks for ur help!
Tested, and proved that my understand should be correct.
Summarize the result as below:
• Transaction timeout control only affects transactional activities (Ex: access DB/ send JMS message).
• Application server do not interrupt current transaction thread immediately while timeout occurs, instead, application server only log down the detail. Timeout exception will throw while transaction commit or attempt to access next transactional activities.
• DB deadlock issue cannot be handled by transaction timeout control. But DB2 have deadlock prevent mechanism to release the deadlock and roll back transaction for some cases.

Resources