For the last few weeks or so, we are having the following issue:
Some of our transactions, when sent via sendRawTransaction() never get picked up by the network (if we look up the txid in the explorer, it's never there), and yet web3js doesn't error out.
We use "#solana/web3.js": "^1.44.1"
This has started happening to us like 2-3 weeks ago:
This issue affects some sets of transactions that all share the same instructions + amount of signers and accounts.
It repros 100% of the time for all transactions in those sets. No matter the state of the network or how many times we retry, they never get picked up.
We don't get any error back from web3.js, such as transaction limit hit
They all work in devnet, but not in mainnet!
For one of these tx, I removed one instruction+signer and it started working, so I imagine there's some limit we're hitting, but I can't tell which or how to even determine the limit.
When network congestion is high, validators can drop transactions without any error. To fix your issue you could send more of the same transaction on some interval while you're waiting for confirmation, and while your transaction blockhash is valid. This way you'll raise a chances for your transaction to been processed by the validator.
Related
I know the DefaultMessageListenerContainer polls by design. And that the receiveTimeout which sets the polling interval defaults to 1 second.
The way I understand it is that the DMLC will issue a get, and waits the 'receiveTimeout' defined interval (1 second) before it times out and issues another get.
From what I have read, we can set this receiveTimout value to a larger value and have NO effect on messages getting picked up from the MQ because the active 'get' will sit on the listener until a message arrives... and once/if the timeout interval expires it will just submit another get which remains active on the queue until a message arrives.
So my questions is, what is the benefit of a smaller receiveTimout interval? If we are always going to process a message when it arrives, why on earth would we want to poll the queue every second?
We are running many large applications, and the polling is simply running the CPU usage/bill through the roof, and I cannot find a justification for this.
Yes - the 1 second receive timeout can be very CPU intensive with a large number of queues.
The general idea for the DefaultMessageListenerContainer was to wait for a bit (1 second seems to be a very short wait period), and then, if you don't get a message, it actually tears everything down and does a full reconnect. This is kind of a poor-mans error handling. "If I haven't heard from the broker, assume that something is broken, drop everything and reconnect". If the reconnect were not so expensive, it might not be a bad strategy. Or if you have only one queue. Or maybe you are expecting 10 messages a second and do want to reconnect if a second goes by. If you have a reasonable number of destinations, the reconnect traffic can get downright abusive.
For IBM MQ, failures on the JMS connection/session are reliably picked up. You don't have the, "it just sits there not getting any messages for some reason" scenario. So setting the timeout to 10 minutes (whatever) would be fine.
Note that if you are running in a JEE application server, and your JMS connections are managed by the JCA, then that layer is responsible for detecting bad connections and you don't have to worry about it up in the application layer.
With Camel and for SpringBoot GitHub might be useful.
I recently encountered a thorny problem, while I am using kafkastream's TimeWindowedKStream aggregation method. The phenomenon was that I stopped my program for 5 minutes and then restarted it. I found a small part of my data was lost and got the following prompt, "Skipping record for expired window". All data are normal data that want to be saved, there is no large delay. What can I do to prevent data from being discarded ? It seems that kafkastream got a later time when it got observedstreamtime
The error message means that a window was already closed -- thus you would need to increase GRACE as pointed out by #groo. -- Data expiration is based on event-time so stopping your program and resuming is later should not change much.
However, if there is a repartition topic before the aggregation, if you stop your program for some time, there might be more out-of-order data inside the repartition topic, because the input topic is read much faster than in the "live run" -- this increased unorder during catchup could be the issue.
I have a Chainlink node and I'm hitting this error:
number of unconfirmed transactions exceeds ETH_MAX_QUEUED_TRANSACTIONS. WARNING: Hitting ETH_MAX_QUEUED_TRANSACTIONS is a sanity limit and should never happen under normal operation. This error is very unlikely to be a problem with Chainlink, and instead more likely to be caused by a problem with your eth node's connectivity. Check your eth node: it may not be broadcasting transactions to the network, or it might be overloaded and evicting Chainlink's transactions from its mempool. Increasing ETH_MAX_QUEUED_TRANSACTIONS is almost certainly not the correct action to take here unless you ABSOLUTELY know what you are doing, and will probably make things worse: cannot create transaction; too many unstarted transactions in the queue (250/250). WARNING: Hitting ETH_MAX_QUEUED_TRANSACTIONS is a sanity limit and should never happen under normal operation. This error is very unlikely to be a problem with Chainlink, and instead more likely to be caused by a problem with your eth node's connectivity. Check your eth node: it may not be broadcasting transactions to the network, or it might be overloaded and evicting Chainlink's transactions from its mempool. Increasing ETH_MAX_QUEUED_TRANSACTIONS is almost certainly not the correct action to take here unless you ABSOLUTELY know what you are doing, and will probably make things worse
What is the best action item to take to solve this?
This means, that there are more than ETH_MAX_QUEUED_TRANSACTIONS in the eth_txes table in the psql database. You can do a few things, but it's important to know what they all do.
Delete the unstarted txes
Log into your database:
psql <DATABASE_URL_STRINGusing a psql client
Run the following:
DELETE FROM eth_txes WHERE state = 'unstarted';
Or find all the stuck transactions with:
select * from eth_txes;
Note, this will drop all transactions in the memepool of your node.
Bump up ETH_MAX_QUEUED_TRANSACTIONS
Edit your .env with:
ETH_MAX_QUEUED_TRANSACTIONS=<SOME_NUMBER_MORE_THAN_250>
Restart your node
This won't drop any transactions, but it'll just bump up how many can be "in flight" at once. Usually this isn't the right move.
Intermittently(once or twice in a month) I am seeing the error
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for cart-topic-0: 5109 ms has passed since batch creation plus linger time
in my logs due to which the corresponding message was not processed by Kafka Producer.
Though all the brokers are up and available I'm not sure why this error is being observed. Even the load is not much during this period.
I have set the retries property value to 10 in Producer configs but still, the message was not been retried. Is there anything else I need to add for the Kafka send method? I have gone through the similar issues raised, but there is no proper conclusion for this error.
Can someone please help on how to fix this.
From the KIP proposal which is now addressed
We propose adding a new timeout delivery.timeout.ms. The window of enforcement includes batching in the accumulator, retries, and the inflight segments of the batch. With this config, the user has a guaranteed upper bound on when a record will either get sent, fail or expire from the point when send returns. In other words we no longer overload request.timeout.ms to act as a weak proxy for accumulator timeout and instead introduce an explicit timeout that users can rely on without exposing any internals of the producer such as the accumulator.
So basically, post this now you can additionally be able to configure a delivery timeout and retries for every async send you execute.
I had an issue where retries were not being obeyed, but in my particular case it was because we were calling the get() method on send for synchronous behaviour. We hadn't realized it would impact retries.
In investigating the issue through various paths I came across the definition of the sorts of errors that are retrial
https://kafka.apache.org/11/javadoc/org/apache/kafka/common/errors/RetriableException.html
What had confused me is that timeout was listed as a retrial one.
I would normally have suggested you would want to look into if the delivery of your batches was taking too long and messages in your buffer were expiring due to increased volume, but you've mentioned that the volume isn't particularly high.
Did you determine if increasing the request.timeout.ms has an impact on the frequency of occurrence? It might be more of a treating the symptom step than the cause.
My application joins about 50 rooms for one user on one connection all at once. After a couple rooms successfully join I start to get a server error return on some of the rooms.
There error is always the same, here it is:
Error: Server Error
at Object.i.build (https://cdn.goinstant.net/v1/platform.min.js:4:7501)
at Connection._onResponse (https://cdn.goinstant.net/v1/platform.min.js:7:25694)
at Connection._onMessage (https://cdn.goinstant.net/v1/platform.min.js:7:28812)
at Connection._onMessage (https://cdn.goinstant.net/v1/platform.min.js:3:4965)
at r.e (https://cdn.goinstant.net/v1/platform.min.js:1:4595)
at r.emit (https://cdn.goinstant.net/v1/platform.min.js:2:6668)
at r.e (https://cdn.goinstant.net/v1/platform.min.js:1:4595)
at r.emit (https://cdn.goinstant.net/v1/platform.min.js:3:7482)
at r.onPacket (https://cdn.goinstant.net/v1/platform.min.js:3:14652)
at r.<anonymous> (https://cdn.goinstant.net/v1/platform.min.js:3:12614)
It's not isolated to any particular rooms, sometimes half of them pass, sometimes nearly all pass, but there are almost always a couple that break.
What I have found is if it's less than 10 rooms it won't break.
Is there any rate limiting on joining rooms that could be causing this? I'd rather not put a delay between each room join but I can if I need to.
Update: It definitely has to do with how fast I'm connecting to the rooms. Spacing them out by 1s each makes it work every time. I need to connect faster though, is there a fix for this?
Even a 100ms deplay seems to work.
This isn't a case of rate-limiting or anything along those lines. It's a bug and we are working to fix it as soon as we can. We'll update you here once we have a solution deployed. If you'd like for us to email you a notification directly, drop us a message via our contact form (https://goinstant.com/contact). Just make reference to this issue and I'll make sure a note is added to email you directly as soon as the fix goes live.
Sorry for any inconvenience this may be causing you.
Regards,
Thomas
Developer, GoInstant