Azure Queues dequeue count growing with Rebus core - deque

I am using Rebus (https://github.com/rebus-org (v.5.3.1)) and until now everything has worked out just fine for years. (we upgraded from core 2.0 to 3.1 a couple of weeks ago)
For the last couple of days, it had a lot of messages that just grows in the "dequeue count" column. we use Azure storage queue. The que handles the other messages as it should, but we still got 6000+ message with dequeue count higher than 250 in this moment.
I can’t really find anything that looks off, and it doesn’t add any rows in the error que.
I have tried to restart the server, so the Rebus is totally restarted but no luck

In my case it was an underlying function that throws exception. And it seems like rebus swallow it somewhere.
in my rebus job i changed to just “return” and not doing anything. the que started to disappear.
I still don’t know why rebus not stopped trying after 5(in my case) try and added to error que

Related

Redis intermittent "crash" on Laravel Horizon. Redis stops working every few weeks/months

I have an issue with Redis that effects running of Laravel Horizon Queue and I am unsure how to debug it at this stage, so am looking for some advice.
Issue
Approx. every 3 - 6 weeks my queues stop running. Every time this happens, the first set of exceptions I see are:
Redis Exception: socket error on read socket
Redis Exception: read error on connection to 127.0.0.1:6379
Both of these are caused by Horizon running the command:
artisan horizon:work redis
Theory
We push around 50k - 100k jobs through the queue each day and I am guessing that Redis is running out of resources over the 3-6 week period. Maybe general memory, maybe something else?
I am unsure if this is due to a leak wikthin my system or something else.
Current Fix
At the moment, I simply run the command redis-cli FLUSHALL to completely clear the database and we are back working again for another 3 - 6 weeks. This is obviously not a great fix!
Other Details
Currently Redis runs within the webserver (not a dedicated Redis server). I am open to changing that but it is not fixing the root cause of the issue.
Help!
At this stage, I am really unsure where to start in terms of debugging and identifing the issue. I feel that is probably a good first step!

Memurai hangs while processing

I am using Memurai 2.0.2 for cache in my distributed application. It runs different services on different machines and all services have Memurai details with them.
The problem that happens is, that sometimes Memurai process just hangs. The Memurai process keeps on running but no queries are served. I am not able to create a connection to it. It's log file consists of an error:
Error trying to rename the existing AOF to old tempfile: Broken pipe
This generally occurs when I restart the Memurai service. Although I am not sure what is the reason for it. Memurai works fine if I restart its service once.
What can be the issue here? What steps can I take to avoid/ minimize its occurrence?
Memurai 2.0.2 is fairly outdated now. Perhaps get the latest version (3.1.4 at the time of this response) at https://www.memurai.com/get-memurai
For whoever looking for an answer, this happened because another service restarted Memurai service when background rewriting of AOF was in progress. Due to this, some zombie processes were getting created and when Memurai started again, this error was coming up.
The solution that we did was to check if any background rewriting is happening by using settings aof_rewrite_scheduled and aof_rewrite_in_progress from Persistence info. If any of these flags is true then don't stop the service.

Camel File Component stops consuming files sometimes

Hi. I'm having a pretty nasty issue with the app I'm responsible for. The app was running on osgi/karaf + spring + apache camel 2.14.1. I removed osgi/karaf, upgraded spring, and moved all that to the spring boot. Camel version was upgraded to 2.24.1. And we started to see some occasional perf. issues on prod. I wasn't able to see the logs until yesterday, but what I saw there confused me a lot. The app at some point stopped processing the files for one of the routes, while the second route was working fine...
And it lasted for almost 2 hours. Yes, we have pretty high load on it but it was always like that. I just can't figure out what exactly could possibly cause that.
Few small details though... The files which this failing route is monitoring are put to the folder through the symb link. So if the monitoring folder name is /test, there is a link to it in $HOME/test(for a different user) and the other process connects through the sftp and puts files to that folder(it's not a problem with that process, I'm 100% sure that the files were on the server but this thing just didn't see them).
I to be honest have no idea where to dig. The server is pretty old, and the FS is pretty defragmented, we have also disk usage on 100%(but our admins don't think it's related + it was like that before and there were no issues). Java changed by the way from 1.6 to 1.8. I also checked the memory, it doesn't seem like there is an issue, one global GC during the 12 hours, minor collections are not that frequent. I would really appreciate any thoughts... Thank you very much!

Azure in role cache exceptions when service scales

I am using Windows Azure SDK 2.2 and have created an Azure cloud service that uses an in-role cache.
I have 2 instances of the service running under normal conditions.
When the services scales (up to 3 instances, or back down to 2 instances), I get lots of DataCacheExceptions. These are often accompanied by Azure db connection failures from the process going in inside the cache. (If I don't find the entry I want in the cache, I get it from the db and put it into the cache. All standard stuff.)
I have implemented retry processes on the cache gets and puts, and use the ReliableSqlConnection object with a retry process for db connection using the Transient Fault Handling application block.
The retry process uses a fixed interval retrying every second for 5 tries.
The failures are typically;
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later
Any idea why the scaling might cause these exceptions?
Should I try a less aggressive retry policy?
Any help appreciated.
I have also noticed that I am getting a high percentage (> 70%) cache miss rate and when the system is struggling, there is high cpu utilisation (> 80%).
Well, I haven't been able to find out any reason for the errors I am seeing, but I have 'fixed' the problem, sort of!
When looking at the last few days processing stats, it is clear the high cpu usage corresponds with the cloud service having 'problems'. I have changed the service to use two medium instances instead of two small instances.
This seems to have solved the problem, and the service has been running quite happily, low cpu usage, low memory usage, no exceptions.
So, whilst still not discovering what the source of the problems were, I seem to have overcome them by providing a bigger environment for the service to run in.
--Late news!!! I noticed this morning that from about 06:30, the cpu usage started to climb, along with the time taken for the service to process as it should. Errors started appearing and I had to restart the service at 10:30 to get things back to 'normal'. Also, when restarting the service, the OnRoleRun process threw loads of DataCacheExceptions before it started running again, 45 minutes later.
Now all seems well again, and I will monitor for the next hours/days...
There seems to be no explanation for this, remote desktop to the instances show no exceptions in the event log, other logging is not showing application problems, so I am still stumped.

Propel and Persistent Connections

I'm having issues with a large number of conncurrent connections to an Amazon RDS database using propel as the ORM with PHP. The application runs fine during load testing with 20 to 50 connections open at a time, then seems to hit a wall, mushrooms up to maximum connections almost immediately, and everything dies.
I believe Propel is using mysql_pconnect, but I can't find where it designates that, or a simple way to turn it off. I may be chasing a red herring here, but I'm stumped, and there are enough comments on the net regarding pconnect causing problems with too many connections that I thought it would be worth a shot to remove it.
Anyone know how to do this? I have been searching using various phrases, can't seem to find anything.
As it turns out, the error was being caused by the RDS redo log. There is only one size for all RDS instance sizes. On the larger instances sizes, it's possible to fill the redo log and come back to the beginning before the data is written out to the database. At this point it does the 'furiously flushing' thing to get caught up, does not process any new requests, and they pile up like crazy. This eventually caused our app to crash. More, smaller RDS servers fixed the issue, though not very happy with Amazon over this. They need to be able to change the size of the redo logs.

Resources