Handling repeated events in a log - events

I have a logging system where some events are repeated infinitely. for example:
12:03 - Restart attempted
12:03 - Restart failed
12:02 - Restart attempted
12:02 - Restart failed
12:01 - Restart attempted
12:01 - Restart failed
This might go on for days. I imagine there are standard ways that systems deal with spammy events like this.
What are the common ways logging systems deal with these kind of events without flooding the log system?

One approach would be to coalesce matching entries that repeat within some time delta of each other, something like
12:03 - Restart attempted [3 times since 12:01]
12:03 - Restart failed [3 times since 12:01]
12:02 - Something
11:23 - Restart attempted [17 times since 11:21]

They typically either compel you to fix the problem, or have flags to suppress them. And I think properly so.
Error logs are typically chronically underadministered and undermonitored anyway. If it's an application I'm in any way involved with, I'd as soon see as much flag-waving as possible if it gets someone's attention.

Related

Redis intermittent "crash" on Laravel Horizon. Redis stops working every few weeks/months

I have an issue with Redis that effects running of Laravel Horizon Queue and I am unsure how to debug it at this stage, so am looking for some advice.
Issue
Approx. every 3 - 6 weeks my queues stop running. Every time this happens, the first set of exceptions I see are:
Redis Exception: socket error on read socket
Redis Exception: read error on connection to 127.0.0.1:6379
Both of these are caused by Horizon running the command:
artisan horizon:work redis
Theory
We push around 50k - 100k jobs through the queue each day and I am guessing that Redis is running out of resources over the 3-6 week period. Maybe general memory, maybe something else?
I am unsure if this is due to a leak wikthin my system or something else.
Current Fix
At the moment, I simply run the command redis-cli FLUSHALL to completely clear the database and we are back working again for another 3 - 6 weeks. This is obviously not a great fix!
Other Details
Currently Redis runs within the webserver (not a dedicated Redis server). I am open to changing that but it is not fixing the root cause of the issue.
Help!
At this stage, I am really unsure where to start in terms of debugging and identifing the issue. I feel that is probably a good first step!

Laravel 8 - Queue jobs timeout, Fixed by clearing cache & restarting horizon

My queue jobs all run fairly seamlessy in our production server, but about every 2 - 3 months I start getting a lot of timeout exceeded/too many attempts exceptions.
Our app is running with event sourcing and many events are queued so neededless to say we have a lot of jobs passing through the system (100 - 200k per day generally).
I have not found the root cause of the issues yet, but a simple re-deploy through Laravel Envoyer fixes the issue. This is most likely due to the cache:clear command being run.
Currently, the cache is handled by Redis and is on the same server as the app. I was considering moving the cache to its own server/instance but this still does not help me with the root cause.
Does anyone have any ideas what might be going on here and how I can diagnose/fix it? I am guessing the cache is just getting overloaded/running out of space/leaking etc. over time but not really sure where to go from here.
Check :
The version of your redis make an update of the predis package
The version of your Laravel
Your server
I hope I gave you some solutions

Informatica BDE ingestion job runs for 10+ hours and when killed and rerun completes in 3 hrs

About my profile -
I am doing L3 support for some of the BDE Informatica ingestion jobs that run on our cluster. Our goal is help application teams meet the SLA. We support job streams that run on top of Hadoop layer (Hive).
Problem Statement -
We have observed that on some days BDE Informatica ingestion jobs run painfully slow and on the other days they complete their cycle in 3 hours. if the job is taking so much time, we usually kill and rerun which helps us, but that does not help us fix the root cause.
Limitations of our profile -
Unfortunately, I don't have the application code or the Informatica tool but I have to connect to the development team and ask relevant questions so that we can narrow down the root cause.
Next Steps -
What sort of scenarios can cause this delay?
What tools can I use to check what may be cause of the delay?
Few possible questions which I may ask the development team are -
are the tables analysed properly before running the job stream?
is there any significant change in volume of data (this is bit unlikely as the job runs quickly on rerun)?
I am aware this is a very broad question and is requesting for help in approach rather than any attending a specific problem, but this is just a start to help fix this issue for good or approaching it in rational manner.
You need to check the Informatica logs to see if it's hanging at the same step each time.
Assuming its not, are you triggering the jobs at the same time each day... say Midnight and it usually completes by 3am... but sometimes it runs till 10am, where you kill and restart?
If so, I suggest you baseline the storage medium activity, under minimal load, during a 3 hrs quick run and during the 10 hour load. Is there a difference in demand?
It sounds like a contention but that is causing a conflict. A process maybe waits forever instead of resuming when the desired resource is available. Speak to the DBAs.

Purposefully exiting process for a dyno restart on heroku

I've got a phantomjs app (http://css.benjaminbenben.com) running on heroku - it works well for some time but then I have to run heroku restart because it requests start timing out.
I'm looking for a stop-gap solution (I've gone from around 6 to 4500 daily visitors over the last week), and I was considering exiting the process after it had served a set number of requests to fire a restart.
Will this work? And would this be considered bad practice?
(in case you're interested, the app source is here - https://github.com/benfoxall/wtcss)
It'd work, as long as you don't crash within I think 10 minutes of the last crash. If it's too frequent the process will stay down.
It's not bad practice, but it's not great practice. You should figure out what is causing your server to hang, of course.

Interesting questions related to lighttpd on Amazon EC2

This problem appeared today and I have no idea what is going on. Please share you ideas.
I have 1 EC2 DB server (MYSQL + NFS File Sharing + Memcached).
And I have 3 EC2 Web servers (lighttpd) where it will mounted the NFS folders on the DB server.
Everything going smoothly for months but suddenly there is an interesting phenomenon.
In every 8 minutes to 10 minutes, PHP file will be unreachable. This will last about 1 minute and then back to normal. Normal files like .html file are unaffected. All servers have the same problem exactly at the same time.
I have spent one whole day to analysis the reason. Finally, I find out when the problem appear, the file descriptor of lighttpd suddenly increased a lot.
I used ls /proc/1234/fd | wc -l to check the number of fd.
The # of fd is around 250 in normal time. However, when the problem appeared, it will be raised to 1500 and then back to normal.
It sounds funny, right? Do you have any idea what's going on?
========================
The CPU graph of one of the web server.
alt text http://pencake.images.s3.amazonaws.com/4be1055884133.jpg
Thoughts:
Have a look at dmesg output.
The number of file descriptors jumping up sounds to me like something is blocking, including the processing of connections to the lighttpd/PHP, which builds up untile the blocking condition ends.
When you say the PHP file is unreachable, do you mean the file is missing? Or maybe the PHP script stalls during execution or? What do the lihttpd log files say is happening on the calls to this PHP script. Are there any other hints in the lighttpd?
What is the maximum file descriptors for the process/user?
I and others have had bizarre networking behavior on EC2 instances from time to time. Give us more details on it. Maybe setup some additional monitoring of the connectivity between your instances. Consider moving your problem instance to another instance in the hopes of the problem magically disappearing. (Shot in the dark.)
And finally...
DOS attack? I doubt it--it would be offline or not. It is way too early in the debugging process for you to infere malice on someone elses part.

Resources