I have a cloudera 5.x cluster running in Azure. Everything was running fine, and then a few days ago I started getting "NODE_MANAGER_UNEXPECTED_EXITS" health notifications via email every hour.
This seems to happen on the 43 minute of every hour.
Most of the forms I've come across have suggested outOfMemory errors - though I'm not seeing any of these in the log files. For good measure I've tried upping the java head space memory allocation for NodeManager but this has not solved the problem.
I've stopped all jobs on the cluster - it is essentially sitting idle, but every hour I'm getting these alerts.
Example of the health alert that comes in the email:
NODE_MANAGER_UNEXPECTED_EXITS Role health test bad Critical The health test result for NODE_MANAGER_UNEXPECTED_EXITS has become bad: This role encountered 1 unexpected exit(s) in the previous 5 minute(s). Critical threshold: any.
Any help is greatly appreciated
Related
I have an issue with Redis that effects running of Laravel Horizon Queue and I am unsure how to debug it at this stage, so am looking for some advice.
Issue
Approx. every 3 - 6 weeks my queues stop running. Every time this happens, the first set of exceptions I see are:
Redis Exception: socket error on read socket
Redis Exception: read error on connection to 127.0.0.1:6379
Both of these are caused by Horizon running the command:
artisan horizon:work redis
Theory
We push around 50k - 100k jobs through the queue each day and I am guessing that Redis is running out of resources over the 3-6 week period. Maybe general memory, maybe something else?
I am unsure if this is due to a leak wikthin my system or something else.
Current Fix
At the moment, I simply run the command redis-cli FLUSHALL to completely clear the database and we are back working again for another 3 - 6 weeks. This is obviously not a great fix!
Other Details
Currently Redis runs within the webserver (not a dedicated Redis server). I am open to changing that but it is not fixing the root cause of the issue.
Help!
At this stage, I am really unsure where to start in terms of debugging and identifing the issue. I feel that is probably a good first step!
About my profile -
I am doing L3 support for some of the BDE Informatica ingestion jobs that run on our cluster. Our goal is help application teams meet the SLA. We support job streams that run on top of Hadoop layer (Hive).
Problem Statement -
We have observed that on some days BDE Informatica ingestion jobs run painfully slow and on the other days they complete their cycle in 3 hours. if the job is taking so much time, we usually kill and rerun which helps us, but that does not help us fix the root cause.
Limitations of our profile -
Unfortunately, I don't have the application code or the Informatica tool but I have to connect to the development team and ask relevant questions so that we can narrow down the root cause.
Next Steps -
What sort of scenarios can cause this delay?
What tools can I use to check what may be cause of the delay?
Few possible questions which I may ask the development team are -
are the tables analysed properly before running the job stream?
is there any significant change in volume of data (this is bit unlikely as the job runs quickly on rerun)?
I am aware this is a very broad question and is requesting for help in approach rather than any attending a specific problem, but this is just a start to help fix this issue for good or approaching it in rational manner.
You need to check the Informatica logs to see if it's hanging at the same step each time.
Assuming its not, are you triggering the jobs at the same time each day... say Midnight and it usually completes by 3am... but sometimes it runs till 10am, where you kill and restart?
If so, I suggest you baseline the storage medium activity, under minimal load, during a 3 hrs quick run and during the 10 hour load. Is there a difference in demand?
It sounds like a contention but that is causing a conflict. A process maybe waits forever instead of resuming when the desired resource is available. Speak to the DBAs.
I have an issue that from time to time one of the EC2 instances within my cluster have its ECS-agent disconnected. This silently removes the EC2 instance from the cluster (i.e. not eligible to run any services anymore) and silently drains my cluster from serving servers. I have my cluster backed with an autoscaling group, spawning servers to keep up the healthy amount. But the ECS-agent'disconnected servers are not marked as unhealthy, so the AS-group thinks everything is alright.
I have the feeling there must be something (easy) to mitigate this, or I'm having a big issue with choosing ECS and using it in production.
We had this issue for a long time. With each new AWS ECS-optimized AMI it got better, but as of 3 months ago it still happened from time to time. As mcheshier mentioned make sure to always use the latest AMI or at least the latest aws ecs agent
The only way we were able to resolve it was through:
Timed autoscale rotations
We would try to prevent it by scaling up and down at random times
Good cloudwatch alerts
We happened to have our application set up as a bunch of microservices that were all queue (SQS) based. We could scale up and down based on queues. We had decent monitoring set up that let us approximate rates of queues across number of ECS containers. When we detected that the rate was off we would rotate that whole ECS instance. Ie. Say our cluster deployed 4 running containers of worker-1. We approximate that each worker does 1000 messages per 5 minutes. If our queue rate was 3000 per 5 minutes and we had 4 workers, then 1 was not working as expected. We had some scripts set up in lambda to find the faulty one and terminate the entire instance that ran that container.
I hope this helps, I realize it's specific to our in-house application, but the advice I can give you and anyone else is to take the initiative and put as many metrics out there as you can. This will let you do some neat analytics and look for kinks in the system, this being one of them.
This is in relation to my previous post (here) regarding the OOM I'm experiencing on a driver after running some Spark steps.
I have a cluster with 2 nodes in addition to the master, running the job as client. It's a small job that is not very memory intensive.
I've paid particular attention to the hadoop processes via htop, they are the user generated ones and also the highest memory consumers. The main culprit is the amazon.emr.metric.server process, followed by the state pusher process.
As a test I killed the process, the memory as shown by Ganglia dropped quite drastically whereby I was then able to run 3-4 consecutive jobs before the OOM happened again. This behaviour repeats if I manually kill the process.
My question really is regarding the default behaviour of these processes and whether what I'm witnessing is the norm or whether something crazy is happening.
There is nothing writing to the Apache error log and I can not find any scheduled tasks that may be causing a problem. The restart occurs around the same time, 3 times over the past week at 12:06 am. Then also in the 3-4 am time frame.
I am running Apache version 2.2.9 on Windows 2003 server version.
The same behavior was happening prior to the past week, where there was an error being written to the Apache error log indicating that the MaxRequestsPerChild limit was being reached. I found this article,
http://httpd.apache.org/docs/2.2/platform/windows.html
suggesting setting MaxRequestsPerChild to 0, which I did and the error stopped reporting to the error log, but the behavior of restarting continued, although not as frequently.