Yarn kills containers when there is heavy load in the cluster. How does Apache Twill react when one of its runnablse running in the container gets killed? Does its run with reduced number of instances of the runnable or does it relaunch it?
By default twill will keep trying re-launch the instances indefinitely. As of version 0.10.0, you are able to specify a maximum number of retries.
Related
We are trying to migrate our laravel setup to use docker. Dockerizing the laravel app was straight forward however we ran into an issue where if do a deployment while scheduled jobs are running they would be killed since the container is destroyed. Whats the best practice here? Having a separate container to run the laravel scheduler doesnt seem like it would solve the problem.
Run the scheduled job in a different container so you can scale it independently of the laravel app.
Run multiple containers of the scheduled job so you can stop some to upgrade them while the old ones will continue processing jobs.
Docker will send a SIGTERM signal to the container and wait for the container to exit cleanly before issuing SIGKILL (the time between the two signals is configurable, 10 seconds by default). This will allow to finish your current job cleanly (or save a checkpoint to continue later).
The plan is to stop old containers and start new containers gradually so there aren't lost jobs or downtime. If you use an orchestrator like Docker Swarm or Kubernetes, they will handle most of these logistics for you.
Note: the laravel scheduler is based on cron and will fire processes that will be killed by docker. To prevent this have the scheduler add a job to a laravel queue. The queue is a foreground process and it will be given the chance to stop/save cleanly by the SIGTERM that it will receive before being killed.
I am running a long-lived spark job on AWS EMR using Yarn as the resource manager. After running for a while, some of the nodes stop responding, and looking at Ganglia I can see that we have run out of memory.
Once this happens, the application is killed and the memory is recovered. However, If I try to monitor the memory using: sc.getExecutorStorageStatus()[executor].memUsed() and sc.getExecutorStorageStatus()[executor].memRemaining(), the system reports that only 140Mb of the 25Gb is being used (right before the crash). Looking on the EMR cluster itself, it appears the hadoop and yarn processes are the ones consuming the resources.
Is there a way to programmatically determine the resources utilized by Yarn during the runtime of a Spark application?
My hadoop job was running over 10 hours but since I put it in wrong queue, the containers are kept getting killed by the scheduler.
How do I change the queue of currently running hadoop job without restarting it?
Thank you
if running Yarn you can change the current job's queue by
yarn application -movetoqueue <app_id> -queue <queue_name>
I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs running. I don't want to terminate the cluster, because doing so would force me to buy a whole new hour of whatever cluster I'm running. Can anyone please help me terminate a spark-step in EMR without terminating the entire cluster?
That's easy:
yarn application -kill [application id]
you can list your running applications with
yarn application -list
You can kill application from the Resource manager (in the links at the top right under cluster status).
In the resource manager, click on the application you want to kill and in the application page there is a small "kill" label (top left) you can click to kill the application.
Obviously you can also SSH but this way I think is faster and easier for some users.
Can someone tell how to kill a container? i see nodes are still running containers even after the application is finished and i want to know the command to kill them? Because of this issue, my subsequent applications stays in accepted state.
Thanks
Hadoop job -list
This gives you jobs that are running with JobID's
To kill job
Hadoop job –kill JobID
If yarn application is finished and some containers are still running, I'd say this is a bug somewhere. Is this a MR app? I don't think there's any commands to kill containers and anyway those should be handled by a nodemanager. Resource manager and Node manager should kill all containers when application is finished.
You didn't provide any info on what is this app, hadoop version, operating system, etc. Having said that, I once had a problem in my ubuntu hosts which had HADOOP-9752 bug which prevented nodemanager to kill a container.