How do I use snakemake --cluster-cancel? - cluster-computing

I currently have a snakemake pipeline running with multiple jobs on a cluster. I want to cancel my jobs early, and the snakemake documentation says that I can use the --cluster-cancel option. However, it doesn't have any example of how to use it. The cluster I am using cancels jobs with qdel. So, I tried using snakemake --cluster-cancel "qdel", but when I do this it returns the error
snakemake: error: unrecognized arguments: --cluster-cancel

--cluster-cancel is a feature that was introduced in version 7.0.0. If you’re able to use that or a newer version on the cluster, you ought to be able to use that feature.

I haven't used the --cluster-cancel feature, but have the following routine whenever I want to kill a bunch of snakemake jobs on my cluster.
Interrupt the snakemake pipeline: ^C
Kill existing jobs on the cluster: bkill 0 (kills all jobs) or bkill [jobid]
Unlock the snakemake pipeline for future use: snakemake --unlock

Related

Does snakebite handle retry in case of cluster failure?

snakebite recently come as an alternative to hdfs cli. According to my understanding hdfs does not do retry of command, if command failed due to cluster issue.
My question here is that does snakebite handle the retry in case of cluster failure. By retry I mean it tries the command several time if the command is failing.
Yes from the cli interface it does opt for retry. But after a time when enough retry fails, It does fail.

Running multiple worker daemons SLURM

I want to run multiple worker daemons on single machine. As per damienfrancois's answer on what is the minimum number of computers for a slurm cluster it can be done. Problem is currently I am able to execute only 1 worker daemon on one machine. for example
When I run
sudo slurmd -N linux1 -cDvv
sudo slurmd -N linux2 -cDvv
linux1 goes down when I run linux2. Is it possible to run multiple worker daemons on one machine?
Here is my slurm.conf file
as your intention seems to be just testing the behavior of Slurm, I would recommend you to use the front-end mode, where you can create dummy computation nodes in the same machine.
In their FAQ, you have more details, but basically you must configure your installation to work with this mode:
./configure --enable-front-end
And configure the nodes in slurm.conf
NodeName=test[1-100] NodeHostName=localhost
In that guide, they also explain how to launch more than one real daemons in the same node by changing the ports, but for my testing purposes it was not necessary.
Good luck!
I got the same issue as you, I resolved it by modifying the paths of log files as mentioned there multiple slurmd support.
In your slurm.conf for example
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
must be
SlurmdLogFile=/var/log/slurm/slurmd.%n.log
SlurmdPidFile=/var/run/slurmd.%n.pid
SlurmdSpoolDir=/var/spool/slurmd.%n
Now you can launch multiple slurmd.
Note : I tried with your slurm conf, I think some parameters are missing like define two NodeName instead of one and add which Port to use for each of Nodes.
This works for me
# COMPUTE NODES
NodeName=linux[1-10] NodeHostname=linux0 Port=17004 CPUs=1 State=UNKNOWN
NodeName=linux[11-19] NodeHostname=linux0 Port=17005 CPUs=1 State=UNKNOWN
# PARTITIONS
PartitionName=main Nodes=linux1 Default=YES MaxTime=INFINITE State=UP
PartitionName=dev Nodes=linux11 Default=YES MaxTime=INFINITE State=UP

Pig job gets killed on Amazon EMR.

I have been trying to run a pig job with multiple steps on Amazon EMR. Here are the details of my environment:
Number of nodes: 20
AMI Version: 3.1.0
Hadoop Distribution: 2.4.0
The pig script has multiple steps and it spawns a long-running map reduce job that has both a map phase and reduce phase. After running for sometime (sometimes an hour, sometimes three or four), the job is killed. The information on the resource manager for the job is:
Kill job received from hadoop (auth:SIMPLE) at
Job received Kill while in RUNNING state.
Obviously, I did not kill it :)
My question is: how do I go about trying to identify what exactly happened? How do I diagnose the issue? Which log files to look at (what to grep for)? Any help on even where the appropriate log files would be greatly helpful. I am new to YARN/Hadoop 2.0
There can be number of reasons. Enable debugging on your cluster and see in the stderr logs for more information.
aws emr create-cluster --name "Test cluster" --ami-version 3.9 --log-uri s3://mybucket/logs/ \
--enable-debugging --applications Name=Hue Name=Hive Name=Pig
More details here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html

Error: Jobflow entered COMPLETED while waiting to ssh

I just started to practice AWS EMR.
I have a sample word-count application set-up, run and completed from the web interface.
Following the guideline here, I have setup the command-line interface.
so when I run the command:
./elastic-mapreduce --list
I receive
j-27PI699U14QHH COMPLETED ec2-54-200-169-112.us-west-2.compute.amazonaws.comWord count
COMPLETED Setup hadoop debugging
COMPLETED Word count
Now, I want to see the log files. I run the command
./elastic-mapreduce --ssh --jobflow j-27PI699U14QHH
Then I receive the following error:
Error: Jobflow entered COMPLETED while waiting to ssh
Can someone please help me understand what's going on here?
Thanks,
When you setup a job on EMR, this means that Amazon is going to provision a cluster on-demand for you for a limited amount of time. During that time, you are free to ssh to your cluster and look at the logs as much as you want, but by the time your job has finished running, then your cluster is going to be taken down ! At that point, you won't be able to ssh anymore because your cluster simply won't exist.
The workflow typically looks like this:
Create your jobflow
It will be for a few minutes in status STARTING. At that point if you try to run ./elastic-mapreduce --ssh --jobflow <jobid> it will simply wait because the cluster is not available yet.
After a while the status will switch to RUNNING. If you had already started the ssh command above it should automatically connect you to your cluster. Otherwise you can initiate your ssh command now and it should connect you directly without any wait.
Depending on the nature of your job, the RUNNING step could take a while or be very short, it depends what amount of data you're processing and the nature of your computations.
Once all your data has been processed, the status will switch to SHUTTING_DOWN. At that point, if you already sshed before you will get disconnected. If you try to use the ssh command at that point, it will not connect.
Once the cluster has finished shutting down it will enter a terminal state of either COMPLETED or FAILED depending on whether your job succeeded or not. At that point your cluster is no longer available, and if you try to ssh you will get the error you are seeing.
Of course there are exceptions, you could setup an EMR cluster in interactive mode, for example you just want to have Hive setup and then ssh there and run Hive queries and you would have to take your cluster down manually. But if you just want a MapReduce job to run, then you will only be able to ssh for the duration of the job.
That being said, if all you want to do is debugging, there is not even a need to ssh in the first place ! When you create your jobflow, you have the option to enable debugging, so you could do something like that:
./elastic-mapreduce --create --enable-debugging --log-uri s3://myawsbucket
What that means is that all the logs for your job will end up being written to the S3 bucket specified (you have to own this bucket of course and have permission to write to it). Also if you do that, you can go into the AWS console afterwards in the EMR section, and you will be able to see next to your job a button to debug as shown below in the screenshot, this should make your life much easier:

Submit a job to Oracle Grid Engine in Jenkins continuous integration test system

I know how to run a bash script on Jenkins. However, if I use qsub to submit the bash script to OGE system, how does Jenkins know that my job terminates or not?
You can use "-sync y" on your qsub to cause the qsub to wait until the job(s) are finished.
Jenkins allows for you to submit the results of a build using a web based API. I currently use this to monitor a job remotely for a grid at my orginization. If you have the ability to perform a web post to the jenkins server you could use the below script to accomplish the job.
#!/bin/bash
MESSAGE="Some message about job success"
RUNTIME="Some calculation to estimate runtime"
USERNAME="userNameForJenkinsLogin"
PASSWORD="passwordForJenkinsLogin"
JENKINS_HOST="URLToJenkins"
TEST_NAME="Name of Test"
curl -i -X post -d "<run><log>$MESSAGE</log><result>0</result><duration>$RUNTIME</duration></run>" http://$USERNAME:$PASSWORD#$JENKINS_HOST/jenkins/job/$TEST_NAME/postBuildResult
The Jenkins SGE Cloud plugin submits builds to the Sun Grid Engine (SGE) batch scheduler. Both the open source version of SGE and the commercial Univa Grid Engine (UGE) are supported.
This plugin adds a new type of build step Run job on SGE that submits batch jobs to SGE. The build step monitors the job status and periodically appends the progress to the build's Console Output. Should the build fail, errors and the exit status of the job also appear. If the job is terminated in Jenkins, it is also terminated in SGE.
Builds are submitted to SGE by a new type of cloud, SGE Cloud. The cloud is given a label like any other slave. When a job with a matching label is run, SGE Cloud submits the build to SGE.

Resources