SGE submitted job state doesn't change from "qw" - job-scheduling

I'm using Sun Grid Engine on ubuntu 14.04 to queue my jobs to be run on a multicore CPU.
I've installed and set up SGE on my system. I created a "hello_world" dir which contains two shell scripts namely "hello_world.sh" & "hello_world_qsub.sh", first one including a simple command and second one including qsub command to submit the first script file as a job to be run.
Here's what "hello_world.sh" includes:
#!/bin/bash
echo "Hello world" > /home/theodore/tmp/hello_world/hello_world_output.txt
And here's what "hello_world_qsub.sh" includes:
#!/bin/bash
qsub \
-e /home/hello_world/hello_world_qsub.error \
-o /home/hello_world/hello_world_qsub.log \
./hello_world.sh
after giving permission to the second sh file and running it with "./hello_world_qsub.sh" command from the specified dir, the output is reasonable:
Your job 1 ("hello_world.sh") has been submitted
But the output of "qstat" command is frustrating:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1 0.50000 hello_worl mhr qw 05/16/2016 20:26:23 1
And the "state" column always remains on "qw" and never changes to "r".
Here's the output of "qstat -j 1" command:
==============================================================
job_number: 1
exec_file: job_scripts/1
submission_time: Mon May 16 20:26:23 2016
owner: mhr
uid: 1000
group: mhr
gid: 1000
sge_o_home: /home/mhr
sge_o_log_name: mhr
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
sge_o_shell: /bin/bash
sge_o_workdir: /home/mhr/hello_world
sge_o_host: localhost
account: sge
stderr_path_list: NONE:NONE:/home/hello_world/hello_world_qsub.error
mail_list: mhr#localhost
notify: FALSE
job_name: hello_world.sh
stdout_path_list: NONE:NONE:/home/hello_world/hello_world_qsub.log
jobshare: 0
env_list:
script_file: ./hello_world.sh
scheduling info: queue instance "mainqueue#localhost" dropped because it is temporarily not available
All queues dropped because of overload or full
And here's the output of "qhost" command:
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
localhost - - - - - - -
What should I do to make my jobs run and finish their task?

From your qhost output, it looks like your machine "localhost" is properly configured in SGE. However, on "localhost" sge_execd is either not running or not configured properly. If it were, qhost would report statistics for "localhost".

Related

problem with snakemake submitting jobs with multiple wildcard on SGE

I used snakemake on LSF cluster before and everything worked just fine. However, recently I migrated to SGE cluster and I am getting a very strange error when I try to run a job with more than one wildcard.
When I try to submit a job based on this rule
rule download_reads :
threads : 1
output : "data/{sp}/raw_reads/{accesion}_1.fastq.gz"
shell : "scripts/download_reads.sh {wildcards.sp} {wildcards.accesion} data/{wildcards.sp}/raw_reads/{wildcards.accesion}"
I get a following error (snakemake_clust.sh details bellow)
./snakemake_clust.sh data/Ecol1/raw_reads/SRA123456_1.fastq.gz
Building DAG of jobs...
Using shell: /bin/bash
Provided cluster nodes: 10
Job counts:
count jobs
1 download_reads
1
[Thu Jul 30 12:08:57 2020]
rule download_reads:
output: data/Ecol1/raw_reads/SRA123456_1.fastq.gz
jobid: 0
wildcards: sp=Ecol1, accesion=SRA123456
scripts/download_reads.sh Ecol1 SRA123456 data/Ecol1/raw_reads/SRA123456
Unable to run job: ERROR! two files are specified for the same host
ERROR! two files are specified for the same host
Exiting.
Error submitting jobscript (exit code 1):
Shutting down, this might take some time.
When I replace the sp wildcard with a constant, it works as expected:
rule download_reads :
threads : 1
output : "data/Ecol1/raw_reads/{accesion}_1.fastq.gz"
shell : "scripts/download_reads.sh Ecol1 {wildcards.accesion} data/Ecol1/raw_reads/{wildcards.accesion}"
I.e. I get
Submitted job 1 with external jobid 'Your job 50731 ("download_reads") has been submitted'.
I wonder why I might have this problem, I am sure I used exactly the same rule on the LSF-based cluster before without any problem.
some details
The snakemake submitting script looks like this
#!/usr/bin/env bash
mkdir -p logs
snakemake $# -p --jobs 10 --latency-wait 120 --cluster "qsub \
-N {rule} \
-pe smp64 \
{threads} \
-cwd \
-b y \
-o \"logs/{rule}.{wildcards}.out\" \
-e \"logs/{rule}.{wildcards}.err\""
-b y makes the command executed as it is, -cwd changes the working directory on the computing node the the working directory from where the job was submitted. Other flags / specifications are clear I hope.
Also, I am aware of --drmaa flag, but I think out cluster is not well configured for that. --cluster was till now a more robust solution.
-- edit 1 --
When I execute exactly the same snakefile locally (on the fronend, without the --cluster flag), the script gets executed as expected. It seems to be a problem of interaction of snakemake and the scheduler.
-o \"logs/{rule}.{wildcards}.out\" \
-e \"logs/{rule}.{wildcards}.err\""
This is a random guess... More than one wildcards are concatenated by space before replacing them into logs/{rule}.{wildcards}.err. So despite you use double quotes, SGE treats the resulting string as two files and throws the error. What if you use single quotes instead? Like:
-o 'logs/{rule}.{wildcards}.out' \
-e 'logs/{rule}.{wildcards}.err'
Alternatively, you could concatenate the wildcards in the rule and use the result on the command line. E.g.:
rule one:
params:
wc= lambda wc: '_'.join(wc)
output: ...
Then use:
-o 'logs/{rule}.{params.wc}.out' \
-e 'logs/{rule}.{params.wc}.err'
(This second solution, if it works, kind of sucks though)

How can I show progress for a long-running Ansible task?

I have a some Ansible tasks that perform unfortunately long operations - things like running an synchronization operation with an S3 folder. It's not always clear if they're progressing, or just stuck (or the ssh connection has died), so it would be nice to have some sort of progress output displayed. If the command's stdout/stderr was directly displayed, I'd see that, but Ansible captures the output.
Piping output back is a difficult problem for Ansible to solve in its current form. But are there any Ansible tricks I can use to provide some sort of indication that things are still moving?
Current ticket is https://github.com/ansible/ansible/issues/4870
I came across this problem today on OSX, where I was running a docker shell command which took a long time to build and there was no output whilst it built. It was very frustrating to not understand whether the command had hung or was just progressing slowly.
I decided to pipe the output (and error) of the shell command to a port, which could then be listened to via netcat in a separate terminal.
myplaybook.yml
- name: run some long-running task and pipe to a port
shell: myLongRunningApp > /dev/tcp/localhost/4000 2>&1
And in a separate terminal window:
$ nc -lk 4000
Output from my
long
running
app will appear here
Note that I pipe the error output to the same port; I could as easily pipe to a different port.
Also, I ended up setting a variable called nc_port which will allow for changing the port in case that port is in use. The ansible task then looks like:
shell: myLongRunningApp > /dev/tcp/localhost/{{nc_port}} 2>&1
Note that the command myLongRunningApp is being executed on localhost (i.e. that's the host set in the inventory) which is why I listen to localhost with nc.
Ansible has since implemented the following:
---
# Requires ansible 1.8+
- name: 'YUM - async task'
yum:
name: docker-io
state: installed
async: 1000
poll: 0
register: yum_sleeper
- name: 'YUM - check on async task'
async_status:
jid: "{{ yum_sleeper.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 30
For further information, see the official documentation on the topic (make sure you're selecting your version of Ansible).
There's a couple of things you can do, but as you have rightly pointed out, Ansible in its current form doesn't really offer a good solution.
Official-ish solutions:
One idea is to mark the task as async and poll it. Obviously this is only suitable if it is capable of running in such a manner without causing failure elsewhere in your playbook. The async docs are here and here's an example lifted from them:
- hosts: all
remote_user: root
tasks:
- name: simulate long running op (15 sec), wait for up to 45 sec, poll every 5 sec
command: /bin/sleep 15
async: 45
poll: 5
This can at least give you a 'ping' to know that the task isn't hanging.
The only other officially endorsed method would be Ansible Tower, which has progress bars for tasks but isn't free.
Hacky-ish solutions:
Beyond the above, you're pretty much going to have to roll your own. Your specific example of synching an S3 bucket could be monitored fairly easily with a script periodically calling the AWS CLI and counting the number of items in a bucket, but that's hardly a good, generic solution.
The only thing I could imagine being somewhat effective would be watching the incoming ssh session from one of your nodes.
To do that you could configure the ansible user on that machine to connect via screen and actively watch it. Alternatively perhaps using the log_output option in the sudoers entry for that user, allowing you to tail the file. Details of log_output can be found on the sudoers man page
If you're on Linux you may use systemd-run to create a transient unit and inspect the output with journalctl, like:
sudo systemd-run --unit foo \
bash -c 'for i in {0..10}; do
echo "$((i * 10))%"; sleep 1;
done;
echo "Complete"'
And in another session
sudo journalctl -xf --unit foo
It would output something like:
Apr 07 02:10:34 localhost.localdomain systemd[1]: Started /bin/bash -c for i in {0..10}; do echo "$((i * 10))%"; sleep 1; done; echo "Complete".
-- Subject: Unit foo.service has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit foo.service has finished starting up.
--
-- The start-up result is done.
Apr 07 02:10:34 localhost.localdomain bash[10083]: 0%
Apr 07 02:10:35 localhost.localdomain bash[10083]: 10%
Apr 07 02:10:36 localhost.localdomain bash[10083]: 20%
Apr 07 02:10:37 localhost.localdomain bash[10083]: 30%
Apr 07 02:10:38 localhost.localdomain bash[10083]: 40%
Apr 07 02:10:39 localhost.localdomain bash[10083]: 50%
Apr 07 02:10:40 localhost.localdomain bash[10083]: 60%
Apr 07 02:10:41 localhost.localdomain bash[10083]: 70%
Apr 07 02:10:42 localhost.localdomain bash[10083]: 80%
Apr 07 02:10:43 localhost.localdomain bash[10083]: 90%
Apr 07 02:10:44 localhost.localdomain bash[10083]: 100%
Apr 07 02:10:45 localhost.localdomain bash[10083]: Complete

script from cron doesn't run

I have a script:
-rwx------. 1 root root 135 Oct 15 12:00 /backup/purge.sh
#!/bin/bash
volume=`echo "list volumes" | bconsole|grep -i "Append\|Full"|awk '{print $4}'`
echo "purge volume=$volume yes" | bconsole
If I run it manually it runs.
If I put the script to crontab it doesn't run, however the log says it ran.
Oct 15 16:07:01 sdfdsfdsf CROND[36326]: (root) CMD (/backup/purge.sh)
The schedule:
07 16 * * * /backup/purge.sh
If I run manually:
/backup/purge.sh
Connecting to Director weewr:9101
1000 OK: 1 werewrewrewr Version: 7.0.5 (28 July 2014)
Enter a period to cancel a command.
purge volume=Vol-0001 yes
This command can be DANGEROUS!!!
It purges (deletes) all Files from a Job,
JobId, Client or Volume; or it purges (deletes)
all Jobs from a Client or Volume without regard
to retention periods. Normally you should use the
PRUNE command, which respects retention periods.
Automatically selected Catalog: MyCatalog
Using Catalog "MyCatalog"
1 File on Volume "Vol-0001" purged from catalog.
There are no more Jobs associated with Volume "Vol-0001". Marking it purged.
bconsole hasn't been in the PATH so I used full path for the bconsole command like this:
!/bin/bash
volume=echo "list volumes" | /sbin/bconsole|grep -i "Append\|Full"|awk '{print $4}'
echo "purge volume=$volume yes" | /sbin/bconsole

SH - How can I make a script read an output and follow a condition based on that output?

I have a KVM virtual machine that is shutting down by itself at random times. Until I fix it I have to keep it up and running.
I know if the virtual machine is running or not by executing the command virsh dominfo kvm110. Output:
[root#dal01 ~]# virsh dominfo kvm110
Id: 54
Name: kvm110
UUID: ea136d63-4806-4d8c-a9b3-7b9f412552c3
OS Type: hvm
State: running
CPU(s): 7
CPU time: 885.4s
Max memory: 10485760 KiB
Used memory: 10485760 KiB
Persistent: no
Autostart: disable
Managed save: no
Security model: none
Security DOI: 0
I want to make a script that executes the above command every second, reads the line State: running and then do the following based on the 2 possible alternatives, running and not running. I have no coding experience in SH or Bash, but I'd imagine that the script would be something simple as this:
def check():
if "status" = "running" :
running = true
print ("VM is running")
else:
print ("VM is NOT running") + print time.localtime() #prints timestamp
run command "virsh start kvm110"
check()
I didn't include the part where the script runs virsh dominfo kvm110 and parses the line State: out of it because I don't know how to do that.
Could someone help me out?
Use grep. If it matches its argument, it exits successfully (code 0), otherwise unsuccessfully (code 1, or 2 for an error). You can test whether a command succeeded or failed with an if statement:
if virsh dominfo kvm110 | grep -q '^State: *running'
then
echo "VM is running"
else
echo "VM is NOT running, " $(date) #prints timestamp
fi
The -q means "quiet". It tells grep not to print out the matching text.

How do I make cloud-init startup scripts run every time my EC2 instance boots?

I have an EC2 instance running an AMI based on the Amazon Linux AMI. Like all such AMIs, it supports the cloud-init system for running startup scripts based on the User Data passed into every instance. In this particular case, my User Data input happens to be an Include file that sources several other startup scripts:
#include
http://s3.amazonaws.com/path/to/script/1
http://s3.amazonaws.com/path/to/script/2
The first time I boot my instance, the cloud-init startup script runs correctly. However, if I do a soft reboot of the instance (by running sudo shutdown -r now, for instance), the instance comes back up without running the startup script the second time around. If I go into the system logs, I can see:
Running cloud-init user-scripts
user-scripts already ran once-per-instance
[ OK ]
This is not what I want -- I can see the utility of having startup scripts that only run once per instance lifetime, but in my case these should run every time the instance starts up, like normal startup scripts.
I realize that one possible solution is to manually have my scripts insert themselves into rc.local after running the first time. This seems burdensome, however, since the cloud-init and rc.d environments are subtly different and I would now have to debug scripts on first launch and all subsequent launches separately.
Does anyone know how I can tell cloud-init to always run my scripts? This certainly sounds like something the designers of cloud-init would have considered.
In 11.10, 12.04 and later, you can achieve this by making the 'scripts-user' run 'always'.
In /etc/cloud/cloud.cfg you'll see something like:
cloud_final_modules:
- rightscale_userdata
- scripts-per-once
- scripts-per-boot
- scripts-per-instance
- scripts-user
- keys-to-console
- phone-home
- final-message
This can be modified after boot, or cloud-config data overriding this stanza can be inserted via user-data. Ie, in user-data you can provide:
#cloud-config
cloud_final_modules:
- rightscale_userdata
- scripts-per-once
- scripts-per-boot
- scripts-per-instance
- [scripts-user, always]
- keys-to-console
- phone-home
- final-message
That can also be '#included' as you've done in your description.
Unfortunately, right now, you cannot modify the 'cloud_final_modules', but only override it. I hope to add the ability to modify config sections at some point.
There is a bit more information on this in the cloud-config doc at
https://github.com/canonical/cloud-init/tree/master/doc/examples
Alternatively, you can put files in /var/lib/cloud/scripts/per-boot , and they'll be run by the 'scripts-per-boot' path.
In /etc/init.d/cloud-init-user-scripts, edit this line:
/usr/bin/cloud-init-run-module once-per-instance user-scripts execute run-parts ${SCRIPT_DIR} >/dev/null && success || failure
to
/usr/bin/cloud-init-run-module always user-scripts execute run-parts ${SCRIPT_DIR} >/dev/null && success || failure
Good luck !
cloud-init supports this now natively, see runcmd vs bootcmd command descriptions in the documentation (http://cloudinit.readthedocs.io/en/latest/topics/examples.html#run-commands-on-first-boot):
"runcmd":
#cloud-config
# run commands
# default: none
# runcmd contains a list of either lists or a string
# each item will be executed in order at rc.local like level with
# output to the console
# - runcmd only runs during the first boot
# - if the item is a list, the items will be properly executed as if
# passed to execve(3) (with the first arg as the command).
# - if the item is a string, it will be simply written to the file and
# will be interpreted by 'sh'
#
# Note, that the list has to be proper yaml, so you have to quote
# any characters yaml would eat (':' can be problematic)
runcmd:
- [ ls, -l, / ]
- [ sh, -xc, "echo $(date) ': hello world!'" ]
- [ sh, -c, echo "=========hello world'=========" ]
- ls -l /root
- [ wget, "http://slashdot.org", -O, /tmp/index.html ]
"bootcmd":
#cloud-config
# boot commands
# default: none
# this is very similar to runcmd, but commands run very early
# in the boot process, only slightly after a 'boothook' would run.
# bootcmd should really only be used for things that could not be
# done later in the boot process. bootcmd is very much like
# boothook, but possibly with more friendly.
# - bootcmd will run on every boot
# - the INSTANCE_ID variable will be set to the current instance id.
# - you can use 'cloud-init-per' command to help only run once
bootcmd:
- echo 192.168.1.130 us.archive.ubuntu.com >> /etc/hosts
- [ cloud-init-per, once, mymkfs, mkfs, /dev/vdb ]
also note the "cloud-init-per" command example in bootcmd. From it's help:
Usage: cloud-init-per frequency name cmd [ arg1 [ arg2 [ ... ] ]
run cmd with arguments provided.
This utility can make it easier to use boothooks or bootcmd
on a per "once" or "always" basis.
If frequency is:
* once: run only once (do not re-run for new instance-id)
* instance: run only the first boot for a given instance-id
* always: run every boot
One possibility, although somewhat hackish, is to delete the lock file that cloud-init uses to determine whether or not the user-script has already run. In my case (Amazon Linux AMI), this lock file is located in /var/lib/cloud/sem/ and is named user-scripts.i-7f3f1d11 (the hash part at the end changes every boot). Therefore, the following user-data script added to the end of the Include file will do the trick:
#!/bin/sh
rm /var/lib/cloud/sem/user-scripts.*
I'm not sure if this will have any adverse effects on anything else, but it has worked in my experiments.
please use the below script above your bash script.
example: here m printing hello world to my file
stop instance before adding to userdata
script
Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0
--//
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="cloud-config.txt"
#cloud-config
cloud_final_modules:
- [scripts-user, always]
--//
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="userdata.txt"
#!/bin/bash
/bin/echo "Hello World." >> /var/tmp/sdksdfjsdlf
--//
I struggled with this issue for almost two days, tried all of the solutions I could find and finally, combining several approaches, came up with the following:
MyResource:
Type: AWS::EC2::Instance
Metadata:
AWS::CloudFormation::Init:
configSets:
setup_process:
- "prepare"
- "run_for_instance"
prepare:
commands:
01_apt_update:
command: "apt-get update"
02_clone_project:
command: "mkdir -p /replication && rm -rf /replication/* && git clone https://github.com/awslabs/dynamodb-cross-region-library.git /replication/dynamodb-cross-region-library/"
03_build_project:
command: "mvn install -DskipTests=true"
cwd: "/replication/dynamodb-cross-region-library"
04_prepare_for_apac:
command: "mkdir -p /replication/replication-west && rm -rf /replication/replication-west/* && cp /replication/dynamodb-cross-region-library/target/dynamodb-cross-region-replication-1.2.1.jar /replication/replication-west/replication-runner.jar"
run_for_instance:
commands:
01_run:
command: !Sub "java -jar replication-runner.jar --sourceRegion us-east-1 --sourceTable ${TableName} --destinationRegion ap-southeast-1 --destinationTable ${TableName} --taskName -us-ap >/dev/null 2>&1 &"
cwd: "/replication/replication-west"
Properties:
UserData:
Fn::Base64:
!Sub |
#cloud-config
cloud_final_modules:
- [scripts-user, always]
runcmd:
- /usr/local/bin/cfn-init -v -c setup_process --stack ${AWS::StackName} --resource MyResource --region ${AWS::Region}
- /usr/local/bin/cfn-signal -e $? --stack ${AWS::StackName} --resource MyResource --region ${AWS::Region}
This is the setup for DynamoDb cross-region replication process.
If someone wants to do this on CDK, here's a python example.
For Windows, user data has a special persist tag, but for Linux, you need to use MultiPart User data to setup cloud-init first. This Linux example worked with cloud-config (see ref blog) part type instead of cloud-boothook which requires a cloud-init-per (see also bootcmd) call I couldn't test out (eg: cloud-init-pre always).
Linux example:
# Create some userdata commands
instance_userdata = ec2.UserData.for_linux()
instance_userdata.add_commands("apt update")
# ...
# Now create the first part to make cloud-init run it always
cinit_conf = ec2.UserData.for_linux();
cinit_conf .add_commands('#cloud-config');
cinit_conf .add_commands('cloud_final_modules:');
cinit_conf .add_commands('- [scripts-user, always]');
multipart_ud = ec2.MultipartUserData()
#### Setup to run every time instance starts
multipart_ud.add_part(ec2.MultipartBody.from_user_data(cinit_conf , content_type='text/cloud-config'))
#### Add the commands desired to run every time
multipart_ud.add_part(ec2.MultipartBody.from_user_data(instance_userdata));
ec2.Instance(
self, "myec2",
userdata=multipart_ud,
#other required config...
)
Windows example:
instance_userdata = ec2.UserData.for_windows()
# Bootstrap
instance_userdata.add_commands("Write-Output 'Run some commands'")
# ...
# Making all the commands persistent - ie: running on each instance start
data_script = instance_userdata.render()
data_script += "<persist>true</persist>"
ud = ec2.UserData.custom(data_script)
ec2.Instance(
self, "myWinec2",
userdata=ud,
#other required config...
)
Another approach is to use #cloud-boothook in your user data script. From the docs:
Cloud Boothook
Begins with #cloud-boothook or Content-Type: text/cloud-boothook.
This content is boothook data. It is stored in a file under /var/lib/cloud and then executed immediately.
This is the earliest "hook" available. There is no mechanism provided for running it only one time. The boothook must take care
of this itself. It is provided with the instance ID in the environment
variable INSTANCE_ID. Use this variable to provide a once-per-instance
set of boothook data.

Resources