Use shell output for error handling for condor - shell

I need to submit multiple simulations to condor (multi-client execution grid) using shell and since this may take a while, I decided to write a shell script to do it for me. I am very new to shell scripting and this is the result of what I did on one day:
for H in {0..50}
do
for S in {0..10}
do
./p32 -data ../data.txt -out ../result -position $S -group $H
echo "> Ready to submit"
condor_submit profile.sub
echo "> Waiting 15 minutes for group $H Pos $S"
for W in {1..15}
do
echo "Staring minute $W"
sleep 60
done
done
echo "Deleting data_3 to free up space"
mkdir /tmp/data_3
if [$H < 10]
then
tar cfvz /tmp/data_3/group_000$H.tar.gz ../result/data_3/group_000$H
rm -r ../result/data_3/group_000$H
else
tar cfvz /tmp/data_3/group_00$H.tar.gz ../result/data_3/group_00$H
rm -r ../result/data_3/group_00$H
fi
done
This script runs through 0..50 simulations and submits 0..10 different parameters to a program that generates a condor submission profile. Then I submit this profile and let it execute for 15 minutes (with a call being made every minute to ensure the SSH pipe doesn't break). Once the 15 minutes are up I compress the output to a volume with more space and erase the original files.
The reason for me implementing this because is due to our condor system can only being able to handle up to 10,000 submissions at once and one submission (condor_submit profile.sub) executes 7000+ simulations.
Now my problem is with this line. When I checked this morning I (luckily) spotted that the when calling condor_submit profile.sub may cause an error if the network is too busy. The error code is:
ERROR: Failed to connect to local queue manager
CEDAR:6001:Failed to connect to <IP_NUMBER:PORT_NUMBER>
This means that from time to time a whole iteration gets lost! How can I work around this? The only way I see is to use shell to read in the last line/s of terminal output and evaluate whether they follow the expected response i.e.:
7392 job(s) submitted to cluster CLUSTER_NUMBER.
But how would I read in the last line and go about checking for errors?
Any help is very needed and very much appreciated

Does condor_submit give a non-zero exit code when it fails? If so, you can try calling it like this:
while ! condor_submit profile.sub; do
sleep 5
done
which will cause the current profile to be submitted every 5 seconds until it succeeds.

Related

Stop command after a given time and return its result in Bash

I need to execute several calls to a C++ program that records frames from a videogame. I have about 1800 test games, and some of them work and some of them don't.
When they don't work, the console returns a Segmentation fault error, but when they do work, the program opens a window and plays the game, and at the same time it records every frame.
The problem is that when it does work, this process does not end until you close the game window.
I need to make a Bash script that will test every game I have and write the names of the ones that work in a text file and the names of the ones that don't work in another file.
For the moment I have tried with this, using the timeout command:
count=0
# Run for every file in the ROMS folder
for filename in ../ROMs/*.bin; do
# Increase the counter
(( count++ ))
# Run the command with a timeout to prevent it from being infinite
timeout 5 ./doc/examples/videoRecordingExample "$filename"
# Check if execution succeeds/fails and print in a text file
if [ $? == 0 ]; then
echo "Game $count named $filename" >> successGames.txt
else
echo "Game $count named $filename" >> failedGames.txt
fi
done
But it doesn't seem to be working, because it writes all the names on the same file. I believe this is because the condition inside the if refers to the timeout and not the execution of the C++ program itself.
Then I tried without the timeout and everytime a game worked, I closed manually the window, and then the result was the expected. I tried this with only 10 games, but when I test it with all the 1800 I would need it to be completely automatic.
So, is there any way of making this process automatic? Like some command to stop the execution and at the same time know if it was successful or not?
instead of
timeout 5 ./doc/examples/videoRecordingExample "$filename"
you could try this:
./doc/examples/videoRecordingExample "$filename" && sleep 5 && pkill videoRecordingExample
Swap the arguments in the timeout code. It should be:
timeout 5 "$filename" ./doc/examples/videoRecordingExample
Reason: the syntax for timeout is:
timeout [OPTION] DURATION COMMAND [ARG]...
So the COMMAND should be just after the DURATION. In the code above the presumably non-executable file videoRecordingExample would be the COMMAND, which probably returns an error every time.

whether a shell script can be executed if another instance of the same script is already running

I have a shell script which usually runs nearly 10 mins for a single run,but i need to know if another request for running the script comes while a instance of the script is running already, whether new request need to wait for existing instance to compplete or a new instance will be started.
I need a new instance must be started whenever a request is available for the same script.
How to do it...
The shell script is a polling script which looks for a file in a directory and execute the file.The execution of the file takes nearly 10 min or more.But during execution if a new file arrives, it also has to be executed simultaneously.
the shell script is below, and how to modify it to execute multiple requests..
#!/bin/bash
while [ 1 ]; do
newfiles=`find /afs/rch/usr8/fsptools/WWW/cgi-bin/upload/ -newer /afs/rch/usr$
touch /afs/rch/usr8/fsptools/WWW/cgi-bin/upload/.my_marker
if [ -n "$newfiles" ]; then
echo "found files $newfiles"
name2=`ls /afs/rch/usr8/fsptools/WWW/cgi-bin/upload/ -Art |tail -n 2 |head $
echo " $name2 "
mkdir -p -m 0755 /afs/rch/usr8/fsptools/WWW/dumpspace/$name2
name1="/afs/rch/usr8/fsptools/WWW/dumpspace/fipsdumputils/fipsdumputil -e -$
$name1
touch /afs/rch/usr8/fsptools/WWW/dumpspace/tempfiles/$name2
fi
sleep 5
done
When writing scripts like the one you describe, I take one of two approaches.
First, you can use a pid file to indicate that a second copy should not run. For example:
#!/bin/sh
pidfile=/var/run/$(0##*/).pid
# remove pid if we exit normally or are terminated
trap "rm -f $pidfile" 0 1 3 15
# Write the pid as a symlink
if ! ln -s "pid=$$" "$pidfile"; then
echo "Already running. Exiting." >&2
exit 0
fi
# Do your stuff
I like using symlinks to store pid because writing a symlink is an atomic operation; two processes can't conflict with each other. You don't even need to check for the existence of the pid symlink, because a failure of ln clearly indicates that a pid cannot be set. That's either a permission or path problem, or it's due to the symlink already being there.
Second option is to make it possible .. nay, preferable .. not to block additional instances, and instead configure whatever it is that this script does to permit multiple servers to run at the same time on different queue entries. "Single-queue-single-server" is never as good as "single-queue-multi-server". Since you haven't included code in your question, I have no way to know whether this approach would be useful for you, but here's some explanatory meta bash:
#!/usr/bin/env bash
workdir=/var/tmp # Set a better $workdir than this.
a=( $(get_list_of_queue_ids) ) # A command? A function? Up to you.
for qid in "${a[#]}"; do
# Set a "lock" for this item .. or don't, and move on.
if ! ln -s "pid=$$" $workdir/$qid.working; then
continue
fi
# Do your stuff with just this $qid.
...
# And finally, clean up after ourselves
remove_qid_from_queue $qid
rm $workdir/$qid.working
done
The effect of this is to transfer the idea of "one at a time" from the handler to the data. If you have a multi-CPU system, you probably have enough capacity to handle multiple queue entries at the same time.
ghoti's answer shows some helpful techniques, if modifying the script is an option.
Generally speaking, for an existing script:
Unless you know with certainty that:
the script has no side effects other than to output to the terminal or to write to files with shell-instance specific names (such as incorporating $$, the current shell's PID, into filenames) or some other instance-specific location,
OR that the script was explicitly designed for parallel execution,
I would assume that you cannot safely run multiple copies of the script simultaneously.
It is not reasonable to expect the average shell script to be designed for concurrent use.
From the viewpoint of the operating system, several processes may of course execute the same program in parallel. No need to worry about this.
However, it is conceivable, that a (careless) programmer wrote the program in such a way that it produces incorrect results, when two copies are executed in parallel.

Ruby shell script realtime output

script.sh
echo First!
sleep 5
echo Second!
sleep 5
echo Third!
another_script.rb
%x[./script.sh]
I want another_script.rb to print the output of script.sh as it happens. That means printing "First!", waiting five seconds, printing "Second!', waiting 5 seconds, and so on.
I've read through the different ways to run an external script in Ruby, but none seem to do this. How can I fulfill my requirements?
You can always execute this in Ruby:
system("sh", "script.sh")
Note it's important to specify how to execute this unless you have a proper #!/bin/sh header as well as the execute bit enabled.

Shell script that continuously checks a text file for log data and then runs a program

I have a java program that stops often due to errors which is logged in a .log file. What can be a simple shell script to detect a particular text in the last/latest line say
[INFO] Stream closed
and then run the following command
java -jar xyz.jar
This should keep on happening forever(possibly after every two minutes or so) because xyz.jar writes the log file.
The text stream closed can arrive a lot of times in the log file. I just want it to take an action when it comes in the last line.
How about
while [[ true ]];
do
sleep 120
tail -1 logfile | grep -q "[INFO] Stream Closed"
if [[ $? -eq 1 ]]
then
java -jar xyz.jar &
fi
done
There may be condition where the tailed last log "Stream Closed" is not the real last log and the process is still logging the messages. We can avoid this condition by checking if the process is alive or not. If the process exited and the last log is "Stream Closed" then we need to restart the application.
#!/bin/bash
java -jar xyz.jar &
PID=$1
while [ true ]
do
tail -1 logfile | grep -q "Stream Closed" && kill -0 $PID && sleep 20 && continue
java -jar xyz.jar &
PID=$1
done
I would prefer checking whether the corresponding process is still running and restart the program on that event. There might be other errors that cause the process to stop. You can use a cronjob to periodically (like every minute) perform such a check.
Also, you might want to improve your java code so that it does not crash that often (if you have access to the code).
i solved this using a watchdog script that checks directly (grep) if program(s) is(are) running. by calling watchdog every minute (from cron under ubuntu), i basically guarantee (programs and environment are VERY stable) that no program will stay offline for more than 59 seconds.
this script will check a list of programs using the name in an array and see if each one is running, and, in case not, start it.
#!/bin/bash
#
# watchdog
#
# Run as a cron job to keep an eye on what_to_monitor which should always
# be running. Restart what_to_monitor and send notification as needed.
#
# This needs to be run as root or a user that can start system services.
#
# Revisions: 0.1 (20100506), 0.2 (20100507)
# first prog to check
NAME[0]=soc_gt2
# 2nd
NAME[1]=soc_gt0
# 3rd, etc etc
NAME[2]=soc_gp00
# START=/usr/sbin/$NAME
NOTIFY=you#gmail.com
NOTIFYCC=you2#mail.com
GREP=/bin/grep
PS=/bin/ps
NOP=/bin/true
DATE=/bin/date
MAIL=/bin/mail
RM=/bin/rm
for nameTemp in "${NAME[#]}"; do
$PS -ef|$GREP -v grep|$GREP $nameTemp >/dev/null 2>&1
case "$?" in
0)
# It is running in this case so we do nothing.
echo "$nameTemp is RUNNING OK. Relax."
$NOP
;;
1)
echo "$nameTemp is NOT RUNNING. Starting $nameTemp and sending notices."
START=/usr/sbin/$nameTemp
$START 2>&1 >/dev/null &
NOTICE=/tmp/watchdog.txt
echo "$NAME was not running and was started on `$DATE`" > $NOTICE
# $MAIL -n -s "watchdog notice" -c $NOTIFYCC $NOTIFY < $NOTICE
$RM -f $NOTICE
;;
esac
done
exit
i do not use the log verification, though you could easily incorporate that into your own version (just change grep for log check, for example).
if you run it from command line (or putty, if you are remotely connected), you will see what was working and what wasnt. have been using it for months now without a hiccup. just call it whenever you want to see what's working (regardless of it running under cron).
you could also place all your critical programs in one folder, do a directory list and check if every file in that folder has a program running under the same name. or read a txt file line by line, with every line correspoding to a program that is supposed to be running. etcetcetc
A good way is to use the awk command:
tail -f somelog.log | awk '/.*[INFO] Stream Closed.*/ { system("java -jar xyz.jar") }'
This continually monitors the log stream and when the regular expression matches its fires off whatever system command you have set, which is anything you would type into a shell.
If you really wanna be good you can put that line into a .sh file and run that .sh file from a process monitoring daemon like upstart to ensure that it never dies.
Nice and clean =D

I want to make a conditional cronjob

I have a cron job that runs every hour. It accesses an xml feed. If the xml feed is unvailable (which seems to happen once a day or so) it creates a "failure" file. This "failure" file has some metadata in it and is erased at the next hour when the script runs again and the XML feed works again.
What I want is to make a 2nd cron job that runs a few minutes after the first one, looks into the directory for a "failure" file and, if it's there, retries the 1st cron job.
I know how to set up cron jobs, I just don't know how to make scripting conditional like that. Am I going about this in entirely the wrong way?
Possibly. Maybe what you'd be better off doing is having the original script sleep and retry a (limited) number of times.
Sleep is a shell command and shells support looping so it could look something like:
for ((retry=0;retry<12;retry++)); do
try the thing
if [[ -e my_xlm_file ]]; then break; fi
sleep 300
# five minutes later...
done
As the command to run, try:
/bin/bash -c 'test -e failurefile && retrycommand -someflag -etc'
It runs retrycommand if failurefile exists
Why not have your set your script touch a status file when it has successfully completed. Have it run every 5 minutes, and have the first check of the script be to see if the status file is less then 60 minutes old, and if it is young, then quit, if it is old, then fetch.
I agree with MarkusQ that you should retry in the original job instead of creating another job to watch the first job.
Take a look at this tool to make retrying easier: https://github.com/kadwanev/retry
You can just wrap the original cron in a retry very easily and the final existence of the failure file would indicate if it failed even after the retries.
If somebody will need a bash script to ping an endpoint (for example, run scheduled API tasks via cron), retry it, if the response status was bad, then:
#!/bin/bash
echo "Start pinch.sh script.";
# run 5 times
for ((i=1;i<=5;i++))
do
# run curl to do a POST request to https://www.google.com
# silently flush all its output
# get the response status code as a bash variable
http_response=$(curl -o /dev/null -s -w "%{response_code}" https://www.google.com);
# check for the expected code
if [ $http_response != "200" ]
then
# process fail
echo "The pinch is Failed. Sleeping for 5 minutes."
# wait for 300 seconds, then start another iteration
sleep 300
else
# exit from the cycle
echo "The pinch is OK. Finishing."
break;
fi
done
exit 0

Resources