snakemake rule calls a shell script but exits after first command - shell

I have a shell script that works well if I just run it from command line. When I call it from a rule within snakemake it fails.
The script runs a for loop over a file of identifiers and uses those to grep the sequences from a fastq file followed by multiple sequence alignment and makes a consensus.
Here is the script. I placed some echo statements in there and for some reason it doesn't call the commands. It stops at the grep statement.
I have tried adding set +o pipefail; in the rule but that doesn't work either.
#!/bin/bash
function Usage(){
echo -e "\
Usage: $(basename $0) -r|--read2 -l|--umi-list -f|--outfile \n\
where: ... \n\
" >&2
exit 1
}
# Check argument count
[[ "$#" -lt 2 ]] && Usage
# parse arguments
while [[ "$#" -gt 1 ]];do
case "$1" in
-r|--read2)
READ2="$2"
shift
;;
-l|--umi-list)
UMI="$2"
shift
;;
-f|--outfile)
OUTFILE="$2"
shift
;;
*)
Usage
;;
esac
shift
done
# Set defaults
# Check arguments
[[ -f "${READ2}" ]] || (echo "Cannot find input file ${READ2}, exiting..." >&2; exit 1)
[[ -f "${UMI}" ]] || (echo "Cannot find input file ${UMI}, exiting..." >&2; exit 1)
#Create output directory
OUTDIR=$(dirname "${OUTFILE}")
[[ -d "${OUTDIR}" ]] || (set -x; mkdir -p "${OUTDIR}")
# Make temporary directories
TEMP_DIR="${OUTDIR}/temp"
[[ -d "${TEMP_DIR}" ]] || (set -x; mkdir -p "${TEMP_DIR}")
#RUN consensus script
for f in $( more "${UMI}" | cut -f1);do
NAME=$(echo $f)
grep "${NAME}" "${READ2}" | cut -f1 -d ' ' | sed 's/#M/M/' > "${TEMP_DIR}/${NAME}.name"
echo subsetting reads
seqtk subseq "${READ2}" "${TEMP_DIR}/${NAME}.name" | seqtk seq -A > "${TEMP_DIR}/${NAME}.fasta"
~/software/muscle3.8.31_i86linux64 -msf -in "${TEMP_DIR}/${NAME}.fasta" -out "${TEMP_DIR}/${NAME}.muscle.fasta"
echo make consensus
~/software/EMBOSS-6.6.0/emboss/cons -sequence "${TEMP_DIR}/${NAME}.muscle.fasta" -outseq "${TEMP_DIR}/${NAME}.cons.fasta"
sed -i 's/n//g' "${TEMP_DIR}/${NAME}.cons.fasta"
sed -i "s/EMBOSS_001/${NAME}.cons/" "${TEMP_DIR}/${NAME}.cons.fasta"
done
cat "${TEMP_DIR}/*.cons.fasta" > "${OUTFILE}"
Snakemake rule:
rule make_consensus:
input:
r2=get_extracted,
lst="{prefix}/{sample}/reads/cell_barcode_umi.count"
output:
fasta="{prefix}/{sample}/reads/fasta/{sample}.R2.consensus.fa"
shell:
"sh ./scripts/make_consensus.sh -r {input.r2} -l {input.lst} -f {output.fasta}"
Edit Snakemake error messages I changed some of the paths to a neutral filepath
RuleException:
CalledProcessError in line 29 of ~/user/scripts/consensus.smk:
Command ' set -euo pipefail; sh ./scripts/make_consensus.sh -r ~/user/file.extracted.fastq -l ~/user/cell_barcode_umi
.count -f ~/user/file.consensus.fa ' returned non-zero exit status 1.
File "~/user/scripts/consensus.smk", line 29, in __rule
_make_consensus
File "~/user/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
If there are better ways to do this than using a shell for loop please let me know!
thanks!
Edit
Script ran as standalone: first grep
grep AGGCCGTTCT_TGTGGATG R_extracted/wgs_5_OL_debug.R2.extracted.fastq | cut -f1 -d ' ' | sed 's/#M/M/' > ./fasta/temp/AGGCCGTTCT_TGTGGATG.name
Script ran through snakemake: first 2 grep statements
grep :::::::::::::: R_extracted/wgs_5_OL_debug.R2.extracted.fastq | cut -f1 -d ' ' | sed 's/#M/M/' > ./fasta/temp/::::::::::::::.name
I'm now trying to figure out where those :::: in snakemake are coming from. All ideas welcome

It stops at the grep statement
My guess is that the grep command in make_consensus.sh doesn't capture anything. grep returns exit code 1 in such cases and the non-zero exit status propagates to snakemake. (see also Handling SIGPIPE error in snakemake)
Loosely related... There is an inconsistency between the shebang of make_consensus.sh that says the script should be executed with bash (#!/bin/bash) and the actual execution using sh (sh ./scripts/make_consensus.sh). (In practice it shouldn't make any difference since sh is probably redirected to bash anyway)

Related

Calling bash script from bash script

I have made two programms and I'm trying to call the one from the other but this is appearing on my screen:
cp: cannot stat ‘PerShip/.csv’: No such file or directory
cp: target ‘tmpship.csv’ is not a directory
I don't know what to do. Here are the programms. Could somebody help me please?
#!/bin/bash
shipname=$1
imo=$(grep "$shipname" shipsNAME-IMO.txt | cut -d "," -f 2)
cp PerShip/$imo'.csv' tmpship.csv
dist=$(octave -q ShipDistance.m 2>/dev/null)
grep "$shipname" shipsNAME-IMO.txt | cut -d "," -f 2 > IMO.txt
idnumber=$(cut -b 4-10 IMO.txt)
echo $idnumber,$dist
#!/bin/bash
rm -f shipsdist.csv
for ship in $(cat shipsNAME-IMO.txt | cut -d "," -f 1)
do
./FindShipDistance "$ship" >> shipsdist.csv
done
cat shipsdist.csv | sort | head -n 1
The code and error messages presented suggest that the second script is calling the first with an empty command-line argument. That would certainly happen if input file shipsNAME-IMO.txt contained any empty lines or otherwise any lines with an empty first field. An empty line at the beginning or end would do it.
I suggest
using the read command to read the data, and manipulating IFS to parse out comma-delimited fields
validating your inputs and other data early and often
making your scripts behave more pleasantly in the event of predictable failures
More generally, using internal Bash features instead of external programs where the former are reasonably natural.
For example:
#!/bin/bash
# Validate one command-line argument
[[ -n "$1" ]] || { echo empty ship name 1>&2; exit 1; }
# Read and validate an IMO corresponding to the argument
IFS=, read -r dummy imo tail < <(grep -F -- "$1" shipsNAME-IMO.txt)
[[ -f PerShip/"${imo}.csv" ]] || { echo no data for "'$imo'" 1>&2; exit 1; }
# Perform the distance calculation and output the result
cp PerShip/"${imo}.csv" tmpship.csv
dist=$(octave -q ShipDistance.m 2>/dev/null) ||
{ echo "failed to compute ship distance for '${imo}'" 2>&1; exit 1; }
echo "${imo:3:7},${dist}"
and
#!/bin/bash
# Note: the original shipsdist.csv will be clobbered
while IFS=, read -r ship tail; do
# Ignore any empty ship name, however it might arise
[[ -n "$ship" ]] && ./FindShipDistance "$ship"
done < shipsNAME-IMO.txt |
tee shipsdist.csv |
sort |
head -n 1
Note that making the while loop in the second script part of a pipeline will cause it to run in a subshell. That is sometimes a gotcha, but it won't cause any problem in this case.

How to process multiple command line arguments in Bash? [duplicate]

This question already has answers here:
How to iterate over arguments in a Bash script
(9 answers)
How do I parse command line arguments in Bash?
(40 answers)
Closed 5 years ago.
I am having problem allowing my script to take more than three arguments. My script will take commands like this, for example:
./myscript.sh -i -v -r filename
so far if it only takes two arguments plus filename:
./myscript.sh -i -v filename
If I run the full commands, [-i] [-v] [-r] it gives this errors...
"mv: invalid option -- 'r'
Try 'mv --help' for more information."
here is my code so far....
#!/bin/bash
dirfile='.trash'
verbose=
function helpfunction()
{
echo "=============== HELP COMMAND ============="
echo ""
echo "-h | --help"
echo "-i | --interactive"
echo "-v | --verbose"
echo "-r | --recursive"
echo ""
}
if [ $# -eq 0 ]; then
echo "no commands"
exit 1
fi
option="${1}"
case ${option} in
-h | --help)
helpfunction
exit
;;
#interactive -i or --interactive
-i) FILE="${*}"
echo "File name is $FILE"
echo -n "Do you want to remove this file (y/n)? "
read answer
if echo "$answer" | grep -iq "^y" ;then
mv $FILE $dirfile
fi
;;
#verbose -v or --verbose
-v) FILE="${*}"
if [ -f "$FILE" ]; then
-v $FILE
fi
;;
#recursive -r or --recursive
-r) FILE="${*}"
if [ -d "${*}" ]; then
rm -r "$FILE"
fi
;;
#unremove -u or --unremove
-u) FILE="${*}"
for file in $dirfile*;
do
if [[ -f $file ]]; then
echo "file found"
else
echo "File not found"
fi
done
;;
#list -l or --list
-l) FILE="${*}"
for entry in "$dirfile"/*
do
echo "$entry"
done
;;
*)
echo "`basename ${0}`:usage: [-f file] | [-d directory]"
exit 1 # Command to come out of the program with status 1
;;
esac
When you say
FILE="${*}"
You're taking all the remaining arguments as one string, and assigning that to FILE. You can see this by adding 'set -x' to the top of your script:
opus$ ./myscript.sh -i -v -r filename
+ '[' 4 -eq 0 ']'
+ option=-i
+ case ${option} in
+ FILE='-i -v -r filename'
+ echo 'File name is -i -v -r filename'
File name is -i -v -r filename
+ echo -n 'Do you want to remove this file (y/n)? '
Do you want to remove this file (y/n)? + read answer
y
+ grep -iq '^y'
+ echo y
+ mv -i -v -r filename
mv: invalid option -- 'r'
Try 'mv --help' for more information.
Saying $1 gives the first argument. After you discover that, you can use shift to load the next argument as $1, and cycle through them like that.
Alternatively, you could just use the builtin getops argument parser.
Joe
You are getting error at this line
mv $FILE $dirfile
Here is the execution of your script in debug mode.
bash -x ./myscript.sh -i -v -r test.txt
+ dirfile=.trash
+ verbose=
+ '[' 4 -eq 0 ']'
+ option=-i
+ case ${option} in
+ FILE='-i -v -r test.txt'
+ echo 'File name is -i -v -r test.txt'
File name is -i -v -r test.txt
+ echo -n 'Do you want to remove this file (y/n)? '
Do you want to remove this file (y/n)? + read answer
y
+ echo y
+ grep -iq '^y'
+ mv -i -v -r test.txt .trash
mv: invalid option -- 'r'
Try 'mv --help' for more information.
-r is not a valid argument for mv.
Update your script like FILE=$4
if you want -i -v -r as optional, you can extract the last argument following this Getting the last argument passed to a shell script
You can also use ${!#} for getting the last parameter.

A script to find all the users who are executing a specific program

I've written the bash script (searchuser) which should display all the users who are executing a specific program or a script (at least a bash script). But when searching for scripts fails because the command the SO is executing is something like bash scriptname.
This script acts parsing the ps command output, it search for all the occurrences of the specified program name, extracts the user and the program name, verifies if the program name is that we're searching for and if it's it displays the relevant information (in this case the user name and the program name, might be better to output also the PID, but that is quite simple). The verification is accomplished to reject all lines containing program names which contain the name of the program but they're not the program we are searching for; if we're searching gedit we don't desire to find sgedit or gedits.
Other issues I've are:
I would like to avoid the use of a tmp file.
I would like to be not tied to GNU extensions.
The script has to be executed as:
root# searchuser programname <invio>
The script searchuser is the following:
#!/bin/bash
i=0
search=$1
tmp=`mktemp`
ps -aux | tr -s ' ' | grep "$search" > $tmp
while read fileline
do
user=`echo "$fileline" | cut -f1 -d' '`
prg=`echo "$fileline" | cut -f11 -d' '`
prg=`basename "$prg"`
if [ "$prg" = "$search" ]; then
echo "$user - $prg"
i=`expr $i + 1`
fi
done < $tmp
if [ $i = 0 ]; then
echo "No users are executing $search"
fi
rm $tmp
exit $i
Have you suggestion about to solve these issues?
One approach might looks like such:
IFS=$'\n' read -r -d '' -a pids < <(pgrep -x -- "$1"; printf '\0')
if (( ! ${#pids[#]} )); then
echo "No users are executing $1"
fi
for pid in "${pids[#]}"; do
# build a more accurate command line than the one ps emits
args=( )
while IFS= read -r -d '' arg; do
args+=( "$arg" )
done </proc/"$pid"/cmdline
(( ${#args[#]} )) || continue # exited while we were running
printf -v cmdline_str '%q ' "${args[#]}"
user=$(stat --format=%U /proc/"$pid") || continue # exited while we were running
printf '%q - %s\n' "$user" "${cmdline_str% }"
done
Unlike the output from ps, which doesn't distinguish between ./command "some argument" and ./command "some" "argument", this will emit output which correctly shows the arguments run by each user, with quoting which will re-run the given command correctly.
What about:
ps -e -o user,comm | egrep "^[^ ]+ +$1$" | cut -d' ' -f1 | sort -u
* Addendum *
This statement:
ps -e -o user,pid,comm | egrep "^\s*\S+\s+\S+\s*$1$" | while read a b; do echo $a; done | sort | uniq -c
or this one:
ps -e -o user,pid,comm | egrep "^\s*\S+\s+\S+\s*sleep$" | xargs -L1 echo | cut -d ' ' -f1 | sort | uniq -c
shows the number of process instances by user.

Bash - output of command seems to be an integer but "[" complains

I am checking to see if a process on a remote server has been killed. The code I'm using is:
if [ `ssh -t -t -i id_dsa headless#remoteserver.com "ps -auxwww |grep pipeline| wc -l" | sed -e 's/^[ \t]*//'` -lt 3 ]
then
echo "PIPELINE STOPPED SUCCESSFULLY"
exit 0
else
echo "PIPELINE WAS NOT STOPPED SUCCESSFULLY"
exit 1
fi
However when I execute this I get:
: integer expression expected
PIPELINE WAS NOT STOPPED SUCCESSFULLY
1
The actual value returned is "1" with no whitespace. I checked that by:
vim <(ssh -t -t -i id_dsa headless#remoteserver.com "ps -auxwww |grep pipeline| wc -l" | sed -e 's/^[ \t]*//')
and then ":set list" which showed only the integer and a line feed as the returned value.
I'm at a loss here as to why this is not working.
If the output of the ssh command is truly just an integer preceded by optional tabs, then you shouldn't need the sed command; the shell will strip the leading and/or trailing whitespace as unnecessary before using it as an operand for the -lt operator.
if [ $(ssh -tti id_dsa headless#remoteserver.com "ps -auxwww | grep -c pipeline") -lt 3 ]; then
It is possible that result of the ssh is not the same when you run it manually as when it runs in the shell. You might try saving it in a variable so you can output it before testing it in your script:
result=$( ssh -tti id_dsa headless#remoteserver.com "ps -auxwww | grep -c pipeline" )
if [ $result -lt 3 ];
The return value you get is not entirely a digit. Maybe some shell-metacharacter/linefeed/whatever gets into your way here:
#!/bin/bash
var=$(ssh -t -t -i id_dsa headless#remoteserver.com "ps auxwww |grep -c pipeline")
echo $var
# just to prove my point here
# Remove all digits, and look wether there is a rest -> then its not integer
test -z "$var" -o -n "`echo $var | tr -d '[0-9]'`" && echo not-integer
# get out all the digits to use them for the arithmetic comparison
var2=$(grep -o "[0-9]" <<<"$var")
echo $var2
if [[ $var2 -lt 3 ]]
then
echo "PIPELINE STOPPED SUCCESSFULLY"
exit 0
else
echo "PIPELINE WAS NOT STOPPED SUCCESSFULLY"
exit 1
fi
As user mbratch noticed I was getting a "\r" in the returned value in addition to the expected "\n". So I changed my sed script so that it stripped out the "\r" instead of the whitespace (which chepner pointed out was unnecessary).
sed -e 's/\r*$//'

Different pipeline behavior between sh and ksh

I have isolated the problem to the below code snippet:
Notice below that null string gets assigned to LATEST_FILE_NAME='' when the script is run using ksh; but the script assigns the value to variable $LATEST_FILE_NAME correctly when run using sh. This in turn affects the value of $FILE_LIST_COUNT.
But as the script is in KornShell (ksh), I am not sure what might be causing the issue.
When I comment out the tee command in the below line, the ksh script works fine and correctly assigns the value to variable $LATEST_FILE_NAME.
(cd $SOURCE_FILE_PATH; ls *.txt 2>/dev/null) | sort -r > ${SOURCE_FILE_PATH}/${FILE_LIST} | tee -a $LOG_FILE_PATH
Kindly consider:
1. Source Code: script.sh
#!/usr/bin/ksh
set -vx # Enable debugging
SCRIPTLOGSDIR=/some/path/Scripts/TEST/shell_issue
SOURCE_FILE_PATH=/some/path/Scripts/TEST/shell_issue
# Log file
Timestamp=`date +%Y%m%d%H%M`
LOG_FILENAME="TEST_LOGS_${Timestamp}.log"
LOG_FILE_PATH="${SCRIPTLOGSDIR}/${LOG_FILENAME}"
## Temporary files
FILE_LIST=FILE_LIST.temp #Will store all extract filenames
FILE_LIST_COUNT=0 # Stores total number of files
getFileListDetails(){
rm -f $SOURCE_FILE_PATH/$FILE_LIST 2>&1 | tee -a $LOG_FILE_PATH
# Get list of all files, Sort in reverse order, and store names of the files line-wise. If no files are found, error is muted.
(cd $SOURCE_FILE_PATH; ls *.txt 2>/dev/null) | sort -r > ${SOURCE_FILE_PATH}/${FILE_LIST} | tee -a $LOG_FILE_PATH
if [[ ! -f $SOURCE_FILE_PATH/$FILE_LIST ]]; then
echo "FATAL ERROR - Could not create a temp file for file list.";exit 1;
fi
LATEST_FILE_NAME="$(cd $SOURCE_FILE_PATH; head -1 $FILE_LIST)";
FILE_LIST_COUNT="$(cat $SOURCE_FILE_PATH/$FILE_LIST | wc -l)";
}
getFileListDetails;
exit 0;
2. Output when using shell sh script.sh:
+ getFileListDetails
+ rm -f /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp
+ tee -a /some/path/Scripts/TEST/shell_issue/TEST_LOGS_201304300506.log
+ cd /some/path/Scripts/TEST/shell_issue
+ sort -r
+ tee -a /some/path/Scripts/TEST/shell_issue/TEST_LOGS_201304300506.log
+ ls 1.txt 2.txt 3.txt
+ [[ ! -f /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp ]]
cd $SOURCE_FILE_PATH; head -1 $FILE_LIST
++ cd /some/path/Scripts/TEST/shell_issue
++ head -1 FILE_LIST.temp
+ LATEST_FILE_NAME=3.txt
cat $SOURCE_FILE_PATH/$FILE_LIST | wc -l
++ cat /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp
++ wc -l
+ FILE_LIST_COUNT=3
exit 0;
+ exit 0
3. Output when using ksh ksh script.sh:
+ getFileListDetails
+ tee -a /some/path/Scripts/TEST/shell_issue/TEST_LOGS_201304300507.log
+ rm -f /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp
+ 2>& 1
+ tee -a /some/path/Scripts/TEST/shell_issue/TEST_LOGS_201304300507.log
+ sort -r
+ 1> /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp
+ cd /some/path/Scripts/TEST/shell_issue
+ ls 1.txt 2.txt 3.txt
+ 2> /dev/null
+ [[ ! -f /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp ]]
+ cd /some/path/Scripts/TEST/shell_issue
+ head -1 FILE_LIST.temp
+ LATEST_FILE_NAME=''
+ wc -l
+ cat /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp
+ FILE_LIST_COUNT=0
exit 0;+ exit 0
OK, here goes...this is a tricky and subtle one. The answer lies in how pipelines are implemented. POSIX states that
If the pipeline is not in the background (see Asynchronous Lists), the shell shall wait for the last command specified in the pipeline to complete, and may also wait for all commands to complete.)
Notice the keyword may. Many shells implement this in a way that all commands need to complete, e.g. see the bash manpage:
The shell waits for all commands in the pipeline to terminate before returning a value.
Notice the wording in the ksh manpage:
Each command, except possibly the last, is run as a separate process; the shell waits for the last command to terminate.
In your example, the last command is the tee command. Since there is no input to tee because you redirect stdout to ${SOURCE_FILE_PATH}/${FILE_LIST} in the command before, it immediately exits. Oversimplified speaking, the tee is faster than the earlier redirection, which means that your file is probably not finished writing to by the time you are reading from it. You can test this (this is not a fix!) by adding a sleep at the end of the whole command:
$ ksh -c 'ls /tmp/* | sort -r > /tmp/foo.txt | tee /tmp/bar.txt; echo "[$(head -n 1 /tmp/foo.txt)]"'
[]
$ ksh -c 'ls /tmp/* | sort -r > /tmp/foo.txt | tee /tmp/bar.txt; sleep 0.1; echo "[$(head -n 1 /tmp/foo.txt)]"'
[/tmp/sess_vo93c7h7jp2a49tvmo7lbn6r63]
$ bash -c 'ls /tmp/* | sort -r > /tmp/foo.txt | tee /tmp/bar.txt; echo "[$(head -n 1 /tmp/foo.txt)]"'
[/tmp/sess_vo93c7h7jp2a49tvmo7lbn6r63]
That being said, here are a few other things to consider:
Always quote your variables, especially when dealing with files, to avoid problems with globbing, word splitting (if your path contains spaces) etc.:
do_something "${this_is_my_file}"
head -1 is deprecated, use head -n 1
If you only have one command on a line, the ending semicolon ; is superfluous...just skip it
LATEST_FILE_NAME="$(cd $SOURCE_FILE_PATH; head -1 $FILE_LIST)"
No need to cd into the directory first, just specify the whole path as argument to head:
LATEST_FILE_NAME="$(head -n 1 "${SOURCE_FILE_PATH}/${FILE_LIST}")"
FILE_LIST_COUNT="$(cat $SOURCE_FILE_PATH/$FILE_LIST | wc -l)"
This is called Useless Use Of Cat because the cat is not needed - wc can deal with files. You probably used it because the output of wc -l myfile includes the filename, but you can use e.g. FILE_LIST_COUNT="$(wc -l < "${SOURCE_FILE_PATH}/${FILE_LIST}")" instead.
Furthermore, you will want to read Why you shouldn't parse the output of ls(1) and How can I get the newest (or oldest) file from a directory?.

Resources