Different pipeline behavior between sh and ksh - shell

I have isolated the problem to the below code snippet:
Notice below that null string gets assigned to LATEST_FILE_NAME='' when the script is run using ksh; but the script assigns the value to variable $LATEST_FILE_NAME correctly when run using sh. This in turn affects the value of $FILE_LIST_COUNT.
But as the script is in KornShell (ksh), I am not sure what might be causing the issue.
When I comment out the tee command in the below line, the ksh script works fine and correctly assigns the value to variable $LATEST_FILE_NAME.
(cd $SOURCE_FILE_PATH; ls *.txt 2>/dev/null) | sort -r > ${SOURCE_FILE_PATH}/${FILE_LIST} | tee -a $LOG_FILE_PATH
Kindly consider:
1. Source Code: script.sh
#!/usr/bin/ksh
set -vx # Enable debugging
SCRIPTLOGSDIR=/some/path/Scripts/TEST/shell_issue
SOURCE_FILE_PATH=/some/path/Scripts/TEST/shell_issue
# Log file
Timestamp=`date +%Y%m%d%H%M`
LOG_FILENAME="TEST_LOGS_${Timestamp}.log"
LOG_FILE_PATH="${SCRIPTLOGSDIR}/${LOG_FILENAME}"
## Temporary files
FILE_LIST=FILE_LIST.temp #Will store all extract filenames
FILE_LIST_COUNT=0 # Stores total number of files
getFileListDetails(){
rm -f $SOURCE_FILE_PATH/$FILE_LIST 2>&1 | tee -a $LOG_FILE_PATH
# Get list of all files, Sort in reverse order, and store names of the files line-wise. If no files are found, error is muted.
(cd $SOURCE_FILE_PATH; ls *.txt 2>/dev/null) | sort -r > ${SOURCE_FILE_PATH}/${FILE_LIST} | tee -a $LOG_FILE_PATH
if [[ ! -f $SOURCE_FILE_PATH/$FILE_LIST ]]; then
echo "FATAL ERROR - Could not create a temp file for file list.";exit 1;
fi
LATEST_FILE_NAME="$(cd $SOURCE_FILE_PATH; head -1 $FILE_LIST)";
FILE_LIST_COUNT="$(cat $SOURCE_FILE_PATH/$FILE_LIST | wc -l)";
}
getFileListDetails;
exit 0;
2. Output when using shell sh script.sh:
+ getFileListDetails
+ rm -f /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp
+ tee -a /some/path/Scripts/TEST/shell_issue/TEST_LOGS_201304300506.log
+ cd /some/path/Scripts/TEST/shell_issue
+ sort -r
+ tee -a /some/path/Scripts/TEST/shell_issue/TEST_LOGS_201304300506.log
+ ls 1.txt 2.txt 3.txt
+ [[ ! -f /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp ]]
cd $SOURCE_FILE_PATH; head -1 $FILE_LIST
++ cd /some/path/Scripts/TEST/shell_issue
++ head -1 FILE_LIST.temp
+ LATEST_FILE_NAME=3.txt
cat $SOURCE_FILE_PATH/$FILE_LIST | wc -l
++ cat /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp
++ wc -l
+ FILE_LIST_COUNT=3
exit 0;
+ exit 0
3. Output when using ksh ksh script.sh:
+ getFileListDetails
+ tee -a /some/path/Scripts/TEST/shell_issue/TEST_LOGS_201304300507.log
+ rm -f /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp
+ 2>& 1
+ tee -a /some/path/Scripts/TEST/shell_issue/TEST_LOGS_201304300507.log
+ sort -r
+ 1> /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp
+ cd /some/path/Scripts/TEST/shell_issue
+ ls 1.txt 2.txt 3.txt
+ 2> /dev/null
+ [[ ! -f /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp ]]
+ cd /some/path/Scripts/TEST/shell_issue
+ head -1 FILE_LIST.temp
+ LATEST_FILE_NAME=''
+ wc -l
+ cat /some/path/Scripts/TEST/shell_issue/FILE_LIST.temp
+ FILE_LIST_COUNT=0
exit 0;+ exit 0

OK, here goes...this is a tricky and subtle one. The answer lies in how pipelines are implemented. POSIX states that
If the pipeline is not in the background (see Asynchronous Lists), the shell shall wait for the last command specified in the pipeline to complete, and may also wait for all commands to complete.)
Notice the keyword may. Many shells implement this in a way that all commands need to complete, e.g. see the bash manpage:
The shell waits for all commands in the pipeline to terminate before returning a value.
Notice the wording in the ksh manpage:
Each command, except possibly the last, is run as a separate process; the shell waits for the last command to terminate.
In your example, the last command is the tee command. Since there is no input to tee because you redirect stdout to ${SOURCE_FILE_PATH}/${FILE_LIST} in the command before, it immediately exits. Oversimplified speaking, the tee is faster than the earlier redirection, which means that your file is probably not finished writing to by the time you are reading from it. You can test this (this is not a fix!) by adding a sleep at the end of the whole command:
$ ksh -c 'ls /tmp/* | sort -r > /tmp/foo.txt | tee /tmp/bar.txt; echo "[$(head -n 1 /tmp/foo.txt)]"'
[]
$ ksh -c 'ls /tmp/* | sort -r > /tmp/foo.txt | tee /tmp/bar.txt; sleep 0.1; echo "[$(head -n 1 /tmp/foo.txt)]"'
[/tmp/sess_vo93c7h7jp2a49tvmo7lbn6r63]
$ bash -c 'ls /tmp/* | sort -r > /tmp/foo.txt | tee /tmp/bar.txt; echo "[$(head -n 1 /tmp/foo.txt)]"'
[/tmp/sess_vo93c7h7jp2a49tvmo7lbn6r63]
That being said, here are a few other things to consider:
Always quote your variables, especially when dealing with files, to avoid problems with globbing, word splitting (if your path contains spaces) etc.:
do_something "${this_is_my_file}"
head -1 is deprecated, use head -n 1
If you only have one command on a line, the ending semicolon ; is superfluous...just skip it
LATEST_FILE_NAME="$(cd $SOURCE_FILE_PATH; head -1 $FILE_LIST)"
No need to cd into the directory first, just specify the whole path as argument to head:
LATEST_FILE_NAME="$(head -n 1 "${SOURCE_FILE_PATH}/${FILE_LIST}")"
FILE_LIST_COUNT="$(cat $SOURCE_FILE_PATH/$FILE_LIST | wc -l)"
This is called Useless Use Of Cat because the cat is not needed - wc can deal with files. You probably used it because the output of wc -l myfile includes the filename, but you can use e.g. FILE_LIST_COUNT="$(wc -l < "${SOURCE_FILE_PATH}/${FILE_LIST}")" instead.
Furthermore, you will want to read Why you shouldn't parse the output of ls(1) and How can I get the newest (or oldest) file from a directory?.

Related

What is the meaning of the number of + signs in stderr in Bash when "set -x"

In Bash, you can see
set --help
-x Print commands and their arguments as they are executed.
Here's test code:
# make script
echo '
#!/bin/bash
set -x
n=$(echo "a" | wc -c)
for i in $(seq $n)
do
file=test_$i.txt
eval "ls -l | head -$i"> $file
rm $file
done
' > test.sh
# execute
chmod +x test.sh
./test.sh 2> stderr
# check
cat stderr
Output
+++ echo a
+++ wc -c
++ n=2
+++ seq 2
++ for i in $(seq $n)
++ file=test_1.txt
++ eval 'ls -l | head -1'
+++ ls -l
+++ head -1
++ rm test_1.txt
++ for i in $(seq $n)
++ file=test_2.txt
++ eval 'ls -l | head -2'
+++ ls -l
+++ head -2
++ rm test_2.txt
What is the meaning of the number of + signs at the beginning of each row in the file? It's kind of obvious, but I want to avoid misinterpreting.
In addition, can a single + sign appear there? If so, what is the meaning of it?
The number of + represents subshell nesting depth.
Note that the entire test.sh script is being run in a subshell because it doesn't begin with #!/bin/bash. This has to be on the first line of the script, but it's on the second line because you have a newline at the beginning of the echo argument that contains the script.
When a script is run this way, it's executed by the original shell in a subshell, approximately like
( source test.sh )
Change that to
echo '#!/bin/bash
set -x
n=$(echo "a" | wc -c)
for i in $(seq $n)
do
file=test_$i.txt
eval "ls -l | head -$i"> $file
rm $file
done
' > test.sh
and the top-level commands being run in the script will have a single +.
So for example the command
n=$(echo "a" | wc -c)
produces the output
++ echo a
++ wc -c
+ n=' 2'
echo a and wc -c are executed in the subshell created for the command substitution, so they get two +, while n=<result> is executed in the original shell with a single +.
From man bash:
-x
After expanding each simple command, for command, case command, select command, or arithmetic for command, display the expanded value of PS4, followed by the command and its expanded arguments or associated word list.
So what's PS4 here?
PS4
The value of this parameter is expanded as with PS1 and the value is printed before each command bash displays during an execution trace. The first character of the expanded value of PS4 is replicated multiple times, as necessary, to indicate multiple levels of indirection. The default is + .
The meaning of "indirection" is not further explained, as far as I can find...

Calling bash script from bash script

I have made two programms and I'm trying to call the one from the other but this is appearing on my screen:
cp: cannot stat ‘PerShip/.csv’: No such file or directory
cp: target ‘tmpship.csv’ is not a directory
I don't know what to do. Here are the programms. Could somebody help me please?
#!/bin/bash
shipname=$1
imo=$(grep "$shipname" shipsNAME-IMO.txt | cut -d "," -f 2)
cp PerShip/$imo'.csv' tmpship.csv
dist=$(octave -q ShipDistance.m 2>/dev/null)
grep "$shipname" shipsNAME-IMO.txt | cut -d "," -f 2 > IMO.txt
idnumber=$(cut -b 4-10 IMO.txt)
echo $idnumber,$dist
#!/bin/bash
rm -f shipsdist.csv
for ship in $(cat shipsNAME-IMO.txt | cut -d "," -f 1)
do
./FindShipDistance "$ship" >> shipsdist.csv
done
cat shipsdist.csv | sort | head -n 1
The code and error messages presented suggest that the second script is calling the first with an empty command-line argument. That would certainly happen if input file shipsNAME-IMO.txt contained any empty lines or otherwise any lines with an empty first field. An empty line at the beginning or end would do it.
I suggest
using the read command to read the data, and manipulating IFS to parse out comma-delimited fields
validating your inputs and other data early and often
making your scripts behave more pleasantly in the event of predictable failures
More generally, using internal Bash features instead of external programs where the former are reasonably natural.
For example:
#!/bin/bash
# Validate one command-line argument
[[ -n "$1" ]] || { echo empty ship name 1>&2; exit 1; }
# Read and validate an IMO corresponding to the argument
IFS=, read -r dummy imo tail < <(grep -F -- "$1" shipsNAME-IMO.txt)
[[ -f PerShip/"${imo}.csv" ]] || { echo no data for "'$imo'" 1>&2; exit 1; }
# Perform the distance calculation and output the result
cp PerShip/"${imo}.csv" tmpship.csv
dist=$(octave -q ShipDistance.m 2>/dev/null) ||
{ echo "failed to compute ship distance for '${imo}'" 2>&1; exit 1; }
echo "${imo:3:7},${dist}"
and
#!/bin/bash
# Note: the original shipsdist.csv will be clobbered
while IFS=, read -r ship tail; do
# Ignore any empty ship name, however it might arise
[[ -n "$ship" ]] && ./FindShipDistance "$ship"
done < shipsNAME-IMO.txt |
tee shipsdist.csv |
sort |
head -n 1
Note that making the while loop in the second script part of a pipeline will cause it to run in a subshell. That is sometimes a gotcha, but it won't cause any problem in this case.

Different output of command substitution

Why does adding | wc -l alters the result as in the following?
tst:
#!/bin/bash
pgrep tst | wc -l
echo $(pgrep tst | wc -l)
echo $(pgrep tst) | wc -l
$ ./tst
1
2
1
and even
$ bash -x tst
+ wc -l
+ pgrep tst
0
++ pgrep tst
++ wc -l
+ echo 0
0
++ pgrep tst
+ echo
pgrep and subshells can have weird interactions, but in this case that's just a red herring; the actual cause is missing double-quotes around the command substitution:
$ cat tst2
#!/bin/bash
pgrep tst | wc -l
echo "$(pgrep tst | wc -l)"
echo "$(pgrep tst)" | wc -l
$ ./tst2
1
2
2
What's going on in the original script is that in the command
echo $(pgrep tst) | wc -l
pgrep prints two process IDs (the main shell running the script, and a subshell created to handle the echo part of the pipeline). It prints each one as a separate line, something like:
11730
11736
The command substitution captures that, but since it's not in double-quotes the newline between them gets converted to an argument break, so the whole thing becomes equivalent to:
echo 11730 11736 | wc -l
As a result, echo prints both IDs as a single line, and wc -l correctly reports that.
The command substitution induces an additional process that has tst in its name, which is included in the input to wc -l.

snakemake rule calls a shell script but exits after first command

I have a shell script that works well if I just run it from command line. When I call it from a rule within snakemake it fails.
The script runs a for loop over a file of identifiers and uses those to grep the sequences from a fastq file followed by multiple sequence alignment and makes a consensus.
Here is the script. I placed some echo statements in there and for some reason it doesn't call the commands. It stops at the grep statement.
I have tried adding set +o pipefail; in the rule but that doesn't work either.
#!/bin/bash
function Usage(){
echo -e "\
Usage: $(basename $0) -r|--read2 -l|--umi-list -f|--outfile \n\
where: ... \n\
" >&2
exit 1
}
# Check argument count
[[ "$#" -lt 2 ]] && Usage
# parse arguments
while [[ "$#" -gt 1 ]];do
case "$1" in
-r|--read2)
READ2="$2"
shift
;;
-l|--umi-list)
UMI="$2"
shift
;;
-f|--outfile)
OUTFILE="$2"
shift
;;
*)
Usage
;;
esac
shift
done
# Set defaults
# Check arguments
[[ -f "${READ2}" ]] || (echo "Cannot find input file ${READ2}, exiting..." >&2; exit 1)
[[ -f "${UMI}" ]] || (echo "Cannot find input file ${UMI}, exiting..." >&2; exit 1)
#Create output directory
OUTDIR=$(dirname "${OUTFILE}")
[[ -d "${OUTDIR}" ]] || (set -x; mkdir -p "${OUTDIR}")
# Make temporary directories
TEMP_DIR="${OUTDIR}/temp"
[[ -d "${TEMP_DIR}" ]] || (set -x; mkdir -p "${TEMP_DIR}")
#RUN consensus script
for f in $( more "${UMI}" | cut -f1);do
NAME=$(echo $f)
grep "${NAME}" "${READ2}" | cut -f1 -d ' ' | sed 's/#M/M/' > "${TEMP_DIR}/${NAME}.name"
echo subsetting reads
seqtk subseq "${READ2}" "${TEMP_DIR}/${NAME}.name" | seqtk seq -A > "${TEMP_DIR}/${NAME}.fasta"
~/software/muscle3.8.31_i86linux64 -msf -in "${TEMP_DIR}/${NAME}.fasta" -out "${TEMP_DIR}/${NAME}.muscle.fasta"
echo make consensus
~/software/EMBOSS-6.6.0/emboss/cons -sequence "${TEMP_DIR}/${NAME}.muscle.fasta" -outseq "${TEMP_DIR}/${NAME}.cons.fasta"
sed -i 's/n//g' "${TEMP_DIR}/${NAME}.cons.fasta"
sed -i "s/EMBOSS_001/${NAME}.cons/" "${TEMP_DIR}/${NAME}.cons.fasta"
done
cat "${TEMP_DIR}/*.cons.fasta" > "${OUTFILE}"
Snakemake rule:
rule make_consensus:
input:
r2=get_extracted,
lst="{prefix}/{sample}/reads/cell_barcode_umi.count"
output:
fasta="{prefix}/{sample}/reads/fasta/{sample}.R2.consensus.fa"
shell:
"sh ./scripts/make_consensus.sh -r {input.r2} -l {input.lst} -f {output.fasta}"
Edit Snakemake error messages I changed some of the paths to a neutral filepath
RuleException:
CalledProcessError in line 29 of ~/user/scripts/consensus.smk:
Command ' set -euo pipefail; sh ./scripts/make_consensus.sh -r ~/user/file.extracted.fastq -l ~/user/cell_barcode_umi
.count -f ~/user/file.consensus.fa ' returned non-zero exit status 1.
File "~/user/scripts/consensus.smk", line 29, in __rule
_make_consensus
File "~/user/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
If there are better ways to do this than using a shell for loop please let me know!
thanks!
Edit
Script ran as standalone: first grep
grep AGGCCGTTCT_TGTGGATG R_extracted/wgs_5_OL_debug.R2.extracted.fastq | cut -f1 -d ' ' | sed 's/#M/M/' > ./fasta/temp/AGGCCGTTCT_TGTGGATG.name
Script ran through snakemake: first 2 grep statements
grep :::::::::::::: R_extracted/wgs_5_OL_debug.R2.extracted.fastq | cut -f1 -d ' ' | sed 's/#M/M/' > ./fasta/temp/::::::::::::::.name
I'm now trying to figure out where those :::: in snakemake are coming from. All ideas welcome
It stops at the grep statement
My guess is that the grep command in make_consensus.sh doesn't capture anything. grep returns exit code 1 in such cases and the non-zero exit status propagates to snakemake. (see also Handling SIGPIPE error in snakemake)
Loosely related... There is an inconsistency between the shebang of make_consensus.sh that says the script should be executed with bash (#!/bin/bash) and the actual execution using sh (sh ./scripts/make_consensus.sh). (In practice it shouldn't make any difference since sh is probably redirected to bash anyway)

Bash: Subshell behaviour of ls

I am wondering why I do not get se same output from:
ls -1 -tF | head -n 1
and
echo $(ls -1 -tF | head -n 1)
I tried to get the last modified file, but using it inside a sub shell sometimes I get more than one file as result?
Why that and how to avoid?
The problem arises because you are using an unquoted subshell and -F flag for ls outputs shell special characters appended to filenames.
-F, --classify
append indicator (one of */=>#|) to entries
Executable files are appended with *.
When you run
echo $(ls -1 -tF | head -n 1)
then
$(ls -1 -tF | head -n 1)
will return a filename, and if it happens to be an executable and also be the prefix to another file, then it will return both.
For example if you have
test.sh
test.sh.backup
then it will return
test.sh*
which when echoed expands to
test.sh test.sh.backup
Quoting the subshell prevents this expansion
echo "$(ls -1 -tF | head -n 1)"
returns
test.sh*
I just found the error:
If you use echo $(ls -1 -tF | head -n 1)
the file globing mechanism may result in additional matches.
So echo "$(ls -1 -tF | head -n 1)" would avoid this.
Because if the result is an executable it contains a * at the end.
I tried to place the why -F in a comment, but now I decided to put it here:
I added the following lines to my .bashrc, to have a shortcut to get last modified files or directories listed:
function L {
myvar=$1; h=${myvar:="1"};
echo "last ${h} modified file(s):";
export L=$(ls -1 -tF|fgrep -v / |head -n ${h}| sed 's/\(\*\|=\|#\)$//g' );
ls -l $L;
}
function LD {
myvar=$1;
h=${myvar:="1"};
echo "last ${h} modified directories:";
export LD=$(ls -1 -tF|fgrep / |head -n $h | sed 's/\(\*\|=\|#\)$//g'); ls -ld $LD;
}
alias ol='L; xdg-open $L'
alias cdl='LD; cd $LD'
So now I can use L (or L 5) to list the last (last 5) modified files. But not directories.
And with L; jmacs $L I can open my editor, to edit it. Traditionally I used my alias lt='ls -lrt' but than I have to retype the name...
Now after mkdir ... I use cdl to change to that dir.

Resources