Howto do floating point compasiosn in an if-statement within a GNU parallel block? - bash

I want to run a batch process in parallel. For this I pipe a list to parallel. When I've an if-statement, that compares two floating point numbers (taken form here), the code doesn't run anymore. How can this be solved.
LIMIT=25
ps | parallel -j2 '
echo "Do stuff for {} to determine NUM"
NUM=33.3333 # set to demonstrate
if (( $(echo "$NUM > $LIMIT" | bc -l) )); then
echo "react..."
fi
echo "Do stuff..."
'
Prints:
Do stuff for \ \ PID\ TTY\ \ \ \ \ \ \ \ \ \ TIME\ CMD to determine NUM
Do stuff...
(standard_in) 2: syntax error
#... snipp

LIMIT is not set inside parallel shell. Running echo "$NUM > $LIMIT" | bc -l exapands to echo "123 > " | bc -l which results in syntax error reported by bc. You need to export/pass/put it's value to the shell run from inside parallel. Try this:
LIMIT=25
ps | parallel -j2 '
LIMIT="'"$LIMIT"'"
echo "Do stuff for {} to determine NUM"
NUM=33.3333 # set to demonstrate
if (( $(echo "$NUM > $LIMIT" | bc -l) )); then
echo "react..."
fi
echo "Do stuff..."
'
Or better use env_parallel, designed for such problems.
Side note: GNU parallel was designed for executing jobs in parallel using one or more computers. For scripts running on one computer it is better to stick with the xargs command, which is more commonly available (so you don't need to install some package each time you move your script to another machine).

While GNU Parallel is designed to deal correctly with commands spanning multiple lines, I personally find that hard to read. I prefer using a function:
doit() {
arg="$1"
echo "Do stuff for $a to determine NUM"
NUM=33.3333 # set to demonstrate
if (( $(echo "$NUM > $LIMIT" | bc -l) )); then
echo "react..."
fi
echo "Do stuff..."
}
export -f doit
LIMIT=25
export LIMIT
ps | parallel -j2 doit
Instead of the exports you can use env_parallel:
ps | env_parallel -j2 doit
If your environment is too big, use env_parallel --session before starting:
#!/bin/bash
env_parallel --session
# Define functions and variables _after_ running --session
doit() {
[...]
}
LIMIT=25
ps | env_parallel -j2 doit

Related

make the bash script to be faster

I have a fairly large list of websites in "file.txt" and wanted to check if the words "Hello World!" in the site in the list using looping and curl.
i.e in "file.txt" :
blabla.com
blabla2.com
blabla3.com
then my code :
#!/bin/bash
put() {
printf "list : "
read list
run=$(cat $list)
}
put
scan_list() {
for run in $(cat $list);do
if [[ $(curl -skL ${run}) =~ "Hello World!" ]];then
printf "${run} Hello World! \n"
else
printf "${run} No Hello:( \n"
fi
done
}
scan_list
this takes a lot of time, is there a way to make the checking process faster?
Use xargs:
% tr '\12' '\0' < file.txt | \
xargs -0 -r -n 1 -t -P 3 sh -c '
if curl -skL "$1" | grep -q "Hello World!"; then
echo "$1 Hello World!"
exit
fi
echo "$1 No Hello:("
' _
Use tr to convert returns in the file.txt to nulls (\0).
Pass through xargs with -0 option to parse by nulls.
The -r option prevents the command from being ran if the input is empty. This is only available on Linux, so for macOS or *BSD you will need to check that file.txt is not empty before running.
The -n 1 permits only one file per execution.
The -t option is debugging, it prints the command before it is ran.
We allow 3 simultaneous commands in parallel with the -P 3 option.
Using sh -c with a single quoted multi-line command, we substitute $1 for the entries from the file.
The _ fills in the $0 argument, so our entries are $1.

Passing args to defined bash functions through GNU parallel

Let me show you a snippet of my Bash script and how I try to run parallel:
parallel -a "$file" \
-k \
-j8 \
--block 100M \
--pipepart \
--bar \
--will-cite \
_fix_col_number {} | _unify_null_value {} >> "$OUTPUT_DIR/$new_filename"
So, I am basically trying to process each line in a file in parallel using Bash functions defined inside my script. However, I am not sure how to pass each line to my defined functions "_fix_col_number" and "_unify_null_value". Whatever I do, nothing gets passed to the functions.
I am exporting the functions like this in my script:
declare -x NUM_OF_COLUMNS
export -f _fix_col_number
export -f _add_tabs
export -f _unify_null_value
The mentioned functions are:
_unify_null_value()
{
_string=$(echo "$1" | perl -0777 -pe "s/(?<=\t)\.(?=\s)//g" | \
perl -0777 -pe "s/(?<=\t)NA(?=\s)//g" | \
perl -0777 -pe "s/(?<=\t)No Info(?=\s)//g")
echo "$_string"
}
_add_tabs()
{
_tabs=""
for (( c=1; c<=$1; c++ ))
do
_tabs="$_tabs\t"
done
echo -e "$_tabs"
}
_fix_col_number()
{
line_cols=$(echo "$1" | awk -F"\t" '{ print NF }')
if [[ $line_cols -gt $NUM_OF_COLUMNS ]]; then
new_line=$(echo "$1" | cut -f1-"$NUM_OF_COLUMNS")
echo -e "$new_line\n"
elif [[ $line_cols -lt $NUM_OF_COLUMNS ]]; then
missing_columns=$(( NUM_OF_COLUMNS - line_cols ))
new_line="${1//$'\n'/}$(_add_tabs $missing_columns)"
echo -e "$new_line\n"
else
echo -e "$1"
fi
}
I tried removing {} from parallel. Not really sure what I am doing wrong.
I see two problems in the invocation plus additional problems with the functions:
With --pipepart there are no arguments. The blocks read from -a file are passed over stdin to your functions. Try the following commands to confirm this:
seq 9 > file
parallel -a file --pipepart echo
parallel -a file --pipepart cat
Theoretically, you could read stdin into a variable and pass that variable to your functions, ...
parallel -a file --pipepart 'b=$(cat); someFunction "$b"'
... but I wouldn't recommend it, especially since your blocks are 100MB each.
Bash interprets the pipe | in your command before parallel even sees it. To run a pipe, quote the entire command:
parallel ... 'b=$(cat); _fix_col_number "$b" | _unify_null_value "$b"' >> ...
_fix_col_number seems to assume its argument to be a single line, but receives 100MB blocks instead.
_unify_null_value does not read stdin, so _fix_col_number {} | _unify_null_value {} is equivalent to _unify_null_value {}.
That being said, your functions can be drastically improved. They start a lot of processes which becomes incredibly expensive for larger files. You can do some trivial improvements like combining perl ... | perl ... | perl ... into a single perl. Likewise, instead of storing everything in variables, you can process stdin directly: Just use f() { cmd1 | cmd2; } instead of f() { var=$(echo "$1" | cmd1); var=$(echo "$var" | cmd2); echo "$var"; }.
However, don't waste time on small things like these. A complete rewrite in sed, awk, or perl is easy and should outperfom every optimization on the existing functions.
Try
n="INSERT NUMBER OF COLUMNS HERE"
tabs=$(perl -e "print \"\t\" x $n")
perl -pe "s/\r?\$/$tabs/; s/\t\K(\.|NA|No Info)(?=\s)//g;" file |
cut -f "1-$n"
If you still find this too slow, leave out file; pack the command into a function, export that function and then call parallel -a file -k --pipepart nameOfTheFunction. The option --block is not necessary as pipepart will evenly split the input based on the number of jobs (can be specified with -j).

How to extract code into a funciton when using xargs -P?

At fisrt,I have write the code,and it run well.
# version1
all_num=10
thread_num=5
a=$(date +%H%M%S)
seq 1 ${all_num} | xargs -n 1 -I {} -P ${thread_num} sh -c 'echo abc{}'
b=$(date +%H%M%S)
echo -e "startTime:\t$a"
echo -e "endTime:\t$b"
Now I want to extract code into a funciton,but it was wrong,how to fix it?
get_file(i){
echo "abc"+i
}
all_num=10
thread_num=5
a=$(date +%H%M%S)
seq 1 ${all_num} | xargs -n 1 -I {} -P ${thread_num} sh -c "$(get_file {})"
b=$(date +%H%M%S)
echo -e "startTime:\t$a"
echo -e "endTime:\t$b"
Because /bin/sh isn't guaranteed to have support for either printing text that when evaluates defines your function, or exporting functions through the environment, we need to do this the hard way, just duplicating the text of the function inside the copy of sh started by xargs.
Other questions already exist in this site describing how to accomplish this with bash, which is quite considerably easier. See f/e How can I use xargs to run a function in a command substitution for each match?
#!/bin/sh
all_num=10
thread_num=5
batch_size=1 # but with a larger all_num, turn this up to start fewer copies of sh
a=$(date +%H%M%S) # warning: this is really inefficient
seq 1 ${all_num} | xargs -n "${batch_size}" -P "${thread_num}" sh -c '
get_file() { i=$1; echo "abc ${i}"; }
for arg do
get_file "$arg"
done
' _
b=$(date +%H%M%S)
printf 'startTime:\t%s\n' "$a"
printf 'endTime:\t%s\n' "$b"
Note:
echo -e is not guaranteed to work with /bin/sh. Moreover, for a shell to be truly compliant, echo -e is required to write -e to its output. See Why is printf better than echo? on UNIX & Linux Stack Exchange, and the APPLICATION USAGE section of the POSIX echo specification.
Putting {} in a sh -c '...{}...' position is a Really Bad Idea. Consider the case where you're passed in a filename that contains $(rm -rf ~)'$(rm -rf ~)' -- it can't be safely inserted in an unquoted context, or a double-quoted context, or a single-quoted context, or a heredoc.
Note that seq is also nonstandard and not guaranteed to be present on all POSIX-compliant systems. i=0; while [ "$i" -lt "$all_num" ]; do echo "$i"; i=$((i + 1)); done is an alternative that will work on all POSIX systems.

runing my function in parallel using xargs

Hi all I have the following bash script that calls hmmscan from hmmer3 software. hmmscan requires to specify 6 command line arguments in this case the code that I have written is as follows:
hmmscan_fun () {
local file=$1
local marker_profiles=$2
local n_threads=$3
local out_dir=$4
fname=$(echo $file | rev | cut -d'/' -f1 | rev)
echo 'filename'
echo $out_dir$fname".txt"
echo 'n threads'
echo $n_threads
echo 'marker profiles'
echo $marker_profiles
echo $out_dir$fname".txt" >> $out_dir"out.txt"
hmmscan -o $out_dir$fname".txt" --tblout $out_dir$fname".hmm" -E 1e-10 --cpu $n_threads $marker_profiles $file
}
Basically I'm iterating over a list of files found in a directory and am running hmmscan over each file, and I'm using this file name to append on the output names so that I'll have different output names corresponding to each of my input files.
My question is that the loop is quite length and I would like to parallelize this process to scale with the number of CPUs that I provide at command line. I want to do so using xargs it is important that I use xargs since I do not have GNUs parallel function and unfortunately I cannot install anything. Please help. Basically Im stuck with how to call a function with xargs and how to pass many command line arguments to it.
I assume you have access to a development machine where you are allowed to install software. On that you install GNU Parallel > 20180222.
Then you run:
parallel --embed > myscript.sh
Then you change the last lines of myscript.sh to something like:
hmmscan_fun () {
local file=$1
local marker_profiles=$2
local n_threads=$3
local out_dir=$4
fname=$(echo $file | rev | cut -d'/' -f1 | rev)
echo 'filename'
echo $out_dir$fname".txt"
echo 'n threads'
echo $n_threads
echo 'marker profiles'
echo $marker_profiles
echo $out_dir$fname".txt" >> $out_dir"out.txt"
hmmscan -o $out_dir$fname".txt" --tblout $out_dir$fname".hmm" -E 1e-10 --cpu $n_threads $marker_profiles $file
}
export -f hmmscan_fun
parallel hmmscan_fun {1} {2} 32 myoutdir ::: files* ::: marker1 marker2
And then you move the script to the production machine and run it there.

snakemake rule calls a shell script but exits after first command

I have a shell script that works well if I just run it from command line. When I call it from a rule within snakemake it fails.
The script runs a for loop over a file of identifiers and uses those to grep the sequences from a fastq file followed by multiple sequence alignment and makes a consensus.
Here is the script. I placed some echo statements in there and for some reason it doesn't call the commands. It stops at the grep statement.
I have tried adding set +o pipefail; in the rule but that doesn't work either.
#!/bin/bash
function Usage(){
echo -e "\
Usage: $(basename $0) -r|--read2 -l|--umi-list -f|--outfile \n\
where: ... \n\
" >&2
exit 1
}
# Check argument count
[[ "$#" -lt 2 ]] && Usage
# parse arguments
while [[ "$#" -gt 1 ]];do
case "$1" in
-r|--read2)
READ2="$2"
shift
;;
-l|--umi-list)
UMI="$2"
shift
;;
-f|--outfile)
OUTFILE="$2"
shift
;;
*)
Usage
;;
esac
shift
done
# Set defaults
# Check arguments
[[ -f "${READ2}" ]] || (echo "Cannot find input file ${READ2}, exiting..." >&2; exit 1)
[[ -f "${UMI}" ]] || (echo "Cannot find input file ${UMI}, exiting..." >&2; exit 1)
#Create output directory
OUTDIR=$(dirname "${OUTFILE}")
[[ -d "${OUTDIR}" ]] || (set -x; mkdir -p "${OUTDIR}")
# Make temporary directories
TEMP_DIR="${OUTDIR}/temp"
[[ -d "${TEMP_DIR}" ]] || (set -x; mkdir -p "${TEMP_DIR}")
#RUN consensus script
for f in $( more "${UMI}" | cut -f1);do
NAME=$(echo $f)
grep "${NAME}" "${READ2}" | cut -f1 -d ' ' | sed 's/#M/M/' > "${TEMP_DIR}/${NAME}.name"
echo subsetting reads
seqtk subseq "${READ2}" "${TEMP_DIR}/${NAME}.name" | seqtk seq -A > "${TEMP_DIR}/${NAME}.fasta"
~/software/muscle3.8.31_i86linux64 -msf -in "${TEMP_DIR}/${NAME}.fasta" -out "${TEMP_DIR}/${NAME}.muscle.fasta"
echo make consensus
~/software/EMBOSS-6.6.0/emboss/cons -sequence "${TEMP_DIR}/${NAME}.muscle.fasta" -outseq "${TEMP_DIR}/${NAME}.cons.fasta"
sed -i 's/n//g' "${TEMP_DIR}/${NAME}.cons.fasta"
sed -i "s/EMBOSS_001/${NAME}.cons/" "${TEMP_DIR}/${NAME}.cons.fasta"
done
cat "${TEMP_DIR}/*.cons.fasta" > "${OUTFILE}"
Snakemake rule:
rule make_consensus:
input:
r2=get_extracted,
lst="{prefix}/{sample}/reads/cell_barcode_umi.count"
output:
fasta="{prefix}/{sample}/reads/fasta/{sample}.R2.consensus.fa"
shell:
"sh ./scripts/make_consensus.sh -r {input.r2} -l {input.lst} -f {output.fasta}"
Edit Snakemake error messages I changed some of the paths to a neutral filepath
RuleException:
CalledProcessError in line 29 of ~/user/scripts/consensus.smk:
Command ' set -euo pipefail; sh ./scripts/make_consensus.sh -r ~/user/file.extracted.fastq -l ~/user/cell_barcode_umi
.count -f ~/user/file.consensus.fa ' returned non-zero exit status 1.
File "~/user/scripts/consensus.smk", line 29, in __rule
_make_consensus
File "~/user/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
If there are better ways to do this than using a shell for loop please let me know!
thanks!
Edit
Script ran as standalone: first grep
grep AGGCCGTTCT_TGTGGATG R_extracted/wgs_5_OL_debug.R2.extracted.fastq | cut -f1 -d ' ' | sed 's/#M/M/' > ./fasta/temp/AGGCCGTTCT_TGTGGATG.name
Script ran through snakemake: first 2 grep statements
grep :::::::::::::: R_extracted/wgs_5_OL_debug.R2.extracted.fastq | cut -f1 -d ' ' | sed 's/#M/M/' > ./fasta/temp/::::::::::::::.name
I'm now trying to figure out where those :::: in snakemake are coming from. All ideas welcome
It stops at the grep statement
My guess is that the grep command in make_consensus.sh doesn't capture anything. grep returns exit code 1 in such cases and the non-zero exit status propagates to snakemake. (see also Handling SIGPIPE error in snakemake)
Loosely related... There is an inconsistency between the shebang of make_consensus.sh that says the script should be executed with bash (#!/bin/bash) and the actual execution using sh (sh ./scripts/make_consensus.sh). (In practice it shouldn't make any difference since sh is probably redirected to bash anyway)

Resources