runing my function in parallel using xargs - bash

Hi all I have the following bash script that calls hmmscan from hmmer3 software. hmmscan requires to specify 6 command line arguments in this case the code that I have written is as follows:
hmmscan_fun () {
local file=$1
local marker_profiles=$2
local n_threads=$3
local out_dir=$4
fname=$(echo $file | rev | cut -d'/' -f1 | rev)
echo 'filename'
echo $out_dir$fname".txt"
echo 'n threads'
echo $n_threads
echo 'marker profiles'
echo $marker_profiles
echo $out_dir$fname".txt" >> $out_dir"out.txt"
hmmscan -o $out_dir$fname".txt" --tblout $out_dir$fname".hmm" -E 1e-10 --cpu $n_threads $marker_profiles $file
}
Basically I'm iterating over a list of files found in a directory and am running hmmscan over each file, and I'm using this file name to append on the output names so that I'll have different output names corresponding to each of my input files.
My question is that the loop is quite length and I would like to parallelize this process to scale with the number of CPUs that I provide at command line. I want to do so using xargs it is important that I use xargs since I do not have GNUs parallel function and unfortunately I cannot install anything. Please help. Basically Im stuck with how to call a function with xargs and how to pass many command line arguments to it.

I assume you have access to a development machine where you are allowed to install software. On that you install GNU Parallel > 20180222.
Then you run:
parallel --embed > myscript.sh
Then you change the last lines of myscript.sh to something like:
hmmscan_fun () {
local file=$1
local marker_profiles=$2
local n_threads=$3
local out_dir=$4
fname=$(echo $file | rev | cut -d'/' -f1 | rev)
echo 'filename'
echo $out_dir$fname".txt"
echo 'n threads'
echo $n_threads
echo 'marker profiles'
echo $marker_profiles
echo $out_dir$fname".txt" >> $out_dir"out.txt"
hmmscan -o $out_dir$fname".txt" --tblout $out_dir$fname".hmm" -E 1e-10 --cpu $n_threads $marker_profiles $file
}
export -f hmmscan_fun
parallel hmmscan_fun {1} {2} 32 myoutdir ::: files* ::: marker1 marker2
And then you move the script to the production machine and run it there.

Related

bash or zsh: how to pass multiple inputs to interactive piped parameters?

I have 3 different files that I want to compare
words_freq
words_freq_deduped
words_freq_alpha
For each file, I run a command like so, which I iterate on constantly to compare the results.
For example, I would do this:
$ cat words_freq | grep -v '[soe]'
$ cat words_freq_deduped | grep -v '[soe]'
$ cat words_freq_alpha | grep -v '[soe]'
and then review the results, and then do it again, with an additional filter
$ cat words_freq | grep -v '[soe]' | grep a | grep r | head -n20
a
$ cat words_freq_deduped | grep -v '[soe]' | grep a | grep r | head -n20
b
$ cat words_freq_alpha | grep -v '[soe]' | grep a | grep r | head -n20
c
This continues on until I've analyzed my data.
I would like to write a script that could take the piped portions, and pass it to each of these files, as I iterate on the grep/head portions of the command.
e.g. The following would dump the results of running the 3 commands above AND also compare the 3 results, and dump additional calculations on them
$ myScript | grep -v '[soe]' | grep a | grep r | head -n20
the letters were in all 3 runs, and it took 5 seconds
a
b
c
How can I do this using bash/python or zsh for the myScript part?
EDIT: After asking the question, it occurred to me that I could use eval to do it, like so, which I've added as an answer as well
The following approach allows me to process multiple files by using eval, which I know is frowned upon - any other suggestions are greatly appreciated!
$ myScript "grep -v '[soe]' | grep a | grep r | head -n20"
myScript
#!/usr/bin/env bash
function doIt(){
FILE=$1
CMD="cat $1 | $2"
echo processing file "$FILE"
eval "$CMD"
echo
}
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"
You can't avoid your shell from running pipes itself, so using it like that isn't very practical - you'd need to either quote everything and then eval it, which would make it hard to pass arguments with spaces, or quote every pipe, which you can then eval, making it so you have to quote every pipe. But yeah, these solutions are kinda hacky.
I'd suggest doing one of these two:
Keep your editor open, and put whatever you want to run inside the doIt function itself before you run it. Then run it in your shell without any arguments:
#!/usr/bin/env bash
doIt() {
# grep -v '[soe]' < "$1"
grep -v '[soe]' < "$1" | grep a | grep r | head -n20
}
doIt words_freq
doIt words_freq_deduped
doIt words_freq_alpha
Or, you could always use a "for" in your shell, which you can use Ctrl+r to find in your history when you want to use:
$ for f in words_freq*; do grep -v '[soe]' < "$f" | grep a | grep r | head -n20; done
But if you really want your approach, I tried to make it accept spaces, but it ended up being even hackier:
#!/usr/bin/env bash
doIt() {
local FILE=$1
shift
echo processing file "$FILE"
local args=()
for n in $(seq 1 $#); do
arg=$1
shift
if [[ $arg == '|' ]]; then
args+=('|')
else
args+=("\"$arg\"")
fi
done
eval "cat '$FILE' | ${args[#]}"
}
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"
With this version you can use it like this:
$ ./myScript grep "a a" "|" head -n1
Notice that it need you to quote the |, and that it now handles arguments with spaces.
Not fully understood problem correctly.
I understood you want to write a script without pipes, by including the filtering logic into the script.
And feeding the filtering patterns as arguments.
Here is a gawk script (standard Linux awk).
With one sweep on 3 input files, without piping.
script.awk
BEGIN {
RS="!#!#!#!#!#!#!#";
# set record separator to something unlikely matched, causing each file to be read entirely as a single record
}
$0 !~ excludeRegEx # if file does not match excludeRegEx
&& $0 ~ includeRegEx1 # and match includeRegEx1
&& $0 ~ includeRegEx2 { # and match includeRegEx2
system "head -n20 "FILENAME; # call shell command "head -n20 " on current filename
}
Running script.awk
awk -v excludeRegEx='[soe]' \
-v includeRegEx1='a' \
-v includeRegEx2='r' \
-f script.awk words_freq words_freq_deduped words_freq_alpha
The following approach allows me to process multiple files by using eval, which I know is frowned upon - any other suggestions are greatly appreciated!
$ myScript "grep -v '[soe]' | grep a | grep r | head -n20"
myScript
#!/usr/bin/env bash
function doIt(){
FILE=$1
CMD="cat $1 | $2"
echo processing file "$FILE"
eval "$CMD"
echo
}
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"

Recursive Arch Linux Shell Script To Get Dependencies

I wrote a shell script to print out to a file all of the dependencies from a specified package. Obviously it's not working (or else I wouldn't be here lol). I'm new to shell scriping / bash programming. I am running on Arch Linux and have searched the web to get me to where I'm at. But now I get a bunch of errors from a "empty string package name". It starts off good, then it's an endless loop of doom. My current code is this:
#!/bin/bash
echo -n "Enter a package: "
read p
echo "Searching through package...$p"
get_dependencies() {
# Make sure we get the package too...
pacman -Sp "$1" >> myPackages.list
# Get dependency list from current package and output to tmp file
pacman -Si "$1" | awk -F'[:<=>' '/^Depends/ {print $2}' | xargs -n1 | sort -u > depList.list
# Read from that output file and store in array
listArray=()
while read -r input ; do
listArray+=("$input")
done < "depList.list"
# Get the number of dependencies
numList=${#listArray[#]}
echo "$numList dependencies from $1"
echo "Delving deeper.."
# Loop through each depend and get all those dependencies
for i in "${listArray[#]}" ; do
get_dependencies "$i"
done
}
# Get dependcies of package that user typed
get_dependencies "$p"
# Finished
echo "Done!"
To avoid cyclic dependencies, you can keep track of what packages you've visited and which you have yet to visit in seperate files. Comparing those files before descending into the dependencies will hopefully keep your script out of trouble.
deps() {
pacman -Si "$1" |
awk -F'[:<=>]' '/^Depends/ {print $2}' |
xargs -n1 |
sort -u |
grep -v None
}
alldeps() {
# needed files, potentially cached for later
unseen_f=unseen.$1.txt
seen_f=seen.$1.txt
deps_f=deps.$1.txt
# start of with the root packaage
echo $1 > $unseen_f
# while we still have unseen packages, find depends
while [ $(sed /^$/d $unseen_f |wc -l ) -gt 0 ]; do
# read in all unseen, and get their deps
for d in $(cat $unseen_f); do
echo $d >> $seen_f
deps $d >> $deps_f
done
# those in deps but not in seen to go unseen
# we'll finish when unseen is empty: nothing in deps we haven't seen
comm -23 <(sort -u $deps_f) <(sort -u $seen_f) > $unseen_f
done
cat $seen_f
#sort -u $seen_f $deps_f
# rm $seen_f $deps_f $unseen_f
}
alldeps xterm

Storing a line in a variable

Hi I have the following batch script where I submitted each file to a separate processing as follows:
for file in ../Positive/*.txt_rn; do
bsub <<EOF
#BSUB -L /bin/bash
#BSUB -W 150:00
#BSUB -M 10000
#BSUB -n 3
#BSUB -e /somefolder/errors/%J.err
#BSUB -o /somefolder/errors/%J.out
while read line; do
name=`cat \$line | awk '{print $1":"$2"-"$3}'`
four=`cat \$line | awk '{print $4}' | cut -d\: -f4`
fasta=\$name".fa"
op=\$name".rs"
echo \$name | xargs samtools faidx /somefolder/rn4/Rattus_norvegicus/UCSC/rn4/Sequence/WholeGenomeFasta/genome.fa > \$fasta
Process -F \$fasta -M "list_"\$four".txt" -p 0.003 | awk '(\$5 >= 0.67)' > \$op
if [ -s "\$op" ]
then
cat "\$line" >> ../Positive_Strand/$file".cons"
fi
rm \$lne
rm \$op
rm \$fasta
done < $file
EOF
done
I am am somehow unable to store the values of the column from the line (which is in $line variable into the $name and $four variable and hence unable to carry on further processes. Also any suggestions to edit the code for a better version of it would be welcome.
If you change EOF to 'EOF' then you will more properly disable shell interpretation. Your problem is that your back-ticks (`) are not escaped.
I've fixed your indentation and cleaned up some of your code. Note that the syntax highlighting here doesn't understand cat <<'EOF'. If you paste that into vim with highlighting enabled, you'll see that block is all the same color since it's just a string.
bsub_helper() {
cat <<'EOF'
#BSUB -L /bin/bash
#BSUB -W 150:00
#BSUB -M 10000
#BSUB -n 3
#BSUB -e /somefolder/errors/%J.err
#BSUB -o /somefolder/errors/%J.out
while read line; do
name=`cat $line | awk '{print $1":"$2"-"$3}'`
four=`cat $line | awk '{print $4}' | cut -d: -f4`
fasta="$name.fa"
op="$name.rs"
genome="/somefolder/rn4/Rattus_norvegicus/UCSC/rn4/Sequence/WholeGenomeFasta/genome.fa"
echo $name | xargs samtools faidx "$genome" > "$fasta"
Process -F "$fasta" -M "list_$four.txt" -p 0.003 | awk '($5 >= 0.67)' > "$op"
if [ -s "$op" ]
then
cat "$line" >> "../Positive_Strand/$file.cons"
fi
rm "$lne" "$op" "$fasta"
EOF
echo " done < \"$1\""
}
for file in ../Positive/*.txt_rn; do
bsub_helper "$file" |bsub
done
I created a helper function because I needed to get the input in two commands. I am assuming that $file is the only variable in that block that you want interpreted. I also surrounded that variable (among others) with quotes so that the code can support file names with spaces in them. The final line of the helper has nested double quotes for this reason.
I left your echo $name | xargs … line alone because it's so odd. Without quotes around $name, xargs will take each whitespace-separated entry as its own file. With quotes, xargs will only supply one (likely invalid) file name to samtools.
If $name is a single file, try:
samtools faidx "$genome" "$name" > "$fasta"
If $name is multiple files and none of them have spaces, try:
samtools faidx "$genome" $name > "$fasta"
The only reason to use xargs here would be if you have too much content for one command line, but if you're running echo $name | xargs then you'll run into the same problem.

A script to find all the users who are executing a specific program

I've written the bash script (searchuser) which should display all the users who are executing a specific program or a script (at least a bash script). But when searching for scripts fails because the command the SO is executing is something like bash scriptname.
This script acts parsing the ps command output, it search for all the occurrences of the specified program name, extracts the user and the program name, verifies if the program name is that we're searching for and if it's it displays the relevant information (in this case the user name and the program name, might be better to output also the PID, but that is quite simple). The verification is accomplished to reject all lines containing program names which contain the name of the program but they're not the program we are searching for; if we're searching gedit we don't desire to find sgedit or gedits.
Other issues I've are:
I would like to avoid the use of a tmp file.
I would like to be not tied to GNU extensions.
The script has to be executed as:
root# searchuser programname <invio>
The script searchuser is the following:
#!/bin/bash
i=0
search=$1
tmp=`mktemp`
ps -aux | tr -s ' ' | grep "$search" > $tmp
while read fileline
do
user=`echo "$fileline" | cut -f1 -d' '`
prg=`echo "$fileline" | cut -f11 -d' '`
prg=`basename "$prg"`
if [ "$prg" = "$search" ]; then
echo "$user - $prg"
i=`expr $i + 1`
fi
done < $tmp
if [ $i = 0 ]; then
echo "No users are executing $search"
fi
rm $tmp
exit $i
Have you suggestion about to solve these issues?
One approach might looks like such:
IFS=$'\n' read -r -d '' -a pids < <(pgrep -x -- "$1"; printf '\0')
if (( ! ${#pids[#]} )); then
echo "No users are executing $1"
fi
for pid in "${pids[#]}"; do
# build a more accurate command line than the one ps emits
args=( )
while IFS= read -r -d '' arg; do
args+=( "$arg" )
done </proc/"$pid"/cmdline
(( ${#args[#]} )) || continue # exited while we were running
printf -v cmdline_str '%q ' "${args[#]}"
user=$(stat --format=%U /proc/"$pid") || continue # exited while we were running
printf '%q - %s\n' "$user" "${cmdline_str% }"
done
Unlike the output from ps, which doesn't distinguish between ./command "some argument" and ./command "some" "argument", this will emit output which correctly shows the arguments run by each user, with quoting which will re-run the given command correctly.
What about:
ps -e -o user,comm | egrep "^[^ ]+ +$1$" | cut -d' ' -f1 | sort -u
* Addendum *
This statement:
ps -e -o user,pid,comm | egrep "^\s*\S+\s+\S+\s*$1$" | while read a b; do echo $a; done | sort | uniq -c
or this one:
ps -e -o user,pid,comm | egrep "^\s*\S+\s+\S+\s*sleep$" | xargs -L1 echo | cut -d ' ' -f1 | sort | uniq -c
shows the number of process instances by user.

Bash script checking cpu usage of specific process

First off, I'm new to this. I have some experience with windows scripting and apple script but not much with bash. What I'm trying to do is grab the PID and %CPU of a specific process. then compare the %CPU against a set number, and if it's higher, kill the process. I feel like I'm close, but now I'm getting the following error:
[[: 0.0: syntax error: invalid arithmetic operator (error token is ".0")
what am I doing wrong? here's my code so far:
#!/bin/bash
declare -i app_pid
declare -i app_cpu
declare -i cpu_limit
app_name="top"
cpu_limit="50"
app_pid=`ps aux | grep $app_name | grep -v grep | awk {'print $2'}`
app_cpu=`ps aux | grep $app_name | grep -v grep | awk {'print $3'}`
if [[ ! $app_cpu -gt $cpu_limit ]]; then
echo "crap"
else
echo "we're good"
fi
Obviously I'm going to replace the echos in the if/then statement but it's acting as if the statement is true regardless of what the cpu load actually is (I tested this by changing the -gt to -lt and it still echoed "crap"
Thank you for all the help. Oh, and this is on a OS X 10.7 if that is important.
I recommend taking a look at the facilities of ps to avoid multiple horrible things you do.
On my system (ps from procps on linux, GNU awk) I would do this:
ps -C "$app-name" -o pid=,pcpu= |
awk --assign maxcpu="$cpu_limit" '$2>maxcpu {print "crappy pid",$1}'
The problem is that bash can't handle decimals. You can just multiply them by 100 and work with plain integers instead:
#!/bin/bash
declare -i app_pid
declare -i app_cpu
declare -i cpu_limit
app_name="top"
cpu_limit="5000"
app_pid=`ps aux | grep $app_name | grep -v grep | awk {'print $2'}`
app_cpu=`ps aux | grep $app_name | grep -v grep | awk {'print $3*100'}`
if [[ $app_cpu -gt $cpu_limit ]]; then
echo "crap"
else
echo "we're good"
fi
Keep in mind that CPU percentage is a suboptimal measurement of application health. If you have two processes running infinite loops on a single core system, no other application of the same priority will ever go over 33%, even if they're trashing around.
#!/bin/sh
PROCESS="java"
PID=`pgrep $PROCESS | tail -n 1`
CPU=`top -b -p $PID -n 1 | tail -n 1 | awk '{print $9}'`
echo $CPU
I came up with this, using top and bc.
Use it by passing in ex: ./script apache2 50 # max 50%
If there are many PIDs matching your program argument, only one will be calculated, based on how top lists them. I could have extended the script by catching them all and avergaing the percentage or something, but this will have to do.
You can also pass in a number, ./script.sh 12345 50, which will force it to use an exact PID.
#!/bin/bash
# 1: ['command\ name' or PID number(,s)] 2: MAX_CPU_PERCENT
[[ $# -ne 2 ]] && exit 1
PID_NAMES=$1
# get all PIDS as nn,nn,nn
if [[ ! "$PID_NAMES" =~ ^[0-9,]+$ ]] ; then
PIDS=$(pgrep -d ',' -x $PID_NAMES)
else
PIDS=$PID_NAMES
fi
# echo "$PIDS $MAX_CPU"
MAX_CPU="$2"
MAX_CPU="$(echo "($MAX_CPU+0.5)/1" | bc)"
LOOP=1
while [[ $LOOP -eq 1 ]] ; do
sleep 0.3s
# Depending on your 'top' version and OS you might have
# to change head and tail line-numbers
LINE="$(top -b -d 0 -n 1 -p $PIDS | head -n 8 \
| tail -n 1 | sed -r 's/[ ]+/,/g' | \
sed -r 's/^\,|\,$//')"
# If multiple processes in $PIDS, $LINE will only match\
# the most active process
CURR_PID=$(echo "$LINE" | cut -d ',' -f 1)
# calculate cpu limits
CURR_CPU_FLOAT=$(echo "$LINE"| cut -d ',' -f 9)
CURR_CPU=$(echo "($CURR_CPU_FLOAT+0.5)/1" | bc)
echo "PID $CURR_PID: $CURR_CPU""%"
if [[ $CURR_CPU -ge $MAX_CPU ]] ; then
echo "PID $CURR_PID ($PID_NAMES) went over $MAX_CPU""%"
echo "[[ $CURR_CPU""% -ge $MAX_CPU""% ]]"
LOOP=0
break
fi
done
echo "Stopped"
Erik, I used a modified version of your code to create a new script that does something similar. Hope you don't mind it.
A bash script to get the CPU usage by process
usage:
nohup ./check_proc bwengine 70 &
bwegnine is the process name we want to monitor 70 is to log only when the process is using over 70% of the CPU.
Check the logs at: /var/log/check_procs.log
The output should be like:
DATE | TOTAL CPU | CPU USAGE | Process details
Example:
03/12/14 17:11 |20.99|98| ProdPROXY-ProdProxyPA.tra
03/12/14 17:11 |20.99|100| ProdPROXY-ProdProxyPA.tra
Link to the full blog:
http://felipeferreira.net/?p=1453
It is also useful to have app_user information available to test whether the current user has the rights to kill/modify the running process. This information can be obtained along with the needed app_pid and app_cpu by using read eliminating the need for awk or any other 3rd party parser:
read app_user app_pid tmp_cpu stuff <<< \
$( ps aux | grep "$app_name" | grep -v "grep\|defunct\|${0##*/}" )
You can then get your app_cpu * 100 with:
app_cpu=$((${tmp_cpu%.*} * 100))
Note: Including defunct and ${0##*/} in grep -v prevents against multiple processes matching $app_name.
I use top to check some details. It provides a few more details like CPU time.
On Linux this would be:
top -b -n 1 | grep $app_name
On Mac, with its BSD version of top:
top -l 1 | grep $app_name

Resources