How to count number of forked (sub-?)processes

How to count number of forked (sub-?)processes - bash

Somebody else has written (TM) some bash script that forks very many sub-processes. It needs optimization. But I'm looking for a way to measure "how bad" the problem is.
Can I / How would I get a count that says how many sub-processes were forked by this script all-in-all / recursively?
This is a simplified version of what the existing, forking code looks like - a poor man's grep:
#!/bin/bash
file=/tmp/1000lines.txt
match=$1
let cnt=0
while read line
do
cnt=`expr $cnt + 1`
lineArray[$cnt]="${line}"
done < $file
totalLines=$cnt
cnt=0
while [ $cnt -lt $totalLines ]
do
cnt=`expr $cnt + 1`
matches=`echo ${lineArray[$cnt]}|grep $match`
if [ "$matches" ] ; then
echo ${lineArray[$cnt]}
fi
done
It takes the script 20 seconds to look for $1 in 1000 lines of input. This code forks way too many sub-processes. In the real code, there are longer pipes (e.g. progA | progB | progC) operating on each line using grep, cut, awk, sed and so on.
This is a busy system with lots of other stuff going on, so a count of how many processes were forked on the entire system during the run-time of the script would be of some use to me, but I'd prefer a count of processes started by this script and descendants. And I guess I could analyze the script and count it myself, but the script is long and rather complicated, so I'd just like to instrument it with this counter for debugging, if possible.
To clarify:
I'm not looking for the number of processes under $$ at any given time (e.g. via ps), but the number of processes run during the entire life of the script.
I'm also not looking for a faster version of this particular example script (I can do that). I'm looking for a way to determine which of the 30+ scripts to optimize first to use bash built-ins.

You can count the forked processes simply trapping the SIGCHLD signal. If You can edit the script file then You can do this:
set -o monitor # or set -m
trap "((++fork))" CHLD
So fork variable will contain the number of forks. At the end You can print this value:
echo $fork FORKS
For a 1000 lines input file it will print:
3000 FORKS
This code forks for two reasons. One for each expr ... and one for `echo ...|grep...`. So in the reading while-loop it forks every time when a line is read; in the processing while-loop it forks 2 times (one because of expr ... and one for `echo ...|grep ...`). So for a 1000 lines file it forks 3000 times.
But this is not exact! It is just the forks done by the calling shell. There are more forks, because `echo ...|grep...` forks to start a bash to run this code. But after it is also forks twice: one for echo and one for grep. So actually it is 3 forks, not one. So it is rather 5000 FORKS, not 3000.
If You need to count the forks of the forks (of the forks...) as well (or You cannot modify the bash script or You want it to do from an other script), a more exact solution can be to used
strace -fo s.log ./x.sh
It will print lines like this:
30934 execve("./x.sh", ["./x.sh"], [/* 61 vars */]) = 0
Then You need to count the unique PIDs using something like this (first number is the PID):
awk '{n[$1]}END{print length(n)}' s.log
In case of this script I got 5001 (the +1 is the PID of the original bash script).
COMMENTS
Actually in this case all forks can be avoided:
Instead of
cnt=`expr $cnt + 1`
Use
((++cnt))
Instead of
matches=`echo ${lineArray[$cnt]}|grep $match`
if [ "$matches" ] ; then
echo ${lineArray[$cnt]}
fi
You can use bash's internal pattern matching:
[[ ${lineArray[cnt]} =~ $match ]] && echo ${lineArray[cnt]}
Mind that bash =~ uses ERE not RE (like grep). So it will behave like egrep (or grep -E), not grep.
I assume that the defined lineArray is not pointless (otherwise in the reading loop the matching could be tested and the lineArray is not needed) and it is used for other purpose as well. In that case I may suggest a little bit shorter version:
readarray -t lineArray <infile
for line in "${lineArray[#]}";{ [[ $line} =~ $match ]] && echo $line; }
First line reads the complete infile to lineArray without any loop. The second line is process the array element-by-element.
MEASURES
Original script for 1000 lines (on cygwin):
$ time ./test.sh
3000 FORKS
real 0m48.725s
user 0m14.107s
sys 0m30.659s
Modified version
FORKS
real 0m0.075s
user 0m0.031s
sys 0m0.031s
Same on linux:
3000 FORKS
real 0m4.745s
user 0m1.015s
sys 0m4.396s
and
FORKS
real 0m0.028s
user 0m0.022s
sys 0m0.005s
So this version uses no fork (or clone) at all. I may suggest to use this version only for small (<100 KiB) files. In other cases grap, egrep, awk over performs the pure bash solution. But this should be checked by a performance test.
For a thousand lines on linux I got the following:
$ time grep Solaris infile # Solaris is not in the infile
real 0m0.001s
user 0m0.000s
sys 0m0.001s

Related

Use PS0 and PS1 to display execution time of each bash command

It seems that by executing code in PS0 and PS1 variables (which are eval'ed before and after a prompt command is run, as I understand) it should be possible to record time of each running command and display it in the prompt. Something like that:
user#machine ~/tmp
$ sleep 1
user#machine ~/tmp 1.01s
$
However, I quickly got stuck with recording time in PS0, since something like this doesn't work:
PS0='$(START=$(date +%s.%N))'
As I understand, START assignment happens in a sub-shell, so it is not visible in the outer shell. How would you approach this?

I was looking for a solution to a different problem and came upon this question, and decided that sounds like a cool feature to have. Using #Scheff's excellent answer as a base in addition to the solutions I developed for my other problem, I came up with a more elegant and full featured solution.
First, I created a few functions that read/write the time to/from memory. Writing to the shared memory folder prevents disk access and does not persist on reboot if the files are not cleaned for some reason
function roundseconds (){
# rounds a number to 3 decimal places
echo m=$1";h=0.5;scale=4;t=1000;if(m<0) h=-0.5;a=m*t+h;scale=3;a/t;" | bc
}
function bash_getstarttime (){
# places the epoch time in ns into shared memory
date +%s.%N >"/dev/shm/${USER}.bashtime.${1}"
}
function bash_getstoptime (){
# reads stored epoch time and subtracts from current
local endtime=$(date +%s.%N)
local starttime=$(cat /dev/shm/${USER}.bashtime.${1})
roundseconds $(echo $(eval echo "$endtime - $starttime") | bc)
}
The input to the bash_ functions is the bash PID
Those functions and the following are added to the ~/.bashrc file
ROOTPID=$BASHPID
bash_getstarttime $ROOTPID
These create the initial time value and store the bash PID as a different variable that can be passed to a function. Then you add the functions to PS0 and PS1
PS0='$(bash_getstarttime $ROOTPID) etc..'
PS1='\[\033[36m\] Execution time $(bash_getstoptime $ROOTPID)s\n'
PS1="$PS1"'and your normal PS1 here'
Now it will generate the time in PS0 prior to processing terminal input, and generate the time again in PS1 after processing terminal input, then calculate the difference and add to PS1. And finally, this code cleans up the stored time when the terminal exits:
function runonexit (){
rm /dev/shm/${USER}.bashtime.${ROOTPID}
}
trap runonexit EXIT
Putting it all together, plus some additional code being tested, and it looks like this:
The important parts are the execution time in ms, and the user.bashtime files for all active terminal PIDs stored in shared memory. The PID is also shown right after the terminal input, as I added display of it to PS0, and you can see the bashtime files added and removed.
PS0='$(bash_getstarttime $ROOTPID) $ROOTPID experiments \[\033[00m\]\n'

As #tc said, using arithmetic expansion allows you to assign variables during the expansion of PS0 and PS1. Newer bash versions also allow PS* style expansion so you don't even need a subshell to get the current time. With bash 4.4:
# PS0 extracts a substring of length 0 from PS1; as a side-effect it causes
# the current time as epoch seconds to PS0time (no visible output in this case)
PS0='\[${PS1:$((PS0time=\D{%s}, PS1calc=1, 0)):0}\]'
# PS1 uses the same trick to calculate the time elapsed since PS0 was output.
# It also expands the previous command's exit status ($?), the current time
# and directory ($PWD rather than \w, which shortens your home directory path
# prefix to "~") on the next line, and finally the actual prompt: 'user#host> '
PS1='\nSeconds: $((PS1calc ? \D{%s}-$PS0time : 0)) Status: $?\n\D{%T} ${PWD:PS1calc=0}\n\u#\h> '
(The %N date directive does not seem to be implemented as part of \D{...} expansion with bash 4.4. This is a pity since we only have a resolution in single second units.)
Since PS0 is only evaluated and printed if there is a command to execute, the PS1calc flag is set to 1 to do the time difference (following the command) in PS1 expansion or not (PS1calc being 0 means PS0 was not previously expanded and so didn't re-evaluate PS1time). PS1 then resets PS1calc to 0. In this way an empty line (just hitting return) doesn't accumulate seconds between return key presses.
One nice thing about this method is that there is no output when you have set -x active. No subshells or temporary files in sight: everything is done within the bash process itself.

I took this as puzzle and want to show the result of my puzzling:
First I fiddled with time measurement. The date +%s.%N (which I didn't realize before) was where I started from. Unfortunately, it seems that bashs arithmetic evaluation seems not to support floating points. Thus, I chosed something else:
$ START=$(date +%s.%N)
$ awk 'BEGIN { printf("%fs", '$(date +%s.%N)' - '$START') }' /dev/null
8.059526s
$
This is sufficient to compute the time difference.
Next, I confirmed what you already described: sub-shell invocation prevents usage of shell variables. Thus, I thought about where else I could store the start time which is global for sub-shells but local enough to be used in multiple interactive shells concurrently. My solution are temp. files (in /tmp). To provide a unique name I came up with this pattern: /tmp/$USER.START.$BASHPID.
$ date +%s.%N >/tmp/$USER.START.$BASHPID ; \
> awk 'BEGIN { printf("%fs", '$(date +%s.%N)' - '$(cat /tmp/$USER.START.$BASHPID)') }' /dev/null
cat: /tmp/ds32737.START.11756: No such file or directory
awk: cmd. line:1: BEGIN { printf("%fs", 1491297723.111219300 - ) }
awk: cmd. line:1: ^ syntax error
$
Damn! Again I'm trapped in the sub-shell issue. To come around this, I defined another variable:
$ INTERACTIVE_BASHPID=$BASHPID
$ date +%s.%N >/tmp/$USER.START.$INTERACTIVE_BASHPID ; \
> awk 'BEGIN { printf("%fs", '$(date +%s.%N)' - '$(cat /tmp/$USER.START.$INTERACTIVE_BASHPID)') }' /dev/null
0.075319s
$
Next step: fiddle this together with PS0 and PS1. In a similar puzzle (SO: How to change bash prompt color based on exit code of last command?), I already mastered the "quoting hell". Thus, I should be able to do it again:
$ PS0='$(date +%s.%N >"/tmp/${USER}.START.${INTERACTIVE_BASHPID}")'
$ PS1='$(awk "BEGIN { printf(\"%fs\", "$(date +%s.%N)" - "$(cat /tmp/$USER.START.$INTERACTIVE_BASHPID)") }" /dev/null)'"$PS1"
0.118550s
$
Ahh. It starts to work. Thus, there is only one issue - to find the right start-up script for the initialization of INTERACTIVE_BASHPID. I found ~/.bashrc which seems to be the right one for this, and which I already used in the past for some other personal customizations.
So, putting it all together - these are the lines I added to my ~/.bashrc:
# command duration puzzle
INTERACTIVE_BASHPID=$BASHPID
date +%s.%N >"/tmp/${USER}.START.${INTERACTIVE_BASHPID}"
PS0='$(date +%s.%N >"/tmp/${USER}.START.${INTERACTIVE_BASHPID}")'
PS1='$(awk "BEGIN { printf(\"%fs\", "$(date +%s.%N)" - "$(cat /tmp/$USER.START.$INTERACTIVE_BASHPID)") }" /dev/null)'"$PS1"
The 3rd line (the date command) has been added to solve another issue. Comment it out and start a new interactive bash to find out why.
A snapshot of my cygwin xterm with bash where I added the above lines to ./~bashrc:
Notes:
I consider this rather as solution to a puzzle than a "serious productive" solution. I'm sure that this kind of time measurement consumes itself a lot of time. The time command might provide a better solution: SE: How to get execution time of a script effectively?. However, this was a nice lecture for practicing the bash...
Don't forget that this code pollutes your /tmp directory with a growing number of small files. Either clean-up the /tmp from time to time or add the appropriate commands for clean-up (e.g. to ~/.bash_logout).

Arithmetic expansion runs in the current process and can assign to variables. It also produces output, which you can consume with something like \e[$((...,0))m (to output \e[0m) or ${t:0:$((...,0))} (to output nothing, which is presumably better). 64-bit integer support in Bash supports will count POSIX nanoseconds until the year 2262.
$ PS0='${t:0:$((t=$(date +%s%N),0))}'
$ PS1='$((( t )) && printf %d.%09ds $((t=$(date +%s%N)-t,t/1000000000)) $((t%1000000000)))${t:0:$((t=0))}\n$ '
0.053282161s
$ sleep 1
1.064178281s
$
$
PS0 is not evaluated for empty commands, which leaves a blank line (I'm not sure if you can conditionally print the \n without breaking things). You can work around that by switching to PROMPT_COMMAND instead (which also saves a fork):
$ PS0='${t:0:$((t=$(date +%s%N),0))}'
$ PROMPT_COMMAND='(( t )) && printf %d.%09ds\\n $((t=$(date +%s%N)-t,t/1000000000)) $((t%1000000000)); t=0'
0.041584565s
$ sleep 1
1.077152833s
$
$
That said, if you do not require sub-second precision, I would suggest using $SECONDS instead (which is also more likely to return a sensible answer if something sets the time).

As correctly stated in the question, PS0 runs inside a sub-shell which makes it unusable for this purpose of setting the start time.
Instead, one can use the history command with epoch seconds %s and the built-in variable $EPOCHSECONDS to calculate when the command finished by leveraging only $PROMPT_COMMAND.
# Save start time before executing command (does not work due to PS0 sub-shell)
# preexec() {
# STARTTIME=$EPOCHSECONDS
# }
# PS0=preexec
# Save end time, without duplicating commands when pressing Enter on an empty line
precmd() {
local st=$(HISTTIMEFORMAT='%s ' history 1 | awk '{print $2}');
if [[ -z "$STARTTIME" || (-n "$STARTTIME" && "$STARTTIME" -ne "$st") ]]; then
ENDTIME=$EPOCHSECONDS
STARTTIME=$st
else
ENDTIME=0
fi
}
__timeit() {
precmd;
if ((ENDTIME - STARTTIME >= 0)); then
printf 'Command took %d seconds.\n' "$((ENDTIME - STARTTIME))";
fi
# Do not forget your:
# - OSC 0 (set title)
# - OSC 777 (notification in gnome-terminal, urxvt; note, this one has preexec and precmd as OSC 777 features)
# - OSC 99 (notification in kitty)
# - OSC 7 (set url) - out of scope for this question
}
export PROMPT_COMMAND=__timeit
Note: If you have ignoredups in your $HISTCONTROL, then this will not report back for a command that is re-run.

Following #SherylHohman use of variables in PS0 I've come with this complete script. I've seen you don't need a PS0Time flag as PS0Calc doesn't exists on empty prompts so _elapsed funct just exit.
#!/bin/bash
# string preceding ms, use color code or ascii
_ELAPTXT=$'\E[1;33m \uf135 '
# extract time
_printtime () {
local _var=${EPOCHREALTIME/,/};
echo ${_var%???}
}
# get diff time, print it and end color codings if any
_elapsed () {
[[ -v "${1}" ]] || ( local _VAR=$(_printtime);
local _ELAPSED=$(( ${_VAR} - ${1} ));
echo "${_ELAPTXT}$(_formatms ${_ELAPSED})"$'\n\e[0m' )
}
# format _elapsed with simple string substitution
_formatms () {
local _n=$((${1})) && case ${_n} in
? | ?? | ???)
echo $_n"ms"
;;
????)
echo ${_n:0:1}${_n:0,-3}"ms"
;;
?????)
echo ${_n:0:2}","${_n:0,-3}"s"
;;
??????)
printf $((${_n:0:3}/60))m+$((${_n:0:3}%60)),${_n:0,-3}"s"
;;
???????)
printf $((${_n:0:4}/60))m$((${_n:0:4}%60))s${_n:0,-3}"ms"
;;
*)
printf "too much!"
;;
esac
}
# prompts
PS0='${PS1:(PS0time=$(_printtime)):0}'
PS1='$(_elapsed $PS0time)${PS0:(PS0time=0):0}\u#\h:\w\$ '
img of result
Save it as _prompt and source it to try:
source _prompt
Change text, ascii codes and colors in _ELAPTXT
_ELAPTXT='\e[33m Elapsed time: '

bash loop taking extremely long time

I have a list of times that I am looping through in the format HH:MM:SS to find the nearest but not past time. The code that I have is:
for i in ${times[#]}; do
hours=$(echo $i | sed 's/\([0-9]*\):.*/\1/g')
minutes=$(echo $i | sed 's/.*:\([0-9]*\):.*/\1/g')
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
if [[ hours -ge currentHours ]]; then
if [[ minutes -ge currentMinutes ]]; then
break
fi
fi
done
The variable times is an array of all the times that I am sorting through (its about 20-40 lines). I'd expect this to take less than 1 second however it is taking upwards of 5 seconds. Any suggestions for decreasing the time of the regular expression would be appreciated.
times=($(cat file.txt))
Here is a list of the times that are stored in a text file and are imported into the times variable using the above line of code.
6:05:00
6:35:00
7:05:00
7:36:00
8:08:00
8:40:00
9:10:00
9:40:00
10:11:00
10:41:00
11:11:00
11:41:00
12:11:00
12:41:00
13:11:00
13:41:00
14:11:00
14:41:00
15:11:00
15:41:00
15:56:00
16:11:00
16:26:00
16:41:00
16:58:00
17:11:00
17:26:00
17:41:00
18:11:00
18:41:00
19:10:00
19:40:00
20:10:00
20:40:00
21:15:00
21:45:00

One of the key things to understand in looking at bash scripts from a performance perspective is that while the bash interpreter is somewhat slow, the act of spawning an external process is extremely slow. Thus, while it can often speed up your scripts to use a single invocation of awk or sed to process a large stream of input, starting those invocations inside a tight loop will greatly outweigh the performance of those tools once they're running.
Any command substitution -- $() -- causes a second copy of the interpreter to be fork()ed off as a subshell. Invoking any command not built into bash -- date, sed, etc -- then causes a subprocess to be fork()ed off for that process, and then the executable associated with that process to be exec()'d -- something involves a great deal of OS-level overhead (the binary needs to be linked, loaded, etc).
This loop would be better written as:
IFS=: read -r currentHours currentMinutes < <(date +"%H:%M")
while IFS=: read -r hours minutes _; do
if (( hours >= currentHours )) && (( minutes >= currentMinutes )); then
break
fi
done <file.txt
In this form only one external command is run, date +"%H:%M", outside the loop. If you were only targeting bash 4.2 and newer (with built-in time formatting support), even this would be unnecessary:
printf -v currentHours '%(%H)T' -1
printf -v currentMinutes '%(%M)T' -1
...will directly place the current hour and minute into the variables currentHours and currentMinutes using only functionality built into modern bash releases.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
BashFAQ #100 - How can I do native string manipulations in bash? (Subsection: "Splitting a string into fields")

To be honest I'm not sure why it's taking an extremely long time but there are certainly some things which could be made more efficient.
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
for time in "${times[#]}"; do
IFS=: read -r hours minutes seconds <<<"$time"
if [[ hours -ge currentHours && minutes -ge currentMinutes ]]; then
break
fi
done
This uses read, a built-in command, to split the text into variables, rather than calling external commands and creating subshells.
I assume that you want the script to run so quickly that it's safe to reuse currentHours and currentMinutes within the loop.
Note that you can also just use awk to do the whole thing:
awk -F: -v currentHours="$(date +"%H") -v currentMinutes="$(date +"%M")" '
$1 >= currentHours && $2 >= currentMinutes { print; exit }' file.txt
Just to make the program produce some output, I added a print, so that the last line is printed.

awk to the rescue!
awk -v time="12:12:00" '
function pad(x) {split(x,ax,":"); return (ax[1]<10)?"0"x:x}
BEGIN {time=pad(time)}
time>pad($0) {next}
{print; exit}' times
12:41:00
with 0 padding the hour you can do string only comparison.

Why is the output from these parallel processes not messed up?

Everything is executing perfectly.
words.dict file contains one word per line:
$ cat words.dict
cat
car
house
train
sun
today
station
kilometer
house
away
chapter.txt file contains plain text:
$ cat chapter.txt
The cars are very noisy today.
The train station is one kilometer away from his house.
This script below adds in result.txt file all words from words.dict not found with grep command in chapter.txt file, using 10 parallel grep:
$ cat psearch.sh
#!/bin/bash --
> result.txt
max_parallel_p=10
while read line ; do
while [ $(jobs | wc -l) -gt "$max_parallel_p" ]; do sleep 1; done
fgrep -q "$line" chapter.txt || printf "%s\n" "$line" >> result.txt &
done < words.dict
wait
A test:
$ ./psearch.sh
$ cat result.txt
cat
sun
I thought the tests would generate mixed words in result.txt
csat
un
But it really seems to work.
Please have a look and explain me why?

Background jobs are not threads. With a multi-threaded process then you can get that effect. The reason is that each process has just one standard output stream (stdout). In a multi-threaded program all threads share the same output stream, so an unprotected write to stdout can lead to garbled output as you describe. But you do not have a multi-threaded program.
When you use the & qualifier bash creates a new child process with its own stdout stream. Generally (depends on implementation details) this is flushed on a newline. So even though the file might be shared, the granularity is by line.
There is a slim chance that two processes could flush to the file at exactly the same time, but your code, with subprocesses and a sleep, makes it highly unlikely.
You could try taking out the newline from the printf, but given the inefficiency of the rest of the code, and the small dataset, it is still unlikely. It is quite possible that each process is complete before the next starts.

Generating sequences of numbers and characters with bash

I have written a script that accept two files as input. I want to run all in parallel at the same time on different CPUs.
inputs:
x00.A x00.B
x01.A x01.B
...
x30.A x30.B
instead of running 30 times:
./script x00.A x00.B
./script x01.A x01.B
...
./script x30.A x30.B
I wanted to use paste and seq to generate and send them to the script.
paste & seq | xargs -n1 -P 30 ./script
But I do not know how to combine letters and numbers using paste and seq commands.

for num in $(seq -f %02.f 0 30); do
./script x$num.A x$num.B &
done
wait
Although I personally prefer to not use GNU seq or BSD jot but (ksh/bash) builtins:
num=-1; while (( ++num <= 30 )); do
./script x$num.A x$num.B &
done
wait
The final wait is just needed to make sure they all finish, after having run spread across your available CPU cores in the background. So, if you need the output of ./script, you must keep the wait.
Putting them into the background with & is the simplest way for parallelism. If you really want to exercise any sort of control over lots of backgrounded jobs like that, you will need some sort of framework like GNU Parallel instead.

You can use pure bash for generating the sequence:
printf "%s %s\n" x{00..30}.{A..B} | xargs -n1 -P 30 ./script
Happy holidays!

Looping files in bash

I want to loop over these kind of files, where the the files with same Sample_ID have to be used together
Sample_51770BL1_R1.fastq.gz
Sample_51770BL1_R2.fastq.gz
Sample_52412_R1.fastq.gz
Sample_52412_R2.fastq.gz
e.g. Sample_51770BL1_R1.fastq.gz and Sample_51770BL1_R2.fastq.gz are used together in one command to create an output.
Similarly, Sample_52412_R1.fastq.gz and Sample_52412_R2.fastq.gz are used together to create output.
I want to write a for loop in bash to iterate over and create output.
sourcedir=/sourcepath/
destdir=/destinationpath/
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta Sample_52412_R1.fastq.gz Sample_52412_R2.fastq.gz>$destdir/Sample_52412_R1_R2.sam
How should I pattern match the file names Sample_ID_R1 and Sample_ID_R2 to be used in one command?
Thanks,

for fname in *_R1.fastq.gz
do
base=${fname%_R1*}
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam"
done
In the comments, you ask about running several, but not too many, jobs in parallel. Below is my first stab at that:
#!/bin/bash
# Limit background jobs to no more that $maxproc at once.
maxproc=3
for fname in * # _R1.fastq.gz
do
while [ $(jobs | wc -l) -ge "$maxproc" ]
do
sleep 1
done
base=${fname%_R1*}
echo starting new job with ongoing=$(jobs | wc -l)
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done
The optimal value of maxproc will depend on how many processors your PC has. You may need to experiment to find what works best.
Note that the above script uses jobs which is a bash builtin function. Thus, it has to be run under bash, not dash which is the default for scripts under Debian-like distributions.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to count number of forked (sub-?)processes - bash

Related

Use PS0 and PS1 to display execution time of each bash command

bash loop taking extremely long time

Why is the output from these parallel processes not messed up?

Generating sequences of numbers and characters with bash

Looping files in bash

Categories

Resources