Strange behaviour of bash script for running parallel subprocess in bash - bash

This following script is used for running parallel subprocess in bash,which is slightly changed from Running a limited number of child processes in parallel in bash?
#!/bin/bash
set -o monitor # means: run background processes in a separate processes...
N=1000
todo_array=($(seq 0 $((N-1))))
max_jobs=5
trap add_next_job CHLD
index=0
function add_next_job {
if [[ $index -lt ${#todo_array[#]} ]]
then
do_job $index &
index=$(($index+1))
fi
}
function do_job {
echo $1 start
time=$(echo "scale=0;x=$RANDOM % 10;scale=5;x/20+0.05" |bc);sleep $time;echo $time
echo $1 done
}
while [[ $index -lt $max_jobs ]] && [[ $index -lt ${#todo_array[#]} ]]
do
add_next_job
done
wait
The job is choosing a random number in 0.05:0.05:5.00 and sleep that much second.
For example, with N=10, a sample out put is
1 start
4 start
3 start
2 start
0 start
.25000
2 done
5 start
.30000
3 done
6 start
.35000
0 done
7 start
.40000
1 done
8 start
.40000
4 done
9 start
.05000
7 done
.20000
5 done
.25000
9 done
.45000
6 done
.50000
8 done
which has 30 lines in total.
But for big N such as 1000,the result can be strange.One run gives 2996 lines of ouput,with 998 lines with start ,999 with done ,and 999 with float number.644 and 652 is missing in start,644 is missing in done.
These test are runned on an Arch Linux with bash 4.2.10(2).Similar results can be produced on debian stable with bash 4.1.5(1).
EDIT:I tried parallel in moreutils and GNU parallel for this test.Parallel in moreutils has the same problem.But GNU parallel works perfect.

I think this is just due to all of the subprocesses inheriting the same file descriptor and trying to append to it in parallel. Very rarely two of the processes race and both start appending at the same location and one overwrites the other. This is essentially the reverse of what one of the comments suggests.
You could easily check this by redirecting through a pipe, such as with your_script | tee file because pipes have rules about atomicity of data delivered by single write() calls that are smaller than a particular size.
There's another question on SO that's similar to this (I think it just involved two threads both quickly writing numbers) where this is also explained but I can't find it.

The only thing I can imagine is that you're running out of resources; check "ulimit -a" and look for "max user processes". If that's less then the number of processes you want to spawn, you will end up with errors.
Try to set the limits for your user (if you're not running as root) to a higher limit. On Redhatish systems you can do this by:
Adding that line to /etc/pam.d/login:
session required pam_limits.so
Adding the following content to /etc/security/limits.conf:
myuser soft nproc 1000
myuser hard nproc 1024
where "myuser" is the username who is granted the right, 1000 the default value of "max user processes" and 1024 the maximum number of userprocesses. Soft- and hard-limit shouldn't be too much apart. It only says what the user is allowed to set himself using the "ulimit" command in his shell.
So the myuser will start with a total of a 1000 processes (including the shell, all other spawned processes), but may raise it to 1024 using ulimit:
$ ulimit -u
1000
$ ulimit -u 1024
$ ulimit -u
1024
$ ulimit -u 2000
-bash: ulimit: max user processes: cannot modify limit: Operation not permitted
A reboot is not required, it works instantly.
Good luck!
Alex.

Related

How to compare the user idle session in bash to a limit in minutes?

I am trying to come up with a bash script to check if a user idle time is more than 30 minutes then kill the session but I am not able to come up with the right filter.
who -u | cut -c 1-10,38-50 > /tmp/idle$$
for idleSession in `cat /tmp/idle$$ | awk '{print $3}'`
do
if [ "$idleSession" -gt 30 ]; then
echo $idleSession
fi
done
I have found suggestions with egrep but I don't understand that.
I keep getting
user_test.sh: line 6: [: 14:25: integer expression expected
Update: I updated the code with the typo and I got everything printed and value is not getting compared to my limit of 30m
This Shellshock-clean code prints details of sessions on the current machine that have been idle for more than 30 minutes:
#! /bin/bash -p
# Check if an idle time string output by 'who -u' represents a long idle time
# (more than 30 minutes)
function is_long_idle_time
{
local -r idle_time=$1
[[ $idle_time == old ]] && return 0
[[ $idle_time == *:* ]] || return 1
local -r hh=${idle_time%:*}
local -r mm=${idle_time#*:}
local -r idle_minutes=$((60*10#$hh + 10#$mm))
(( idle_minutes > 30 )) && return 0 || return 1
}
who_output=$(LC_ALL=C who -u)
while read -r user tty _ _ _ idle_time pid _ ; do
if is_long_idle_time "$idle_time" ; then
printf 'user=%s, tty=%s, idle_time=%s, pid=%s\n' \
"$user" "$tty" "$idle_time" "$pid"
fi
done <<<"$who_output"
The code assumes that the output of LC_ALL=C who -H -u looks like:
NAME LINE TIME IDLE PID COMMENT
username pts/9 Apr 25 18:42 06:44 3366 (:0)
username pts/10 Apr 25 18:42 old 3366 (:0)
username pts/11 Apr 25 18:44 . 3366 (:0)
username pts/12 Apr 25 18:44 00:25 3366 (:0)
...
It may look different on your system, in which case the code might need to be modified.
The "idle" string output by who -u can take several different forms. See who (The Open Group Base Specifications Issue 7) for details. Processing it is not completely trivial and is done by a function, is_long_idle_time, to keep the main code simpler.
The function extracts the hours (hh (06)) and minutes (mm (44)) from idle strings like '06:44' and calculates a total number of idle minutes (idle_minutes (404)). The base qualifiers (10#) in the arithmetic expression are necessary to prevent strings '08' and '09' being treated as invalid octal numbers. See Value too great for base (error token is "08").
The format of the who -u output can (and does) differ according to the Locale. Running it with LC_ALL=C who -u ensures that it will generate the same output regardless of the user's environment. See Explain the effects of export LANG, LC_CTYPE, LC_ALL.
Within the main loop you get the username, terminal/line, idle time, and PID of all sessions that have been idle for more than 30 minutes. However, it may not be straightforward to use this information to kill idle sessions. On some systems, multiple sessions may be associated with the same PID. Even if you can reliably determine the PIDs of idle sessions, the idleness may be false. For instance, a session that is running a long-running program that has generated no terminal output (yet) will appear to be idle. Killing it might not be a smart thing to do though. Consider using TMOUT instead. See How can one time out a root shell after a certain period of time? (and note that it can be used for any user, not just root).

How to start a large number of quick jobs in Bash

I have 3000 very quick jobs to run that on average take 2/3 seconds.
The list of jobs is in a file, and I want to control how many I have open.
However, the process of starting a job in background (& line) seems to take some time itself, therefore, some jobs are already finishing before "INTOTAL" amount have got started...
Therefore, I am not using my 32 core efficiently.
Is the a better approach than the one below?
#!/bin/sh
#set -x
INTOTAL=28
while true
do
NUMRUNNING=`tasklist | egrep Prod.exe | wc -l`
JOBS=`cat jobs.lst | wc -l`
if [ $JOBS -gt 0 ]
then
MAXSTART=$(($INTOTAL-$NUMRUNNING))
NUMTOSTART=$JOBS
if [ $NUMTOSTART -gt $MAXSTART ]
then
NUMTOSTART=$MAXSTART
fi
echo 'Starting: '$NUMTOSTART
for ((i=1;i<=$NUMTOSTART;i++))
do
JOB=`head -n1 jobs.lst`
sed -i 1d jobs.lst
/Prod $JOB &
done
sleep 2
fi
sleep 3
done
You may want to have a look at parallel, which you should be able to install on Cygwin according to the release notes. Then running the tasks in parallel can be as easy as:
parallel /Prod {} < jobs.lst
See here for an example of this in its man page (and have a look through the plethora of examples for more about the many options it has).
To control how many jobs to run at a time use the -j flag. By default it will run 1 job per core at a time, so 32 for you. To limit to 16 for instance:
parallel -j 16 /Prod {} < jobs.lst

Parallel nested for loop in bash

I am trying to run a c executable through bash. The executable will take a different argument in each iteration, and I want to do it in parallel since I have 12 cores available.
I tried
w=1;
for i in {1..100}
do
l=$(($i-1));
for j in {12*l..12*i}
do
./run $w/100 > "$w"_out &
done
expr=$w % 12;
if ["$expr" -eq "0"]
then wait;
fi;
done
run is the c executable. I want to run it with increasing argument w in each step, and I want to wait until all processes are done if 12 of the cores are in use. SO basically, I will run 12 executables at the same time, then wait until they are completed, and then move to the next 12.
Hope I made my point clear.
Cheers.
Use gnu parallel instead:
parallel ./myscript {1} ::: {1..100}
You can specify the number of parallel processes with the -P option, but it defaults to the number of cores in the system.
You can also specify -k to keep the output order and redirect the file.
To redirect the output to individual files, you can specify the output redirection, but you have to quote it, so that it is not parsed by the shell. For example:
parallel ./run {1} '>' {1}_out ::: {1..10}
is equivalent to running ./run 1 > 1_out to ./run 10 > 10_out

Looping files in bash

I want to loop over these kind of files, where the the files with same Sample_ID have to be used together
Sample_51770BL1_R1.fastq.gz
Sample_51770BL1_R2.fastq.gz
Sample_52412_R1.fastq.gz
Sample_52412_R2.fastq.gz
e.g. Sample_51770BL1_R1.fastq.gz and Sample_51770BL1_R2.fastq.gz are used together in one command to create an output.
Similarly, Sample_52412_R1.fastq.gz and Sample_52412_R2.fastq.gz are used together to create output.
I want to write a for loop in bash to iterate over and create output.
sourcedir=/sourcepath/
destdir=/destinationpath/
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta Sample_52412_R1.fastq.gz Sample_52412_R2.fastq.gz>$destdir/Sample_52412_R1_R2.sam
How should I pattern match the file names Sample_ID_R1 and Sample_ID_R2 to be used in one command?
Thanks,
for fname in *_R1.fastq.gz
do
base=${fname%_R1*}
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam"
done
In the comments, you ask about running several, but not too many, jobs in parallel. Below is my first stab at that:
#!/bin/bash
# Limit background jobs to no more that $maxproc at once.
maxproc=3
for fname in * # _R1.fastq.gz
do
while [ $(jobs | wc -l) -ge "$maxproc" ]
do
sleep 1
done
base=${fname%_R1*}
echo starting new job with ongoing=$(jobs | wc -l)
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done
The optimal value of maxproc will depend on how many processors your PC has. You may need to experiment to find what works best.
Note that the above script uses jobs which is a bash builtin function. Thus, it has to be run under bash, not dash which is the default for scripts under Debian-like distributions.

How to count number of forked (sub-?)processes

Somebody else has written (TM) some bash script that forks very many sub-processes. It needs optimization. But I'm looking for a way to measure "how bad" the problem is.
Can I / How would I get a count that says how many sub-processes were forked by this script all-in-all / recursively?
This is a simplified version of what the existing, forking code looks like - a poor man's grep:
#!/bin/bash
file=/tmp/1000lines.txt
match=$1
let cnt=0
while read line
do
cnt=`expr $cnt + 1`
lineArray[$cnt]="${line}"
done < $file
totalLines=$cnt
cnt=0
while [ $cnt -lt $totalLines ]
do
cnt=`expr $cnt + 1`
matches=`echo ${lineArray[$cnt]}|grep $match`
if [ "$matches" ] ; then
echo ${lineArray[$cnt]}
fi
done
It takes the script 20 seconds to look for $1 in 1000 lines of input. This code forks way too many sub-processes. In the real code, there are longer pipes (e.g. progA | progB | progC) operating on each line using grep, cut, awk, sed and so on.
This is a busy system with lots of other stuff going on, so a count of how many processes were forked on the entire system during the run-time of the script would be of some use to me, but I'd prefer a count of processes started by this script and descendants. And I guess I could analyze the script and count it myself, but the script is long and rather complicated, so I'd just like to instrument it with this counter for debugging, if possible.
To clarify:
I'm not looking for the number of processes under $$ at any given time (e.g. via ps), but the number of processes run during the entire life of the script.
I'm also not looking for a faster version of this particular example script (I can do that). I'm looking for a way to determine which of the 30+ scripts to optimize first to use bash built-ins.
You can count the forked processes simply trapping the SIGCHLD signal. If You can edit the script file then You can do this:
set -o monitor # or set -m
trap "((++fork))" CHLD
So fork variable will contain the number of forks. At the end You can print this value:
echo $fork FORKS
For a 1000 lines input file it will print:
3000 FORKS
This code forks for two reasons. One for each expr ... and one for `echo ...|grep...`. So in the reading while-loop it forks every time when a line is read; in the processing while-loop it forks 2 times (one because of expr ... and one for `echo ...|grep ...`). So for a 1000 lines file it forks 3000 times.
But this is not exact! It is just the forks done by the calling shell. There are more forks, because `echo ...|grep...` forks to start a bash to run this code. But after it is also forks twice: one for echo and one for grep. So actually it is 3 forks, not one. So it is rather 5000 FORKS, not 3000.
If You need to count the forks of the forks (of the forks...) as well (or You cannot modify the bash script or You want it to do from an other script), a more exact solution can be to used
strace -fo s.log ./x.sh
It will print lines like this:
30934 execve("./x.sh", ["./x.sh"], [/* 61 vars */]) = 0
Then You need to count the unique PIDs using something like this (first number is the PID):
awk '{n[$1]}END{print length(n)}' s.log
In case of this script I got 5001 (the +1 is the PID of the original bash script).
COMMENTS
Actually in this case all forks can be avoided:
Instead of
cnt=`expr $cnt + 1`
Use
((++cnt))
Instead of
matches=`echo ${lineArray[$cnt]}|grep $match`
if [ "$matches" ] ; then
echo ${lineArray[$cnt]}
fi
You can use bash's internal pattern matching:
[[ ${lineArray[cnt]} =~ $match ]] && echo ${lineArray[cnt]}
Mind that bash =~ uses ERE not RE (like grep). So it will behave like egrep (or grep -E), not grep.
I assume that the defined lineArray is not pointless (otherwise in the reading loop the matching could be tested and the lineArray is not needed) and it is used for other purpose as well. In that case I may suggest a little bit shorter version:
readarray -t lineArray <infile
for line in "${lineArray[#]}";{ [[ $line} =~ $match ]] && echo $line; }
First line reads the complete infile to lineArray without any loop. The second line is process the array element-by-element.
MEASURES
Original script for 1000 lines (on cygwin):
$ time ./test.sh
3000 FORKS
real 0m48.725s
user 0m14.107s
sys 0m30.659s
Modified version
FORKS
real 0m0.075s
user 0m0.031s
sys 0m0.031s
Same on linux:
3000 FORKS
real 0m4.745s
user 0m1.015s
sys 0m4.396s
and
FORKS
real 0m0.028s
user 0m0.022s
sys 0m0.005s
So this version uses no fork (or clone) at all. I may suggest to use this version only for small (<100 KiB) files. In other cases grap, egrep, awk over performs the pure bash solution. But this should be checked by a performance test.
For a thousand lines on linux I got the following:
$ time grep Solaris infile # Solaris is not in the infile
real 0m0.001s
user 0m0.000s
sys 0m0.001s

Resources