parallelize bash script

parallelize bash script - bash

I need the sum of an integer contained in several webpages. getPages() parses the integer and sets it to $subTotal. getPages() is called in a for loop in background, but how do I get the sum of $subTotal? Is this a subshelling problem?
This is what I've tried so far.
#!/bin/bash
total=0
getPages(){
subTotal=$(lynx -dump http://"$(printf "%s:%s" $1 $2)"/file.html | awk -F, 'NR==1 {print $1}' | sed 's/\s//g')
total=$(($total+$subTotal))
echo "SubTotal: " $subTotal "Total: " $total
}
# /output/ SubTotal: 22 Total: 22
# /output/ SubTotal: 48 Total: 48 //Note Total should be 70
ARRAY=(
'pf2.server.com:6599'
'pf5.server.com:1199'
...
)
for server in ${ARRAY[#]} ; do
KEY=${server%%:*}
VALUE=${server##*:}
getPages $KEY $VALUE &
done
wait
echo $total
exit 0
# /output/ 0
Any advice appreciated.

Yes, this is a subshelling problem. Everything executed in a ... & list (i.e. your getPages $KEY $VALUE &) is executed in a subshell, which means that changes of variables there do not affect the parent shell.
I think one could do something using coprocesses (i.e. communication by streams), or maybe using GNU parallel or pexec.
Here is an example with pexec, using the default output to communicate from the single processes. I used a simpler command as the servers you listed are not accessible from here. This counts the lines on some webpages and sums them up.
ARRAY=(
'www.gmx.de:80'
'www.gmx.net:80'
'www.gmx.at:80'
'www.gmx.li:80'
)
(( total = 0 ))
while read subtotal
do
(( total += subtotal ))
echo "subtotal: $subtotal, total: $total"
done < <(
pexec --normal-redirection --environment hostname --number ${#ARRAY[*]} \
--parameters "${ARRAY[#]}" --shell-command -- '
lynx -dump http://$hostname/index.html | wc -l'
)
echo "total: $total"
We are using some tricks here:
we pipe the output of the parallel processes back to the main process, reading it in a loop there.
To avoid the creating of a subshell for the while loop, we use bash's process substitution feature (<( ... )) together with input redirection (<) instead of a simple pipe.
We do arithmetic in a (( ... )) arithmetic expression command. I could have used let, instead, but then I would have to quote everything or avoid spaces. (Your total=$(( total + subtotal )) would have worked, too.)
the options to pexec:
--normal-redirection means redirecting all the output streams from the subprocesses together into the output stream of pexec. (I'm not sure this could result in some confusion if two processes want to write at the same time.)
--environment hostname passes the differing parameter for each execution as a environment variable. Otherwise it would be a simple command line argument.
--number ${#ARRAY[*]} (which gets --number 4 in our case) makes sure that the all the processes will be started in parallel, instead of only as many as we have CPUs or some other heuristic. (This is for network-roundtrip-bound work. For CPU-bound or bandwidth-bound stuff, a smaller number would be better.)
--shell-command makes sure the command will be evaluated by a shell, instead of trying to execute it directly. This is necessary because of the pipeline in there.
--parameters "${ARRAY[#]}" lists the actual arguments - i.e. the elements of the array. For each of them a separate version of the command will be started.
after the final -- comes the command - as a single '-quoted string, to avoid premature interpretation of the $hostname in there by the outer shell. The command simple downloads the file and pipes it to wc -l, counting the lines.
Example output:
subtotal: 1120, total: 1120
subtotal: 968, total: 2088
subtotal: 1120, total: 3208
subtotal: 1120, total: 4328
total: 4328
Here is (part of) the output of ps -f while this is running:
2799 pts/1 Ss 0:03 \_ bash
5427 pts/1 S+ 0:00 \_ /bin/bash ./download-test.sh
5428 pts/1 S+ 0:00 \_ /bin/bash ./download-test.sh
5429 pts/1 S+ 0:00 \_ pexec --number 4 --normal-redirection --environment hostname --parame...
5430 pts/1 S+ 0:00 \_ /bin/sh -c ? lynx -dump http://$hostname/index.html | wc -l
5434 pts/1 S+ 0:00 | \_ lynx -dump http://www.gmx.de:80/index.html
5435 pts/1 S+ 0:00 | \_ wc -l
5431 pts/1 S+ 0:00 \_ /bin/sh -c ? lynx -dump http://$hostname/index.html | wc -l
5436 pts/1 S+ 0:00 | \_ lynx -dump http://www.gmx.net:80/index.html
5437 pts/1 S+ 0:00 | \_ wc -l
5432 pts/1 S+ 0:00 \_ /bin/sh -c ? lynx -dump http://$hostname/index.html | wc -l
5438 pts/1 S+ 0:00 | \_ lynx -dump http://www.gmx.at:80/index.html
5439 pts/1 S+ 0:00 | \_ wc -l
5433 pts/1 S+ 0:00 \_ /bin/sh -c ? lynx -dump http://$hostname/index.html | wc -l
5440 pts/1 S+ 0:00 \_ lynx -dump http://www.gmx.li:80/index.html
5441 pts/1 S+ 0:00 \_ wc -l
We can see that really everything runs in parallel, as much as possible on my one-processor system.

A shorter version using GNU Parallel:
ARRAY=(
'www.gmx.de:80'
'www.gmx.net:80'
'www.gmx.at:80'
'www.gmx.li:80'
)
parallel lynx -dump http://{}/index.html \| wc -l ::: "${ARRAY[#]}" | awk '{s+=$1} END {print s}'
If the host:port is in a file:
cat host_port | parallel lynx -dump http://{}/index.html \| wc -l | awk '{s+=$1} END {print s}'
Learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Related

Bash, looping through space delimited array for creating files causes it to hang [duplicate]

I have a shell script which emails me errors encountered via crontasks which looks like the following:
exec >&-;
output="$(cat)";
shopt -s nocasematch
if [[ "$output" == *"error"* || "$output" == *"warning"* ]]; then
echo "$output" | mail -s "Error" my#email.com;
fi
exit 0;
My crontab looks like:
*/1 * * * * /opt/sh/email.sh /usr/bin/php /home/sites/website/app/console my:cli:command >> /var/log/cron.d/ my.cli.command/log 2>&1
The script works, but "cat" seem to hang:
root 23083 0.0 0.0 139752 1112 ? S Mar20 0:00 \_ CROND
500 23091 0.0 0.0 106096 1016 ? Ss Mar20 0:00 | \_ /bin/sh -c /usr/bin/php /var/www/website/app/console my:cli:command 2>&1 | /usr/local/bin/email.sh
500 23096 0.0 0.3 463528 27292 ? S Mar20 0:35 | \_ /usr/bin/php /var/www/website/app/console my:cli:command
500 23097 0.0 0.0 106096 1048 ? S Mar20 0:00 | \_ /bin/bash /usr/local/bin/email.sh
500 23101 0.0 0.0 100936 496 ? S Mar20 0:00 | \_ cat
root 12167 0.0 0.0 139752 1276 ? S Mar22 0:00 \_ CROND
500 12183 0.0 0.0 106096 1104 ? Ss Mar22 0:00 | \_ /bin/sh -c /usr/bin/php /var/www/website/app/console my:cli:command 2>&1 | /usr/local/bin/email.sh
500 12185 0.0 0.4 463528 36612 ? S Mar22 0:32 | \_ /usr/bin/php /var/www/website/app/console my:cli:command
500 12186 0.0 0.0 106096 1104 ? S Mar22 0:00 | \_ /bin/bash /usr/local/bin/email.sh
500 12194 0.0 0.0 100936 516 ? S Mar22 0:00 | \_ cat
root 1675 0.0 0.0 139752 1128 ? S Mar25 0:00 \_ CROND
Any ideas out there?

It is hanging because you're not giving cat any input to concatenate, so it will just listen to STDIN forever.
From the man page:
The cat utility reads files sequentially, writing them to the standard output. The file operands are processed in command-line order. If file is a single dash (`-') or absent, cat reads from the standard input. If file is a UNIX domain socket, cat connects to it and then reads it until EOF. This complements the UNIX domain binding capability available in inetd(8).

where does the 2nd process come from [duplicate]

This question already has answers here:
Different results between ps aux and `ps aux` inside a script
(2 answers)
Closed 5 years ago.
Put the following code into a shell script e.g. x.sh and chmod +x x.sh
./x.sh
what I expect is there is only one line, while actually, I got 2 lines:
tmp $ ./x.sh
52140 ttys003 0:00.00 /bin/sh ./x.sh
52142 ttys003 0:00.00 /bin/sh ./x.sh
My question is where does the 52142 come from?
#!/bin/sh
myself=$(basename $0)
running=$(ps -A | grep "$myself" |grep -v grep)
echo "${running}"
Note: this is on MacOS 10.12
Updated the question with new experiment:
#!/bin/sh
myself=$(basename $0)
running=$(ps -fe | grep $myself |grep -v grep)
echo ==== '$(ps -fe | grep "$myself" |grep -v grep)' ====
echo "${running}"
echo
running=$(ps -fe | cat)
echo ==== '$(ps -fe | cat) |grep $myself' ====
echo "${running}"|grep $myself
echo
running=$(ps -fe)
echo ==== '$(ps -fe) |grep $myself' ====
echo "${running}" |grep $myself
The output on MacOS 10.12 is:
==== $(ps -fe | grep "$myself" |grep -v grep) ====
501 59912 81738 0 9:01AM ttys003 0:00.00 /bin/sh ./x.sh
501 59914 59912 0 9:01AM ttys003 0:00.00 /bin/sh ./x.sh
==== $(ps -fe | cat) |grep $myself ====
501 59912 81738 0 9:01AM ttys003 0:00.00 /bin/sh ./x.sh
501 59918 59912 0 9:01AM ttys003 0:00.00 /bin/sh ./x.sh
==== $(ps -fe) |grep $myself ====
501 59912 81738 0 9:01AM ttys003 0:00.00 /bin/sh ./x.sh
From above, it seems the subshell is also related to pipe.

$(command) is command substitution. The command was executed in a subshell.
A subshell is a child process (fork) of your original shell (or shell script), if you check with ps, the file name part is same as your original(parent) shell script.
You can verify it by changing your x.sh into:
echo "$(ps -fe --forest>foo.txt)"
With --forest ps will output a tree structure with sub processes. Open the foo.txt search x.sh, you will see the tree structure.
If I run here on my machine, I get:
kent 20866 707 0 15:55 \_ urxvt
kent 20867 20866 0 15:55 \_ zsh
kent 21457 20867 0 15:56 \_ sh ./x.sh
kent 21459 21457 0 15:56 \_ sh ./x.sh #subshell
kent 21460 21459 0 15:56 \_ ps -fe --forest
If we add one more layer, change your script to:
(
echo "$(ps -fe --forest>foo.txt)"
)
Now our x.sh will create two nested subshells, if I run and check my foo.txt:
... 25882 27657 0 16:05 ... \_ -zsh
... 31027 25882 0 16:16 ... \_ sh ./x.sh
... 31028 31027 0 16:16 ... \_ sh ./x.sh #subshell1
... 31029 31028 0 16:16 ... \_ sh ./x.sh #subshell2
... 31031 31029 0 16:16 ... \_ ps -fe --forest
The $PPID show it too.

bash command substitution issue with subshell

I was trying to prevent a script from being run by more than one user simultaneously and did not want to use commands only available on some OS'es or shells (pgrep, pidof, ...) and bumped into an issue that I am not sure whether it is a bug or not...
Please ignore the specifics I used in my script: the issue is about the command substitution in bash when using ps.
When I run the following (note the shebang in ksh):
#!/bin/ksh
CMD=`basename $0`
echo $CMD
ps -ef | grep "$CMD"
ps -ef | grep "$CMD" | wc -l
RUNS=`ps -ef | grep "$CMD" | wc -l`
echo $RUNS
if [ $RUNS -gt 2 ]; then
echo The script is currently being run by another user.
#exit 1
fi
RUNS=`ps -ef | grep "$CMD"`
echo "$RUNS"
RUNS=`echo "$RUNS" | wc -l`
echo $RUNS
if [ $RUNS -gt 2 ]; then
echo The script is currently being run by another user.
#exit 1
fi
ps -ef | grep "$CMD" | wc -l > lock
RUNS=`cat lock`
echo $RUNS
if [ $RUNS -gt 2 ]; then
echo The script is currently being run by another user.
exit 1
fi
I get this correct output:
testksh.sh7
abriere 19126 5669 0 14:15 pts/21 00:00:00 /bin/ksh ./testksh.sh7
abriere 19129 19126 0 14:15 pts/21 00:00:00 grep testksh.sh7
2
2
abriere 19126 5669 0 14:15 pts/21 00:00:00 /bin/ksh ./testksh.sh7
abriere 19137 19126 0 14:15 pts/21 00:00:00 grep testksh.sh7
2
2
I get this after replacing the shebang for bash and renaming the script accordingly:
testbash.sh7
abriere 5631 5669 0 14:12 pts/21 00:00:00 /bin/bash ./testbash.sh7
abriere 5634 5631 0 14:12 pts/21 00:00:00 grep testbash.sh7
2
3
The script is currently being run by another user.
abriere 5631 5669 0 14:12 pts/21 00:00:00 /bin/bash ./testbash.sh7
abriere 5643 5631 0 14:12 pts/21 00:00:00 /bin/bash ./testbash.sh7
abriere 5645 5643 0 14:12 pts/21 00:00:00 grep testbash.sh7
3
The script is currently being run by another user.
2
Note the extra line in the ps output.
The following line in bash:
RUNS=`ps -ef | grep "$CMD" | wc -l`
does not return the same value as:
ps -ef | grep "$CMD" | wc -l
Ksh does not have this issue.
As you can see, there are workarounds: I use one in the last section of my script.
I ran the scripts on Linux, AIX and SunOS and they gave me the same results; only Cygwin did not, but the ps command does not return the script in either shell.
Is this a bug? Even if bash runs command substitution within a subshell (see question 21331042), I still consider the variable assigned the value of the command substitution should return the same value as the command itself...

why parent shell does not really finish (becomes "S" ) after spawning a subshell using parenthsis?

Recently, I noticed that "(cmd list)" will make the current (parent) shell become defunct until the subshell quit. I would be grateful if someone can tell me why it is the case.
Here is how to reproduce it:
$ cat test.sh
#!/bin/bash
(echo hello; sleep 60 )&
$ ./test.sh
hello
$ ps aux | grep -i '\(test.sh\|sleep\)'
dsuser 32621 0.0 0.0 113124 700 pts/0 S 00:56 0:00 /bin/bash ./test.sh
dsuser 32622 0.0 0.0 107896 620 pts/0 S 00:56 0:00 sleep 60
dsuser 32624 0.0 0.0 112644 1012 pts/0 R+ 00:56 0:00 grep --color=auto -i \(test.sh\|sleep\)
$ pkill sleep
./test.sh: line 3: 32622 Terminated sleep 60
$ ps aux | grep -i '\(test.sh\|sleep\)'
dsuser 32627 0.0 0.0 112644 1012 pts/0 R+ 00:57 0:00 grep --color=auto -i \(test.sh\|sleep\)
Please note that "test.sh" exists until the subshell (sleep) was killed. Also note that, in the following test, the "test.sh" is reap immediately.
$ cat test.sh
#!/bin/bash
echo hello; sleep 60 &
$ ./test.sh
hello
$ ps aux | grep -i '\(test.sh\|sleep\)'
dsuser 32631 0.0 0.0 107896 620 pts/0 S 00:57 0:00 sleep 60
dsuser 32633 0.0 0.0 112644 1012 pts/0 S+ 00:57 0:00 grep --color=auto -i \(test.sh\|sleep\)

The parent process remains in the process table until its children complete. At that point it's reaped. This is just part of how process management works in Unixy systems (some variance applies). You could just execute the command within the same shell using the exec command instead if you want the command to replace the parent process.

xautolock doesn't start a second time

I'll give an example to describe my problem.
#!/bin/sh
if (( $# == 1 ))
then
xmessage "before kill"
killall xautolock
xmessage "after kill"
var=$1
let "var += 1"
xautolock -time $var -locker "\"./test1.sh\"" &
xmessage "after run"
exit 0
fi
The first time I start xautolock from bash:
$ xautolock -time 1 -locker "./test1.sh 1" &
The option -time means that xautolock will start a program which passed as an argument of the option -locker after 1 minute idle time.
After starting xautolock from bash:
$ ps ax | grep -E "xaut|test"
6038 pts/1 S 0:00 xautolock -time 1 -locker ./test1.sh 1
6046 pts/2 S+ 0:00 grep -E xaut|test
After starting xmessage "before kill" :
$ ps ax | grep -E "xaut|test"
6038 pts/1 S 0:00 xautolock -time 1 -locker ./test1.sh 1
6223 pts/1 S 0:00 /bin/sh /home/mhd/Texts/Programming/Programms/test1.sh 1
6240 pts/2 S+ 0:00 grep -E xaut|test
After starting xmessage "after kill":
$ ps ax | grep -E "xaut|test"
6223 pts/1 S 0:00 /bin/sh /home/mhd/Texts/Programming/Programms/test1.sh 1
6373 pts/2 S+ 0:00 grep -E xaut|test
After starting xmessage "after run":
$ ps ax | grep -E "xaut|test"
6223 pts/1 S 0:00 /bin/sh /home/mhd/Texts/Programming/Programms/test1.sh 1
6470 pts/2 S+ 0:00 grep -E xaut|test
Why isn't xautolock in a list of processes after this step? How to start it a second time in a Bash script?

xautolock closes stdout and stdrerr by default. If you will pass the option "-noclose" to xautolock then it will not close stdout and stdrerr and you can start xautolock a second time in the Bash script. But I don't understand why xautolock will not start a second time in my sample script if it has closed stdout and stderr?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

parallelize bash script - bash

Related

Bash, looping through space delimited array for creating files causes it to hang [duplicate]

where does the 2nd process come from [duplicate]

bash command substitution issue with subshell

why parent shell does not really finish (becomes "S" ) after spawning a subshell using parenthsis?

xautolock doesn't start a second time

Categories

Resources