Maintaining a set number of concurrent jobs w/ args from a file in bash - bash

I found this script on the net, I don't know to work in bash too much is too weird but..
Here's my script:
CONTOR=0
for i in `cat targets`
do
CONTOR=`ps aux | grep -c php`
while [ $CONTOR -ge 250 ];do
CONTOR=`ps aux | grep -c php`
sleep 0.1
done
if [ $CONTOR -le 250 ]; then
php b $i > /dev/null &
fi
done
My targets are urls, and the b php file is a crawler which save some links into a file. The problem is max numbers of threads is 50-60 and that's because the crawler finish very fast and that bash script code doesn't have time to open my all 250 threads. It's any chance to do something to open all threads (250) ? It is possible to run more than one thread per ps -aux process? Right know seems he open 1 thread after execute ps -aux.

First: Bash has no multithreading support whatsoever. foo & starts a separate process, not a thread.
Second: launching ps to check for children is both prone to false positives (treating unrelated invocations of php as if they were jobs in the current process) and extremely inefficient if done in a loop (since every invocation involves a fork()/exec()/wait() cycle).
Thus, don't do it that way: Use a release of GNU xargs with -P, or (if you must) GNU parallel.
Assuming your targets file is newline-delimited, and has no special quoting or characters, this could be as simple as:
xargs -d $'\n' -n 1 -P 250 php b <targets
...or, for pure POSIX shells:
xargs -d "
" -n 1 -P 250 php b <targets

With GNU Parallel it looks like this (choose the style you like best):
cat targets | parallel -P 250 php b
parallel -a targets -P 250 php b
parallel -P 250 php b :::: targets
There is no risk of false positives if there are other php processes running. And unlike xargs there is no risk if the file targets contain space, " or '.

Related

GNU Parallel: Run bash code that reads (seq number) from pipe?

I would like parallel to read the (seq numbers) pipe, so I would like running something like that:
seq 2000 | parallel --max-args 0 --jobs 10 "{ read test; echo $test; }"
Would be equivalent to running:
echo 1
echo 2
echo 3
echo 4
...
echo 2000
But unfortunately, the pipe was not read by parallel, meaning that it was instead ran like:
echo
echo
echo
...
echo
And the output is empty.
Does anyone know how to make parallel read (seq numbers) pipe? Thanks.
An alternative with GNU xargs that does not require GNU parallel:
seq 2000 | xargs -P 10 -I {} "echo" "hello world {}"
Output:
hello world 1
hello world 2
hello world 3
hello world 4
hello world 5
.
.
.
From man xargs:
-P max-procs: Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time.
-I replace-str: Replace occurrences of replace-str in the initial-arguments with names read from standard input.
You want the input to be piped into the command you run, so use --pipe:
seq 2000 |
parallel --pipe -N1 --jobs 10 'read test; echo $test;'
But if you really just need it for a variable, I would do one of these:
seq 2000 | parallel --jobs 10 echo
seq 2000 | parallel --jobs 10 echo {}
seq 2000 | parallel --jobs 10 'test={}; echo $test'
I will encourage you to spend 20 minutes on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 Your command line will love you for it.
Using xargs instead of parallel while still using a shell (instead of starting up a new copy of the /bin/echo executable per line to run) would look like:
seq 2000 | xargs -P 10 \
sh -c 'for arg in "$#"; do echo "hello world $arg"; done' _
This is likely to be faster than the existing answer by Cyrus, because starting executables takes time; even though starting a new copy of /bin/sh takes longer than starting a copy of /bin/echo, because this isn't using -I {}, it's able to pass many arguments to each copy of /bin/sh, thus amortizing that time cost over more numbers; and that way we're able to used the copy of echo built into sh, instead of the separate echo executable.

Run jobs in sequence rather than consecutively using bash

So I work a lot with Gaussian 09 (the computational chemistry software) on a supercomputer.
To submit a job I use the following command line
g09sub input.com -n 2 -m 4gb -t 200:00:00
Where n is the number of processors used, m is the memory requested, and t is the time requested.
I was wondering if there was a way to write a script that will submit the first 10 .com files in the folder and then submit another .com file as each finishes.
I have a script that will submit all the .com files in a folder at once, but I have a limit to how many jobs I can queue on the supercomputer I use.
The current script looks like
#!/bin/bash
#SBATCH --partition=shared
for i in *.com
do g09sub $i -n 2 -m 4gb -t 200:00:00
done
So 1.com, 2.com, 3.com, etc would be submitted all at the same time.
What I want is to have 1.com, 2.com, 3.com, 4.com, 5.com, 6.com, 7.com, 8.com, 9.com, and 10.com all start at the same time and then as each of those finishes have another .com file start. So that no more than 10 jobs from any one folder will be running at the same time.
If it would be useful, each job creates a .log file when it is finished.
Though I am unsure if it is important, the supercomputer uses a PBS queuing system.
Try xargs or GNU parallel
xargs
ls *.com | xargs -I {} g09sub -P 10 {} -n 2 -m 4gb -t 200:00:00
Explanation:
-I {} tell that {} will represent input file name
-P 10 set max jobs at once
parallel
ls *.com | parallel -P 10 g09sub {} -n 2 -m 4gb -t 200:00:00 # GNU parallel supports -P too
ls *.com | parallel --jobs 10 g09sub {} -n 2 -m 4gb -t 200:00:00
Explanation:
{} represent input file name
--jobs 10 set max jobs at once
Not sure about the availability on your supercomputer, but the GNU bash manual offers a parallel example under 3.2.6 GNU Parallel, at the bottom.
There are ways to run commands in parallel that are not built into Bash. GNU Parallel is a tool to do just that.
...
Finally, Parallel can be used to run a sequence of shell commands in parallel, similar to ‘cat file | bash’. It is not uncommon to take a list of filenames, create a series of shell commands to operate on them, and feed that list of commands to a shell. Parallel can speed this up. Assuming that file contains a list of shell commands, one per line,
parallel -j 10 < file
will evaluate the commands using the shell (since no explicit command
is supplied as an argument), in blocks of ten shell jobs at a time.
Where that option was not available to me, using the jobs function worked rather crudely. eg:
for entry in *.com; do
while [ $(jobs | wc -l) -gt 9 ]; do
sleep 1 # this is in seconds; your sleep may support 'arbitrary floating point number'
done
g09sub ${entry} -n 2 -m 4gb -t 200:00:00 &
done
$(jobs | wc -l) counts the number of jobs spawned in the background by ${cmd} &

How can you make sure that exactly n project is running in bash?

I have a program that processes files in a really disk-usage heavy way. I want to call this process on many fies, and experience shows that the performance is the best, when there are no more than 3 process started at the same time (otherwise they are competing for the disk-usage as resource too much and slow each other down). Is there an easy way to call commands from a list and start executing the new one when there are less than n (3) of the processes (started by the listed commands) are running at the same time?
You could use xargs. From the manpage:
--max-procs=max-procs
-P max-procs
Run up to max-procs processes at a time; the default is 1. If
max-procs is 0, xargs will run as many processes as possible at
a time. Use the -n option with -P; otherwise chances are that
only one exec will be done.
For example, assuming your commands are one per line:
printf 'sleep %dm\n' 1 2 3 4 5 6 | xargs -L1 -P3 -I {} sh -c {}
Then, in a terminal:
$ pgrep sleep -fa
11987 sleep 1m
11988 sleep 2m
11989 sleep 3m
$ # a little while later
$ pgrep sleep -fa
11988 sleep 2m
11989 sleep 3m
12045 sleep 4m
The -L1 option uses one line at a time as the argument, and -I {} indicates that {} will be replaced with that line. To actually run the command, we pass it to sh as an argument to -c.

Running programs in parallel using xargs

I currently have the current script.
#!/bin/bash
# script.sh
for i in {0..99}; do
script-to-run.sh input/ output/ $i
done
I wish to run it in parallel using xargs. I have tried
script.sh | xargs -P8
But doing the above only executed once at the time. No luck with -n8 as well.
Adding & at the end of the line to be executed in the script for loop would try to run the script 99 times at once. How do I execute the loop only 8 at the time, up to 100 total.
From the xargs man page:
This manual page documents the GNU version of xargs. xargs reads items
from the standard input, delimited by blanks (which can be protected
with double or single quotes or a backslash) or newlines, and executes
the command (default is /bin/echo) one or more times with any initial-
arguments followed by items read from standard input. Blank lines on
the standard input are ignored.
Which means that for your example xargs is waiting and collecting all of the output from your script and then running echo <that output>. Not exactly all that useful nor what you wanted.
The -n argument is how many items from the input to use with each command that gets run (nothing, by itself, about parallelism here).
To do what you want with xargs you would need to do something more like this (untested):
printf %s\\n {0..99} | xargs -n 1 -P 8 script-to-run.sh input/ output/
Which breaks down like this.
printf %s\\n {0..99} - Print one number per-line from 0 to 99.
Run xargs
taking at most one argument per run command line
and run up to eight processes at a time
With GNU Parallel you would do:
parallel script-to-run.sh input/ output/ {} ::: {0..99}
Add in -P8 if you do not want to run one job per CPU core.
Opposite xargs it will do The Right Thing, even if the input contain space, ', or " (not the case here, though). It also makes sure the output from different jobs are not mixed together, so if you use the output you are guaranteed that you will not get half-a-line from two different jobs.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
You can use this simple 1 line command
seq 1 500 | xargs -n 1 -P 8 script-to-run.sh input/ output/
Here's an example running commands in parallel in conjuction with find:
find -name "*.wav" -print0 | xargs -0 -t -I % -P $(nproc) flac %
-print0 terminates filenames with a null byte rather than a newline so we can use -0 in xargs to prevent filenames with spaces being treated as two seperate arguments.
-t means verbose, makes xargs print every command it's executing, can be useful, remove if not needed.
-I % means replace occurrences of % in the command with arguments read from standard input.
-P $(nproc) means run a maximum of nproc instances of our command in parallel (nproc prints the number of available processing units).
flac % is our command, the -I % from earlier means this will become flac foo.wav
See also: Manual for xargs(1)

bash pipe limit

i got a txt list of urls i want to download
n=1
end=`cat done1 |wc -l`
while [ $n -lt $end ]
do
nextUrls=`sed -n "${n}p" < done1`
wget -N nH --random-wait -t 3 -a download.log -A$1 $nextUrls
let "n++"
done
i want to do it faster with pipes but if i do this
wget -N nH --random-wait -t 3 -a download.log -A$1 $nextUrls &
my ram fills up and blocks my PC completely.
Any1 know how to limit pipes created to like 10 at the same time?
You are not creating pipes (|), you are creating background processes (&). Everytime your while executes its body, you create a new wget process and don't wait for it to exit, which (depending on the value of end) may create lot of wget processes very fast. Either do sequentially (remove the &) or you can try executing n processes in parallel and wait for them.
BTW, useless use of cat: you can simply do:
end=`wc -l done1`
i got a txt list of urls i want to download... i want to do it faster..
So here's a shortest way to do that. The following command downloads the URL from the list contained in file *txt_list_of_urls* parallely running 10 threads:
xargs -a txt_list_of_urls -P 10 -r -n 1 wget -nv

Resources