How can you make sure that exactly n project is running in bash? - bash

I have a program that processes files in a really disk-usage heavy way. I want to call this process on many fies, and experience shows that the performance is the best, when there are no more than 3 process started at the same time (otherwise they are competing for the disk-usage as resource too much and slow each other down). Is there an easy way to call commands from a list and start executing the new one when there are less than n (3) of the processes (started by the listed commands) are running at the same time?

You could use xargs. From the manpage:
--max-procs=max-procs
-P max-procs
Run up to max-procs processes at a time; the default is 1. If
max-procs is 0, xargs will run as many processes as possible at
a time. Use the -n option with -P; otherwise chances are that
only one exec will be done.
For example, assuming your commands are one per line:
printf 'sleep %dm\n' 1 2 3 4 5 6 | xargs -L1 -P3 -I {} sh -c {}
Then, in a terminal:
$ pgrep sleep -fa
11987 sleep 1m
11988 sleep 2m
11989 sleep 3m
$ # a little while later
$ pgrep sleep -fa
11988 sleep 2m
11989 sleep 3m
12045 sleep 4m
The -L1 option uses one line at a time as the argument, and -I {} indicates that {} will be replaced with that line. To actually run the command, we pass it to sh as an argument to -c.

Related

GNU Parallel: Run bash code that reads (seq number) from pipe?

I would like parallel to read the (seq numbers) pipe, so I would like running something like that:
seq 2000 | parallel --max-args 0 --jobs 10 "{ read test; echo $test; }"
Would be equivalent to running:
echo 1
echo 2
echo 3
echo 4
...
echo 2000
But unfortunately, the pipe was not read by parallel, meaning that it was instead ran like:
echo
echo
echo
...
echo
And the output is empty.
Does anyone know how to make parallel read (seq numbers) pipe? Thanks.
An alternative with GNU xargs that does not require GNU parallel:
seq 2000 | xargs -P 10 -I {} "echo" "hello world {}"
Output:
hello world 1
hello world 2
hello world 3
hello world 4
hello world 5
.
.
.
From man xargs:
-P max-procs: Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time.
-I replace-str: Replace occurrences of replace-str in the initial-arguments with names read from standard input.
You want the input to be piped into the command you run, so use --pipe:
seq 2000 |
parallel --pipe -N1 --jobs 10 'read test; echo $test;'
But if you really just need it for a variable, I would do one of these:
seq 2000 | parallel --jobs 10 echo
seq 2000 | parallel --jobs 10 echo {}
seq 2000 | parallel --jobs 10 'test={}; echo $test'
I will encourage you to spend 20 minutes on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 Your command line will love you for it.
Using xargs instead of parallel while still using a shell (instead of starting up a new copy of the /bin/echo executable per line to run) would look like:
seq 2000 | xargs -P 10 \
sh -c 'for arg in "$#"; do echo "hello world $arg"; done' _
This is likely to be faster than the existing answer by Cyrus, because starting executables takes time; even though starting a new copy of /bin/sh takes longer than starting a copy of /bin/echo, because this isn't using -I {}, it's able to pass many arguments to each copy of /bin/sh, thus amortizing that time cost over more numbers; and that way we're able to used the copy of echo built into sh, instead of the separate echo executable.

Making a Bash script that can open multiple terminals and run wget in each

I have to download bulks of over 100,000 docs from a databank using this script:
#!/usr/bin/bash
IFS=$'\n'
set -f
for line in $(cat < "$1")
do
wget https://www.uniprot.org/uniprot/${line}.txt
done
The first time it took over a week to download all the files (all under 8Kb) so I tried opening multiple terminals and running a split of the total.txt (10 equal splits of 10000 files in 10 terminals) and in just 14 hours I had all the documents downloaded, is there a way to make a script do that for me?
this is a sample of what the list looks like:
D7E6X7
A0A1L9C3F2
A3K3R8
W0K0I7
gnome-terminal -e command
or
xterm -e command
or
konsole -e command
Or
terminal -e command
There is an another alternative to make it fast.
Right now your downloads are synchronized i.e next download process is not started until current one is finished.
Search for how to make command asynchronous/run in background on unix.
When you were doing this by hand, opening multiple terminals made sense. If you want to script this, you can run multiple processes from one terminal/script. You could use xargs to start multiple processes simultaneously:
xargs -a list.txt -n 1 -P 8 -I # bash -c "wget https://www.uniprot.org/uniprot/#.txt"
Where:
-a list.txt tells xargs to use the list.txt file as input.
-n 1 tells xargs to use a maximum of one argument (from the input) for each command it runs.
-P 8 tells xargs to run 8 commands at a time, you can change this to suit your system/requirements.
-I # tells xargs to use "#" to represent the input (i.e. the line from your file).

Run jobs in sequence rather than consecutively using bash

So I work a lot with Gaussian 09 (the computational chemistry software) on a supercomputer.
To submit a job I use the following command line
g09sub input.com -n 2 -m 4gb -t 200:00:00
Where n is the number of processors used, m is the memory requested, and t is the time requested.
I was wondering if there was a way to write a script that will submit the first 10 .com files in the folder and then submit another .com file as each finishes.
I have a script that will submit all the .com files in a folder at once, but I have a limit to how many jobs I can queue on the supercomputer I use.
The current script looks like
#!/bin/bash
#SBATCH --partition=shared
for i in *.com
do g09sub $i -n 2 -m 4gb -t 200:00:00
done
So 1.com, 2.com, 3.com, etc would be submitted all at the same time.
What I want is to have 1.com, 2.com, 3.com, 4.com, 5.com, 6.com, 7.com, 8.com, 9.com, and 10.com all start at the same time and then as each of those finishes have another .com file start. So that no more than 10 jobs from any one folder will be running at the same time.
If it would be useful, each job creates a .log file when it is finished.
Though I am unsure if it is important, the supercomputer uses a PBS queuing system.
Try xargs or GNU parallel
xargs
ls *.com | xargs -I {} g09sub -P 10 {} -n 2 -m 4gb -t 200:00:00
Explanation:
-I {} tell that {} will represent input file name
-P 10 set max jobs at once
parallel
ls *.com | parallel -P 10 g09sub {} -n 2 -m 4gb -t 200:00:00 # GNU parallel supports -P too
ls *.com | parallel --jobs 10 g09sub {} -n 2 -m 4gb -t 200:00:00
Explanation:
{} represent input file name
--jobs 10 set max jobs at once
Not sure about the availability on your supercomputer, but the GNU bash manual offers a parallel example under 3.2.6 GNU Parallel, at the bottom.
There are ways to run commands in parallel that are not built into Bash. GNU Parallel is a tool to do just that.
...
Finally, Parallel can be used to run a sequence of shell commands in parallel, similar to ‘cat file | bash’. It is not uncommon to take a list of filenames, create a series of shell commands to operate on them, and feed that list of commands to a shell. Parallel can speed this up. Assuming that file contains a list of shell commands, one per line,
parallel -j 10 < file
will evaluate the commands using the shell (since no explicit command
is supplied as an argument), in blocks of ten shell jobs at a time.
Where that option was not available to me, using the jobs function worked rather crudely. eg:
for entry in *.com; do
while [ $(jobs | wc -l) -gt 9 ]; do
sleep 1 # this is in seconds; your sleep may support 'arbitrary floating point number'
done
g09sub ${entry} -n 2 -m 4gb -t 200:00:00 &
done
$(jobs | wc -l) counts the number of jobs spawned in the background by ${cmd} &

Maintaining a set number of concurrent jobs w/ args from a file in bash

I found this script on the net, I don't know to work in bash too much is too weird but..
Here's my script:
CONTOR=0
for i in `cat targets`
do
CONTOR=`ps aux | grep -c php`
while [ $CONTOR -ge 250 ];do
CONTOR=`ps aux | grep -c php`
sleep 0.1
done
if [ $CONTOR -le 250 ]; then
php b $i > /dev/null &
fi
done
My targets are urls, and the b php file is a crawler which save some links into a file. The problem is max numbers of threads is 50-60 and that's because the crawler finish very fast and that bash script code doesn't have time to open my all 250 threads. It's any chance to do something to open all threads (250) ? It is possible to run more than one thread per ps -aux process? Right know seems he open 1 thread after execute ps -aux.
First: Bash has no multithreading support whatsoever. foo & starts a separate process, not a thread.
Second: launching ps to check for children is both prone to false positives (treating unrelated invocations of php as if they were jobs in the current process) and extremely inefficient if done in a loop (since every invocation involves a fork()/exec()/wait() cycle).
Thus, don't do it that way: Use a release of GNU xargs with -P, or (if you must) GNU parallel.
Assuming your targets file is newline-delimited, and has no special quoting or characters, this could be as simple as:
xargs -d $'\n' -n 1 -P 250 php b <targets
...or, for pure POSIX shells:
xargs -d "
" -n 1 -P 250 php b <targets
With GNU Parallel it looks like this (choose the style you like best):
cat targets | parallel -P 250 php b
parallel -a targets -P 250 php b
parallel -P 250 php b :::: targets
There is no risk of false positives if there are other php processes running. And unlike xargs there is no risk if the file targets contain space, " or '.

Bash - starting and killing processes

I need some advice on a "simple" bash script.
I want to start around 500 instances of a program "myprog", and kill all of them after x number of seconds
In short, I have a loop that starts the program in background, and after sleep x (number of seconds) pkill is called with the program name.
My questions are:
How can I verify that after 10 seconds all 500 instances are running? ps and grep combination with counting or is there another way?
How can I get a count of how many processes did the pkill (or similar kill functions) actually kill (so that there are not any processes that terminate before the actual timelimit)?
How can one redirect the output of pkill(or similar kill functions) so that it doesn't output the killed process information, so that 500 lines of ./initializeTest: line 250: 7566 Terminated ./$myprog can be avoided. Redirecting to /dev/null didn't do the trick.
In bash there is the ulimit command that controls the resources of a (sub)shell.
This, for example, is guaranteed to use at most 10 seconds of cpu time and then die:
(ulimit -t 10; ./do_something)
That doesn't answer your question but hopefully it is helpful.
1,2. Use pgrep. I don't remember off the top of my head whether pgrep has -c parameter, so you might need to pipe that to wc -l.
3: that output is produced by your shell's job control. I think if you run that as a script (not in an interactive shell), there shouldn't be such an output. For an interactive shell, the are number of ways to turn that off, but they are shell-dependent, so refer to your shell's manual.
Well my 2 cents :
ps and grep can do the job. I found that kill -0 $pid is better, by the way :) (it tells you if a process is running or not)
You can use ps/grep or kill -0. For your problem, I will start all processes in the background and get their pid with $!, store them in an array or a list, then use kill -0 to get the status of all the processes.
use &> or 2>&1 as it is probably written on stderr
my2c
To make sure that each process gets their fair share of 10 seconds before they are killed, I would wrap each command within a subshell with it's own sleep && kill.
function run_with_tmout {
CMD=$1; TMOUT=$2
$CMD &
PID=$!
sleep $TMOUT
kill $PID
}
for ((i=0; i < 500; i++)); do
run_with_tmout ./myprog 10 &
done
# wait for all child processes to end
wait && echo "all done"
For a more complete example, see this example from bashcookbook.com which first checks if the process is still running, then tries kill -s SIGTERM before resorting to SIGKILL.
I have been using something like the following to get a list of pids.
PS=$(ps ax | sed s/^' '*// | grep java | grep program_name | cut -d' ' -f1)
Then I use kill $PS to stop them.
!/bin/bash
PS=$(ps ax | sed s/^' '*// | grep java | grep program_name | cut -d' ' -f1)
kill $PS

Resources