Run jobs in sequence rather than consecutively using bash - bash

So I work a lot with Gaussian 09 (the computational chemistry software) on a supercomputer.
To submit a job I use the following command line
g09sub input.com -n 2 -m 4gb -t 200:00:00
Where n is the number of processors used, m is the memory requested, and t is the time requested.
I was wondering if there was a way to write a script that will submit the first 10 .com files in the folder and then submit another .com file as each finishes.
I have a script that will submit all the .com files in a folder at once, but I have a limit to how many jobs I can queue on the supercomputer I use.
The current script looks like
#!/bin/bash
#SBATCH --partition=shared
for i in *.com
do g09sub $i -n 2 -m 4gb -t 200:00:00
done
So 1.com, 2.com, 3.com, etc would be submitted all at the same time.
What I want is to have 1.com, 2.com, 3.com, 4.com, 5.com, 6.com, 7.com, 8.com, 9.com, and 10.com all start at the same time and then as each of those finishes have another .com file start. So that no more than 10 jobs from any one folder will be running at the same time.
If it would be useful, each job creates a .log file when it is finished.
Though I am unsure if it is important, the supercomputer uses a PBS queuing system.

Try xargs or GNU parallel
xargs
ls *.com | xargs -I {} g09sub -P 10 {} -n 2 -m 4gb -t 200:00:00
Explanation:
-I {} tell that {} will represent input file name
-P 10 set max jobs at once
parallel
ls *.com | parallel -P 10 g09sub {} -n 2 -m 4gb -t 200:00:00 # GNU parallel supports -P too
ls *.com | parallel --jobs 10 g09sub {} -n 2 -m 4gb -t 200:00:00
Explanation:
{} represent input file name
--jobs 10 set max jobs at once

Not sure about the availability on your supercomputer, but the GNU bash manual offers a parallel example under 3.2.6 GNU Parallel, at the bottom.
There are ways to run commands in parallel that are not built into Bash. GNU Parallel is a tool to do just that.
...
Finally, Parallel can be used to run a sequence of shell commands in parallel, similar to ‘cat file | bash’. It is not uncommon to take a list of filenames, create a series of shell commands to operate on them, and feed that list of commands to a shell. Parallel can speed this up. Assuming that file contains a list of shell commands, one per line,
parallel -j 10 < file
will evaluate the commands using the shell (since no explicit command
is supplied as an argument), in blocks of ten shell jobs at a time.
Where that option was not available to me, using the jobs function worked rather crudely. eg:
for entry in *.com; do
while [ $(jobs | wc -l) -gt 9 ]; do
sleep 1 # this is in seconds; your sleep may support 'arbitrary floating point number'
done
g09sub ${entry} -n 2 -m 4gb -t 200:00:00 &
done
$(jobs | wc -l) counts the number of jobs spawned in the background by ${cmd} &

Related

Making a Bash script that can open multiple terminals and run wget in each

I have to download bulks of over 100,000 docs from a databank using this script:
#!/usr/bin/bash
IFS=$'\n'
set -f
for line in $(cat < "$1")
do
wget https://www.uniprot.org/uniprot/${line}.txt
done
The first time it took over a week to download all the files (all under 8Kb) so I tried opening multiple terminals and running a split of the total.txt (10 equal splits of 10000 files in 10 terminals) and in just 14 hours I had all the documents downloaded, is there a way to make a script do that for me?
this is a sample of what the list looks like:
D7E6X7
A0A1L9C3F2
A3K3R8
W0K0I7
gnome-terminal -e command
or
xterm -e command
or
konsole -e command
Or
terminal -e command
There is an another alternative to make it fast.
Right now your downloads are synchronized i.e next download process is not started until current one is finished.
Search for how to make command asynchronous/run in background on unix.
When you were doing this by hand, opening multiple terminals made sense. If you want to script this, you can run multiple processes from one terminal/script. You could use xargs to start multiple processes simultaneously:
xargs -a list.txt -n 1 -P 8 -I # bash -c "wget https://www.uniprot.org/uniprot/#.txt"
Where:
-a list.txt tells xargs to use the list.txt file as input.
-n 1 tells xargs to use a maximum of one argument (from the input) for each command it runs.
-P 8 tells xargs to run 8 commands at a time, you can change this to suit your system/requirements.
-I # tells xargs to use "#" to represent the input (i.e. the line from your file).

BASH - transfer large files and process after transfer limiting the number of processes

I have several large files that I need to transfer to a local machine and process. The transfer takes about as long as the processing of the file, and I would like to start processing it immediately after it transfers. But the processing could take longer than the transfer, and I don't want the processes to keep building up, but I would like to limit it to some number, say 4.
Consider the following:
LIST_OF_LARGE_FILES="file1 file2 file3 file4 ... fileN"
for FILE in $LIST_OF_LARGE_FILES; do
scp user#host:$FILE ./
myCommand $FILE &
done
This will transfer each file and start processing it after the transfer while allowing the next file to start transferring. However, if myCommand $FILE takes much longer than the time to transfer one file, these could keep piling up and bogging down the local machine. So I would like to limit myCommand to maybe 2-4 parallel instances. Subsequent attempts to invoke myCommand should buffer it until a "slot" is open. Is there a good way to do this in BASH (using xargs or other utilities is acceptable).
UPDATE:
Thanks for the help in getting this far. Now I'm trying to implement the following logic:
LIST_OF_LARGE_FILES="file1 file2 file3 file4 ... fileN"
for FILE in $LIST_OF_LARGE_FILES; do
echo "Starting on $FILE" # should go to terminal output
scp user#host:$FILE ./
echo "Processing $FILE" # should go to terminal output
echo $FILE # should go through pipe to parallel
done | parallel myCommand
You can use GNU Parallel for that. Just echo the commands you want run into parallel and it will run one job per CPU core your machine has.
for f in ... ; do
scp ...
echo ./process "$f"
done | parallel
If you specifically want 4 processes at a time, use parallel -j 4.
If you want a progress bar, use parallel --bar.
Alternatively, echo just the filename with null-termination, and add the processing command into the invocation of parallel:
for f in ... ; do
scp ...
printf "%s\0" "$f"
done | parallel -0 -j4 ./process

How to issue shell commands to slave machines from master and wait until all are finished?

I have 4 shell commands I need to run and they do not depend on each other.
I have 4 slave machines. So, I want to run one of the 4 commands on each of the 4 machines, and then I want to wait until all 4 of them are finished.
How do I distribute this processing? This is what I tried:
$1 is a list of ip addresses to the slave machines.
for host in $(cat $1)
do
echo $host
# ssh into each machine and launch command
ssh username#$host <command>;
done
But this seems as if it is waiting for the command to finish before moving on to the next host and launching the next command.
How do I accomplish this distributed processing that doesn't depend on each other?
I would use GNU Parallel like this - running hostname in parallel on each of 4 servers:
parallel -j 4 --nonall -S 192.168.0.1,192.168.0.2,192.168.0.3,192.168.0.4 hostname
If you need to pass parameters, use --onall and put arguments after :::
parallel -j 4 --onall -S 192.168.0.1,192.168.0.2,192.168.0.3,192.168.0.4 echo ::: hello
Add --tag if you want the output lines tagged by the hostname/IP.
Add -k if you want to keep the output in order.
Add : to the server list to run on local host too.
If you aren't concerned about how many commands run concurrently, just put each one in the background with &, then wait on them as a group.
while IFS= read -r host; do
ssh username#$host <command> &
done < "$1"
wait
Note the use of a while loop instead of a for loop; see Bash FAQ 001.
The ssh part of your script needs to be like:
$ ssh -f user#host "sh -c 'sleep 30 ; nohup ls > foo 2>&1 &'"
This one sleeps for 30 secs and writes the output of ls to file foo. 30 secs is enough for you to go and see it yourself. Just build your loop around that.

How can you make sure that exactly n project is running in bash?

I have a program that processes files in a really disk-usage heavy way. I want to call this process on many fies, and experience shows that the performance is the best, when there are no more than 3 process started at the same time (otherwise they are competing for the disk-usage as resource too much and slow each other down). Is there an easy way to call commands from a list and start executing the new one when there are less than n (3) of the processes (started by the listed commands) are running at the same time?
You could use xargs. From the manpage:
--max-procs=max-procs
-P max-procs
Run up to max-procs processes at a time; the default is 1. If
max-procs is 0, xargs will run as many processes as possible at
a time. Use the -n option with -P; otherwise chances are that
only one exec will be done.
For example, assuming your commands are one per line:
printf 'sleep %dm\n' 1 2 3 4 5 6 | xargs -L1 -P3 -I {} sh -c {}
Then, in a terminal:
$ pgrep sleep -fa
11987 sleep 1m
11988 sleep 2m
11989 sleep 3m
$ # a little while later
$ pgrep sleep -fa
11988 sleep 2m
11989 sleep 3m
12045 sleep 4m
The -L1 option uses one line at a time as the argument, and -I {} indicates that {} will be replaced with that line. To actually run the command, we pass it to sh as an argument to -c.

Maintaining a set number of concurrent jobs w/ args from a file in bash

I found this script on the net, I don't know to work in bash too much is too weird but..
Here's my script:
CONTOR=0
for i in `cat targets`
do
CONTOR=`ps aux | grep -c php`
while [ $CONTOR -ge 250 ];do
CONTOR=`ps aux | grep -c php`
sleep 0.1
done
if [ $CONTOR -le 250 ]; then
php b $i > /dev/null &
fi
done
My targets are urls, and the b php file is a crawler which save some links into a file. The problem is max numbers of threads is 50-60 and that's because the crawler finish very fast and that bash script code doesn't have time to open my all 250 threads. It's any chance to do something to open all threads (250) ? It is possible to run more than one thread per ps -aux process? Right know seems he open 1 thread after execute ps -aux.
First: Bash has no multithreading support whatsoever. foo & starts a separate process, not a thread.
Second: launching ps to check for children is both prone to false positives (treating unrelated invocations of php as if they were jobs in the current process) and extremely inefficient if done in a loop (since every invocation involves a fork()/exec()/wait() cycle).
Thus, don't do it that way: Use a release of GNU xargs with -P, or (if you must) GNU parallel.
Assuming your targets file is newline-delimited, and has no special quoting or characters, this could be as simple as:
xargs -d $'\n' -n 1 -P 250 php b <targets
...or, for pure POSIX shells:
xargs -d "
" -n 1 -P 250 php b <targets
With GNU Parallel it looks like this (choose the style you like best):
cat targets | parallel -P 250 php b
parallel -a targets -P 250 php b
parallel -P 250 php b :::: targets
There is no risk of false positives if there are other php processes running. And unlike xargs there is no risk if the file targets contain space, " or '.

Resources