BASH - transfer large files and process after transfer limiting the number of processes - bash

I have several large files that I need to transfer to a local machine and process. The transfer takes about as long as the processing of the file, and I would like to start processing it immediately after it transfers. But the processing could take longer than the transfer, and I don't want the processes to keep building up, but I would like to limit it to some number, say 4.
Consider the following:
LIST_OF_LARGE_FILES="file1 file2 file3 file4 ... fileN"
for FILE in $LIST_OF_LARGE_FILES; do
scp user#host:$FILE ./
myCommand $FILE &
done
This will transfer each file and start processing it after the transfer while allowing the next file to start transferring. However, if myCommand $FILE takes much longer than the time to transfer one file, these could keep piling up and bogging down the local machine. So I would like to limit myCommand to maybe 2-4 parallel instances. Subsequent attempts to invoke myCommand should buffer it until a "slot" is open. Is there a good way to do this in BASH (using xargs or other utilities is acceptable).
UPDATE:
Thanks for the help in getting this far. Now I'm trying to implement the following logic:
LIST_OF_LARGE_FILES="file1 file2 file3 file4 ... fileN"
for FILE in $LIST_OF_LARGE_FILES; do
echo "Starting on $FILE" # should go to terminal output
scp user#host:$FILE ./
echo "Processing $FILE" # should go to terminal output
echo $FILE # should go through pipe to parallel
done | parallel myCommand

You can use GNU Parallel for that. Just echo the commands you want run into parallel and it will run one job per CPU core your machine has.
for f in ... ; do
scp ...
echo ./process "$f"
done | parallel
If you specifically want 4 processes at a time, use parallel -j 4.
If you want a progress bar, use parallel --bar.
Alternatively, echo just the filename with null-termination, and add the processing command into the invocation of parallel:
for f in ... ; do
scp ...
printf "%s\0" "$f"
done | parallel -0 -j4 ./process

Related

How to run an Inotify shell script as an asynchronous process

I have an inotify shell script which monitors a directory, and executes certain commands if a new file comes in. I need to make this inotify script into a parallelized process, so the execution of the script doesn't wait for the process to complete whenever multiple files comes into the directory.
I have tried using nohup, & and xargs to achieve this task. But the problem was, xargs runs the same script as a number of processes, whenever a new file comes in, all the running n processes try to process the script. But essentially I only want one of the processes to process the new file whichever is idle. Something like worker pool, whichever worker is free or idle tries to execute the task.
This is my shell script.
#!/bin/bash
# script.sh
inotifywait --monitor -r -e close_write --format '%w%f' ./ | while read FILE
do
echo "started script";
sleep $(( $RANDOM % 10 ))s;
#some more process which takes time when a new file comes in
done
I did try to execute the script like this with xargs =>
xargs -n1 -P3 bash sample.sh
So whenever a new file comes in, it is getting processed thrice because of P3, but ideally i want one of the processes to pick this task which ever is idle.
Please shed some light on how to approach this problem?
There is no reason to have a pool of idle processes. Just run one per new file when you see new files appear.
#!/bin/bash
inotifywait --monitor -r -e close_write --format '%w%f' ./ |
while read -r file
do
echo "started script";
( sleep $(( $RANDOM % 10 ))s
#some more process which takes time when a new "$file" comes in
) &
done
Notice the addition of & and the parentheses to group the sleep and the subsequent processing into a single subshell which we can then background.
Also, notice how we always prefer read -r and Correct Bash and shell script variable capitalization
Maybe this will work:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-GNU-Parallel-as-dir-processor
If you have a dir in which users drop files that needs to be processed you can do this on GNU/Linux (If you know what inotifywait is called on other platforms file a bug report):
inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |
parallel -u echo
This will run the command echo on each file put into my_dir or subdirs of my_dir.
To run at most 5 processes use -j5.

Run jobs in sequence rather than consecutively using bash

So I work a lot with Gaussian 09 (the computational chemistry software) on a supercomputer.
To submit a job I use the following command line
g09sub input.com -n 2 -m 4gb -t 200:00:00
Where n is the number of processors used, m is the memory requested, and t is the time requested.
I was wondering if there was a way to write a script that will submit the first 10 .com files in the folder and then submit another .com file as each finishes.
I have a script that will submit all the .com files in a folder at once, but I have a limit to how many jobs I can queue on the supercomputer I use.
The current script looks like
#!/bin/bash
#SBATCH --partition=shared
for i in *.com
do g09sub $i -n 2 -m 4gb -t 200:00:00
done
So 1.com, 2.com, 3.com, etc would be submitted all at the same time.
What I want is to have 1.com, 2.com, 3.com, 4.com, 5.com, 6.com, 7.com, 8.com, 9.com, and 10.com all start at the same time and then as each of those finishes have another .com file start. So that no more than 10 jobs from any one folder will be running at the same time.
If it would be useful, each job creates a .log file when it is finished.
Though I am unsure if it is important, the supercomputer uses a PBS queuing system.
Try xargs or GNU parallel
xargs
ls *.com | xargs -I {} g09sub -P 10 {} -n 2 -m 4gb -t 200:00:00
Explanation:
-I {} tell that {} will represent input file name
-P 10 set max jobs at once
parallel
ls *.com | parallel -P 10 g09sub {} -n 2 -m 4gb -t 200:00:00 # GNU parallel supports -P too
ls *.com | parallel --jobs 10 g09sub {} -n 2 -m 4gb -t 200:00:00
Explanation:
{} represent input file name
--jobs 10 set max jobs at once
Not sure about the availability on your supercomputer, but the GNU bash manual offers a parallel example under 3.2.6 GNU Parallel, at the bottom.
There are ways to run commands in parallel that are not built into Bash. GNU Parallel is a tool to do just that.
...
Finally, Parallel can be used to run a sequence of shell commands in parallel, similar to ‘cat file | bash’. It is not uncommon to take a list of filenames, create a series of shell commands to operate on them, and feed that list of commands to a shell. Parallel can speed this up. Assuming that file contains a list of shell commands, one per line,
parallel -j 10 < file
will evaluate the commands using the shell (since no explicit command
is supplied as an argument), in blocks of ten shell jobs at a time.
Where that option was not available to me, using the jobs function worked rather crudely. eg:
for entry in *.com; do
while [ $(jobs | wc -l) -gt 9 ]; do
sleep 1 # this is in seconds; your sleep may support 'arbitrary floating point number'
done
g09sub ${entry} -n 2 -m 4gb -t 200:00:00 &
done
$(jobs | wc -l) counts the number of jobs spawned in the background by ${cmd} &

Parallel shell execution and parameters

I'm monitoring a folder which is receiving log files.
For each log file received, I need to send it to a remote server via SCP. SCP transfer is done via transfer.sh script.
Since I need to perform a transfer for each file, its probable that a single file may delay other new files. I would like to "create" a new parallel process for each file in my directory.
MONITOR_FOLDER='/repository/'
PATTERN='log_*'
for log_file in $MONITOR_FOLDER$PATTERN
do
echo "$(date +%c) monitor() Processing $log_file CDR file..."
parallel --will-cite -n0 "sh transfer.sh $log_file 1" ::: {1..1}
done
the $MONITOR_FOLDER$PATTERN can return 0 or more files.
When there is more than 1 file, I want to create a parallel process per file.
The following command display the correct list.
ls $MONITOR_FOLDER | grep 'log_*'
Question:
1) For each entry use it as param for my shell script and at the same time create a new process without the loop
I'm monitoring a folder which is receiving log files.
For each log file received, I need to send it to a remote server via SCP. SCP transfer is done via transfer.sh script.
That part is easy:
MONITOR_FOLDER='/repository/'
PATTERN='log_*'
parallel -j0 'echo "$(date +%c) monitor() Processing {} CDR file..."; sh transfer.sh {} 1' ::: $MONITOR_FOLDER$PATTERN
Or:
ls $MONITOR_FOLDER | grep 'log_*' | parallel -j0 'echo "$(date +%c) monitor() Processing {} CDR file..."; sh transfer.sh {} 1'
Since I need to perform a transfer for each file, its probable that a single file may delay other new files. I would like to "create" a new parallel process for each file in my directory.
This is also easy if you allow for a file to be copied more than once and to have as many scp's running as there are files. Simply add & to the command:
MONITOR_FOLDER='/repository/'
PATTERN='log_*'
for log_file in $MONITOR_FOLDER$PATTERN
do
echo "$(date +%c) monitor() Processing $log_file CDR file..."
sh transfer.sh $log_file 1 &
done
Now it gets more tricky if:
You at most want 12 scp's running at the same time
You only want to copy a file once
But you can probably use this: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-GNU-Parallel-as-dir-processor
notifywait -q -m -r -e MOVED_TO -e CLOSE_WRITE --format %w%f $MONITOR_FOLDER |\
grep 'log_*' | parallel -j12 'echo "$(date +%c) monitor() Processing {} CDR file..."; sh transfer.sh {} 1'
It will just sit there waiting for a new file to be written. So if you want to stop it, you will have to kill it.
I think the problem is in your code :
for log_file in $MONITOR_FOLDER$PATTERN
Please go through the loop process and study how loop works, in your case for
For e.g.
for i in '1 2 3 4 5' # it will iterate from 1 to 5
but
for i in $VAR # it will iterate over `echo $VAR` means its value
Thus in your case the variable log_file will get first value as /repository/log_* but not its content.
To make your code working you may do like.
for log_file in `ls $MONITOR_FOLDER$PATTERN`

Maintaining a set number of concurrent jobs w/ args from a file in bash

I found this script on the net, I don't know to work in bash too much is too weird but..
Here's my script:
CONTOR=0
for i in `cat targets`
do
CONTOR=`ps aux | grep -c php`
while [ $CONTOR -ge 250 ];do
CONTOR=`ps aux | grep -c php`
sleep 0.1
done
if [ $CONTOR -le 250 ]; then
php b $i > /dev/null &
fi
done
My targets are urls, and the b php file is a crawler which save some links into a file. The problem is max numbers of threads is 50-60 and that's because the crawler finish very fast and that bash script code doesn't have time to open my all 250 threads. It's any chance to do something to open all threads (250) ? It is possible to run more than one thread per ps -aux process? Right know seems he open 1 thread after execute ps -aux.
First: Bash has no multithreading support whatsoever. foo & starts a separate process, not a thread.
Second: launching ps to check for children is both prone to false positives (treating unrelated invocations of php as if they were jobs in the current process) and extremely inefficient if done in a loop (since every invocation involves a fork()/exec()/wait() cycle).
Thus, don't do it that way: Use a release of GNU xargs with -P, or (if you must) GNU parallel.
Assuming your targets file is newline-delimited, and has no special quoting or characters, this could be as simple as:
xargs -d $'\n' -n 1 -P 250 php b <targets
...or, for pure POSIX shells:
xargs -d "
" -n 1 -P 250 php b <targets
With GNU Parallel it looks like this (choose the style you like best):
cat targets | parallel -P 250 php b
parallel -a targets -P 250 php b
parallel -P 250 php b :::: targets
There is no risk of false positives if there are other php processes running. And unlike xargs there is no risk if the file targets contain space, " or '.

GNU parallel processing

I have the following script that I want to run using GNU parallel, it is a for loop that needs to be run n times. How can I do this using GNU parallel?
SHARK=tshark
# Create file list
FILELIST=`ls $1`
TEMPDIR=/tmp/foobar
mkdir $TEMPDIR
i=1
for I in $FILELIST; do
echo "$i $I $2"
$SHARK -r $I -w $TEMPDIR/~$I-$i -R "$2" &>/dev/null
i=`echo $i+1|bc`
done
There are a number of ways of doing this, either with sub-shells and sub-processes, see e.g.
Running shell script in parallel
or by installing neat utilities designed to do this, e.g:
|P|P|S|S| - (Distributed) Parallel Processing Shell Script
GNU Parallel
I would try to get it done first with sub-shells, and then try the others if you still need better power.

Resources