I have multiple files in $tmpdir/$i.dirlist with entries command rsync.
Each file have (depending on the amount) 10 sometimes 50 and even 150 entries of rsync.
I'm wondering now how to manage it by FOR or WHILE loop with IF sequence to run from each files ($tmpdir/$i.dirlist - if we have have 100 files) only 2 entries and wait for complet some processes and if total of all running process of rsync are less than 200 processes - launched new entries, maintaining a fixed number of processes defined in the parameter. In this case 200
Any idea? how to do it?
Edit:
about rsync entry.
In each file $tmpdir/*.dirlist is (in this example 200) entries with directory
path like:
==> /tmp/rsync.23611/0.dirlist <==
system/root/etc/ssl
system/root/etc/dbus-1
system/root/etc/lirc
system/root/etc/sysctl.d
==> /tmp/rsync.23611/1.dirlist <==
system/root/etc/binfmt.d
system/root/etc/cit
system/root/etc/gdb
==> /tmp/rsync.23611/2.dirlist <==
system/root/usr/local
system/root/usr/bin
system/root/usr/lib
now to run it i use simply for
for i in $(seq 1 $rsyncs); do
while read r; do
rsync $rsyncopts backup#$host:$remotepath/$ri $r 2>&1 |
tee $tmpdir/$i.dirlist.log ;
done < $tmpdir/$i.dirlist &
done
with an example of use
for ARG in $*; do
command $ARG &
NPROC=$(($NPROC+1))
if [ "$NPROC" -ge 4 ]; then
wait
NPROC=0
fi
done
Assuming the maximum value of $i is 100, with your code above you are still below the maximum you want to allow of 200 processes.
So a solution would be to run twice as much processes. I suggest you to divide your main loop for i in $(seq 1 $rsyncs); do ... in two loops running concurrently, introduced by resp. for i in $(seq 1 2 $rsyncs); do ... for the odd values of $i, and for i in $(seq 2 2 $rsyncs); do ... for the even values of $i.
for i in $(seq 1 2 $rsyncs); do # i = 1 3 5 ...
while read r; do
rsync $rsyncopts backup#$host:$remotepath/$ri $r 2>&1 |
tee $tmpdir/$i.dirlist.log ;
done < $tmpdir/$i.dirlist &
done & # added an ampersand here
for i in $(seq 2 2 $rsyncs); do # i = 2 4 6 ...
while read r; do
rsync $rsyncopts backup#$host:$remotepath/$ri $r 2>&1 |
tee $tmpdir/$i.dirlist.log ;
done < $tmpdir/$i.dirlist &
done
Edit: Since my approach above doesn't convince you, let us try something completely different. First, create a list of all the processes you want to run and store these in an array:
processes=() # create an empty bash array
for i in $(sed 1 $rsyncs); do
while read r; do
# add the full rsync command line to the array
processes+=("rsync $rsyncopts backup#$host:$remotepath/$ri $r 2>&1 | tee $tmpdir/$i.dirlist.log");
done < $tmpdir/$i.dirlist
done
Once you have that array, launch say 200 processes, and then enter a loop to wait a process to finish and launch the next one:
for ((j=0;j<200;j++)); do
$processes[$j]& # launch processes in background
done
while [ ! -z "$processes[$j]" ] ; do
wait # wait one process finishes
$processes[((j++))]& # launch one more process
done
Please try this and tell us.
Related
I have a bash loop that I run to copy 2 files from the hpc to my local drive recursively over the processors and all the timesteps. On the hpc the timesteps are saved as
1 2 3
whereas the bash loop interprets it as
1.0 2.0 3.0
probably because of the 0.5 increment. Is there a way to get the $j to be changed to whole number (without the decimal) when running the script?
Script I use:
for i in $(seq 0 1 23)
do
mkdir Run1/processor$i
for j in $(seq 0 0.5 10);
do
mkdir Run1/processor$i/$j
scp -r xx#login.hpc.xx.xx:/scratch/Run1/processor$i/$j/p Run1/processor$i/$j/
scp -r xx#login.hpc.xx.xx:/scratch/Run1/processor$i/$j/U Run1/processor$i/$j/
done
done
Result:
scp: /scratch/Run1/processor0/1.0/p: No such file or directory
The correct directory that exists is
/scratch/Run1/processor0/1
Thanks!
well, yes!
but: Depending on what the end result is.
I will assume you want to floor the decimal number. I can think of 2 options:
pipe the number to cut
do a little bit of perl
for i in $(seq 0 1 23); do
for j in $(seq 0 0.5 10); do
# pipe to cut
echo /scratch/Run1/processor$i/$(echo $j | cut -f1 -d".")/U Run1/processor"$i/$j"/
# pipe to perl
echo /scratch/Run1/processor$i/$(echo $j | perl -nl -MPOSIX -e 'print floor($_);')/U Run1/processor"$i/$j"/
done
done
result:
...
/scratch/Run1/processor23/9/U Run1/processor23/9/
/scratch/Run1/processor23/9/U Run1/processor23/9.5/
/scratch/Run1/processor23/9/U Run1/processor23/9.5/
/scratch/Run1/processor23/10/U Run1/processor23/10/
/scratch/Run1/processor23/10/U Run1/processor23/10/
edit :
Experimented a litle, found another way:
echo /scratch/Run1/processor$i/${j%%.[[:digit:]]}/U Run1/processor"$i/$j"/
I have a for loop in bash that writes values to a file. However, because there are a lot of values, the process takes a long time, which I think can be saved by improving the code.
nk=1152
nb=24
for k in $(seq 0 $((nk-1))); do
for i in $(seq 0 $((nb-1))); do
for j in $(seq 0 $((nb-1))); do
echo -e "$k\t$i\t$j"
done
done
done > file.dat
I've moved the output action to after the entire loop is done rather than echo -e "$k\t$i\t$j" >> file.dat to avoid opening and closing the file many times. However, the speed the script writes to the file is still rather slow, ~ 10kbps.
Is there a better way to improve the IO?
Many thanks
Jacek
It looks like the seq calls are fairly punishing since that is a separate process. Try this just using shell math instead:
for ((k=0;k<=$nk-1;k++)); do
for ((i=0;i<=$nb-1;i++)); do
for ((j=0;j<=$nb-1;j++)); do
echo -e "$k\t$i\t$j"
done
done
done > file.dat
It takes just 7.5s on my machine.
Another way is to compute the sequences just once and use them repeatedly, saving a lot of shell calls:
nk=1152
nb=24
kseq=$(seq 0 $((nk-1)))
bseq=$(seq 0 $((nb-1)))
for k in $kseq; do
for i in $bseq; do
for j in $bseq; do
echo -e "$k\t$i\t$j"
done
done
done > file.dat
This is not really "better" than the first option, but it shows how much of the time is spent spinning up instances of seq versus actually getting stuff done.
Bash isn't always the best for this. Consider this Ruby equivalent which runs in 0.5s:
#!/usr/bin/env ruby
nk=1152
nb=24
nk.times do |k|
nb.times do |i|
nb.times do |j|
puts "%d\t%d\t%d" % [ k, i, j ]
end
end
end
What is the most time consuming is calling seq in a nested loop. Keep in mind that each time you call seq it loads command from disk, fork a process to run it, capture the output, and store the whole output sequence into memory.
Instead of calling seq you could use an arithmetic loop:
#!/usr/bin/env bash
declare -i nk=1152
declare -i nb=24
declare -i i j k
for ((k=0; k<nk; k++)); do
for (( i=0; i<nb; i++)); do
for (( j=0; j<nb; j++)); do
printf '%d\t%d\t%d\n' "$k" "$i" "$j"
done
done
done > file.dat
Running seq in a subshell consumes most of the time.
Switch to a different language that provides all the needed features without shelling out. For example, in Perl:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $nk = 1152;
my $nb = 24;
for my $k (0 .. $nk - 1) {
for my $i (0 .. $nb - 1) {
for my $j (0 .. $nb - 1) {
say "$k\t$i\t$j"
}
}
}
The original bash solution runs for 22 seconds, the Perl one finishes in 0.1 seconds. The output is identical.
#Jacek : I don't think the I/O is the problem, but the number of child processes spawned. I would store the result of the seq 0 $((nb-1)) into an array and loop over the array, i.e.
nb_seq=( $(seq 0 $((nb-1)) )
...
for i in "${nb_seq[#]}"; do
for j in "${nb_seq[#]}"; do
seq is bad) once i've done this function special for this case:
$ que () { printf -v _N %$1s; _N=(${_N// / 1}); printf "${!_N[*]}"; }
$ que 10
0 1 2 3 4 5 6 7 8 9
And you can try to write first all to a var and then whole var into a file:
store+="$k\t$i\t$j\n"
printf "$store" > file
No. it's even worse like that)
This question already has answers here:
Parallelize Bash script with maximum number of processes
(16 answers)
Closed 5 years ago.
I have the following bash script, which starts a program several times in parallel and passes a control variable to each execution.
The program is utilizes several resources, so after it was started 10 times in parallel I want to wait till the last 10 started, are finished.
I am currently doing this very roughly, by just waiting after 10 iterations for the longest time possible that 10 parallel started programs are finished.
Is there a straight forward way to implement this behavior?
steps=$((500/20))
echo $steps
START=0
for((i=START;i < steps; i++))
do
for((j=START;j < steps;j++))
do
for((k=START;k < steps;k++))
do
n=$(($j*steps +$k))
idx=$(($i*$steps*$steps + $n))
if ! ((idx % 10)); then
echo "waiting for the last 10 programs"
sleep 10
else
./someprogram $idx &
fi
done
done
done
Well, since you already have a code in place to check the 10th iteration (idx % 10), the wait builtin seems perfect. From the docs:
wait: wait [-n] [id ...]
[...] Waits for each process identified by an ID, which may be a process ID or a
job specification, and reports its termination status. If ID is not
given, waits for all currently active child processes, and the return
status is zero.
So, by waiting each time idx % 10 == 0, you are actually waiting for all previous child processes to finish. And if you are not spawning anything else than someprogram, then you'll be waiting for those (up to 10) last to finish.
Your script with wait:
#!/bin/bash
steps=$((500/20))
START=0
for ((i=START; i<steps; i++)); do
for ((j=START; j<steps; j++)); do
for ((k=START; k<steps; k++)); do
idx=$((i*steps*steps + j*steps + k))
if ! ((idx % 10)); then
wait
else
./someprogram $idx &
fi
done
done
done
Also, notice you don't have to use $var (dollar prefix) inside arithmetic expansion $((var+1)).
I'm assuming above your actual script does some additional processing before calling someprogram, but if all you need is to call someprogram on consecutive indexes, 10 instances at a time, you might consider using xargs or GNU parallel.
For example, with xargs:
seq 0 1000 | xargs -n1 -P10 ./someprogram
or, with additional arguments to someprogram:
seq 0 1000 | xargs -n1 -P10 -I{} ./someprogram --option --index='{}' someparam
With GNU parallel:
seq 0 1000 | parallel -P10 ./someprogram '{}'
seq 0 1000 | parallel -P10 ./someprogram --option --index='{}' someparam
I have a loop, in a bash script. It runs a programme that by default outputs a text file when it works, and no file if it doesn't. I'm running it a large number of times (> 500K) so I want to merge the output files, row by row. If one iteration of the loop creates a file, I want to take the LAST line of that file, append it to a master output file, then delete the original so I don't end up with 1000s of files in one directory. The Loop I have so far is:
oFile=/path/output/outputFile_
oFinal=/path/output.final
for counter in {101..200}
do
$programme $counter -out $oFile$counter
if [ -s $oFile$counter ] ## This returns TRUE if file isn't empty, right?
then
out=$(tail -1 $oFile$counter)
final=$out$oFile$counter
$final >> $oFinal
fi
done
However, it doesn't work properly, as it seems to not return all the files I want. So is the conditional wrong?
You can be clever and pass the programme a process substitution instead of a "real" file:
oFinal=/path/output.final
for counter in {101..200}
do
$programme $counter -out >(tail -n 1)
done > $oFinal
$programme will treat the process substitution as a file, and all the lines written to it will be processed by tail
Testing: my "programme" outputs 2 lines if the given counter is even
$ cat programme
#!/bin/bash
if (( $1 % 2 == 0 )); then
{
echo ignore this line
echo $1
} > $2
fi
$ ./programme 101 /dev/stdout
$ ./programme 102 /dev/stdout
ignore this line
102
So, this loop should output only the even numbers between 101 and 200
$ for counter in {101..200}; do ./programme $counter >(tail -1); done
102
104
[... snipped ...]
198
200
Success.
Consider the following script:
#!/bin/bash
function long_running {
for i in $(seq 1 10); do
echo foo
sleep 100
done
}
long_running | head -n 1
This produces the expected output (one line "foo") but sleeps (for the specified 100 seconds) before terminating. I would like the script to terminate immediately when head does. How can I force bash to actually quit immediately? Even changing the last line to
long_running | (head -n 1; exit)
or similar doesn't work; I can't get set -e, another common suggestion, to work even if I force a failure with, say, (head -n 1; false) or the like.
(This is a simplified version of my real code (obviously) which doesn't sleep; just creates a fairly complex set of nested pipelines searching for various solutions to a constraint problem; as I only need one and don't care which I get, I'd like to be able to make the script terminate by adding head -n 1 to the invocation...)
How about sending the function to head like this -
#!/bin/bash
function long_running {
for i in $(seq 1 10); do
echo foo
sleep 100
done
}
head -n 1 <(long_running)
Obviously if you will increase the -n to a greater number, the sleep would kick in but would exit once head is completed.