I have done the calculations of a certain variable through multiple processors in cluster, now I want to get all the result from them to a single processor. Which technique I have to use for that.
It is a bit unclear what you are asking, but GNU Parallel may be what you are looking for:
parallel --slf cluster-hosts.txt do_my_procession --my_arg {} ::: {1..1000}
Example:
parallel -S localhost echo {} ::: {1..1000}
See more on: https://www.gnu.org/software/parallel/parallel_tutorial.html#Remote-execution
Related
I have a script pytonscript.py that I want to run on 500 samples. I have 50 CPUs available and want to run the script in parallel using 1 CPU for each sample, so that 50 samples are constantly running with 1 CPU each. Any ideas how to set this up without typing 500 lines with the different inputs? I know how to make a loop for each sample, but not how to make 50 samples running in parallel. I guess GNU parallel is a way?
Input samples in folder samples:
sample1
sample2
sample2
...
sample500
pytonscript.py -i samples/sample1.sam.bz2 -o output_folder
What about GNU xargs?
printf '%s\0' samples/sample*.sam.bz |
xargs -0L1 -P 50 pytonscript.py -o output_dir -i
This launches a new instance of the python script for each file, concurrently, maintaining a maximum of 50 at once.
If the wildcard glob expansion isn't specific enough, you can use bash's extglob: shopt -s exglob; # samples/sample+([0-9]).sam.bz
Is there a quick, easy, and efficient way of running iterations in this for loop in parallel?
for i in `seq 1 5000`; do
repid="$(printf "%05d" "$i")"
inp="${repid}.inp"
out="${repid}.out"
/command "$inp" "$out"
done
If you want to take advantage of all your lovely CPU cores that you paid Intel so handsomely for, turn to GNU Parallel:
seq -f "%05g" 5000 | parallel -k echo command {}.inp {}.out
If you like the look of that, run it again without the -k (which keeps the output in order) and without the echo. You may need to enclose the command in single quotes:
seq -f "%05g" 5000 | parallel '/command {}.inp {}.out'
It will run 1 instance per CPU core in parallel, but, if you want say 32 in parallel, use:
seq ... | parallel -j 32 ...
If you want an "estimated time of arrival", use:
parallel --eta ...
If you want a progress meter, use:
parallel --progress ...
If you have bash version 4+, it can zero-pad brace expansions. And if your ARGMAX is big enough, so you can more simply use:
parallel 'echo command {}.inp {}.out' ::: {00001..05000}
You can check your ARGMAX with:
sysctl -a kern.argmax
and it tells you how many bytes long your parameter list can be. You are going to need 5,000 numbers at 5 digits plus a space each, so 30,000 minimum.
If you are on macOS, you can install GNU Parallel with homebrew:
brew install parallel
for i in `seq 1 5000`; do
repid="$(printf "%05d" "$i")"
inp="${repid}.inp"
out="${repid}.out"
/command "$inp" "$out" &
done
Is there a more idiomatic way of doing the following:
cat some_lines.txt | while read x; do ./process_line.sh $x; done
ie. applying a script to each line of stdin?
I could include the while read x; boilerplate in the script itself, but that doesn't really feel right either.
If you're running an external process and have GNU xargs, consider:
xargs -n1 -d $'\n' ./process_line.sh <some_lines.txt
If you don't like the while read loop's verbosity, and are running a shell function (where a fork() isn't natively needed, and thus where using an external tool like xargs or GNU parallel has a substantial performance cost), you can avoid it by wrapping the loop in a function:
for_each_line() {
local line
while IFS= read -r line; do
"$#" "$line" </dev/null
done
}
...can be run as:
process_line() {
echo "Processing line: $1"
}
for_each_line process_line <some_lines.txt
GNU Parallel is made for this kind of tasks - provided there is no problem in running the processing in parallel:
cat some_lines.txt | parallel ./process_line.sh {}
By default it will run one job per cpu-core. This can be adjusted with --jobs.
There is an overhead of running it through GNU Parallel in the order of 5 ms per job. One of the benefits you get is that you are guaranteed the output from the different jobs are not jumbled together and you can therefore use use the output as if the jobs had not been run in parallel:
cat some_lines.txt | parallel ./process_line.sh {} | do_post_processing
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
I have a sh script that, if it is invoked by the command ./textgenerate text0, a file text0.txt would be generated.
Now that I need text0.txt, text1.txt, ..., text1000.txt, how could I write the script to do that. i.e., how to replace the 0 part to changing variables, say 1~1000?
I know a certain "trick" to implement this, for example generate the script 1000 times using Microsoft Excel and paste it into the shell. But is there an elegent and efficient way to do this?
for i in {0..1000}; do
./textgenerate "text${i}"
done
This loops over the range [0..1,000] and assigns each value to $i in the body of the loop. ./textgenerate will be invoked 1,001 times.
Very simply and in parallel with GNU Parallel:
parallel ./textgenerate text{}.txt ::: {0..1000}
Or, if you don't have a recent bash to expand the {0..1000}, you could equally do this:
seq 0 1000 | parallel ./textgenerate text{}.txt
And, if you want to see what it would do, without actually doing anything:
parallel --dry-run ... as above ...
And, if you want a progress bar:
parallel --bar ... as above ...
You can also let printf process the looping.
. <(printf "./textgenerate text%s\n" {0..1000})
I created a script to verify a (big) number of items and it was doing the verification in a serial way (one after the other) with the end result of the script taking about 9 hours to complete. Looking around about how to improve this, I found GNU parallel but I'm having problems making it work.
The list of items is in a text file so I was doing the following:
readarray items < ${ALL_ITEMS}
export -f process_item
parallel process_item ::: "${items[#]}"
Problem is, I receive an error:
GNU parallel: Argument list too long
I understand by looking at similar posts 1, 2, 3 that this is more a Linux limitation than a GNU parallel one. From the answers to those posts I also tried to extrapolate a workaround by piping the items to head but the result is that only a few items (the parameter passed to head) are processed.
I have been able to make it work using xargs:
cat "${ALL_ITEMS}" | xargs -n 1 -P ${THREADS} -I {} bash -c 'process_item "$#"' _ {}
but I've seen GNU parallel has other nice features I'd like to use.
Any idea how to make this work with GNU parallel? By the way, the number of items is about 2.5 million and growing every day (the script run as a cron job).
Thanks
From man parallel:
parallel [options] [command [arguments]] < list_of_arguments
So:
export -f process_item
parallel process_item < ${ALL_ITEMS}
probably does what you want.
You can pipe the file to parallel, or just use the -a (--arg-file) option. The following are equvalent:
cat "${ALL_ITEMS}" | parallel process_item
parallel process_item < "${ALL_ITEMS}"
parallel -a "${ALL_ITEMS}" process_item
parallel --arg-file "${ALL_ITEMS}" process_item
parallel process_item :::: "${ALL_ITEMS}"