Easy parallelisation

Easy parallelisation - bash

I often find myself writing simple for loops to perform an operation to many files, for example:
for i in `find . | grep ".xml$"`; do bzip2 $i; done
It seems a bit depressing that on my 4-core machine only one core is getting used.. is there an easy way I can add parallelism to my shell scripting?
EDIT: To introduce a bit more context to my problems, sorry I was not more clear to start with!
I often want to run simple(ish) scripts, such as plot a graph, compress or uncompress, or run some program, on reasonable sized datasets (usually between 100 and 10,000). The scripts I use to solve such problems look like the one above, but might have a different command, or even a sequence of commands to execute.
For example, just now I am running:
for i in `find . | grep ".xml.bz2$"`; do find_graph -build_graph $i.graph $i; done
So my problems are in no way bzip specific! (Although parallel bzip does look cool, I intend to use it in future).

Solution: Use xargs to run in parallel (don't forget the -n option!)
find -name \*.xml -print0 | xargs -0 -n 1 -P 3 bzip2

This perl program fits your needs fairly well, you would just do this:
runN -n 4 bzip2 `find . | grep ".xml$"`

gnu make has a nice parallelism feature (eg. -j 5) that would work in your case. Create a Makefile
%.xml.bz2 : %.xml
all: $(patsubt %.xml,%xml.bz2,$(shell find . -name '*.xml') )
then do a
nice make -j 5
replace '5' with some number, probably 1 more than the number of CPU's. You might want to do 'nice' this just in case someone else wants to use the machine while you are on it.

The answer to the general question is difficult, because it depends on the details of the things you are parallelizing.
On the other hand, for this specific purpose, you should use pbzip2 instead of plain bzip2 (chances are that pbzip2 is already installed or at least in the repositories or your distro). See here for details: http://compression.ca/pbzip2/

I find this kind of operation counterproductive. The reason is the more processes access the disk at the same time the higher the read/write time goes so the final result ends in a longer time. The bottleneck here won't be a CPU issue, no matter how many cores you have.
Haven't you ever performed a simple two big file copies at the same time on the same HD drive? I is usually faster to copy one and then another.
I know this task involves some CPU power (bzip2 is demanding compression method), but try measuring first CPU load before going the "challenging" path we all technicians tend to choose much more often than needed.

I did something like this for bash. The parallel make trick is probably a lot faster for one-offs, but here is the main code section to implement something like this in bash, you will need to modify it for your purposes though:
#!/bin/bash
# Replace NNN with the number of loops you want to run through
# and CMD with the command you want to parallel-ize.
set -m
nodes=`grep processor /proc/cpuinfo | wc -l`
job=($(yes 0 | head -n $nodes | tr '\n' ' '))
isin()
{
local v=$1
shift 1
while (( $# > 0 ))
do
if [ $v = $1 ]; then return 0; fi
shift 1
done
return 1
}
dowait()
{
while true
do
nj=( $(jobs -p) )
if (( ${#nj[#]} < nodes ))
then
for (( o=0; o<nodes; o++ ))
do
if ! isin ${job[$o]} ${nj[*]}; then let job[o]=0; fi
done
return;
fi
sleep 1
done
}
let x=0
while (( x < NNN ))
do
for (( o=0; o<nodes; o++ ))
do
if (( job[o] == 0 )); then break; fi
done
if (( o == nodes )); then
dowait;
continue;
fi
CMD &
let job[o]=$!
let x++
done
wait

If you had to solve the problem today you would probably use a tool like GNU Parallel (unless there is a specialized parallelized tool for your task like pbzip2):
find . | grep ".xml$" | parallel bzip2
To learn more:
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial). You command line
with love you for it.

I think you could to the following
for i in `find . | grep ".xml$"`; do bzip2 $i&; done
But that would spin off however many processes as you have files instantly and isn't an optimal as just running four processes at a time.

Related

How to start a large number of quick jobs in Bash

I have 3000 very quick jobs to run that on average take 2/3 seconds.
The list of jobs is in a file, and I want to control how many I have open.
However, the process of starting a job in background (& line) seems to take some time itself, therefore, some jobs are already finishing before "INTOTAL" amount have got started...
Therefore, I am not using my 32 core efficiently.
Is the a better approach than the one below?
#!/bin/sh
#set -x
INTOTAL=28
while true
do
NUMRUNNING=`tasklist | egrep Prod.exe | wc -l`
JOBS=`cat jobs.lst | wc -l`
if [ $JOBS -gt 0 ]
then
MAXSTART=$(($INTOTAL-$NUMRUNNING))
NUMTOSTART=$JOBS
if [ $NUMTOSTART -gt $MAXSTART ]
then
NUMTOSTART=$MAXSTART
fi
echo 'Starting: '$NUMTOSTART
for ((i=1;i<=$NUMTOSTART;i++))
do
JOB=`head -n1 jobs.lst`
sed -i 1d jobs.lst
/Prod $JOB &
done
sleep 2
fi
sleep 3
done

You may want to have a look at parallel, which you should be able to install on Cygwin according to the release notes. Then running the tasks in parallel can be as easy as:
parallel /Prod {} < jobs.lst
See here for an example of this in its man page (and have a look through the plethora of examples for more about the many options it has).
To control how many jobs to run at a time use the -j flag. By default it will run 1 job per core at a time, so 32 for you. To limit to 16 for instance:
parallel -j 16 /Prod {} < jobs.lst

Do I need stay away from bash scripts for big files?

I have big log files(1-2 gb and more). I'm new on programming and bash so useful and easy for me. When I need something, I can do (someone help me on here). Simple scripts works fine, but when I need complex operations, maybe bash so slow maybe my programming skill so bad, it's so slow working.
So do I need C for complex programming on my server log files or do I need just optimization my scripts?
If I need just optimization, how can I check where is bad or where is good on my codes?
For example I have while-do loop:
while read -r date month size;
do
...
...
done < file.tmp
How can I use awk for faster run?

That depends on how you use bash. To illustrate, consider how you'd sum a possibly large number of integers.
This function does what Bash was meant for: being control logic for calling other utilities.
sumlines_fast() {
awk '{n += $1} END {print n}'
}
It runs in 0.5 seconds on a million line file. That's the kind of bash code you can very effectively use for larger files.
Meanwhile, this function does what Bash is not intended for: being a general purpose programming language:
sumlines_slow() {
local i=0
while IFS= read -r line
do
(( i += $line ))
done
echo "$i"
}
This function is slow, and takes 30 seconds to sum the same million line file. You should not be doing this for larger files.
Finally, here's a function that could have been written by someone who has no understanding of bash at all:
sumlines_garbage() {
i=0
for f in `cat`
do
i=`echo $f + $i | bc`
done
echo $i
}
It treats forks as being free and therefore runs ridiculously slowly. It would take something like five hours to sum the file. You should not be using this at all.

Looping files in bash

I want to loop over these kind of files, where the the files with same Sample_ID have to be used together
Sample_51770BL1_R1.fastq.gz
Sample_51770BL1_R2.fastq.gz
Sample_52412_R1.fastq.gz
Sample_52412_R2.fastq.gz
e.g. Sample_51770BL1_R1.fastq.gz and Sample_51770BL1_R2.fastq.gz are used together in one command to create an output.
Similarly, Sample_52412_R1.fastq.gz and Sample_52412_R2.fastq.gz are used together to create output.
I want to write a for loop in bash to iterate over and create output.
sourcedir=/sourcepath/
destdir=/destinationpath/
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta Sample_52412_R1.fastq.gz Sample_52412_R2.fastq.gz>$destdir/Sample_52412_R1_R2.sam
How should I pattern match the file names Sample_ID_R1 and Sample_ID_R2 to be used in one command?
Thanks,

for fname in *_R1.fastq.gz
do
base=${fname%_R1*}
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam"
done
In the comments, you ask about running several, but not too many, jobs in parallel. Below is my first stab at that:
#!/bin/bash
# Limit background jobs to no more that $maxproc at once.
maxproc=3
for fname in * # _R1.fastq.gz
do
while [ $(jobs | wc -l) -ge "$maxproc" ]
do
sleep 1
done
base=${fname%_R1*}
echo starting new job with ongoing=$(jobs | wc -l)
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done
The optimal value of maxproc will depend on how many processors your PC has. You may need to experiment to find what works best.
Note that the above script uses jobs which is a bash builtin function. Thus, it has to be run under bash, not dash which is the default for scripts under Debian-like distributions.

Can Unix shell be used to report completion status in some manner?

I have seen some ideas for progress bars around SO and externally for specific commands (such as cat). However, my question seems to deviate slightly from the standard...
Currently, I am using the capability of the find command in shell, such as the follow example:
find . -name file -exec cmd "{}" \;
Where "cmd" is generally a zipping capability or removal tool to free up disk space.
When "." is very large, this can take minutes, and I would like some ability to report "status".
Is there some way to have some type of progress bar, percentage completion, or even print periods (i.e., Working....) until completed? If at all possible, I would like to avoid increasing the duration of this execution by adding another find. Is it possible?
Thanks in advance.

Clearly, you can only have a progress meter or percent completion if you know how long the command will take to run, or if it can tell you that it has finished x tasks out of y.
Here's a simple way to show an indicator while something is working:
#!/bin/sh
echo "launching: $#"
spinner() {
while true; do
for char in \| / - \\; do
printf "\r%s" "$char"
sleep 1
done
done
}
# start the spinner
spinner &
spinner_pid=$!
# launch the command
"$#"
# shut off the spinner
kill $spinner_pid
echo ""
So, you'd do (assuming the script is named "progress_indicator")
find . -name file -exec progress_indicator cmd "{}" \;

The trick with find is that you add two -print clauses, one at the start, and
one at the end. You then use awk (or perl) to update and print a line counter for each
unique line. In this example I tell awk to print to stderr.
Any duplicate lines must be the result of the conditions we specified, so we treat that special.
In this example, we just print that line:
find . -print -name aa\* -print |
awk '$0 == last {
print "" > "/dev/fd/2"
print
next
}
{
printf "\r%d", n++ > "/dev/fd/2"
last=$0
}'
It's best to leave find to just report pathnames, and do further processing from awk,
or just add another pipeline. (Because the counters are printed to stderr, those will not
interfere.)

If you have the dialog utility installed (), you can easily make a nice rolling display:
find . -type f -name glob -exec echo {} \; -exec cmd {} \; |
dialog --progressbox "Files being processed..." 12 $((COLUMNS*3/2))
The arguments to --progressbox are the box's title (optional, can't look like a number); the height in text rows and the width in text columns. dialog has a bunch of options to customize the presentation; the above is just to get you started.
dialog also has a progress bar, otherwise known as a "gauge", but as #glennjackman points out in his answer, you need to know how much work there is to do in order to show progress. One way to do this would be to collect the entire output of the find command, count the number of files in it, and then run the desired task from the accumulated output. However, that means waiting until the find command finishes in order to start doing the work, which might not be desirable.
Just because it was an interesting challenge, I came up with the following solution, which is possibly over-engineered because it tries to work around all the shell gotchas I could think of (and even so, it probably misses some). It consists of two shell files:
# File: run.sh
#!/bin/bash
# Usage: run.sh root-directory find-tests
#
# Fix the following path as required
PROCESS="$HOME/bin/process.sh"
TD=$(mktemp --tmpdir -d gauge.XXXXXXXX)
find "$#" -print0 |
tee >(awk -vRS='\0' 'END{print NR > "'"$TD/_total"'"}';
ln -s "$TD/_total" "$TD/total") |
{ xargs -0 -n50 "$PROCESS" "$TD"; printf "XXX\n100\nDone\nXXX\n"; } |
dialog --gauge "Starting..." 7 70
rm -fR "$TD"
# File: process.sh
#!/bin/bash
TD="$1"; shift
TOTAL=
if [[ -f $TD/count ]]; then COUNT=$(cat "$TD/count"); else COUNT=0; fi
for file in "$#"; do
if [[ -z $TOTAL && -f $TD/total ]]; then TOTAL=$(cat "$TD/total"); fi
printf "XXX\n%d\nProcessing file\n%q\nXXX\n" \
$((COUNT*100/${TOTAL:-100})) "$file"
#
# do whatever you want to do with $file
#
((++COUNT))
done
echo $COUNT > "$TD/count"
Some notes:
There are a lot of gnu extensions scattered in the above. I haven't made a complete list, but it certainly includes the %q printf format (which could just be %s); the flags used to NUL-terminate the filename list, and the --tmpdir flag to mktemp.
run.sh uses tee to simultaneously count the number of files found (with awk) and to start processing the files.
The -n50 argument to xargs causes it to wait only for the first 50 files to avoid delaying startup if find spends a lot of time not finding the first files; it might not be necessary.
The -vRS='\0' argument to awk causes it to use a NUL as a line delimiter, to match the -print0 action to find (and the -0 option to xargs); all this is only necessary if filepaths could contain a new-line.
awk writes the count to _total and then we symlink _total to total to avoid a really unlikely race condition where total is read before it is completely written. symlinking is atomic, so doing it this way guarantees that total either doesn't exist or is completely written.
It might have been better to have counted the total size of the files rather than just counting them, particularly if the processing work is related to file size (compression, for example). That would be a reasonably simple modification. Also, it would be tempting to use xargs parallel execution feature, but that would require a bit more work coordinating the sum of processed files between the parallel processes.
If you're using a managed environment which doesn't have dialog, the simplest solution is to just run the above script using ssh from an environment which does have dialog. Remove | dialog --gauge "Starting..." 7 70 from run.sh, and put it in your ssh invocation instead: ssh user#host /path/to/run.sh root-dir find-tests | dialog --gauge "Starting..." 7 70

Bash running time optimization

I am trying to solve an optimization problem and to find the most efficient way of performing the following commands:
whois -> sed -> while (exit while) ->perform action
while loop currently look like
while [x eq smth]; do
x=$((x+1))
done
some action
Maybe it is more efficient to have while true with an if inside (if clause the same as for while). Also, what is the best case using bash to evaluate the time required for every single step?

The by far biggest performance penalty and most common performance problem in Bash is unnecessary forking.
while [[ something ]]
do
var+=$(echo "$expression" | awk '{print $1}')
done
will be thousands of times slower than
while [[ something ]]
do
var+=${expression%% *}
done
Since the former will cause two forks per iteration, while the latter causes none.
Things that cause forks include but are not limited to pipe | lines, $(command expansion), <(process substitution), (explicit subshells), and using any command not listed in help (which type somecmd will identify as 'builtin' or 'shell keyword').

Well for starters you could remove $(, this creates a subshell and is sure
to slow the task down somewhat
while [ x -eq smth ]
do
(( x++ ))
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Easy parallelisation - bash

Solution: Use xargs to run in parallel (don't forget the -n option!) find -name \*.xml -print0 | xargs -0 -n 1 -P 3 bzip2

This perl program fits your needs fairly well, you would just do this: runN -n 4 bzip2 `find . | grep ".xml$"`

I think you could to the following for i in `find . | grep ".xml$"`; do bzip2 $i&; done But that would spin off however many processes as you have files instantly and isn't an optimal as just running four processes at a time.

Related

How to start a large number of quick jobs in Bash

Do I need stay away from bash scripts for big files?

Looping files in bash

Can Unix shell be used to report completion status in some manner?

Bash running time optimization

Categories

Resources