bash script for many files in parallel

bash script for many files in parallel - bash

I have read similar questions about this topic but none of them help me with the following problem:
I have a bash script that looks like this:
#!/bin/bash
for filename in /home/user/Desktop/emak/*.fa; do
mkdir ${filename%.*}
cd ${filename%.*}
mkdir emak
cd ..
done
This script basically does the following:
Iterate through all files in a directory
Create a new directory with the name of each file
Go inside the new file and create a new file called "emak"
The real task does something much computational expensive than create the "emak" file...
I have about thousands of files to iterate through.
As each iteration is independent from the previous one, I will like
to split it in different processors ( I have 24 cores) so I can do multiples files at the same time.
I read some previous post about running in parallel (using: GNU) but I do not see a clear way to apply it in this case.
thanks

No need for parallel; you can simply use
N=10
for filename in /home/user/Desktop/emak/*.fa; do
mkdir -p "${filename%.*}/emak" &
(( ++count % N == 0)) && wait
done
The second line pauses every Nth job to allow all the previous jobs to complete before continuing.

Something like this with GNU Parallel, whereby you create and export a bash function called doit:
#!/bin/bash
doit() {
dir=${1%.*}
mkdir "$dir"
cd "$dir"
mkdir emak
}
export -f doit
parallel doit ::: /home/user/Desktop/emak/*.fa
You will really see the benefit of this approach if the time taken by your "computationally expensive" part is longer, or especially variable. If it takes, say up to 10 seconds and is variable, GNU Parallel will submit the next job as soon as the shortest of the N parallel processes completes, rather than waiting for all N to complete before starting the next batch of N jobs.
As a crude benchmark, this takes 58 seconds:
#!/bin/bash
doit() {
echo $1
# Sleep up to 10 seconds
sleep $((RANDOM*11/32768))
}
export -f doit
parallel -j 10 doit ::: {0..99}
and this is directly comparable and takes 87 seconds:
#!/bin/bash
N=10
for i in {0..99}; do
echo $i
sleep $((RANDOM*11/32768)) &
(( ++count % N == 0)) && wait
done

Related

Write a configuration file for each run in a separate directory, then launch mpirun

I need to do a set of calculations by changing one parameter for each time. A calculation directory contains a control file named 'test.ctrl', a job submission file named 'job-test' and a bunch of data files. Each calculation should be submitted with the same control file name (written inside the job-test), and the output is given in those data files without changing their names, which creates an overwriting problem. For this reason, I want to automize the job submission process with a bash script so that I don't need to submit each calculation by hand.
As an example, I have done the first calculation in directory b1-k1-a1 (I choose this format of dir names to indicate calc. parameters). This test.ctrl file has the parameters:
Beta=1
Kappa=1
Alpha=0 1
and I submitted this job using 'sbatch job-test' command. For the following calculations, my code should copy this whole directory with the name bX-kY-aZ, make the changes in the control file, and finally submit the job. I naively tried this writing the whole thing in the job-test file as you can see in below MWE:
#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --time=0:15:00 ##hh:mm:ss
for n in $(seq 0 5)
do
for m in $(seq 0 5)
do
for v in $(seq 0 5)
do
mkdir b$n-k$m-a$v
cd b$n-k$m-a$v
cp ~/home/b01-k1-a01/* .
sed "s/Beta=1/Beta=$n/" test.ctrl
sed "s/Kappa=1/Kappa=$m/" test.ctrl
sed "s/Alpha=0 1/Alpha=0 $v/" test.ctrl
cd ..<<EOF
EOF
mpirun soft.x test.ctrl
sleep 5
done
done
done
I will appreciate if you could suggest me how to make it work this way.

It worked after I moved cd .. to the very end of the loops and removed sed, as suggested in the comments. Hence this works now:
#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --time=0:15:00 ##hh:mm:ss
for n in $(seq 0 5)
do
for m in $(seq 0 5)
do
for v in $(seq 0 5)
do
mkdir b$n-k$m-a$v
cd b$n-k$m-a$v
cp ~/home/b01-k1-a01/* .
cat >test.ctrl <<EOF
Beta=$n
Kappa=$m
Alpha=0 $v
EOF
mpirun soft.x test.ctrl
sleep 5
cd ..
done
done
done

The immediate problem is that sed without any options does not modify the file at all; it just prints the results to standard output.
It is frightfully unclear what you were hoping the here document was accomplishing. cd does not read its standard input, so it wasn't accomplishing anything at all, anyway.
#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --time=0:15:00 ##hh:mm:ss
for n in $(seq 0 5); do
for m in $(seq 0 5); do
for v in $(seq 0 5); do
mkdir "b$n-k$m-a$v"
cd "b$n-k$m-a$v"
cp ~/home/b01-k1-a01/* .
sed -e "s/Beta=1/Beta=$n/" \
-e "s/Kappa=1/Kappa=$m/" \
-e "s/Alpha=0 1/Alpha=0 $v/" ~/home/b01-k1-a01/test.ctrl >/test.ctrl
mpirun soft.x test.ctrl
cd ..
sleep 5
done
done
done
Notice also the merging of multiple sed commands into a single script (though as noted elsewhere, maybe printf would be even better if that's everything which you have in the configuration file).

Tesseract OCR large number of files

I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
UPDATE:
Below is my code after implementing on suggestions given by #Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}

You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.
The basic command looks like this:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
which will show a progress bar and use all available cores on your machine. Here it is in action:
If you want to see what it would do without actually doing anything, use parallel --dry-run.
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:
sysctl -a kern.argmax
kern.argmax: 262144
So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
where : is shorthand for the machine on which you are running.

Bash for loop end condition given as script argument

I'm using this script to copy virtual machines in my ESXI 6.5. The first argument of the script is the name of the directory to copy.
I would like to have a second argument, which would be the number of vms I want to copy. For now on, I need to modify the for loop every time I want to copy different number of vms. The below script creates 20 vms by copying the directory of a vm given as the first script argument. I run it like this: ./copy.sh CentOS1 but would like to have something like this: ./copy.sh CentOS1 x where x is the end condition in my for loop.
#!/bin/sh
for i in $(seq 1 1 20)
do
mkdir ./$1_$i/
cp $1/* $1_$i/
echo "Copying machine '$1_$i' ... DONE!"
done
NOTE: Please do not suggest other for solutions, like those given, for instance, here: https://www.cyberciti.biz/faq/bash-for-loop/ because I checked them and they didn't work.
Thanks.

Use a C-style for loop, if you are using bash.
for ((i=1; i<=$2; i++))
do
mkdir "./$1_$i/"
cp "$1"/* "$1_$i/"
echo "Copying machine '$1_$i' ... DONE!"
done
If you need POSIX compatibility (as implied by your shebang), then you probably can't rely on seq being available either; use a while loop.
i=1
while [ "$i" -le "$2" ]; do
mkdir ./"$1_$i"
cp "$1"/* "$1_$i"
i=$((i+1))
done

In spite of your protestations to the contrary, one of the solutions in your link would work fine:
for ((i=1; i<=$2; i++)); do
# body of loop goes here
done
would loop from 1 to the number given in the second argument

Can i cache the output of a command on Linux from CLI?

I'm looking for an implementation of a 'cacheme' command, which 'memoizes' the output (stdout) of whatever has in ARGV. If it never ran it, it will run it and somewhat memorize the output. If it ran it, it will just copy the output of the file (or even better, both output and error to &1 and &2 respectively).
Let's suppose someone wrote this command, it would work like this.
$ time cacheme sleep 1 # first time it takes one sec
real 0m1.228s
user 0m0.140s
sys 0m0.040s
$ time cacheme sleep 1 # second time it looks for stdout in the cache (dflt expires in 1h)
#DEBUG# Cache version found! (1 minute old)
real 0m0.100s
user 0m0.100s
sys 0m0.040s
This example is a bit silly because it has no output. Ideally it would be tested on a script like sleep-1-and-echo-hello-world.sh.
I created a small script that creates a file in /tmp/ with hash of full command name and username, but I'm pretty sure something already exists.
Are you aware of any of this?
Note. Why I would do this? Occasionally I run commands that are network or compute intensive, they take minutes to run and the output doesn't change much. If I know it in advance I'd just prepend a cacheme <cmd>, go for dinner and when i'm back I can just rerun the SAME command over and over on the same machine and get the same answer in an instance.

Improved solution above somewhat by also adding expiry age as optional argument.
#!/bin/sh
# save as e.g. $HOME/.local/bin/cacheme
# and then chmod u+x $HOME/.local/bin/cacheme
VERBOSE=false
PROG="$(basename $0)"
DIR="${HOME}/.cache/${PROG}"
mkdir -p "${DIR}"
EXPIRY=600 # default to 10 minutes
# check if first argument is a number, if so use it as expiration (seconds)
[ "$1" -eq "$1" ] 2>/dev/null && EXPIRY=$1 && shift
[ "$VERBOSE" = true ] && echo "Using expiration $EXPIRY seconds"
CMD="$#"
HASH=$(echo "$CMD" | md5sum | awk '{print $1}')
CACHE="$DIR/$HASH"
test -f "${CACHE}" && [ $(expr $(date +%s) - $(date -r "$CACHE" +%s)) -le $EXPIRY ] || eval "$CMD" > "${CACHE}"
cat "${CACHE}"

I've implemented a simple caching script for bash, because I wanted to speed up plotting from piped shell command in gnuplot. It can be used to cache output of any command. Cache is used as long as the arguments are the same and files passed in arguments haven't changed. System is responsible for cleaning up.
#!/bin/bash
# hash all arguments
KEY="$#"
# hash last modified dates of any files
for arg in "$#"
do
if [ -f $arg ]
then
KEY+=`date -r "$arg" +\ %s`
fi
done
# use the hash as a name for temporary file
FILE="/tmp/command_cache.`echo -n "$KEY" | md5sum | cut -c -10`"
# use cached file or execute the command and cache it
if [ -f $FILE ]
then
cat $FILE
else
$# | tee $FILE
fi
You can name the script cache, set executable flag and put it in your PATH. Then simply prefix any command with cache to use it.

Author of bash-cache here with an update. I recently published bkt, a CLI and Rust library for subprocess caching. Here's a simple example:
# Execute and cache an invocation of 'date +%s.%N'
$ bkt -- date +%s.%N
1631992417.080884000
# A subsequent invocation reuses the same cached output
$ bkt -- date +%s.%N
1631992417.080884000
It supports a number of features such as asynchronous refreshing (--stale and --warm), namespaced caches (--scope), and optionally keying off the working directory (--cwd) and select environment variables (--env). See the README for more.
It's still a work in progress but it's functional and effective! I'm using it already to speed up my shell prompt and a number of other common tasks.

I created bash-cache, a memoization library for Bash, which works exactly how you're describing. It's designed specifically to cache Bash functions, but obviously you can wrap calls to other commands in functions.
It handles a number of edge-case behaviors that many simpler caching mechanisms miss. It reports the exit code of the original call, keeps stdout and stderr separately, and retains any trailing whitespace in the output ($() command substitutions will truncate trailing whitespace).
Demo:
# Define function normally, then decorate it with bc::cache
$ maybe_sleep() {
sleep "$#"
echo "Did I sleep?"
} && bc::cache maybe_sleep
# Initial call invokes the function
$ time maybe_sleep 1
Did I sleep?
real 0m1.047s
user 0m0.000s
sys 0m0.020s
# Subsequent call uses the cache
$ time maybe_sleep 1
Did I sleep?
real 0m0.044s
user 0m0.000s
sys 0m0.010s
# Invocations with different arguments are cached separately
$ time maybe_sleep 2
Did I sleep?
real 0m2.049s
user 0m0.000s
sys 0m0.020s
There's also a benchmark function that shows the overhead of the caching:
$ bc::benchmark maybe_sleep 1
Original: 1.007
Cold Cache: 1.052
Warm Cache: 0.044
So you can see the read/write overhead (on my machine, which uses tmpfs) is roughly 1/20th of a second. This benchmark utility can help you decide whether it's worth caching a particular call or not.

How about this simple shell script (not tested)?
#!/bin/sh
mkdir -p cache
cachefile=cache/cache
for i in "$#"
do
cachefile=${cachefile}_$(printf %s "$i" | sed 's/./\\&/g')
done
test -f "$cachefile" || "$#" > "$cachefile"
cat "$cachefile"

Improved upon solution from error:
Pipes output into the "tee" command which allows it to be viewed real-time as well as stored in the cache.
Preserve colors (for example in commands like "ls --color") by using "script --flush --quiet /dev/null --command $CMD".
Avoid calling "exec" by using script as well
Use bash and [[
#!/usr/bin/env bash
CMD="$#"
[[ -z $CMD ]] && echo "usage: EXPIRY=600 cache cmd arg1 ... argN" && exit 1
# set -e -x
VERBOSE=false
PROG="$(basename $0)"
EXPIRY=${EXPIRY:-600} # default to 10 minutes, can be overriden
EXPIRE_DATE=$(date -Is -d "-$EXPIRY seconds")
[[ $VERBOSE = true ]] && echo "Using expiration $EXPIRY seconds"
HASH=$(echo "$CMD" | md5sum | awk '{print $1}')
CACHEDIR="${HOME}/.cache/${PROG}"
mkdir -p "${CACHEDIR}"
CACHEFILE="$CACHEDIR/$HASH"
if [[ -e $CACHEFILE ]] && [[ $(date -Is -r "$CACHEFILE") > $EXPIRE_DATE ]]; then
cat "$CACHEFILE"
else
script --flush --quiet --return /dev/null --command "$CMD" | tee "$CACHEFILE"
fi

The solution I came up in ruby is this. Does anybody see any optimization?
#!/usr/bin/env ruby
VER = '1.2'
$time_cache_secs = 3600
$cache_dir = File.expand_path("~/.cacheme")
require 'rubygems'
begin
require 'filecache' # gem install ruby-cache
rescue Exception => e
puts 'gem filecache requires installation, sorry. trying to install myself'
system 'sudo gem install -r filecache'
puts 'Try re-running the program now.'
exit 1
end
=begin
# create a new cache called "my-cache", rooted in /home/simon/caches
# with an expiry time of 30 seconds, and a file hierarchy three
# directories deep
=end
def main
cache = FileCache.new("cache3", $cache_dir, $time_cache_secs, 3)
cmd = ARGV.join(' ').to_s # caching on full command, note that quotes are stripped
cmd = 'echo give me an argment' if cmd.length < 1
# caches the command and retrieves it
if cache.get('output' + cmd)
#deb "Cache found!(for '#{cmd}')"
else
#deb "Cache not found! Recalculating and setting for the future"
cache.set('output' + cmd, `#{cmd}`)
end
#deb 'anyway calling the cache now'
print(cache.get('output' + cmd))
end
main

An implementation exists here: https://bitbucket.org/sivann/runcached/src
Caches executable path, output, exit code, remembers arguments. Configurable expiration. Implemented in bash, C, python, choose whatever suits you.

Running a limited number of child processes in parallel in bash? [duplicate]

This question already has answers here:
How to limit number of threads/sub-processes used in a function in bash
(7 answers)
Closed 3 years ago.
I have a large set of files for which some heavy processing needs to be done.
This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run.
My current usecase is to start a hadoop job on the input data, but I've had this same problem in other cases before.
In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.
However a very simple example shell script like this will trash the system performance due to excessive load and swapping:
find . -type f | while read name ;
do
some_heavy_processing_command ${name} &
done
So what I want is essentially similar to what "gmake -j4" does.
I know bash supports the "wait" command but that only waits untill all child processes have completed. In the past I've created scripting that does a 'ps' command and then grep the child processes out by name (yes, i know ... ugly).
What is the simplest/cleanest/best solution to do what I want?
Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bash
The "xargs --max-procs=4" works like a charm.
(So I voted to close my own question)

I know I'm late to the party with this answer but I thought I would post an alternative that, IMHO, makes the body of the script cleaner and simpler. (Clearly you can change the values 2 & 5 to be appropriate for your scenario.)
function max2 {
while [ `jobs | wc -l` -ge 2 ]
do
sleep 5
done
}
find . -type f | while read name ;
do
max2; some_heavy_processing_command ${name} &
done
wait

#! /usr/bin/env bash
set -o monitor
# means: run background processes in a separate processes...
trap add_next_job CHLD
# execute add_next_job when we receive a child complete signal
todo_array=($(find . -type f)) # places output into an array
index=0
max_jobs=2
function add_next_job {
# if still jobs to do then add one
if [[ $index -lt ${#todo_array[*]} ]]
# apparently stackoverflow doesn't like bash syntax
# the hash in the if is not a comment - rather it's bash awkward way of getting its length
then
echo adding job ${todo_array[$index]}
do_job ${todo_array[$index]} &
# replace the line above with the command you want
index=$(($index+1))
fi
}
function do_job {
echo "starting job $1"
sleep 2
}
# add initial set of jobs
while [[ $index -lt $max_jobs ]]
do
add_next_job
done
# wait for all jobs to complete
wait
echo "done"
Having said that Fredrik makes the excellent point that xargs does exactly what you want...

With GNU Parallel it becomes simpler:
find . -type f | parallel some_heavy_processing_command {}
Learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

I think I found a more handy solution using make:
#!/usr/bin/make -f
THIS := $(lastword $(MAKEFILE_LIST))
TARGETS := $(shell find . -name '*.sh' -type f)
.PHONY: all $(TARGETS)
all: $(TARGETS)
$(TARGETS):
some_heavy_processing_command $#
$(THIS): ; # Avoid to try to remake this makefile
Call it as e.g. 'test.mak', and add execute rights. If You call ./test.mak it will call the some_heavy_processing_command one-by-one. But You can call as ./test.mak -j 4, then it will run four subprocesses at once. Also You can use it on a more sophisticated way: run as ./test.mak -j 5 -l 1.5, then it will run maximum 5 sub-processes while the system load is under 1.5, but it will limit the number of processes if the system load exceeds 1.5.
It is more flexible than xargs, and make is part of the standard distribution, not like parallel.

This code worked quite well for me.
I noticed one issue in which the script couldn't end.
If you run into a case where the script wont end due to max_jobs being greater than the number of elements in the array, the script will never quit.
To prevent the above scenario, I've added the following right after the "max_jobs" declaration.
if [ $max_jobs -gt ${#todo_array[*]} ];
then
# there are more elements found in the array than max jobs, setting max jobs to #of array elements"
max_jobs=${#todo_array[*]}
fi

Another option:
PARALLEL_MAX=...
function start_job() {
while [ $(ps --no-headers -o pid --ppid=$$ | wc -l) -gt $PARALLEL_MAX ]; do
sleep .1 # Wait for background tasks to complete.
done
"$#" &
}
start_job some_big_command1
start_job some_big_command2
start_job some_big_command3
start_job some_big_command4
...

Here is a very good function I used to control the maximum # of jobs from bash or ksh. NOTE: the - 1 in the pgrep subtracts the wc -l subprocess.
function jobmax
{
typeset -i MAXJOBS=$1
sleep .1
while (( ($(pgrep -P $$ | wc -l) - 1) >= $MAXJOBS ))
do
sleep .1
done
}
nproc=5
for i in {1..100}
do
sleep 1 &
jobmax $nproc
done
wait # Wait for the rest

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

bash script for many files in parallel - bash

No need for parallel; you can simply use N=10 for filename in /home/user/Desktop/emak/.fa; do mkdir -p "${filename%.}/emak" & (( ++count % N == 0)) && wait done The second line pauses every Nth job to allow all the previous jobs to complete before continuing.

Related

Write a configuration file for each run in a separate directory, then launch mpirun

Tesseract OCR large number of files

Bash for loop end condition given as script argument

Can i cache the output of a command on Linux from CLI?

Running a limited number of child processes in parallel in bash? [duplicate]

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

bash script for many files in parallel - bash

No need for parallel; you can simply use N=10 for filename in /home/user/Desktop/emak/*.fa; do mkdir -p "${filename%.*}/emak" & (( ++count % N == 0)) && wait done The second line pauses every Nth job to allow all the previous jobs to complete before continuing.

Related

Write a configuration file for each run in a separate directory, then launch mpirun

Tesseract OCR large number of files

Bash for loop end condition given as script argument

Can i cache the output of a command on Linux from CLI?

Running a limited number of child processes in parallel in bash? [duplicate]

Categories

Resources

No need for parallel; you can simply use N=10 for filename in /home/user/Desktop/emak/.fa; do mkdir -p "${filename%.}/emak" & (( ++count % N == 0)) && wait done The second line pauses every Nth job to allow all the previous jobs to complete before continuing.