Running a limited number of child processes in parallel in bash? [duplicate] - bash

This question already has answers here:
How to limit number of threads/sub-processes used in a function in bash
(7 answers)
Closed 3 years ago.
I have a large set of files for which some heavy processing needs to be done.
This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run.
My current usecase is to start a hadoop job on the input data, but I've had this same problem in other cases before.
In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.
However a very simple example shell script like this will trash the system performance due to excessive load and swapping:
find . -type f | while read name ;
do
some_heavy_processing_command ${name} &
done
So what I want is essentially similar to what "gmake -j4" does.
I know bash supports the "wait" command but that only waits untill all child processes have completed. In the past I've created scripting that does a 'ps' command and then grep the child processes out by name (yes, i know ... ugly).
What is the simplest/cleanest/best solution to do what I want?
Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bash
The "xargs --max-procs=4" works like a charm.
(So I voted to close my own question)

I know I'm late to the party with this answer but I thought I would post an alternative that, IMHO, makes the body of the script cleaner and simpler. (Clearly you can change the values 2 & 5 to be appropriate for your scenario.)
function max2 {
while [ `jobs | wc -l` -ge 2 ]
do
sleep 5
done
}
find . -type f | while read name ;
do
max2; some_heavy_processing_command ${name} &
done
wait

#! /usr/bin/env bash
set -o monitor
# means: run background processes in a separate processes...
trap add_next_job CHLD
# execute add_next_job when we receive a child complete signal
todo_array=($(find . -type f)) # places output into an array
index=0
max_jobs=2
function add_next_job {
# if still jobs to do then add one
if [[ $index -lt ${#todo_array[*]} ]]
# apparently stackoverflow doesn't like bash syntax
# the hash in the if is not a comment - rather it's bash awkward way of getting its length
then
echo adding job ${todo_array[$index]}
do_job ${todo_array[$index]} &
# replace the line above with the command you want
index=$(($index+1))
fi
}
function do_job {
echo "starting job $1"
sleep 2
}
# add initial set of jobs
while [[ $index -lt $max_jobs ]]
do
add_next_job
done
# wait for all jobs to complete
wait
echo "done"
Having said that Fredrik makes the excellent point that xargs does exactly what you want...

With GNU Parallel it becomes simpler:
find . -type f | parallel some_heavy_processing_command {}
Learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

I think I found a more handy solution using make:
#!/usr/bin/make -f
THIS := $(lastword $(MAKEFILE_LIST))
TARGETS := $(shell find . -name '*.sh' -type f)
.PHONY: all $(TARGETS)
all: $(TARGETS)
$(TARGETS):
some_heavy_processing_command $#
$(THIS): ; # Avoid to try to remake this makefile
Call it as e.g. 'test.mak', and add execute rights. If You call ./test.mak it will call the some_heavy_processing_command one-by-one. But You can call as ./test.mak -j 4, then it will run four subprocesses at once. Also You can use it on a more sophisticated way: run as ./test.mak -j 5 -l 1.5, then it will run maximum 5 sub-processes while the system load is under 1.5, but it will limit the number of processes if the system load exceeds 1.5.
It is more flexible than xargs, and make is part of the standard distribution, not like parallel.

This code worked quite well for me.
I noticed one issue in which the script couldn't end.
If you run into a case where the script wont end due to max_jobs being greater than the number of elements in the array, the script will never quit.
To prevent the above scenario, I've added the following right after the "max_jobs" declaration.
if [ $max_jobs -gt ${#todo_array[*]} ];
then
# there are more elements found in the array than max jobs, setting max jobs to #of array elements"
max_jobs=${#todo_array[*]}
fi

Another option:
PARALLEL_MAX=...
function start_job() {
while [ $(ps --no-headers -o pid --ppid=$$ | wc -l) -gt $PARALLEL_MAX ]; do
sleep .1 # Wait for background tasks to complete.
done
"$#" &
}
start_job some_big_command1
start_job some_big_command2
start_job some_big_command3
start_job some_big_command4
...

Here is a very good function I used to control the maximum # of jobs from bash or ksh. NOTE: the - 1 in the pgrep subtracts the wc -l subprocess.
function jobmax
{
typeset -i MAXJOBS=$1
sleep .1
while (( ($(pgrep -P $$ | wc -l) - 1) >= $MAXJOBS ))
do
sleep .1
done
}
nproc=5
for i in {1..100}
do
sleep 1 &
jobmax $nproc
done
wait # Wait for the rest

Related

Waiting ANY child process exit on macOS?

I am wondering how to wait for any process to finish in macOS, since wait -n doesn't work. I have a script doing several things, and in some point it will enter a loop calling another script to the background to exploit some parallelism, but not more than X times since it wouldn't be efficient. Thus, I need to wait for any child process to finish before creating new processes.
I have seen this question but it doesn't answer the "any" part, it just says how to wait to a specific process to finish.
I've thought of either storing all PIDs and actively checking if they're still running with ps, but it's very slapdash and resource consuming. I also thought about upgrading bash to a newer version (if that's ever possible in macOS without breaking how bash already works), but I would be very disappointed if there was no other way to actually wait for any process to finish, it's such a basic feature... Any ideas?
A basic version of my code would look like this:
for vid_file in $VID_FILES
do
my_script.sh $vid_file other_args &
((TOTAL_PROCESSES=TOTAL_PROCESSES+1))
if [ $TOTAL_PROCESSES -ge $MAX_PROCESS ]; then
wait -n
((TOTAL_PROCESSES=TOTAL_PROCESSES-1))
fi
done
My neither elegant nor performant approach to substitute the wait -n:
NUM_PROCC=$MAX_PROCESS
while [ $NUM_PROCC -ge $MAX_PROCESS ]
do
sleep 5
NUM_PROCC=$(ps | grep "my_script.sh"| wc -l | tr -d " \t")
# grep command will count as one so we need to remove it
((NUM_PROCC=NUM_PROCC-1))
done
PS: This question could be closed and merged with the one I mentioned above. I've just created this new one because stackoverflow wouldn't let me comment or ask...
PS2: I do understand that my objective could be achieved by other means. If you don't have an answer for the specific question itself but rather a workaround, please let other people answer the question about "waiting any" since it would be very useful for me/everyone in the future as well. I will of course welcome and be thankful for the workaround too!
Thank you in advance!
It seems like you just want to limit the number of processes that are running at the same time. Here's a rudimentary way to do it with bash <= 4.2:
#!/bin/bash
MAX_PROCESS=2
INPUT_PATH=/somewhere
for vid_file in "$INPUT_PATH"/*
do
while [[ "$(jobs -pr | wc -l)" -ge "$MAX_PROCESS" ]]; do sleep 1; done
my_script.sh "$vid_file" other_args &
done
wait
Here's the bash >= 4.3 version:
#!/bin/bash
MAX_PROCESS=2
INPUT_PATH=/somewhere
for vid_file in "$INPUT_PATH"/*
do
[[ "$(jobs -pr | wc -l)" -ge "$MAX_PROCESS" ]] && wait -n
my_script.sh "$vid_file" other_args &
done
wait
GNU make has parallelization capabilities and the following Makefile should work even with the very old make 3.81 that comes with macOS. Replace the 4 leading spaces before my_script.sh by a tab and store this in a file named Makefile:
.PHONY: all $(VID_FILES)
all: $(VID_FILES)
$(VID_FILES):
my_script.sh "$#" other_args
And then to run 8 jobs max in parallel:
$ make -j8 VID_FILES="$VID_FILES"
Make can do even better: avoid redoing things that have already been done:
TARGETS := $(patsubst %,.%.done,$(VID_FILES))
.PHONY: all clean
all: $(TARGETS)
$(TARGETS): .%.done: %
my_script.sh "$<" other_args
touch "$#"
clean:
rm -f $(TARGETS)
With this last version an empty tag file .foo.done is created for each processed video foo. If, later, you re-run make and video foo did not change, it will not be re-processed. Type make clean to delete all tag files. Do not forget to replace the leading spaces by a tab.
Suggestion 1: Completion indicator file
Suggesting to add task completion indication file to your my_script.sh
Like this:
echo "$0.$(date +%F_%T).done" >> my_script.sh
And in your deployment script test if the completion indicator file exist.
rm "my_script.sh.*.done"
my_script.sh "$vid_file" other_args &
while [[ ! -e "my_script.sh.*.done" ]]; do
sleep 5
done
Don't forget to clean up the completion indicator files.
Advantages for this approach:
Simple
Supported in all shells
Retain a history/audit trail on completion
Disadvantages for this approach:
Requires modification to original script my_script.sh
Requires cleanup.
Using loop
Suggestion 2: Using wait command with pgrep command
Suggesting to learn more about wait command here.
Suggesting to learn more about pgrep command here.
my_script.sh "$vid_file" other_args &
wait $(pgrep -f "my_script.sh $vid_file")
Advantages for this approach:
Simple
Readable
Disadvantages for this approach:
Multiple users using same command same time
wait command is specific to Linux bash maybe in other shells as well. Check your current support.
With GNU Parallel it would look something like:
parallel my_script.sh {} other_args ::: $VID_FILES

Tesseract OCR large number of files

I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
UPDATE:
Below is my code after implementing on suggestions given by #Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}
You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.
The basic command looks like this:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
which will show a progress bar and use all available cores on your machine. Here it is in action:
If you want to see what it would do without actually doing anything, use parallel --dry-run.
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:
sysctl -a kern.argmax
kern.argmax: 262144
So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
where : is shorthand for the machine on which you are running.

bash script for many files in parallel

I have read similar questions about this topic but none of them help me with the following problem:
I have a bash script that looks like this:
#!/bin/bash
for filename in /home/user/Desktop/emak/*.fa; do
mkdir ${filename%.*}
cd ${filename%.*}
mkdir emak
cd ..
done
This script basically does the following:
Iterate through all files in a directory
Create a new directory with the name of each file
Go inside the new file and create a new file called "emak"
The real task does something much computational expensive than create the "emak" file...
I have about thousands of files to iterate through.
As each iteration is independent from the previous one, I will like
to split it in different processors ( I have 24 cores) so I can do multiples files at the same time.
I read some previous post about running in parallel (using: GNU) but I do not see a clear way to apply it in this case.
thanks
No need for parallel; you can simply use
N=10
for filename in /home/user/Desktop/emak/*.fa; do
mkdir -p "${filename%.*}/emak" &
(( ++count % N == 0)) && wait
done
The second line pauses every Nth job to allow all the previous jobs to complete before continuing.
Something like this with GNU Parallel, whereby you create and export a bash function called doit:
#!/bin/bash
doit() {
dir=${1%.*}
mkdir "$dir"
cd "$dir"
mkdir emak
}
export -f doit
parallel doit ::: /home/user/Desktop/emak/*.fa
You will really see the benefit of this approach if the time taken by your "computationally expensive" part is longer, or especially variable. If it takes, say up to 10 seconds and is variable, GNU Parallel will submit the next job as soon as the shortest of the N parallel processes completes, rather than waiting for all N to complete before starting the next batch of N jobs.
As a crude benchmark, this takes 58 seconds:
#!/bin/bash
doit() {
echo $1
# Sleep up to 10 seconds
sleep $((RANDOM*11/32768))
}
export -f doit
parallel -j 10 doit ::: {0..99}
and this is directly comparable and takes 87 seconds:
#!/bin/bash
N=10
for i in {0..99}; do
echo $i
sleep $((RANDOM*11/32768)) &
(( ++count % N == 0)) && wait
done

Portable way to build up arguments for a utility in shell?

I'm writing a shell script that's meant to run on a range of machines. Some of these machines have bash 2 or bash 3. Some are running BusyBox 1.18.4 where bin/bash exists but
/bin/bash --version doesn't return anything at all
foo=( "hello" "world" ) complains about a syntax error near the unexpected "(" both with and without the extra spaces just inside the parens ... so arrays seem either limited or missing
There are also more modern or more fully featured Linux and bash versions.
What is the most portable way for a bash script to build up arguments at run time for calling some utility like find? I can build up a string but feel that arrays would be a better choice. Except there's that second bullet point above...
Let's say my script is foo and you call it like so: foo -o 1 .jpg .png
Here's some pseudo-code
#!/bin/bash
# handle option -o here
shift $(expr $OPTIND - 1)
# build up parameters for find here
parameters=(my-diretory -type f -maxdepth 2)
if [ -n "$1" ]; then
parameters+=-iname '*$1' -print
shift
fi
while [ $# -gt 0 ]; do
parameters+=-o -iname '*$1' -print
shift
done
find <new positional parameters here> | some-while-loop
If you need to use mostly-POSIX sh, such as would be available in busybox ash-named-bash, you can build up positional parameters directly with set
$ set -- hello
$ set -- "$#" world
$ printf '%s\n' "$#"
hello
world
For a more apt example:
$ set -- /etc -name '*b*'
$ set -- "$#" -type l -exec readlink {} +
$ find "$#"
/proc/mounts
While your question involves more than just Bash, you may benefit from reading the Wooledge Bash FAQ on the subject:
http://mywiki.wooledge.org/BashFAQ/050
It mentions the use of "set --" for older shells, but also gives a lot of background information. When building a list of argument, it's easy to create a system that works in simple cases but fails when the data has special characters, so reading up on the subject is probably worthwhile.

Can i cache the output of a command on Linux from CLI?

I'm looking for an implementation of a 'cacheme' command, which 'memoizes' the output (stdout) of whatever has in ARGV. If it never ran it, it will run it and somewhat memorize the output. If it ran it, it will just copy the output of the file (or even better, both output and error to &1 and &2 respectively).
Let's suppose someone wrote this command, it would work like this.
$ time cacheme sleep 1 # first time it takes one sec
real 0m1.228s
user 0m0.140s
sys 0m0.040s
$ time cacheme sleep 1 # second time it looks for stdout in the cache (dflt expires in 1h)
#DEBUG# Cache version found! (1 minute old)
real 0m0.100s
user 0m0.100s
sys 0m0.040s
This example is a bit silly because it has no output. Ideally it would be tested on a script like sleep-1-and-echo-hello-world.sh.
I created a small script that creates a file in /tmp/ with hash of full command name and username, but I'm pretty sure something already exists.
Are you aware of any of this?
Note. Why I would do this? Occasionally I run commands that are network or compute intensive, they take minutes to run and the output doesn't change much. If I know it in advance I'd just prepend a cacheme <cmd>, go for dinner and when i'm back I can just rerun the SAME command over and over on the same machine and get the same answer in an instance.
Improved solution above somewhat by also adding expiry age as optional argument.
#!/bin/sh
# save as e.g. $HOME/.local/bin/cacheme
# and then chmod u+x $HOME/.local/bin/cacheme
VERBOSE=false
PROG="$(basename $0)"
DIR="${HOME}/.cache/${PROG}"
mkdir -p "${DIR}"
EXPIRY=600 # default to 10 minutes
# check if first argument is a number, if so use it as expiration (seconds)
[ "$1" -eq "$1" ] 2>/dev/null && EXPIRY=$1 && shift
[ "$VERBOSE" = true ] && echo "Using expiration $EXPIRY seconds"
CMD="$#"
HASH=$(echo "$CMD" | md5sum | awk '{print $1}')
CACHE="$DIR/$HASH"
test -f "${CACHE}" && [ $(expr $(date +%s) - $(date -r "$CACHE" +%s)) -le $EXPIRY ] || eval "$CMD" > "${CACHE}"
cat "${CACHE}"
I've implemented a simple caching script for bash, because I wanted to speed up plotting from piped shell command in gnuplot. It can be used to cache output of any command. Cache is used as long as the arguments are the same and files passed in arguments haven't changed. System is responsible for cleaning up.
#!/bin/bash
# hash all arguments
KEY="$#"
# hash last modified dates of any files
for arg in "$#"
do
if [ -f $arg ]
then
KEY+=`date -r "$arg" +\ %s`
fi
done
# use the hash as a name for temporary file
FILE="/tmp/command_cache.`echo -n "$KEY" | md5sum | cut -c -10`"
# use cached file or execute the command and cache it
if [ -f $FILE ]
then
cat $FILE
else
$# | tee $FILE
fi
You can name the script cache, set executable flag and put it in your PATH. Then simply prefix any command with cache to use it.
Author of bash-cache here with an update. I recently published bkt, a CLI and Rust library for subprocess caching. Here's a simple example:
# Execute and cache an invocation of 'date +%s.%N'
$ bkt -- date +%s.%N
1631992417.080884000
# A subsequent invocation reuses the same cached output
$ bkt -- date +%s.%N
1631992417.080884000
It supports a number of features such as asynchronous refreshing (--stale and --warm), namespaced caches (--scope), and optionally keying off the working directory (--cwd) and select environment variables (--env). See the README for more.
It's still a work in progress but it's functional and effective! I'm using it already to speed up my shell prompt and a number of other common tasks.
I created bash-cache, a memoization library for Bash, which works exactly how you're describing. It's designed specifically to cache Bash functions, but obviously you can wrap calls to other commands in functions.
It handles a number of edge-case behaviors that many simpler caching mechanisms miss. It reports the exit code of the original call, keeps stdout and stderr separately, and retains any trailing whitespace in the output ($() command substitutions will truncate trailing whitespace).
Demo:
# Define function normally, then decorate it with bc::cache
$ maybe_sleep() {
sleep "$#"
echo "Did I sleep?"
} && bc::cache maybe_sleep
# Initial call invokes the function
$ time maybe_sleep 1
Did I sleep?
real 0m1.047s
user 0m0.000s
sys 0m0.020s
# Subsequent call uses the cache
$ time maybe_sleep 1
Did I sleep?
real 0m0.044s
user 0m0.000s
sys 0m0.010s
# Invocations with different arguments are cached separately
$ time maybe_sleep 2
Did I sleep?
real 0m2.049s
user 0m0.000s
sys 0m0.020s
There's also a benchmark function that shows the overhead of the caching:
$ bc::benchmark maybe_sleep 1
Original: 1.007
Cold Cache: 1.052
Warm Cache: 0.044
So you can see the read/write overhead (on my machine, which uses tmpfs) is roughly 1/20th of a second. This benchmark utility can help you decide whether it's worth caching a particular call or not.
How about this simple shell script (not tested)?
#!/bin/sh
mkdir -p cache
cachefile=cache/cache
for i in "$#"
do
cachefile=${cachefile}_$(printf %s "$i" | sed 's/./\\&/g')
done
test -f "$cachefile" || "$#" > "$cachefile"
cat "$cachefile"
Improved upon solution from error:
Pipes output into the "tee" command which allows it to be viewed real-time as well as stored in the cache.
Preserve colors (for example in commands like "ls --color") by using "script --flush --quiet /dev/null --command $CMD".
Avoid calling "exec" by using script as well
Use bash and [[
#!/usr/bin/env bash
CMD="$#"
[[ -z $CMD ]] && echo "usage: EXPIRY=600 cache cmd arg1 ... argN" && exit 1
# set -e -x
VERBOSE=false
PROG="$(basename $0)"
EXPIRY=${EXPIRY:-600} # default to 10 minutes, can be overriden
EXPIRE_DATE=$(date -Is -d "-$EXPIRY seconds")
[[ $VERBOSE = true ]] && echo "Using expiration $EXPIRY seconds"
HASH=$(echo "$CMD" | md5sum | awk '{print $1}')
CACHEDIR="${HOME}/.cache/${PROG}"
mkdir -p "${CACHEDIR}"
CACHEFILE="$CACHEDIR/$HASH"
if [[ -e $CACHEFILE ]] && [[ $(date -Is -r "$CACHEFILE") > $EXPIRE_DATE ]]; then
cat "$CACHEFILE"
else
script --flush --quiet --return /dev/null --command "$CMD" | tee "$CACHEFILE"
fi
The solution I came up in ruby is this. Does anybody see any optimization?
#!/usr/bin/env ruby
VER = '1.2'
$time_cache_secs = 3600
$cache_dir = File.expand_path("~/.cacheme")
require 'rubygems'
begin
require 'filecache' # gem install ruby-cache
rescue Exception => e
puts 'gem filecache requires installation, sorry. trying to install myself'
system 'sudo gem install -r filecache'
puts 'Try re-running the program now.'
exit 1
end
=begin
# create a new cache called "my-cache", rooted in /home/simon/caches
# with an expiry time of 30 seconds, and a file hierarchy three
# directories deep
=end
def main
cache = FileCache.new("cache3", $cache_dir, $time_cache_secs, 3)
cmd = ARGV.join(' ').to_s # caching on full command, note that quotes are stripped
cmd = 'echo give me an argment' if cmd.length < 1
# caches the command and retrieves it
if cache.get('output' + cmd)
#deb "Cache found!(for '#{cmd}')"
else
#deb "Cache not found! Recalculating and setting for the future"
cache.set('output' + cmd, `#{cmd}`)
end
#deb 'anyway calling the cache now'
print(cache.get('output' + cmd))
end
main
An implementation exists here: https://bitbucket.org/sivann/runcached/src
Caches executable path, output, exit code, remembers arguments. Configurable expiration. Implemented in bash, C, python, choose whatever suits you.

Resources