I have a script that I want to run on a number of files
my_script file_name
but I have many so I have written some code that is meant to process multiple at the same time by first creating 5 'equal' lists of the files I want to process followed by this
my_function() {
while read i; do
my_script $i
done < $1
}
my_function list_1 &
my_function list_2 &
my_function list_3 &
my_function list_4 &
my_function list_5 &
wait
This works for the first file in each list but then finishes. If I change the function to a simple echo it works fine
my_function() {
while read i; do
echo $i
done < $1
}
it prints all the files in each list as I would expect.
Why does it not work if I use 'my_script'?? And is there a 'nicer' way of doing this?
GNU Parallel is made for this:
parallel my_script ::: files*
You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/
You can install GNU Parallel in just 10 seconds with:
wget -O - pi.dk/3 | sh
Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Edit:
If the reason for not installing GNU Parallel is not covered by
http://oletange.blogspot.dk/2013/04/why-not-install-gnu-parallel.html
would you then be kind to elaborate why?
There must be an exit statement in my_script. Replace the exit statement(s) with return statement(s).
Another thing to check is the possibility that the same file is contained in more than one list. There may be contention issues in processing - the file is already being processed and another process attempts to open the same file. Check for any duplicate files with-:
sort file_[1-5] | uniq -d
As an alternative to GNU parallel, there is https://github.com/mauvilsa/run_parallel which is simply a function in bash, so it does not require root access or compiling.
To use it, first source the file
source run_parallel.inc.sh
Then in your example, execute it as
run_parallel -T 5 my_function 'list_{%}'
It could also do the splitting of the lists for you as
run_parallel -T 5 -l full_list -n split my_function '{#}'
To see the usage explanation and some examples, execute run_parallel without any arguments.
Related
I have a sh script that, if it is invoked by the command ./textgenerate text0, a file text0.txt would be generated.
Now that I need text0.txt, text1.txt, ..., text1000.txt, how could I write the script to do that. i.e., how to replace the 0 part to changing variables, say 1~1000?
I know a certain "trick" to implement this, for example generate the script 1000 times using Microsoft Excel and paste it into the shell. But is there an elegent and efficient way to do this?
for i in {0..1000}; do
./textgenerate "text${i}"
done
This loops over the range [0..1,000] and assigns each value to $i in the body of the loop. ./textgenerate will be invoked 1,001 times.
Very simply and in parallel with GNU Parallel:
parallel ./textgenerate text{}.txt ::: {0..1000}
Or, if you don't have a recent bash to expand the {0..1000}, you could equally do this:
seq 0 1000 | parallel ./textgenerate text{}.txt
And, if you want to see what it would do, without actually doing anything:
parallel --dry-run ... as above ...
And, if you want a progress bar:
parallel --bar ... as above ...
You can also let printf process the looping.
. <(printf "./textgenerate text%s\n" {0..1000})
I have a delimited (|) input file (TableInfo.txt) that has data as shown below
dbName1|Table1
dbName1|Table2
dbName2|Table3
dbName2|Table4
...
I have a shell script (LoadTables.sh) that parses each line and calls a executable passing args from the line like dbName, TableName. This process reads data from a SQL Server and loads it into HDFS.
while IFS= read -r line;do
fields=($(printf "%s" "$line"|cut -d'|' --output-delimiter=' ' -f1-))
query=$(< ../sqoop/"${fields[1]}".sql)
sh ../ProcessName "${fields[0]}" "${fields[1]}" "$query"
done < ../TableInfo.txt
Right now my process is running in sequential for each line in the file and its time consuming based on the number of entries in the file.
Is there any way I can execute the process in parallel? I have heard about using xargs/GNU parallel/ampersand and wait options. I am not familiar on how to construct and use it. Any help is appreciated.
Note:I don't have GNU parallel installed on the Linux machine. So xargs is the only option as I have heard some cons on using ampersand and wait option.
Put an & on the end of any line you want to move to the background. Replacing the silly (buggy) array-splitting method used in your code with read's own field-splitting, this looks something like:
while IFS='|' read -r db table; do
../ProcessName "$db" "$table" "$(<"../sqoop/${table}.sql")" &
done < ../TableInfo.txt
...FYI, re: what I meant about "buggy" --
fields=( $(foo) )
...performs not only string-splitting but also globbing on the output of foo; thus, a * in the output is replaced with a list of filenames in the current directory; a name such as foo[bar] can be replaced with files named foob, fooa or foor; the globfail shell option can cause such an expansion to result in a failure, the nullglob shell option can cause it to result in an empty result; etc.
If you have GNU xargs, consider the following:
# assuming you have "nproc" to get the number of CPUs; otherwise, hardcode
xargs -P "$(nproc)" -d $'\n' -n 1 bash -c '
db=${1%|*}; table=${1##*|}
query=$(<"../sqoop/${table}.sql")
exec ../ProcessName "$db" "$table" "$query"
' _ < ../TableInfo.txt
Context
I need to optimize deduplication using 'sort -u' and my linux machine has an old implementation of 'sort' command (i.e. 5.97) that has not '--parallel' option. Although 'sort' implements parallelizable algorithms (e.g. merge-sort), I need to make such parallelization explicit. Therefore, I make it by hand via 'xargs' command that outperforms ~2.5X w.r.t. to the single 'sort -u' method ... when it works fine.
Here the intuition of what I am doing.
I am running a bash script that splits an input file (e.g. file.txt) into several parts (e.g. file.txt.part1, file.txt.part2, file.txt.part3, file.txt.part4). The resulting parts are passed to the 'xargs' command in order to perform parallel deduplication via the sortu.sh script (details at the end). sortu.sh wraps the invocation of 'sort -u' and outputs the resulting file name (e.g. "sortu.sh file.txt.part1" outputs "file.txt.part1.sorted"). Then the resulting sorted parts are passed to a 'sort --merge -u' that merges/deduplicates the input parts assuming that such parts are already sorted.
The problem I am experiencing is on the parallelization via 'xargs'. Here a simplified version of my code:
AVAILABLE_CORES=4
PARTS="file.txt.part1
file.txt.part2
file.txt.part3
file.txt.part4"
SORTED_PARTS=$(echo "$PARTS" | xargs --max-args=1 \
--max-procs=$AVAILABLE_CORES \
bash sortu.sh \
)
...
#More code for merging the resulting parts $SORTED_PARTS
...
The expecting result is a list of sorted parts into the variable SORTED_PARTS:
echo "$SORTED_PARTS"
file.txt.part1.sorted
file.txt.part2.sorted
file.txt.part3.sorted
file.txt.part4.sorted
Symptom
Nevertheless, (sometimes) there is a missing sorted part. For instance, the file.txt.part2.sorted:
echo "$SORTED_PARTS"
file.txt.part1.sorted
file.txt.part3.sorted
file.txt.part4.sorted
This symptom is non-deterministic in its occurrence (i.e. the execution for the same file.txt succeeds and in another time it fails) or in the missing file (i.e. it is not always the same sorted missing part).
Problem
I have a race condition where all the sortu.sh instances finish and 'xargs' sends EOF before the stdout is flushed.
Question
Is there a way to ensure stdout flushing before 'xagrs' sends EOF?
Constraints
I am not able to use neither parallel command nor "--parallel" option of sort command.
sortu.sh code
#!/bin/bash
SORTED=$1.sorted
sort -u $1 > $SORTED
echo $SORTED
The below doesn't write contents out to disk at all, and parallelizes the split process, the sort processes, and the merge, performing all of these at once.
This version has been backported to bash 3.2; a version built for newer releases of bash wouldn't need eval.
#!/bin/bash
nprocs=5 # maybe call nprocs command instead?
fd_min=10 # on bash 4.1, can use automatic FD allocation instead
# create a temporary directory; delete on exit
tempdir=$(mktemp -d "${TMPDIR:-/tmp}/psort.XXXXXX")
trap 'rm -rf "$tempdir"' 0
# close extra FDs and clear traps, before optionally executing another tool.
#
# Doing this in subshells ensures that only the main process holds write handles on the
# individual sorts, so that they exit when those handles are closed.
cloexec() {
local fifo_fd
for ((fifo_fd=fd_min; fifo_fd < (fd_min+nprocs); fifo_fd++)); do
: "Closing fd $fifo_fd"
# in modern bash; just: exec {fifo_fd}>&-
eval "exec ${fifo_fd}>&-"
done
if (( $# )); then
trap - 0
exec "$#"
fi
}
# For each parallel process:
# - Run a sort -u invocation reading from an FD and writing from a FIFO
# - Add the FIFO's name to a merge sort command
merge_cmd=(sort --merge -u)
for ((i=0; i<nprocs; i++)); do
mkfifo "$tempdir/fifo.$i" # create FIFO
merge_cmd+=( "$tempdir/fifo.$i" ) # add to sort command line
fifo_fd=$((fd_min+i))
: "Opening FD $fifo_fd for sort to $tempdir/fifo.$i"
# in modern bash: exec {fifo_fd}> >(cloexec sort -u >$fifo_fd)
printf -v exec_str 'exec %q> >(cloexec; exec sort -u >%q)' "$fifo_fd" "$tempdir/fifo.$i"
eval "$exec_str"
done
# Run the big merge sort recombining output from all the FIFOs
cloexec "${merge_cmd[#]}" &
merge_pid=$!
# Split input stream out to all the individual sort processes...
awk -v "nprocs=$nprocs" \
-v "fd_min=$fd_min" \
'{ print $0 >("/dev/fd/" (fd_min + (NR % nprocs))) }'
# ...when done, close handles on the FIFOs, so their sort invocations exit
cloexec
# ...and wait for the merge sort to exit
wait "$merge_pid"
I am trying to run a c executable through bash. The executable will take a different argument in each iteration, and I want to do it in parallel since I have 12 cores available.
I tried
w=1;
for i in {1..100}
do
l=$(($i-1));
for j in {12*l..12*i}
do
./run $w/100 > "$w"_out &
done
expr=$w % 12;
if ["$expr" -eq "0"]
then wait;
fi;
done
run is the c executable. I want to run it with increasing argument w in each step, and I want to wait until all processes are done if 12 of the cores are in use. SO basically, I will run 12 executables at the same time, then wait until they are completed, and then move to the next 12.
Hope I made my point clear.
Cheers.
Use gnu parallel instead:
parallel ./myscript {1} ::: {1..100}
You can specify the number of parallel processes with the -P option, but it defaults to the number of cores in the system.
You can also specify -k to keep the output order and redirect the file.
To redirect the output to individual files, you can specify the output redirection, but you have to quote it, so that it is not parsed by the shell. For example:
parallel ./run {1} '>' {1}_out ::: {1..10}
is equivalent to running ./run 1 > 1_out to ./run 10 > 10_out
I'm evaluating if GNU Parallel can be used to search files stored on a system in parallel. There can be only one file for each day of year (doy) on the system (so a maximum of 366 files per year). Let's say there are 3660 files on the system (about 10 years worth of data). The system could be a multi-CPU multi-core Linux or a multi-CPU Solaris.
I'm storing the search commands to run on the files in an array (one command per file). And this is what I'm doing right now (using bash) but then I have no control on how many searches to start in parallel (definitely don't want to start all 3660 searches at once):
#!/usr/bin/env bash
declare -a cmds
declare -i cmd_ctr=0
while [[ <condition> ]]; do
if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
cmds[$cmd_ctr]="<cmd_to_run>"
let cmd_ctr++
fi
done
declare -i arr_len=${#cmds[#]}
for (( i=0; i<${arr_len}; i++ ));
do
# Get the command and run it in background
eval ${cmds[$i]} &
done
wait
If I were to use parallel (which will automatically figure out the max. CPUs/cores and start only so many searches in parallel), how can I reuse the array cmds with parallel and rewrite the above code? The other alternative is to write all commands to a file and then do cat cmd_file | parallel
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Using-shell-variables says:
parallel echo ::: "${V[#]}"
You do not want the echo, so:
parallel ::: "${cmds[#]}"
If you do not need $cmds for anything else, then use 'sem' (which is an alias for parallel --semaphore) https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-mutex-and-counting-semaphore
while [[ <condition> ]]; do
if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
sem -j+0 <cmd_to_run>
fi
done
sem --wait
You have not described what <condition> might be. If you are simply doing a something like a for-loop you could replace the whole script with:
parallel 'if [ -s {} ] ; then cmd_to_run {}; fi' ::: $cur_archive_path/log.{1..3660}
(based on https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Composed-commands).