Passing args to defined bash functions through GNU parallel - bash

Let me show you a snippet of my Bash script and how I try to run parallel:
parallel -a "$file" \
-k \
-j8 \
--block 100M \
--pipepart \
--bar \
--will-cite \
_fix_col_number {} | _unify_null_value {} >> "$OUTPUT_DIR/$new_filename"
So, I am basically trying to process each line in a file in parallel using Bash functions defined inside my script. However, I am not sure how to pass each line to my defined functions "_fix_col_number" and "_unify_null_value". Whatever I do, nothing gets passed to the functions.
I am exporting the functions like this in my script:
declare -x NUM_OF_COLUMNS
export -f _fix_col_number
export -f _add_tabs
export -f _unify_null_value
The mentioned functions are:
_unify_null_value()
{
_string=$(echo "$1" | perl -0777 -pe "s/(?<=\t)\.(?=\s)//g" | \
perl -0777 -pe "s/(?<=\t)NA(?=\s)//g" | \
perl -0777 -pe "s/(?<=\t)No Info(?=\s)//g")
echo "$_string"
}
_add_tabs()
{
_tabs=""
for (( c=1; c<=$1; c++ ))
do
_tabs="$_tabs\t"
done
echo -e "$_tabs"
}
_fix_col_number()
{
line_cols=$(echo "$1" | awk -F"\t" '{ print NF }')
if [[ $line_cols -gt $NUM_OF_COLUMNS ]]; then
new_line=$(echo "$1" | cut -f1-"$NUM_OF_COLUMNS")
echo -e "$new_line\n"
elif [[ $line_cols -lt $NUM_OF_COLUMNS ]]; then
missing_columns=$(( NUM_OF_COLUMNS - line_cols ))
new_line="${1//$'\n'/}$(_add_tabs $missing_columns)"
echo -e "$new_line\n"
else
echo -e "$1"
fi
}
I tried removing {} from parallel. Not really sure what I am doing wrong.

I see two problems in the invocation plus additional problems with the functions:
With --pipepart there are no arguments. The blocks read from -a file are passed over stdin to your functions. Try the following commands to confirm this:
seq 9 > file
parallel -a file --pipepart echo
parallel -a file --pipepart cat
Theoretically, you could read stdin into a variable and pass that variable to your functions, ...
parallel -a file --pipepart 'b=$(cat); someFunction "$b"'
... but I wouldn't recommend it, especially since your blocks are 100MB each.
Bash interprets the pipe | in your command before parallel even sees it. To run a pipe, quote the entire command:
parallel ... 'b=$(cat); _fix_col_number "$b" | _unify_null_value "$b"' >> ...
_fix_col_number seems to assume its argument to be a single line, but receives 100MB blocks instead.
_unify_null_value does not read stdin, so _fix_col_number {} | _unify_null_value {} is equivalent to _unify_null_value {}.
That being said, your functions can be drastically improved. They start a lot of processes which becomes incredibly expensive for larger files. You can do some trivial improvements like combining perl ... | perl ... | perl ... into a single perl. Likewise, instead of storing everything in variables, you can process stdin directly: Just use f() { cmd1 | cmd2; } instead of f() { var=$(echo "$1" | cmd1); var=$(echo "$var" | cmd2); echo "$var"; }.
However, don't waste time on small things like these. A complete rewrite in sed, awk, or perl is easy and should outperfom every optimization on the existing functions.
Try
n="INSERT NUMBER OF COLUMNS HERE"
tabs=$(perl -e "print \"\t\" x $n")
perl -pe "s/\r?\$/$tabs/; s/\t\K(\.|NA|No Info)(?=\s)//g;" file |
cut -f "1-$n"
If you still find this too slow, leave out file; pack the command into a function, export that function and then call parallel -a file -k --pipepart nameOfTheFunction. The option --block is not necessary as pipepart will evenly split the input based on the number of jobs (can be specified with -j).

Related

make the bash script to be faster

I have a fairly large list of websites in "file.txt" and wanted to check if the words "Hello World!" in the site in the list using looping and curl.
i.e in "file.txt" :
blabla.com
blabla2.com
blabla3.com
then my code :
#!/bin/bash
put() {
printf "list : "
read list
run=$(cat $list)
}
put
scan_list() {
for run in $(cat $list);do
if [[ $(curl -skL ${run}) =~ "Hello World!" ]];then
printf "${run} Hello World! \n"
else
printf "${run} No Hello:( \n"
fi
done
}
scan_list
this takes a lot of time, is there a way to make the checking process faster?
Use xargs:
% tr '\12' '\0' < file.txt | \
xargs -0 -r -n 1 -t -P 3 sh -c '
if curl -skL "$1" | grep -q "Hello World!"; then
echo "$1 Hello World!"
exit
fi
echo "$1 No Hello:("
' _
Use tr to convert returns in the file.txt to nulls (\0).
Pass through xargs with -0 option to parse by nulls.
The -r option prevents the command from being ran if the input is empty. This is only available on Linux, so for macOS or *BSD you will need to check that file.txt is not empty before running.
The -n 1 permits only one file per execution.
The -t option is debugging, it prints the command before it is ran.
We allow 3 simultaneous commands in parallel with the -P 3 option.
Using sh -c with a single quoted multi-line command, we substitute $1 for the entries from the file.
The _ fills in the $0 argument, so our entries are $1.

How to extract code into a funciton when using xargs -P?

At fisrt,I have write the code,and it run well.
# version1
all_num=10
thread_num=5
a=$(date +%H%M%S)
seq 1 ${all_num} | xargs -n 1 -I {} -P ${thread_num} sh -c 'echo abc{}'
b=$(date +%H%M%S)
echo -e "startTime:\t$a"
echo -e "endTime:\t$b"
Now I want to extract code into a funciton,but it was wrong,how to fix it?
get_file(i){
echo "abc"+i
}
all_num=10
thread_num=5
a=$(date +%H%M%S)
seq 1 ${all_num} | xargs -n 1 -I {} -P ${thread_num} sh -c "$(get_file {})"
b=$(date +%H%M%S)
echo -e "startTime:\t$a"
echo -e "endTime:\t$b"
Because /bin/sh isn't guaranteed to have support for either printing text that when evaluates defines your function, or exporting functions through the environment, we need to do this the hard way, just duplicating the text of the function inside the copy of sh started by xargs.
Other questions already exist in this site describing how to accomplish this with bash, which is quite considerably easier. See f/e How can I use xargs to run a function in a command substitution for each match?
#!/bin/sh
all_num=10
thread_num=5
batch_size=1 # but with a larger all_num, turn this up to start fewer copies of sh
a=$(date +%H%M%S) # warning: this is really inefficient
seq 1 ${all_num} | xargs -n "${batch_size}" -P "${thread_num}" sh -c '
get_file() { i=$1; echo "abc ${i}"; }
for arg do
get_file "$arg"
done
' _
b=$(date +%H%M%S)
printf 'startTime:\t%s\n' "$a"
printf 'endTime:\t%s\n' "$b"
Note:
echo -e is not guaranteed to work with /bin/sh. Moreover, for a shell to be truly compliant, echo -e is required to write -e to its output. See Why is printf better than echo? on UNIX & Linux Stack Exchange, and the APPLICATION USAGE section of the POSIX echo specification.
Putting {} in a sh -c '...{}...' position is a Really Bad Idea. Consider the case where you're passed in a filename that contains $(rm -rf ~)'$(rm -rf ~)' -- it can't be safely inserted in an unquoted context, or a double-quoted context, or a single-quoted context, or a heredoc.
Note that seq is also nonstandard and not guaranteed to be present on all POSIX-compliant systems. i=0; while [ "$i" -lt "$all_num" ]; do echo "$i"; i=$((i + 1)); done is an alternative that will work on all POSIX systems.

Howto do floating point compasiosn in an if-statement within a GNU parallel block?

I want to run a batch process in parallel. For this I pipe a list to parallel. When I've an if-statement, that compares two floating point numbers (taken form here), the code doesn't run anymore. How can this be solved.
LIMIT=25
ps | parallel -j2 '
echo "Do stuff for {} to determine NUM"
NUM=33.3333 # set to demonstrate
if (( $(echo "$NUM > $LIMIT" | bc -l) )); then
echo "react..."
fi
echo "Do stuff..."
'
Prints:
Do stuff for \ \ PID\ TTY\ \ \ \ \ \ \ \ \ \ TIME\ CMD to determine NUM
Do stuff...
(standard_in) 2: syntax error
#... snipp
LIMIT is not set inside parallel shell. Running echo "$NUM > $LIMIT" | bc -l exapands to echo "123 > " | bc -l which results in syntax error reported by bc. You need to export/pass/put it's value to the shell run from inside parallel. Try this:
LIMIT=25
ps | parallel -j2 '
LIMIT="'"$LIMIT"'"
echo "Do stuff for {} to determine NUM"
NUM=33.3333 # set to demonstrate
if (( $(echo "$NUM > $LIMIT" | bc -l) )); then
echo "react..."
fi
echo "Do stuff..."
'
Or better use env_parallel, designed for such problems.
Side note: GNU parallel was designed for executing jobs in parallel using one or more computers. For scripts running on one computer it is better to stick with the xargs command, which is more commonly available (so you don't need to install some package each time you move your script to another machine).
While GNU Parallel is designed to deal correctly with commands spanning multiple lines, I personally find that hard to read. I prefer using a function:
doit() {
arg="$1"
echo "Do stuff for $a to determine NUM"
NUM=33.3333 # set to demonstrate
if (( $(echo "$NUM > $LIMIT" | bc -l) )); then
echo "react..."
fi
echo "Do stuff..."
}
export -f doit
LIMIT=25
export LIMIT
ps | parallel -j2 doit
Instead of the exports you can use env_parallel:
ps | env_parallel -j2 doit
If your environment is too big, use env_parallel --session before starting:
#!/bin/bash
env_parallel --session
# Define functions and variables _after_ running --session
doit() {
[...]
}
LIMIT=25
ps | env_parallel -j2 doit

How to use shell variables with GNU parallel?

This is the content of list.csv:
Apple,Red,10
Banana,Yellow,3
Coconut,White,18
Suppose I have this GNU parallel command:
parallel -a list.csv -j0 -C, \
color=`echo {2} | sed 's/e/eee/g' | ./capitalize.sh` ";" \
echo "{2}" ";" \
echo "$color" ";"
To get:
Red
REEED
Yellow
YEEELLOW
White
WHITEEE
Why isn't the color variable being defined/printed?
EDIT 20151218:
Now that I got the quoting right, I'd like to introduce a function reading a variable from another function, and reading $0.
This is a working example without GNU parallel (I made grep case-insensitive before posting, to facilitate testing without ./capitalize.sh).
while read line; do
doit() {
color=`echo $1 | cut -d, -f2 | sed 's/e/eee/g' | ./capitalize.sh`
}
export -f doit
get_key() {
key=`grep -i $color $0 | cut -d, -f2`
}
export -f get_key
#note that I would use parallel's `-C,` here instead of `cut`.
doit $line #get CSV's 2nd element and make it look like the one in script.
get_key #extract this element's value from the script's comments.
echo "color: $color"
echo "key: $key"
done < list.csv
#Key database in the shell script
# REEED,r-key
# YEEELLOW,y-key
# WHITEEE,w-key
Working output:
color: REEED
key: r-key
color: YEEELLOW
key: y-key
color: WHITEEE
key: w-key
This should work:
parallel -a list.csv -j0 -C, 'color=`echo {2} | sed "s/e/eee/g" | ./capitalize.sh`' ";" echo "{2}" ";" echo '"$color"' ";"
You are being hit by inadequate quoting. It might be easier to use a function:
doit() {
color=`echo $2 | sed 's/e/eee/g' | ./capitalize.sh`
echo "$2"
echo "$color"
}
export -f doit
parallel -a list.csv -j0 -C, doit
If this is the real goal you might want to use {= =} instead which is made for similar situations:
parallel -a list.csv -j0 -C, echo {2}";" echo '{=2 s/e/eee/g; $_=uc($_) =}'
If you are using $color several times, then --rpl can introduce a shorthand:
parallel --rpl '{clr} s/e/eee/g; $_=uc($_)' -a list.csv -j0 -C, echo {2}";" echo '{2clr} and again: {2clr}'
From the xargs afficionados I would really like to see a solution using xargs that:
guarantees not mixing output from different jobs - even if the lines are 60k long (e.g. the value of $color is 60k long)
sends stdout to stdout, and stderr to stderr
does not skip jobs even if the list of jobs (list.csv) is bigger than the number of available processes in the process table - even if capitalize.sh takes a full minute to run (xargs -P0)
The idea is to use a single function to do everything.
#!/bin/bash
#Key database in the shell script
# REEED,r-key
# YEEELLOW,y-key
# WHITEEE,w-key
doit() {
# get CSV's 2nd element and make it look like the one in script.
color=`echo $3 | cut -d, -f2 | sed 's/e/eee/g' | ./capitalize.sh`
#extract this element's value from the script's comments.
key=`grep -i $color $1 | cut -d, -f2`
echo "color: $color"
echo "key: $key"
}
export -f doit
#note that I would use parallel's `-C,` here instead of `cut`.
parallel -C, doit $0 < list.csv

bash functions with loops and pipes

I have a bash script that pipes the contents of a file into a series of user defined functions each of which performs a sed operation on stdin, sending output to stdout.
For example:
#!/bin/bash
MOD_config ()
{
sed 's/config/XYZ/g'
}
MOD_ABC ()
{
sed 's/ABC/WHOA!/g'
}
cat demo.txt \
| MOD_config \
| MOD_ABC
So far so good. Everything is working great.
Now I want to allow additional pairs of pattern changes specified via the script's command line. For example, I'd like to allow the user to run any of these:
demo.sh # changes per script (MOD_config and MOD_ABC)
demo.sh CDE 345 # change all 'CDE' to '345'
demo.sh CDE 345 AAAAA abababa # .. also changes 'AAAAA' to 'abababa'
so I tried adding this to the script:
USER_MODS ()
{
if [ $# -lt 1]; then
#just echo stdin to stdout if no cmd line args exist
grep .
else
STARTING_ARGC=$#
I=1
while [ $I -lt $STARTING_ARGC ]; then
sed "s/$1/$2/g"
shift
shift
I=`expr $I + 2`
done
fi
}
cat demo.txt \
| MOD_config \
| MOD_ABC \
| USER_MODS
This approach works only if I have no command line args, or if I have only two. However, adding additional args on the command line has no effect.
Not sure exactly how to send stdout of one iteration of the while loop to the stdin of the next iteration. I think that's the crux of my problem.
Is there a fix for this? Or should I take a different approach altogether?
To have a dynamic list of pipes, you'll want a recursive solution. Have a function which applies one set of modifications and then calls itself with two fewer arguments. If the function has no arguments then simply call cat to copy stdin to stdout unchanged.
mod() {
if (($# >= 2)); then
search=$1
replace=$2
shift 2
sed "s/$search/$replace/g" | mod "$#"
else
cat
fi
}
# Apply two base modifications, plus any additional user mods ("$#")
mod config XYZ ABC 'WHOA!' "$#"
A remark: with more than 2 arguments, your seds are executed, but after the first one that has already consumed all the input. Instead you want to build up a chain of sed commands.
#!/bin/bash
mod_config() { sed 's/config/XYZ/g'; }
mod_abc() { sed 's/ABC/WHOA!/g'; }
user_mods() {
local IFS=';'
local sed_subs=()
while (($#>=2)); do
sed_subs+=( "s/$1/$2/g" )
shift 2
done
# at this point you have an array of sed s commands (maybe empty!).
# Just join then with a semi colon using the IFS already set
sed "${sed_subs[*]}"
}
cat demo.txt \
| mod_config \
| mod_abc \
| user_mods "$#" # <--- don't forget to pass the arguments to your function
And pray that your users aren't going to input stuff that will confuse sed, e.g., a slash!
(And sorry, I lowercased all your variables. Uppercases are sooooo ugly).
Try this, uses recursive call to go down the list of replacement pairs calling USER_MODS each time.
#!/bin/bash
MOD_config ()
{
sed 's/config/XYZ/g'
}
MOD_ABC ()
{
sed 's/ABC/WHOA!/g'
}
USER_MODS ()
{
if [ $# -lt 1 ]; then
#just echo stdin to stdout if no args exist
grep .
else
# grap the next two arguments
arg1=$1
arg2=$2
# remove them from the argument list
shift
shift
# do the replacement for these two and recursivly pipe to the function with
# the new argument list
sed "s/$arg1/$arg2/g" | USER_MODS $#
fi
}
cat demo.txt \
| MOD_config \
| MOD_ABC \
| USER_MODS $#

Resources