Passing Arguments to GNU parallel - bash

I'm trying to use awk and GNU parallel to filter the files based on the values in column 1 and column 2 and dump the result in a single .csv.gz file. Thanks to the answer here, I could manage to write myscript.sh to do the job in parallel.
#!/bin/bash
doit() {
pigz -dc $1 | awk -F, '$1>0.5 && $2<1.5'
}
export -f doit
find $1 -name '*.csv.gz' | parallel doit | pigz > output.csv.gz
and then run the script in the terminal.
./myscript.sh /path/to/files
I'm wondering how I can pass 0.5 and 1.5 as arguments of myscript.sh?
./myscript.sh /path/to/files 0.5 1.5

This is may be an easier, or more explicit, way of passing variables and parameters around:
#!/bin/bash
dir="$1"
# Pick up second and third parameters, defaulting to 0.5 and 1.5 if unspecified
a=${2:-0.5}
b=${3:-1.5}
doit() {
file=$1
a=$2
b=$3
echo "File: $file, a=$a, b=$b"
cat "$1" | awk -F, -v a="$a" -v b="$b" '$1>a && $2<b'
}
export -f doit
find "$dir" -name '*.tst' | parallel doit {} "$a" "$b"

#!/bin/bash
doit() {
# $1 $2 $3 are arguments to doit
# '$1' and '$2' are variables in awk
pigz -dc $1 | awk -F, '$1>'$2' && $2<'$3
}
export -f doit
find $1 -name '*.csv.gz' | parallel doit {} $2 $3 | pigz > output.csv.gz
Call as:
paste <(seq 10 | shuf) <(seq 10 | shuf) | gzip > h.csv.gz
./myscript.sh . 5 6
zcat output.csv.gz

Related

Bash function skips pv (pipe viewer)

Thanks to the answer here, I could manage to write myscript.sh to filter files based on the values in column 1 and column 2 in parallel.
#!/bin/bash
doit() {
pigz -dc $1 | awk -F, '$1>0.5 && $2<1.5'
}
export -f doit
find $1 -name '*.csv.gz' | parallel doit | pigz > output.csv.gz
and then run it in the terminal.
./myscript.sh /path/to/files
However, pv (pipe viewer) command in the following script doesn't show the progress of data sent via the pipe.
#!/bin/bash
doit() {
pv $1 | pigz -dc | awk -F, '$1>0.5 && $2<1.5'
}
export -f doit
find $1 -name '*.csv.gz' | parallel doit | pigz > output.csv.gz
I'm wondering how can I show the pipeline progress (not job progress with parallel --progress) using pv?

bash or zsh: how to pass multiple inputs to interactive piped parameters?

I have 3 different files that I want to compare
words_freq
words_freq_deduped
words_freq_alpha
For each file, I run a command like so, which I iterate on constantly to compare the results.
For example, I would do this:
$ cat words_freq | grep -v '[soe]'
$ cat words_freq_deduped | grep -v '[soe]'
$ cat words_freq_alpha | grep -v '[soe]'
and then review the results, and then do it again, with an additional filter
$ cat words_freq | grep -v '[soe]' | grep a | grep r | head -n20
a
$ cat words_freq_deduped | grep -v '[soe]' | grep a | grep r | head -n20
b
$ cat words_freq_alpha | grep -v '[soe]' | grep a | grep r | head -n20
c
This continues on until I've analyzed my data.
I would like to write a script that could take the piped portions, and pass it to each of these files, as I iterate on the grep/head portions of the command.
e.g. The following would dump the results of running the 3 commands above AND also compare the 3 results, and dump additional calculations on them
$ myScript | grep -v '[soe]' | grep a | grep r | head -n20
the letters were in all 3 runs, and it took 5 seconds
a
b
c
How can I do this using bash/python or zsh for the myScript part?
EDIT: After asking the question, it occurred to me that I could use eval to do it, like so, which I've added as an answer as well
The following approach allows me to process multiple files by using eval, which I know is frowned upon - any other suggestions are greatly appreciated!
$ myScript "grep -v '[soe]' | grep a | grep r | head -n20"
myScript
#!/usr/bin/env bash
function doIt(){
FILE=$1
CMD="cat $1 | $2"
echo processing file "$FILE"
eval "$CMD"
echo
}
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"
You can't avoid your shell from running pipes itself, so using it like that isn't very practical - you'd need to either quote everything and then eval it, which would make it hard to pass arguments with spaces, or quote every pipe, which you can then eval, making it so you have to quote every pipe. But yeah, these solutions are kinda hacky.
I'd suggest doing one of these two:
Keep your editor open, and put whatever you want to run inside the doIt function itself before you run it. Then run it in your shell without any arguments:
#!/usr/bin/env bash
doIt() {
# grep -v '[soe]' < "$1"
grep -v '[soe]' < "$1" | grep a | grep r | head -n20
}
doIt words_freq
doIt words_freq_deduped
doIt words_freq_alpha
Or, you could always use a "for" in your shell, which you can use Ctrl+r to find in your history when you want to use:
$ for f in words_freq*; do grep -v '[soe]' < "$f" | grep a | grep r | head -n20; done
But if you really want your approach, I tried to make it accept spaces, but it ended up being even hackier:
#!/usr/bin/env bash
doIt() {
local FILE=$1
shift
echo processing file "$FILE"
local args=()
for n in $(seq 1 $#); do
arg=$1
shift
if [[ $arg == '|' ]]; then
args+=('|')
else
args+=("\"$arg\"")
fi
done
eval "cat '$FILE' | ${args[#]}"
}
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"
With this version you can use it like this:
$ ./myScript grep "a a" "|" head -n1
Notice that it need you to quote the |, and that it now handles arguments with spaces.
Not fully understood problem correctly.
I understood you want to write a script without pipes, by including the filtering logic into the script.
And feeding the filtering patterns as arguments.
Here is a gawk script (standard Linux awk).
With one sweep on 3 input files, without piping.
script.awk
BEGIN {
RS="!#!#!#!#!#!#!#";
# set record separator to something unlikely matched, causing each file to be read entirely as a single record
}
$0 !~ excludeRegEx # if file does not match excludeRegEx
&& $0 ~ includeRegEx1 # and match includeRegEx1
&& $0 ~ includeRegEx2 { # and match includeRegEx2
system "head -n20 "FILENAME; # call shell command "head -n20 " on current filename
}
Running script.awk
awk -v excludeRegEx='[soe]' \
-v includeRegEx1='a' \
-v includeRegEx2='r' \
-f script.awk words_freq words_freq_deduped words_freq_alpha
The following approach allows me to process multiple files by using eval, which I know is frowned upon - any other suggestions are greatly appreciated!
$ myScript "grep -v '[soe]' | grep a | grep r | head -n20"
myScript
#!/usr/bin/env bash
function doIt(){
FILE=$1
CMD="cat $1 | $2"
echo processing file "$FILE"
eval "$CMD"
echo
}
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"

Bash - Counter for multiple parameters in file

I created a command, which works, but not exactly as I want. So I would like to upgrade this command to right output.
My command:
awk '{print $1}' ios-example.com.access | sort | uniq -c | sort -nr
Output of my command:
8 192.27.69.191
2 82.202.69.253
Input file:
https://pajda.fit.vutbr.cz/ios/ios-19-1-logs/blob/master/ios-example.com.access.log
Output I need(hashtags instead of numbers):
198.27.69.191 (8): ########
82.202.69.253 (2): ##
cat ios-example.com.access | sort | uniq -c | awk 'ht="#"{for(i=1;i<$1;i++){ht=ht"#"} str=sprintf("%s (%d): %s", $2,$1, ht); print str}'
expecting file with content like:
ipadress1
ipadress1
ipadress1
ipadress2
ipadress2
ipadress1
ipadress2
ipadress1
Using xargs with sh and printf. Comments in between the lines. Live version at tutorialspoint.
# sorry cat
cat <<EOF |
8 192.27.69.191
2 82.202.69.253
EOF
# for each 2 arguments
xargs -n2 sh -c '
# format the output as "$2 ($1): "
printf "%s (%s): " "$2" "$1"
# repeat the character `#` $1 times
seq "$1" | xargs printf "#%.0s"
# lastly a newline
printf "\n"
' --
I think we could shorten that a bit with:
xargs -n2 sh -c 'printf "%s (%s): %s\n" "$2" "$1" $(printf "#%.0s" $(seq $1))' --
or maybe just echo, if the input is sufficiently safe:
xargs -n2 sh -c 'echo "$2 ($1): $(printf "#%.0s" $(seq $1))"' --
You can upgrade your command by adding another awk to the list, or you can just use a single awk for the whole thing:
awk '{a[$1]++}
END { for(i in a) {
printf "%s (%d):" ,i,a[i]
for(j=0;j<a[i];++j) printf "#"; printf "\n"
}
}' file

Splitting out a large file

I would like to process a 200 GB file with lines like the following:
...
{"captureTime": "1534303617.738","ua": "..."}
...
The objective is to split this file into multiple files grouped by hours.
Here is my basic script:
#!/bin/sh
echo "Splitting files"
echo "Total lines"
sed -n '$=' $1
echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
while read p; do
date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '#{}' '+%Y%m%d%H')
echo $p >> split.$date
done <$1
Some facts:
80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.
Could you help me to optimize this bash script?
Thank you
This awk solution might come to your rescue:
awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1
It essentially replaces your while-loop.
Furthermore, you can replace the complete script with:
# Start AWK file
BEGIN{ FS='"' }
(NR==1){tmin=tmax=$4}
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
END {
print "Total lines processed: ", NR
print "First date: "strftime("%Y%m%d%H",tmin)
print "Last date: "strftime("%Y%m%d%H",tmax)
}
Which you then can run as:
awk -f <awk_file.awk> <jq-file>
Note: the usage of strftime indicates that you need to use GNU awk.
you can start optimizing by changing this
sed 's/{"captureTime": "//' | sed 's/","ua":.*//'
with this
sed -nE 's/(\{"captureTime": ")([0-9\.]+)(.*)/\2/p'
-n suppress automatic printing of pattern space
-E use extended regular expressions in the script

Bash Xargs Sleep (Multiple Command Line Arguments)

Ok so I have the following script that updates Route43 DNS entries. Unfortunately there is a limit to the number of calls per second you can make so I need to make the final Xargs command sleep for about a second between each iteration.
I've tried a couple of things like ' {../cli53 blah; sleep 10; } ' and I cant seem to get it to work. Does anyone have any suggestions please:
#!/bin/bash
set root='dirname $0'
ec2-describe-instances -O ******* -W ******* --region eu-west-1 |
perl -ne '/^INSTANCE\s+(i-\S+).*?(\S+\.amazonaws\.com)/
and do { $dns = $2; print "$1 $dns\n" }; /^TAG.+\sName\s+(\S+)/
and print "$1 $dns\n"' |
perl -ane 'print "$F[0] CNAME $F[1] --replace\n"' |
grep -v 'i-' | xargs --verbose -n 4 /usr/local/bin/cli53 rrcreate -x 5 contoso.com
Edit: Thanks Etan for the Answer. Here is my solution for anyone else that needs it:
I had to include the -I %variable% switch into the xargs statement aswel to make sure that the feed in was passed as parameters to cli53 but it all looks to be working nicely now.
#!/bin/bash
set root='dirname $0'
ec2-describe-instances -O ******* -W ******* --region eu-west-1 |
perl -ne '/^INSTANCE\s+(i-\S+).*?(\S+\.amazonaws\.com)/
and do { $dns = $2; print "$1 $dns\n" }; /^TAG.+\sName\s+(\S+)/
and print "$1 $dns\n"' |
perl -ane 'print "$F[0] CNAME $F[1] --replace\n"' |
grep -v '^i-' |
xargs --verbose -n 4 -I myvar /bin/sh -c '{ /usr/local/bin/cli53 rrcreate -x 5 contoso.com 'myvar'; sleep 1; printf "\n\n"; }'
The simplest solution would be to simply put the cli53 and sleep calls in a script and use xargs to execute the script.
If you don't want to do that you should be able to do what you were trying to do with this:
... | xargs ... /bin/sh -c '{ /usr/local/bin/cli53 ... "$#"; sleep 10; }' -

Resources