Thanks to the answer here, I could manage to write myscript.sh to filter files based on the values in column 1 and column 2 in parallel.
#!/bin/bash
doit() {
pigz -dc $1 | awk -F, '$1>0.5 && $2<1.5'
}
export -f doit
find $1 -name '*.csv.gz' | parallel doit | pigz > output.csv.gz
and then run it in the terminal.
./myscript.sh /path/to/files
However, pv (pipe viewer) command in the following script doesn't show the progress of data sent via the pipe.
#!/bin/bash
doit() {
pv $1 | pigz -dc | awk -F, '$1>0.5 && $2<1.5'
}
export -f doit
find $1 -name '*.csv.gz' | parallel doit | pigz > output.csv.gz
I'm wondering how can I show the pipeline progress (not job progress with parallel --progress) using pv?
Related
I'm trying to use awk and GNU parallel to filter the files based on the values in column 1 and column 2 and dump the result in a single .csv.gz file. Thanks to the answer here, I could manage to write myscript.sh to do the job in parallel.
#!/bin/bash
doit() {
pigz -dc $1 | awk -F, '$1>0.5 && $2<1.5'
}
export -f doit
find $1 -name '*.csv.gz' | parallel doit | pigz > output.csv.gz
and then run the script in the terminal.
./myscript.sh /path/to/files
I'm wondering how I can pass 0.5 and 1.5 as arguments of myscript.sh?
./myscript.sh /path/to/files 0.5 1.5
This is may be an easier, or more explicit, way of passing variables and parameters around:
#!/bin/bash
dir="$1"
# Pick up second and third parameters, defaulting to 0.5 and 1.5 if unspecified
a=${2:-0.5}
b=${3:-1.5}
doit() {
file=$1
a=$2
b=$3
echo "File: $file, a=$a, b=$b"
cat "$1" | awk -F, -v a="$a" -v b="$b" '$1>a && $2<b'
}
export -f doit
find "$dir" -name '*.tst' | parallel doit {} "$a" "$b"
#!/bin/bash
doit() {
# $1 $2 $3 are arguments to doit
# '$1' and '$2' are variables in awk
pigz -dc $1 | awk -F, '$1>'$2' && $2<'$3
}
export -f doit
find $1 -name '*.csv.gz' | parallel doit {} $2 $3 | pigz > output.csv.gz
Call as:
paste <(seq 10 | shuf) <(seq 10 | shuf) | gzip > h.csv.gz
./myscript.sh . 5 6
zcat output.csv.gz
I have 3 different files that I want to compare
words_freq
words_freq_deduped
words_freq_alpha
For each file, I run a command like so, which I iterate on constantly to compare the results.
For example, I would do this:
$ cat words_freq | grep -v '[soe]'
$ cat words_freq_deduped | grep -v '[soe]'
$ cat words_freq_alpha | grep -v '[soe]'
and then review the results, and then do it again, with an additional filter
$ cat words_freq | grep -v '[soe]' | grep a | grep r | head -n20
a
$ cat words_freq_deduped | grep -v '[soe]' | grep a | grep r | head -n20
b
$ cat words_freq_alpha | grep -v '[soe]' | grep a | grep r | head -n20
c
This continues on until I've analyzed my data.
I would like to write a script that could take the piped portions, and pass it to each of these files, as I iterate on the grep/head portions of the command.
e.g. The following would dump the results of running the 3 commands above AND also compare the 3 results, and dump additional calculations on them
$ myScript | grep -v '[soe]' | grep a | grep r | head -n20
the letters were in all 3 runs, and it took 5 seconds
a
b
c
How can I do this using bash/python or zsh for the myScript part?
EDIT: After asking the question, it occurred to me that I could use eval to do it, like so, which I've added as an answer as well
The following approach allows me to process multiple files by using eval, which I know is frowned upon - any other suggestions are greatly appreciated!
$ myScript "grep -v '[soe]' | grep a | grep r | head -n20"
myScript
#!/usr/bin/env bash
function doIt(){
FILE=$1
CMD="cat $1 | $2"
echo processing file "$FILE"
eval "$CMD"
echo
}
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"
You can't avoid your shell from running pipes itself, so using it like that isn't very practical - you'd need to either quote everything and then eval it, which would make it hard to pass arguments with spaces, or quote every pipe, which you can then eval, making it so you have to quote every pipe. But yeah, these solutions are kinda hacky.
I'd suggest doing one of these two:
Keep your editor open, and put whatever you want to run inside the doIt function itself before you run it. Then run it in your shell without any arguments:
#!/usr/bin/env bash
doIt() {
# grep -v '[soe]' < "$1"
grep -v '[soe]' < "$1" | grep a | grep r | head -n20
}
doIt words_freq
doIt words_freq_deduped
doIt words_freq_alpha
Or, you could always use a "for" in your shell, which you can use Ctrl+r to find in your history when you want to use:
$ for f in words_freq*; do grep -v '[soe]' < "$f" | grep a | grep r | head -n20; done
But if you really want your approach, I tried to make it accept spaces, but it ended up being even hackier:
#!/usr/bin/env bash
doIt() {
local FILE=$1
shift
echo processing file "$FILE"
local args=()
for n in $(seq 1 $#); do
arg=$1
shift
if [[ $arg == '|' ]]; then
args+=('|')
else
args+=("\"$arg\"")
fi
done
eval "cat '$FILE' | ${args[#]}"
}
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"
With this version you can use it like this:
$ ./myScript grep "a a" "|" head -n1
Notice that it need you to quote the |, and that it now handles arguments with spaces.
Not fully understood problem correctly.
I understood you want to write a script without pipes, by including the filtering logic into the script.
And feeding the filtering patterns as arguments.
Here is a gawk script (standard Linux awk).
With one sweep on 3 input files, without piping.
script.awk
BEGIN {
RS="!#!#!#!#!#!#!#";
# set record separator to something unlikely matched, causing each file to be read entirely as a single record
}
$0 !~ excludeRegEx # if file does not match excludeRegEx
&& $0 ~ includeRegEx1 # and match includeRegEx1
&& $0 ~ includeRegEx2 { # and match includeRegEx2
system "head -n20 "FILENAME; # call shell command "head -n20 " on current filename
}
Running script.awk
awk -v excludeRegEx='[soe]' \
-v includeRegEx1='a' \
-v includeRegEx2='r' \
-f script.awk words_freq words_freq_deduped words_freq_alpha
The following approach allows me to process multiple files by using eval, which I know is frowned upon - any other suggestions are greatly appreciated!
$ myScript "grep -v '[soe]' | grep a | grep r | head -n20"
myScript
#!/usr/bin/env bash
function doIt(){
FILE=$1
CMD="cat $1 | $2"
echo processing file "$FILE"
eval "$CMD"
echo
}
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"
I have a function that generates a random file name
#generate random file names
get_rand_filename() {
if [ "$ASCIIONLY" == "1" ]; then
for ((i=0; i<$((MINFILENAMELEN+RANDOM%MAXFILENAMELEN)); i++)) {
printf \\$(printf '%03o' ${AARR[RANDOM%aarrcount]});
}
else
# no need to escape double quotes for filename
cat /dev/urandom | tr -dc '[ -~]' | tr -d '[$></~:`\\]' | head -c$((MINFILENAMELEN+RANDOM%MAXFILENAMELEN)) #| sed 's/\(["]\)/\\\1/g'
fi
printf "%s" $FILEEXT
}
export -f get_rand_filename
When I call it from within another function
cf(){
fD=$1
echo "the target dir recieved is " $fD
CFILE="$(get_rand_filename)"
echo "the file name is "$CFILE
}
export -f cf
when I call
echo "$targetdir" | xargs -0 sh -c 'cf $1' sh
I only get the FILEXT (no random file name)
when I call
cf "$targetdir"
I get a valid result
I need to be able to handle spaces in the $targetdir and file name string.
echo "$targetdir" | xargs -0 sh -c 'cf $1' sh
You should invoke bash rather than sh. Function exporting is a bash feature.
$ foo() { echo bar; }
$ export -f foo
$ sh -c 'foo'
sh: 1: foo: not found
$ bash -c 'foo'
bar
Also, get rid of the -0 option since the input isn't NUL-separated. Use -d'\n' instead. And quote "$1" for robustness.
echo "$targetdir" | xargs -d'\n' bash -c 'cf "$1"' bash
Actually, you could use -0 if you change the input format.
printf '%s\0' "$targetdir" | xargs -0 bash -c 'cf "$1"' bash
For what it's worth, mktemp creates random temporary files, and does it safely. It makes sure the file doesn't already exist and then creates it to prevent anybody else from snatching up the name in the split second between the name being generated and it being returned to the caller.
I have a bash function
agg_generror () {
echo $1
find ${folder} -name "${prefix}_*_${1}_${suffix}.count" | xargs -I % sh -c 'cat %; echo "";' | awk 'BEGIN{e=0;t=0} {e+=$1;t+=$2} END{print e/t}' > generror_${1}
}
which if I call directly
agg_generror 17.5
works and doesn't complain.
But if I do
echo 17.5 | xargs -I % sh -c 'agg_generror %'
It fails with
17.5
awk: fatal: division by zero attempted
Why may the behaviour different in the two cases?
while read; do agg_generror $REPLY; done < input.txt
Ok so I have the following script that updates Route43 DNS entries. Unfortunately there is a limit to the number of calls per second you can make so I need to make the final Xargs command sleep for about a second between each iteration.
I've tried a couple of things like ' {../cli53 blah; sleep 10; } ' and I cant seem to get it to work. Does anyone have any suggestions please:
#!/bin/bash
set root='dirname $0'
ec2-describe-instances -O ******* -W ******* --region eu-west-1 |
perl -ne '/^INSTANCE\s+(i-\S+).*?(\S+\.amazonaws\.com)/
and do { $dns = $2; print "$1 $dns\n" }; /^TAG.+\sName\s+(\S+)/
and print "$1 $dns\n"' |
perl -ane 'print "$F[0] CNAME $F[1] --replace\n"' |
grep -v 'i-' | xargs --verbose -n 4 /usr/local/bin/cli53 rrcreate -x 5 contoso.com
Edit: Thanks Etan for the Answer. Here is my solution for anyone else that needs it:
I had to include the -I %variable% switch into the xargs statement aswel to make sure that the feed in was passed as parameters to cli53 but it all looks to be working nicely now.
#!/bin/bash
set root='dirname $0'
ec2-describe-instances -O ******* -W ******* --region eu-west-1 |
perl -ne '/^INSTANCE\s+(i-\S+).*?(\S+\.amazonaws\.com)/
and do { $dns = $2; print "$1 $dns\n" }; /^TAG.+\sName\s+(\S+)/
and print "$1 $dns\n"' |
perl -ane 'print "$F[0] CNAME $F[1] --replace\n"' |
grep -v '^i-' |
xargs --verbose -n 4 -I myvar /bin/sh -c '{ /usr/local/bin/cli53 rrcreate -x 5 contoso.com 'myvar'; sleep 1; printf "\n\n"; }'
The simplest solution would be to simply put the cli53 and sleep calls in a script and use xargs to execute the script.
If you don't want to do that you should be able to do what you were trying to do with this:
... | xargs ... /bin/sh -c '{ /usr/local/bin/cli53 ... "$#"; sleep 10; }' -