I'm trying to use awk and GNU parallel to filter the files based on the values in column 1 and column 2 and dump the result in a single .csv.gz file. Thanks to the answer here, I could manage to write to do the job in parallel.
doit() {
pigz -dc $1 | awk -F, '$1>0.5 && $2<1.5'
export -f doit
find $1 -name '*.csv.gz' | parallel doit | pigz > output.csv.gz
and then run the script in the terminal.
./ /path/to/files
I'm wondering how I can pass 0.5 and 1.5 as arguments of
./ /path/to/files 0.5 1.5

This is may be an easier, or more explicit, way of passing variables and parameters around:
# Pick up second and third parameters, defaulting to 0.5 and 1.5 if unspecified
doit() {
echo "File: $file, a=$a, b=$b"
cat "$1" | awk -F, -v a="$a" -v b="$b" '$1>a && $2<b'
export -f doit
find "$dir" -name '*.tst' | parallel doit {} "$a" "$b"

doit() {
# $1 $2 $3 are arguments to doit
# '$1' and '$2' are variables in awk
pigz -dc $1 | awk -F, '$1>'$2' && $2<'$3
export -f doit
find $1 -name '*.csv.gz' | parallel doit {} $2 $3 | pigz > output.csv.gz
Call as:
paste <(seq 10 | shuf) <(seq 10 | shuf) | gzip > h.csv.gz
./ . 5 6
zcat output.csv.gz


Bash function skips pv (pipe viewer)

Thanks to the answer here, I could manage to write to filter files based on the values in column 1 and column 2 in parallel.
doit() {
pigz -dc $1 | awk -F, '$1>0.5 && $2<1.5'
export -f doit
find $1 -name '*.csv.gz' | parallel doit | pigz > output.csv.gz
and then run it in the terminal.
./ /path/to/files
However, pv (pipe viewer) command in the following script doesn't show the progress of data sent via the pipe.
doit() {
pv $1 | pigz -dc | awk -F, '$1>0.5 && $2<1.5'
export -f doit
find $1 -name '*.csv.gz' | parallel doit | pigz > output.csv.gz
I'm wondering how can I show the pipeline progress (not job progress with parallel --progress) using pv?

bash or zsh: how to pass multiple inputs to interactive piped parameters?

I have 3 different files that I want to compare
For each file, I run a command like so, which I iterate on constantly to compare the results.
For example, I would do this:
$ cat words_freq | grep -v '[soe]'
$ cat words_freq_deduped | grep -v '[soe]'
$ cat words_freq_alpha | grep -v '[soe]'
and then review the results, and then do it again, with an additional filter
$ cat words_freq | grep -v '[soe]' | grep a | grep r | head -n20
$ cat words_freq_deduped | grep -v '[soe]' | grep a | grep r | head -n20
$ cat words_freq_alpha | grep -v '[soe]' | grep a | grep r | head -n20
This continues on until I've analyzed my data.
I would like to write a script that could take the piped portions, and pass it to each of these files, as I iterate on the grep/head portions of the command.
e.g. The following would dump the results of running the 3 commands above AND also compare the 3 results, and dump additional calculations on them
$ myScript | grep -v '[soe]' | grep a | grep r | head -n20
the letters were in all 3 runs, and it took 5 seconds
How can I do this using bash/python or zsh for the myScript part?
EDIT: After asking the question, it occurred to me that I could use eval to do it, like so, which I've added as an answer as well
The following approach allows me to process multiple files by using eval, which I know is frowned upon - any other suggestions are greatly appreciated!
$ myScript "grep -v '[soe]' | grep a | grep r | head -n20"
#!/usr/bin/env bash
function doIt(){
CMD="cat $1 | $2"
echo processing file "$FILE"
eval "$CMD"
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"
You can't avoid your shell from running pipes itself, so using it like that isn't very practical - you'd need to either quote everything and then eval it, which would make it hard to pass arguments with spaces, or quote every pipe, which you can then eval, making it so you have to quote every pipe. But yeah, these solutions are kinda hacky.
I'd suggest doing one of these two:
Keep your editor open, and put whatever you want to run inside the doIt function itself before you run it. Then run it in your shell without any arguments:
#!/usr/bin/env bash
doIt() {
# grep -v '[soe]' < "$1"
grep -v '[soe]' < "$1" | grep a | grep r | head -n20
doIt words_freq
doIt words_freq_deduped
doIt words_freq_alpha
Or, you could always use a "for" in your shell, which you can use Ctrl+r to find in your history when you want to use:
$ for f in words_freq*; do grep -v '[soe]' < "$f" | grep a | grep r | head -n20; done
But if you really want your approach, I tried to make it accept spaces, but it ended up being even hackier:
#!/usr/bin/env bash
doIt() {
local FILE=$1
echo processing file "$FILE"
local args=()
for n in $(seq 1 $#); do
if [[ $arg == '|' ]]; then
eval "cat '$FILE' | ${args[#]}"
doIt words_freq "$#"
doIt words_freq_deduped "$#"
doIt words_freq_alpha "$#"
With this version you can use it like this:
$ ./myScript grep "a a" "|" head -n1
Notice that it need you to quote the |, and that it now handles arguments with spaces.
Not fully understood problem correctly.
I understood you want to write a script without pipes, by including the filtering logic into the script.
And feeding the filtering patterns as arguments.
Here is a gawk script (standard Linux awk).
With one sweep on 3 input files, without piping.
# set record separator to something unlikely matched, causing each file to be read entirely as a single record
$0 !~ excludeRegEx # if file does not match excludeRegEx
&& $0 ~ includeRegEx1 # and match includeRegEx1
&& $0 ~ includeRegEx2 { # and match includeRegEx2
system "head -n20 "FILENAME; # call shell command "head -n20 " on current filename
Running script.awk
awk -v excludeRegEx='[soe]' \
-v includeRegEx1='a' \
-v includeRegEx2='r' \
-f script.awk words_freq words_freq_deduped words_freq_alpha
Bash - Counter for multiple parameters in file

I created a command, which works, but not exactly as I want. So I would like to upgrade this command to right output.
My command:
awk '{print $1}' | sort | uniq -c | sort -nr
Output of my command:
Input file:
Output I need(hashtags instead of numbers): (8): ######## (2): ##
cat | sort | uniq -c | awk 'ht="#"{for(i=1;i<$1;i++){ht=ht"#"} str=sprintf("%s (%d): %s", $2,$1, ht); print str}'
expecting file with content like:
Using xargs with sh and printf. Comments in between the lines. Live version at tutorialspoint.
# sorry cat
cat <<EOF |
# for each 2 arguments
xargs -n2 sh -c '
# format the output as "$2 ($1): "
printf "%s (%s): " "$2" "$1"
# repeat the character `#` $1 times
seq "$1" | xargs printf "#%.0s"
# lastly a newline
printf "\n"
' --
I think we could shorten that a bit with:
xargs -n2 sh -c 'printf "%s (%s): %s\n" "$2" "$1" $(printf "#%.0s" $(seq $1))' --
or maybe just echo, if the input is sufficiently safe:
xargs -n2 sh -c 'echo "$2 ($1): $(printf "#%.0s" $(seq $1))"' --
You can upgrade your command by adding another awk to the list, or you can just use a single awk for the whole thing:
awk '{a[$1]++}
END { for(i in a) {
printf "%s (%d):" ,i,a[i]
for(j=0;j<a[i];++j) printf "#"; printf "\n"
}' file

Splitting out a large file

I would like to process a 200 GB file with lines like the following:
{"captureTime": "1534303617.738","ua": "..."}
The objective is to split this file into multiple files grouped by hours.
Here is my basic script:
echo "Splitting files"
echo "Total lines"
sed -n '$=' $1
echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
while read p; do
date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '#{}' '+%Y%m%d%H')
echo $p >> split.$date
done <$1
Some facts:
80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.
Could you help me to optimize this bash script?
Thank you
This awk solution might come to your rescue:
awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1
It essentially replaces your while-loop.
Furthermore, you can replace the complete script with:
# Start AWK file
BEGIN{ FS='"' }
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
print "Total lines processed: ", NR
print "First date: "strftime("%Y%m%d%H",tmin)
print "Last date: "strftime("%Y%m%d%H",tmax)
Which you then can run as:
awk -f <awk_file.awk> <jq-file>
Note: the usage of strftime indicates that you need to use GNU awk.
you can start optimizing by changing this
sed 's/{"captureTime": "//' | sed 's/","ua":.*//'
with this
sed -nE 's/(\{"captureTime": ")([0-9\.]+)(.*)/\2/p'
-n suppress automatic printing of pattern space
-E use extended regular expressions in the script

Bash Xargs Sleep (Multiple Command Line Arguments)

Ok so I have the following script that updates Route43 DNS entries. Unfortunately there is a limit to the number of calls per second you can make so I need to make the final Xargs command sleep for about a second between each iteration.
I've tried a couple of things like ' {../cli53 blah; sleep 10; } ' and I cant seem to get it to work. Does anyone have any suggestions please:
set root='dirname $0'
ec2-describe-instances -O ******* -W ******* --region eu-west-1 |
perl -ne '/^INSTANCE\s+(i-\S+).*?(\S+\.amazonaws\.com)/
and do { $dns = $2; print "$1 $dns\n" }; /^TAG.+\sName\s+(\S+)/
and print "$1 $dns\n"' |
perl -ane 'print "$F[0] CNAME $F[1] --replace\n"' |
grep -v 'i-' | xargs --verbose -n 4 /usr/local/bin/cli53 rrcreate -x 5
Edit: Thanks Etan for the Answer. Here is my solution for anyone else that needs it:
I had to include the -I %variable% switch into the xargs statement aswel to make sure that the feed in was passed as parameters to cli53 but it all looks to be working nicely now.
set root='dirname $0'
ec2-describe-instances -O ******* -W ******* --region eu-west-1 |
perl -ne '/^INSTANCE\s+(i-\S+).*?(\S+\.amazonaws\.com)/
and do { $dns = $2; print "$1 $dns\n" }; /^TAG.+\sName\s+(\S+)/
and print "$1 $dns\n"' |
perl -ane 'print "$F[0] CNAME $F[1] --replace\n"' |
grep -v '^i-' |
xargs --verbose -n 4 -I myvar /bin/sh -c '{ /usr/local/bin/cli53 rrcreate -x 5 'myvar'; sleep 1; printf "\n\n"; }'
The simplest solution would be to simply put the cli53 and sleep calls in a script and use xargs to execute the script.
If you don't want to do that you should be able to do what you were trying to do with this:
... | xargs ... /bin/sh -c '{ /usr/local/bin/cli53 ... "$#"; sleep 10; }' -
