How to evaluate conditions passed in awk? - performance

A script written in Bash passes arguments to an Awk, such as sample_name==10.
Awk then finds which column in a table corresponds to sample_name, and rewrites the argument corresponding to the left hand of the expression, such as as $1 == 10.
But I don't know how to actually evaluate the condition when it's stored as a variable. The problem is mainly because we want to be able to pass all kinds of conditions, including regular expressions back.
So, I have coded some workarounds, which has actually caused the script to blow up to beyond its original purpose.
for (c in where_col) {
((where_math[c] == "==" && $where_idx[c] == where_val[c]) ||
(where_math[c] == ">=" && $where_idx[c] >= where_val[c]) ||
(where_math[c] == "<=" && $where_idx[c] <= where_val[c]) ||
(where_math[c] == "!=" && $where_idx[c] != where_val[c]) ||
(where_math[c] == ">" && $where_idx[c] > where_val[c]) ||
(where_math[c] == "~" && $where_idx[c] ~ where_val[c]) ||
(where_math[c] == "<" && $where_idx[c] < where_val[c])) {
#some action
}
Although it works now, I'm looking for a way to do this more cleanly.

You would probably do this by meta-programming:
You'd generate the awk script to execute. The additional variable expansion step allows you to insert, e.g., <= in the code. But it takes some extra thoughts about reliability as you don't want to allow generating invalid or insecure scripts.
You probably can do this online with a here-doc in bash easily.

Awk does not have the eval type function you seek
but (as you are doing), it can be used to write an evaluator.
Maybe something along the lines of writing it
with the language instead of in the language will get you closer.
Otherwise I'm not sure awk is your path of least resistance
awk -v "lhs=$lhs" -v "op=$op" -v "rhs=$rhs"
op == "==" {result = lhs == rhs}
op == ">=" {result = lhs >= rhs}
op == "<=" {result = lhs <= rhs}
op == "!=" {result = lhs != rhs}
op == ">" {result = lhs > rhs}
op == "~" {result = lhs ~ rhs}
op == "<" {result = lhs < rhs}
END{ #some action involving result
}

Related

Generic "append to file if not exists" function in Bash

I am trying to write a util function in a bash script that can take a multi-line string and append it to the supplied file if it does not already exist.
This works fine using grep if the pattern does not contain \n.
if grep -qF "$1" $2
then
return 1
else
echo "$1" >> $2
fi
Example usage
append 'sometext\nthat spans\n\tmutliple lines' ~/textfile.txt
I am on MacOS btw which has presented some problems with some of the solutions I've seen posted elsewhere being very linux specific. I'd also like to avoid installing any other tools to achieve this if possible.
Many thanks
If the files are small enough to slurp into a Bash variable (you should be OK up to a megabyte or so on a modern system), and don't contain NUL (ASCII 0) characters, then this should work:
IFS= read -r -d '' contents <"$2"
if [[ "$contents" == *"$1"* ]]; then
return 1
else
printf '%s\n' "$1" >>"$2"
fi
In practice, the speed of Bash's built-in pattern matching might be more of a limitation than ability to slurp the file contents.
See the accepted, and excellent, answer to Why is printf better than echo? for an explanation of why I replaced echo with printf.
Using awk:
awk '
BEGIN {
n = 0 # length of pattern in lines
m = 0 # number of matching lines
}
NR == FNR {
pat[n++] = $0
next
}
{
if ($0 == pat[m])
m++
else if (m > 0 && $0 == pat[0])
m = 1
else
m = 0
}
m == n {
exit
}
END {
if (m < n) {
for (i = 0; i < n; i++)
print pat[i] >>FILENAME
}
}
' - "$2" <<EOF
$1
EOF
if necessary, one would need to properly escape any metacharacters inside FS | OFS :
jot 7 9 |
{m,g,n}awk 'BEGIN { FS = OFS = "11\n12\n13\n"
_^= RS = (ORS = "") "^$" } _<NF || ++NF'
9
10
11
12
13
14
15
jot 7 -2 | (... awk stuff ...)
-2
-1
0
1
2
3
4
11
12
13

Count occurrences in a csv with Bash

I have to create a script that given a country and a sport you get the number of medalists and medals won after reading a csv file.
The csv is called "athletes.csv" and have this header
id|name|nationality|sex|date_of_birth|height|weight|sport|gold|silver|bronze|info
when you call the script you have to add the nationality and sport as parameters.
The script i have created is this one:
#!/bin/bash
participants=0
medals=0
while IFS=, read -ra array
do
if [[ "${array[2]}" == $1 && "${array[7]}" == $2 ]]
then
participants=$participants++
medals=$(($medals+${array[8]}+${array[9]}+${array[10]))
fi
done < athletes.csv
echo $participants
echo $medals
where array[3] is the nationality, array[8] is the sport and array[9] to [11] are the number of medals won.
When i run the script with the correct paramters I get 0 participants and 0 medals.
Could you help me to understand what I'm doing wrong?
Note I cannot use awk nor grep
Thanks in advance
Try this:
#! /bin/bash -p
nation_arg=$1
sport_arg=$2
declare -i participants=0
declare -i medals=0
declare -i line_num=0
while IFS=, read -r _ _ nation _ _ _ _ sport ngold nsilver nbronze _; do
(( ++line_num == 1 )) && continue # Skip the header
[[ $nation == "$nation_arg" && $sport == "$sport_arg" ]] || continue
participants+=1
medals+=ngold+nsilver+nbronze
done <athletes.csv
declare -p participants
declare -p medals
The code uses named variables instead of numbered positional parameters and array indexes to try to improve readability and maintainability.
Using declare -i means that strings assigned to the declared variables are treated as arithmetic expressions. That reduces clutter by avoiding the need for $(( ... )).
The code assumes that the field separator in the CSV file is ,, not | as in the header. If the separator is really |, replace IFS=, with IFS='|'.
I'm assuming that the field delimiter of your CSV file is a comma but you can set it to whatever character you need.
Here's a fixed version of your code:
#!/bin/bash
participants=0
medals=0
{
# skip the header
read
# process the records
while IFS=',' read -ra array
do
if [[ "${array[2]}" == $1 && "${array[7]}" == $2 ]]
then
(( participants++ ))
medals=$(( medals + array[8] + array[9] + array[10] ))
fi
done
} < athletes.csv
echo "$participants" "$medals"
remark: As $1 and $2 are left unquoted they are subject to glob matching (right side of [[ ... == ... ]]). For example you'll be able to show the total number of medals won by the US with:
./script.sh 'US' '*'
But I have to say, doing text processing with pure shell isn't considered a good practice; there exists dedicated tools for that. Here's an example with awk:
awk -v FS=',' -v country="$1" -v sport="$2" '
BEGIN {
participants = medals = 0
}
NR == 1 { next }
$3 == country && $8 == sport {
participants++
medals += $9 + $10 + $11
}
END { print participants, medals }
' athletes.csv
There's also a potential problem remaining: the CSV format might need a real CSV parser for reading it accurately. There exists a few awk libraries for that but IMHO it's simpler to use a CSV‑aware tool that provides the functionalities that you need.
Here's an example with Miller:
mlr --icsv --ifs=',' filter -s country="$1" -s sport="$2" '
begin {
#participants = 0;
#medals = 0;
}
$nationality == #country && $sport == #sport {
#participants += 1;
#medals += $gold + $silver + $bronze;
}
false;
end { print #participants, #medals; }
' athletes.csv

Matching a number against a comma-separated sequence of ranges

I'm writing a bash script which takes a number, and also a comma-separated sequence of values and strings, e.g.: 3,15,4-7,19-20. I want to check whether the number is contained in the set corresponding to the sequence. For simplicity, assume no comma-separated elements intersect, and that the elements are sorted in ascending order.
Is there a simple way to do this in bash other than the brute-force naive way? Some shell utility which does something like that for me, maybe something related to lpr which already knows how to process page range sequences etc.
Is awk cheating?:
$ echo -n 3,15,4-7,19-20 |
awk -v val=6 -v RS=, -F- '(NF==1&&$1==val) || (NF==2&&$1<=val&&$2>=val)' -
Output:
4-7
Another version:
$ echo 19 |
awk -v ranges=3,15,4-7,19-20 '
BEGIN {
split(ranges,a,/,/)
}
{
for(i in a) {
n=split(a[i],b,/-/)
if((n==1 && $1==a[i]) || (n==2 && $1>=b[1] && $1<=b[2]))
print a[i]
}
}' -
Outputs:
19-20
The latter is better as you can feed it more values from a file etc. Then again the former is shorter. :D
Pure bash:
check() {
IFS=, a=($2)
for b in "${a[#]}"; do
IFS=- c=($b); c+=(${c[0]})
(( $1 >= c[0] && $1 <= c[1] )) && break
done
}
$ check 6 '3,15,4-7,19-20' && echo "yes" || echo "no"
yes
$ check 42 '3,15,4-7,19-20' && echo "yes" || echo "no"
no
As bash is tagged, why not just
inrange() { for r in ${2//,/ }; do ((${r%-*}<=$1 && $1<=${r#*-})) && break; done; }
Then test it as usual:
$ inrange 6 3,15,4-7,19-20 && echo yes || echo no
yes
$ inrange 42 3,15,4-7,19-20 && echo yes || echo no
no
A function based on #JamesBrown's method:
function match_in_range_seq {
(( $# == 2 )) && [[ -n "$(echo -n "$2" | awk -v val="$1" -v RS=, -F- '(NF==1&&$1==val) || (NF==2&&$1<=val&&$2>=val)' - )" ]]
}
Will return 0 (in $?) if the second argument (the range sequence) contains the first argument, 1 otherwise.
Another awk idea using two input (-v) variables:
# use of function wrapper is optional but cleaner for the follow-on test run
in_range() {
awk -v value="$1" -v r="$2" '
BEGIN { n=split(r,ranges,",")
for (i=1;i<=n;i++) {
low=high=ranges[i]
if (ranges[i] ~ "-") {
split(ranges[i],x,"-")
low=x[1]
high=x[2]
}
if (value >= low && value <= high) {
print value,"found in the range:",ranges[i]
exit
}
}
}'
}
NOTE: the exit assumes no overlapping ranges, ie, value will not be found in more than one 'range'
Take for a test spin:
ranges='3,15,4-7,19-20'
for value in 1 6 15 32
do
echo "########### value = ${value}"
in_range "${value}" "${ranges}"
done
This generates:
########### value = 1
########### value = 6
6 found in the range: 4-7
########### value = 15
15 found in the range: 15
########### value = 32
NOTES:
OP did not mention what to generate as output if no range match is found; code could be modified to output a 'not found' message as needed
in a comment OP mentioned possibly running the search for a number of values; code could be modified to support such a requirement but would need more input (eg, format of list of values, desired output and how to be used/captured by calling process, etc)

Values based comparison in Unix

I have two variables like below .
a=rw,bg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock
b=bg,rg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock
If condition is failing as it's looking for rw value from a variable at position 1 in b variable but it's in position 2 in variable b.
How can I compare the two lines even though the order of the fields is not the same?
This script seems to work:
a="rw,bg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock"
b="bg,rg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock"
{ echo "$a"; echo "$b"; } |
awk -F, \
'NR == 1 { for (i = 1; i <= NF; i++) a[$i] = 1 }
NR == 2 { for (i = 1; i <= NF; i++)
{
if ($i in a)
delete a[$i]
else
{
mismatch++
print "Unmatched item (row 2):", $i
}
}
}
END {
for (i in a)
{
print "Unmatched item (row 1):", i
mismatch++
}
if (mismatch > 0)
print mismatch, "Mismatches"
else
print "Rows are the same"
}'
Example runs:
$ bash pairing.sh
Unmatched item (row 2): rg
Unmatched item (row 1): rw
2 Mismatches
$ sed -i.bak 's/,rg,/,rw,/' pairing.sh
$ bash pairing.sh
Rows are the same
$
There are undoubtedly ways to make the script more compact, but the code is fairly straight-forward. If a field appears twice in the second row and appears once in the first row, the second one will be reported as an unmatched item. The code doesn't check for duplicates while processing the first row — it's an easy exercise to make it do so. The code doesn't print the input data for verification; it probably should.

how to pass condtion file in awk program? I have one file of conditions and one input file in awk

I have one file of conditions and one input file in awk. i want to execute condition one by one in awk program. how to pass condition in awk program?
cond.txt
---------
a[NR-1]==a[NR-2] && b[NR-1] > 0 && c[NR-1] > 0 && d[NR-1] > 0
a[NR-1]==a[NR-2] && b[NR-1] > 0 && c[NR-1] > 0 && d[NR-1] < 0
a[NR-1]==a[NR-2] && b[NR-1] > 0 && c[NR-1] < 0 && d[NR-1] > 0
a[NR-1]==a[NR-2] && b[NR-1] > 0 && c[NR-1] < 0 && d[NR-1] < 0
---------
prog.sh
--------
while read aa
do
((d++))
awk -v CC="$aa" -F ";" '{ a[NR]=$1;b[NR]=$2;c[NR]=$3;d[NR]=$4;if(CC){ printf "%s;%8.2f;%8.2f;%8.2f; \n",$1,$2,$3,$4 } }' input > output-$d
done <cond.txt
input
--------
p; -415.98; 428.49; -422.24;
p; 232.55; 234.85; 233.7;
p; -440.35; 444.42; -442.38;
p; 17.05; 17.09; 17.07;
p; 351.25; -355.35; -353.3;
p; 366.89; -371.28; 369.08;
n; 11.97; 12.17; 12.07;
n; 506.93; 509.15; 508.04;
n; 306.9; 314.7; 310.8;
n; 381.1; 381.94; 381.52;
n; 84.12; 84.33; 84.22;
n; 237.36; 240.73; 239.05;
n; 345.51; 352.49; 349;
Thanks You in advance
That is an unusual and somewhat unconventional arrangement, and potentially insecure. But you can use the shell to generate the Awk script from the input strings. You almost had it -- just use the shell's variable interpolation to inject the condition into the string you pass to Awk.
#!/bin/bash
while read -r cond; do
((+d))
awk -F ";" "$cond"'{ printf "%s;%8.2f;%8.2f;%8.2f; \n",$1,$2,$3,$4 } }' input > output-$d
done <cond.txt
The arrays in the script were not used for anything so I took them out.
The ((arithmetic)) is a Bash feature so I changed the shebang. Notice also the use of read -r to avoid the pesky legacy default behavior of read.
The conditions seem to be rather tightly coupled to your solution so I would perhaps embed them in a here document rather than store them in a separate file.
I am not sure -v is the right way to do it. But you can achieve the same goal through bash variable like this:
while read condition; do awk '{ if('"$condition"') { ... }}' input; done < cond.txt
Why? To parse the same file 4 times testing a different condition each time, just do this instead:
awk -F';' '
BEGIN { for (i=1; i<=4; i++) { ARGV[ARGC] = ARGV[ARGC-1]; ARGC++ } }
{ a[FNR]=$1; b[FNR]=$2; c[FNR]=$3; d[FNR]=$4 }
(ARGIND==1 && a[FNR-1]==a[FNR-2] && b[FNR-1] > 0 && c[FNR-1] > 0 && d[FNR-1] > 0) ||
(ARGIND==2 && a[FNR-1]==a[FNR-2] && b[FNR-1] > 0 && c[FNR-1] > 0 && d[FNR-1] < 0) ||
(ARGIND==3 && a[FNR-1]==a[FNR-2] && b[FNR-1] > 0 && c[FNR-1] < 0 && d[FNR-1] > 0) ||
(ARGIND==4 && a[FNR-1]==a[FNR-2] && b[FNR-1] > 0 && c[FNR-1] < 0 && d[FNR-1] < 0) {
printf "%s;%8.2f;%8.2f;%8.2f; \n",$1,$2,$3,$4 > (output"-"ARGIND)
}
' input
The above uses gawks ARGIND to determine which iteration of reading the file you're on. if you don't have that just add a line at the top that says FNR==1{++ARGIND}.

Resources