awk match by variable with dot in it - bash

I have a script that will iterate over a file containing domains (google.com, youtube.com, etc). The purpose of the script is to check how many times each domain is included in the 12th column of a tab seperated value file.
while read domain; do
awk -F '\t' '$12 == '$domain'' data.txt | wc -l
done < domains.txt
However awk seems to be interpretating the dots in the domains as a special character. The following error message is shown:
awk: syntax error at source line 1
context is
$12 ~ >>> google. <<< com
awk: bailing out at source line 1
I am a beginner in bash so any help would be greatly appreciated!

When you write:
domain='google.com'
awk -F '\t' '$12 == '$domain'' data.txt
the $domain is outside of any quotes:
awk -F '\t' '$12 == '$domain' ' data.txt
< > < >
start end start end
and so exposed to the shell for interpretation first and THEN it becomes part of the body of the awk script before awk sees it. So what awk sees is:
awk -F '\t' '$12 == google.com' data.txt
and google.com is not a valid symbol (e.g. variable or function) name nor string nor number. What you MEANT to do was:
awk -F '\t' '$12 == "'"$domain"'"' data.txt
so the shell would see "$domain" instead of just $domain (see https://mywiki.wooledge.org/Quotes for why that's important) and awk would finally see:
awk -F '\t' '$12 == "google.com"' data.txt
which is fine as now "google.com" is a string, not a symbol BUT you should never allow shell variables to expand to become part of an awk script as there are other caveats so what you should really have done is:
awk -F '\t' -v dom="$domain" '$12 == dom' data.txt
See How do I use shell variables in an awk script? for more information.
By the way, even after fixing the above problem do not do this:
while read domain; do
awk -F '\t' -v dom="$domain" '$12 == dom' data.txt | wc -l
done < domains.txt
as it'll be immensely slow and contains insidious bugs (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice). Do something like this instead (untested):
awk -F'\t' '
NR==FNR {
cnt[$1] = 0
next
}
$12 in cnt {
cnt[$12]++
}
END {
for ( dom in cnt ) {
print dom, cnt[dom]
}
}
' domains.txt data.txt
That will be far more efficient, robust, and portable than calling awk inside a shell read loop.
See What are NR and FNR and what does "NR==FNR" imply? for how that awk script works. Get the book Effective AWK Programming, 5th Edition, by Arnold Robbins to learn awk.

awk -F '\t' '$12 == '$domain'' data.txt | wc -l
The single quotes are building an awk program. They are not something visible to awk. So awk sees this:
$12 == google.com
Since there aren't any quotes around google.com, that is a syntax error. You just need to add quotation marks.
awk -F '\t' '$12 == "'"$domain"'"' data.txt
The quotes jammed together like that are a little confusing, but it's just this:
'....' stuff to send to awk. Single quotes are for the shell.
'..."...' a double quote inside the awk program for awk to see
'...'"..." stuff in double quotes _outside_ the awk program for the shell
We can combine those like this:
'..."'"$var"'"...'
That's a bunch of literal awk code ending in a double-quote, followed by the expansion of the shell parameter var, which is double-quoted as usual in the shell for safety, followed by more literal awk code starting with a double quotes. So the end result is a string passed to awk that includes the value of the var inside double quotes.
But you don't have to be so fancy or confusing since awk provides the -v option to set variables from the shell:
awk -v domain="$domain" '$12 == domain' data.txt
Since the domain is not quoted inside the awk code, it is interpreted as the name of a variable. (Periods are not legal in variable names, which is why you got a syntax error with your domains; if you hadn't, though, awk would have treated them as empty and been looking for lines whose twelfth field was likewise blank.)

Use a combination of cut to print the 12th column of the TAB-delimited file, sort and uniq to count the items:
cut -f12 data.txt | sort | uniq -c

This should give the count of how many lines of the input has "google.com" in $12
{m,g}awk -v __="${domain}" '
BEGIN { _*=\
( _ ="\t[^\t]*")*gsub(".",(_)_,_)*sub(".","",_)*\
gsub("[.:&=/-]","[&]",__)*sub("[[][^[]+$",__"\t?",_)*(\
FS=_ } { _+=NF } END { print _-NR }'

Related

How can we use '~|~' delimiter to split the records using scripting command?

Please suggest how can I split the columns separated with ~|~ delimiter.(file: abc.dat)
a~|~1~|~x
b~|~1~|~y
c~|~2~|~z
I am trying below awk command but getting output 0 count.
awk -F'~|~' '$2 == 1' ${file} | wc -l
With your shown samples, please try following. We need not to use wc command along with awk, it could be done within awk itself.
awk -F'~\\|~' '$2 == 1{count++} END{print count}' "$file"
Explanation: Setting field separator as ~|~(escaped | here). Then checking if 2nd field is 1, increment variable count with 1 then. In END block of this program print its value.
For saving values into shell variable use like:
var=$(awk -F'~\\|~' '$2 == 1{count++} END{print count}' "$file")
You can also use ~[|]~ as FS value, as the pipe char used inside a bracket expression always matches itself, a pipe char:
counter=$(awk 'BEGIN{FS="~[|]~"} $2==1{cnt++} END{print cnt}' file)
See the online awk demo:
s='a~|~1~|~x
b~|~1~|~y
c~|~2~|~z'
counter=$(awk 'BEGIN{FS="~[|]~"} $2==1{cnt++} END{print cnt}' <<< "$s")
echo $counter
# => 2

AWK Finding a way to print lines containing a word from a comma separated string

I want to write a bash script that only prints lines that, on their second column, contain a word from a comma separated string. Example:
words="abc;def;ghi;jkl"
>cat log1.txt
hello;abc;1234
house;ab;987
mouse;abcdef;654
What I want is to print only lines that contain a whole word from the "words" variable. That means that "ab" won't match, neither will "abcdef". It seems so simple yet after trying for manymany hours, I was unable to find a solution.
For example, I tried this as my awk command, but it matched any substring.
-F \; -v b="TSLA;NVDA" 'b ~ $2 { print $0 }'
I will appreciate any help. Thank you.
EDIT:
A sample input would look like this
1;UNH;buy;344.74
2;PG;sell;138.60
3;MSFT;sell;237.64
4;TSLA;sell;707.03
A variable like this would be set
filter="PG;TSLA"
And according to this filter, I want to echo these lines
2;PG;sell;138.60
4;TSLA;sell;707.03
Grep is a good choice here:
grep -Fw -f <(tr ';' '\n' <<<"$words") log1.txt
With awk I'd do
awk -F ';' -v w="$words" '
BEGIN {
n = split(w, a, /;/)
# next line moves the words into the _index_ of an array,
# to make the file processing much easier and more efficient
for (i=1; i<=n; i++) words[a[i]]=1
}
$2 in words
' log1.txt
You may use this awk:
words="abc;def;ghi;jkl"
awk -F';' -v s=";$words;" 'index(s, FS $2 FS)' log1.txt
hello;abc;1234

How to write a bash script that dumps itself out to stdout (for use as a help file)?

Sometimes I want a bash script that's mostly a help file. There are probably better ways to do things, but sometimes I want to just have a file called "awk_help" that I run, and it dumps my awk notes to the terminal.
How can I do this easily?
Another idea, use #!/bin/cat -- this will literally answer the title of your question since the shebang line will be displayed as well.
Turns out it can be done as pretty much a one liner, thanks to #CharlesDuffy for the suggestions!
Just put the following at the top of the file, and you're done
cat "$BASH_SOURCE" | grep -v EZREMOVEHEADER
So for my awk_help example, it'd be:
cat "$BASH_SOURCE" | grep -v EZREMOVEHEADER
# Basic form of all awk commands
awk search pattern { program actions }
# advanced awk
awk 'BEGIN {init} search1 {actions} search2 {actions} END { final actions }' file
# awk boolean example for matching "(me OR you) OR (john AND ! doe)"
awk '( /me|you/ ) || (/john/ && ! /doe/ )' /path/to/file
# awk - print # of lines in file
awk 'END {print NR,"coins"}' coins.txt
# Sum up gold ounces in column 2, and find out value at $425/ounce
awk '/gold/ {ounces += $2} END {print "value = $" 425*ounces}' coins.txt
# Print the last column of each line in a file, using a comma (instead of space) as a field separator:
awk -F ',' '{print $NF}' filename
# Sum the values in the first column and pretty-print the values and then the total:
awk '{s+=$1; print $1} END {print "--------"; print s}' filename
# functions available
length($0) > 72, toupper,tolower
# count the # of times the word PASSED shows up in the file /tmp/out
cat /tmp/out | awk 'BEGIN {X=0} /PASSED/{X+=1; print $1 X}'
# awk regex operators
https://www.gnu.org/software/gawk/manual/html_node/Regexp-Operators.html
I found another solution that works on Mac/Linux and works exactly as one would hope.
Just use the following as your "shebang" line, and it'll output everything from line 2 on down:
test.sh
#!/usr/bin/tail -n+2
hi there
how are you
Running this gives you what you'd expect:
$ ./test.sh
hi there
how are you
and another possible solution - just use less, and that way your file will open in searchable gui
#!/usr/bin/less
and this way you can grep if for something too, e.g.
$ ./test.sh | grep something

Using awk to search for a line that starts with but also contains a string

I have a file that has multiple lines that starts with a keyword. I only want to modify one of them and it's easy to distinguish the two. I want the one that is under the [dbinfo] section. The domain name is static so I know that won't change.
awk -F '=' '$1 ~ /^dbhost/ {print $NF};' myfile.txt
myfile.txt
[ual]
path=/web/
dbhost=ez098sf
[dbinfo]
dbhost=ec0001.us-east-1.localdomain
dbname=ez098sf_default
dbpass=XXXXXX
You can use this awk command to first check for presence of [dbinfo] section and then modify dbhost parameter:
awk -v h='newhost' 'BEGIN{FS=OFS="="}
$0 == "[dbinfo]" {sec=1} sec && $1 == "dbhost"{$2 = h; sec=0} 1' file
[ual]
path=/web/
dbhost=ez098sf
[dbinfo]
dbhost=newhost
dbname=ez098sf_default
dbpass=XXXXXX
You want to utilize a little bit of a state machine here:
awk -F '=' '
$0 ~ /^\[.*\]/ {in_db_info=($0=="[dbinfo]"}
$0 ~ /^dbhost/{if (in_db_info) print $2;}' myfile.txt
You can also do it with sed:
sed '/\[dbinfo\]/,/\[/s/\(^dbhost=\).*/\1domain.com/' myfile.txt

How to execute awk command in shell script

I have an awk command that extracts the 16th column from 3rd line in a csv file and prints the first 4 characters.
awk -F"," 'NR==3{print $16}' sample.csv|sed -e 's/^[ \t]*//'|awk '{print substr($0,0,4)}'
This works fine.
But when I execute it from a shell script, I get and error
#!/bin/ksh
YEAR=awk -F"," 'NR==3{print $16}' sample.csv|sed -e 's/^[ \t]*//'|awk '{print substr($0,0,4)}'
Error message:
-F,: not found
Use command substitution to assign the output of a command to a variable, as shown below:
YEAR=$(awk -F"," 'NR==3{print $16}' sample.csv|sed -e 's/^[ \t]*//'|awk '{print substr($0,0,4)}')
you are asking the shell to do :
VAR=value command [arguments...]
which means: launch command but pass it the VAR=value environment first
(ex: LC_ALL=C grep '[0-9]*' /some/file.txt : will grep a number in file.txt (and this with the LC_ALL variable set to C just for the duration of the call of grep)
So here : you ask the shell to launch the -F"," command (ie, -F, once the shell interpret the "," into , with arguments 'NR==3.......... and with the variable YEAR set to the value awk for the duration of the command invocation.
Just replace it with :
#!/bin/ksh
YEAR="$(awk -F',' 'NR==3{print $16}' sample.csv|sed -e 's/^[ \t]*//'|awk '{print substr($0,1,4)}')"
(I didn't try it, but I hope they work for you and your sample.csv file)
(Note that you use "0" to match character position 1, which works in many awk implementation but not all (ie most (but not all) assume 1 when you write 0))
From your description, it looks like you want to extract the year from the 16th field, which might contain leading spaces. You can accomplish it by calling AWK once:
YEAR=$(awk -F, 'NR==3{sub(/^[ \t]*/, "", $16); print ">" substr($16,1,4) "<" }')
Better yet, you don't even have to use awk. Since you are already writing shell script, let's do it all in shell script:
{ read line; read line; read line; } < sample.csv # Get the third line
IFS=, set $line # Breaks line into comma-separated fields
IFS=" " set ${16} # Trick to remove leading spaces, field 16 becomes field 1
YEAR=${1:0:4} # Extract the first 4 char from field 1
Do this:
year=$(awk -F, 'NR==3{sub(/^[ \t]+/,"",$16); print substr($16,1,4); exit }' sample.csv)

Resources