trouble with variable in awk match command - bash

Apologies if below is messy or there's a cleaner way to do it, I'm still learning!
I'm using CURL to grab a page with numbers/HTML in, to get to the table with numbers I'm using the below command
echo $curlo | awk '/<th>00/ { match($0, /<th>00/); print substr($0, RSTART - 10, RLENGTH + 40000); }' | sed 's/d1ffce/\'$'\n/g'| sed 's/88ff7f/\'$'\n/g' | grep -o '[0-9]*'
To begin the output at th00, print the next 40000 characters (the page varies in size but will never be that high), replace some hex colour codes and then print out all the numbers only
However th00 will change to th01, 02 etc with the hour so I'm trying to use a variable. For testing I set cnt=00 and replace it in the command with the variable
echo $curlo | awk '"/<th>$cnt/" { match($0, "/<th>$cnt/"); print substr($0, RSTART - 10, RLENGTH + 40000); }' | sed 's/d1ffce/\'$'\n/g'| sed 's/88ff7f/\'$'\n/g' | grep -o '[0-9]*'
but the output is completely different. If I echo $cnt it's printing 00 fine. I've also tried placing the whole th00 in the cnt variable and the same issue.
For comparison when I use the first command, I get 382 lines, when I use the second I get 896
This is using bash shell btw

Shell variables aren't expanded inside single quotes. But it's better to assign an awk variable with the -v option:
echo "$curlo" | awk -v cnt=$cnt 'match($0, "<th>" cnt "") {
str = substr($0, RSTART-10, RLENGTH+40000);
gsub("d1ffce|88ff7f", "$\n", str);
gsub(/^[^0-9]+|[^0-9]+$/, "", str);
gsub(/[^0-9]+/, "\n", str);
print str; }'
There's also no need to pipe to sed and grep -o, since awk can do the same things with gsub().

Related

awk match by variable with dot in it

I have a script that will iterate over a file containing domains (google.com, youtube.com, etc). The purpose of the script is to check how many times each domain is included in the 12th column of a tab seperated value file.
while read domain; do
awk -F '\t' '$12 == '$domain'' data.txt | wc -l
done < domains.txt
However awk seems to be interpretating the dots in the domains as a special character. The following error message is shown:
awk: syntax error at source line 1
context is
$12 ~ >>> google. <<< com
awk: bailing out at source line 1
I am a beginner in bash so any help would be greatly appreciated!
When you write:
domain='google.com'
awk -F '\t' '$12 == '$domain'' data.txt
the $domain is outside of any quotes:
awk -F '\t' '$12 == '$domain' ' data.txt
< > < >
start end start end
and so exposed to the shell for interpretation first and THEN it becomes part of the body of the awk script before awk sees it. So what awk sees is:
awk -F '\t' '$12 == google.com' data.txt
and google.com is not a valid symbol (e.g. variable or function) name nor string nor number. What you MEANT to do was:
awk -F '\t' '$12 == "'"$domain"'"' data.txt
so the shell would see "$domain" instead of just $domain (see https://mywiki.wooledge.org/Quotes for why that's important) and awk would finally see:
awk -F '\t' '$12 == "google.com"' data.txt
which is fine as now "google.com" is a string, not a symbol BUT you should never allow shell variables to expand to become part of an awk script as there are other caveats so what you should really have done is:
awk -F '\t' -v dom="$domain" '$12 == dom' data.txt
See How do I use shell variables in an awk script? for more information.
By the way, even after fixing the above problem do not do this:
while read domain; do
awk -F '\t' -v dom="$domain" '$12 == dom' data.txt | wc -l
done < domains.txt
as it'll be immensely slow and contains insidious bugs (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice). Do something like this instead (untested):
awk -F'\t' '
NR==FNR {
cnt[$1] = 0
next
}
$12 in cnt {
cnt[$12]++
}
END {
for ( dom in cnt ) {
print dom, cnt[dom]
}
}
' domains.txt data.txt
That will be far more efficient, robust, and portable than calling awk inside a shell read loop.
See What are NR and FNR and what does "NR==FNR" imply? for how that awk script works. Get the book Effective AWK Programming, 5th Edition, by Arnold Robbins to learn awk.
awk -F '\t' '$12 == '$domain'' data.txt | wc -l
The single quotes are building an awk program. They are not something visible to awk. So awk sees this:
$12 == google.com
Since there aren't any quotes around google.com, that is a syntax error. You just need to add quotation marks.
awk -F '\t' '$12 == "'"$domain"'"' data.txt
The quotes jammed together like that are a little confusing, but it's just this:
'....' stuff to send to awk. Single quotes are for the shell.
'..."...' a double quote inside the awk program for awk to see
'...'"..." stuff in double quotes _outside_ the awk program for the shell
We can combine those like this:
'..."'"$var"'"...'
That's a bunch of literal awk code ending in a double-quote, followed by the expansion of the shell parameter var, which is double-quoted as usual in the shell for safety, followed by more literal awk code starting with a double quotes. So the end result is a string passed to awk that includes the value of the var inside double quotes.
But you don't have to be so fancy or confusing since awk provides the -v option to set variables from the shell:
awk -v domain="$domain" '$12 == domain' data.txt
Since the domain is not quoted inside the awk code, it is interpreted as the name of a variable. (Periods are not legal in variable names, which is why you got a syntax error with your domains; if you hadn't, though, awk would have treated them as empty and been looking for lines whose twelfth field was likewise blank.)
Use a combination of cut to print the 12th column of the TAB-delimited file, sort and uniq to count the items:
cut -f12 data.txt | sort | uniq -c
This should give the count of how many lines of the input has "google.com" in $12
{m,g}awk -v __="${domain}" '
BEGIN { _*=\
( _ ="\t[^\t]*")*gsub(".",(_)_,_)*sub(".","",_)*\
gsub("[.:&=/-]","[&]",__)*sub("[[][^[]+$",__"\t?",_)*(\
FS=_ } { _+=NF } END { print _-NR }'

Processing text with multiple delims in awk

I have a text which looks like -
Application.||dates:[2022-11-12]|models:[MODEL1]|count:1|ids:2320
Application.||dates:[2022-11-12]|models:[MODEL1]|count:5|ids:2320
I want the number from the count:1 columns so 1 and i wish to store these numbers in an array.
nums=($(echo -n "$grepResult" | awk -F ':' '{ print $4 }' | awk -F '|' '{ print $1 }'))
this seems very repetitive and not very efficient, any ideas how to simplify this ?
You can use awk once, set the field separator to |. Then loop all the fields and split on :
If the field starts with count then print the second part of the splitted value.
This way the count: part can occur anywhere in the string and can possibly print this multiple times.
nums=($(echo -n "$grepResult" | awk -F'|' '
{
for(i=1; i<=NF; i++) {
split($i, a, ":")
if (a[1] == "count") {
print a[2]
}
}
}
'))
for i in "${nums[#]}"
do
echo "$i"
done
Output
1
5
If you want to combine the both split values, you can use [|:] as a character class and print field number 8 for a precise match as mentioned in the comments.
Note that it does not check if it starts with count:
nums=($(echo -n "$grepResult" | awk -F '[|:]' '{print $8}'))
With gnu awk you can use a capture group to get a bit more precise match where on the left and right can be either the start/end of string or a pipe char. The 2nd group matches 1 or more digits:
nums=($(echo -n "$grepResult" | awk 'match($0, /(^|\|)count:([0-9]+)(\||$)/, a) {print a[2]}' ))
Try sed
nums=($(sed 's/.*count://;s/|.*//' <<< "$grepResult"))
Explanation:
There are two sed commands separated with ; symbol.
First command 's/.*count://' remove all characters till 'count:' including it.
Second command 's/|.*//' remove all characters starting from '|' including it.
Command order is important here.

Show with star symbols how many times a user have logged in

I'm trying to create a simple shell script showing how many times a user has logged in to their linux machine for at least one week. The output of the shell script should be like this:
2021-12-16
****
2021-12-15
**
2021-12-14
*******
I have tried this so far but it shows only numeric but i want showing * symbols.
user="$1"
last -F | grep "${user}" | sed -E "s/${user}.*(Mon|Tue|Wed|Thu|Fri|Sat|Sun) //" | awk '{print $1"-"$2"-"$4}' | uniq -c
Any help?
You might want to refactor all of this into a simple Awk script, where repeating a string n times is also easy.
user="$1"
last -F |
awk -v user="$1" 'BEGIN { split("Jan:Feb:Mar:Apr:May:Jun:Jul:Aug:Sep:Oct:Nov:Dec", m, ":");
for(i=1; i<=12; i++) mon[m[i]] = sprintf("%02i", i) }
$1 == user { ++count[$8 "-" mon[$5] "-" sprintf("%02i", $6)] }
END { for (date in count) {
padded = sprintf("%-" count[date] "s", "*");
gsub(/ /, "*", padded);
print date, padded } }'
The BEGIN block creates an associative array mon which maps English month abbreviations to month numbers.
sprintf("%02i", number) produces the value of number with zero padding to two digits (i.e. adds a leading zero if number is a single digit).
The $1 == user condition matches the lines where the first field is equal to the user name we passed in. (Your original attempt had two related bugs here; it would look for the user name anywhere in the line, so if the user name happened to match on another field, it would erroneously match on that; and the regex you used would match a substring of a longer field).
When that matches, we just update the value in the associative array count whose key is the current date.
Finally, in the END block, we simply loop over the values in count and print them out. Again, we use sprintf to produce a field with a suitable length. We play a little trick here by space-padding to the specified width, because sprintf does that out of the box, and then replace the spaces with more asterisks.
Your desired output shows the asterisks on a separate line from the date; obviously, it's easy to change that if you like, but I would advise against it in favor of a format which is easy to sort, grep, etc (perhaps to then reformat into your desired final human-readable form).
If you have GNU sed you're almost there. Just pipe the output of uniq -c to this GNU sed command:
sed -En 's/^\s*(\S+)\s+(\S+).*/printf "\2\n%\1s" ""/e;s/ /*/g;p'
Explanation: in the output of uniq -c we substitute a line like:
6 Dec-15-2021
by:
printf "Dec-15-2021\n%6s" ""
and we use the e GNU sed flag (this is a GNU sed extension so you need GNU sed) to pass this to the shell. The output is:
Dec-15-2021
where the second line contains 6 spaces. This output is copied back into the sed pattern space. We finish by a global substitution of spaces by stars and print:
Dec-15-2021
******
A simple soluction, using tempfile
#!/bin/bash
user="$1"
tempfile="/tmp/last.txt"
IFS='
'
last -F | grep "${user}" | sed -E "s/"${user}".*(Mon|Tue|Wed|Thu|Fri|Sat|Sun) //" | awk '{print $1"-"$2"-"$4}' | uniq -c > $tempfile
for LINE in $(cat $tempfile)
do
qtde=$(echo $LINE | awk '{print $1'})
data=$(echo $LINE | awk '{print $2'})
echo -e "$data "
for ((i=1; i<=qtde; i++))
do
echo -e "*\c"
done
echo -e "\n"
done

How to convert date with awk

My file temp.txt
ID53,20150918,2015-09-19,,0,CENTER<br>
ID54,20150911,2015-09-14,,0,CENTER<br>
ID55,20150911,2015-09-14,,0,CENTER
I need to replace and convert the 2nd field (yyyymmdd) for seconds
I try it, but only the first line is replaced
awk -F"," '{ ("date -j -f ""%Y%m%d"" ""20150918"" ""+%s""") | getline $2; print }' OFS="," temp.txt
and tried to like this
awk -F"," '{system("date -j -f ""%Y%m%d"" "$2" ""+%s""") | getline $2; print }' temp.txt
the output is:
1442619474
sh: 0: command not found
ID53,20150918,2015-09-19,,0,CENTER
1442014674
ID54,20150911,2015-09-14,,0,CENTER
1442014674
ID55,20150911,2015-09-14,,0,CENTER
Using gsub also could not
awk -F"," '{gsub($2,"system("date -j -f ""%Y%m%d"" "$2" ""+%s""")",$2); print}' OFS="," temp.txt
awk: syntax error at source line 1
context is
{gsub($2,"system("date -j -f ""%Y%m%d"" "$2" >>> ""+% <<< s""")",$2); print}
awk: illegal statement at source line 1
extra )
I need the output to be so. How to?
ID53,1442619376,2015-09-19,,0,CENTER
ID54,1442014576,2015-09-14,,0,CENTER
ID55,1442014576,2015-09-14,,0,CENTER
This GNU awk script should make it. If it is not yet installed on your mac, I suggest installing macport and then GNU awk. You can also install a decent version of bash, date and other important utilities for which the default are really disappointing on OSX.
BEGIN { FS = ","; OFS = FS; }
{
y = substr($2, 1, 4);
m = substr($2, 5, 2);
d = substr($2, 7, 2);
$2 = mktime(y " " m " " d " 00 00 00");
print;
}
Put it in a file (e.g. txt2ts.awk) and process your file with:
$ awk -f txt2ts.awk data.txt
ID53,1442527200,2015-09-19,,0,CENTER<br>
ID54,1441922400,2015-09-14,,0,CENTER<br>
ID55,1441922400,2015-09-14,,0,CENTER
Note that we do not have the same timestamps. I let you try to understand where it comes from, it is another problem.
Explanations: substr(s, m, n) returns the n-characters sub-string of s that starts at position m (starting with 1). mktime("YYYY MM DD HH MM SS") converts the date string into a timestamp (seconds since epoch). FS and OFS are the input and output filed separators, respectively. The commands between the curly braces of the BEGIN pattern are executed at the beginning only while the others are executed on each line of the file.
You could use substr:
printf "%s-%s-%s", substr($6,0,4), substr($6,5,2), substr($6,7,2)
Assuming that the 6th field was 20150914, this would produce 2015-09-14

How can I pass variables from awk to a shell command?

I am trying to run a shell command from within awk for each line of a file, and the shell command needs one input argument. I tried to use system(), but it didn't recognize the input argument.
Each line of this file is an address of a file, and I want to run a command to process that file. So, for a simple example I want to use 'wc' command for each line and pass $1to wc.
awk '{system("wc $1")}' myfile
you are close. you have to concatenate the command line with awk variables:
awk '{system("wc "$1)}' myfile
You cannot grab the output of an awk system() call, you can only get the exit status. Use the getline/pipe or getline/variable/pipe constructs
awk '{
cmd = "your_command " $1
while (cmd | getline line) {
do_something_with(line)
}
close(cmd)
}' file
FYI here's how to use awk to process files whose names are stored in a file (providing wc-like functionality in this example):
gawk '
NR==FNR { ARGV[ARGC++]=$0; next }
{ nW+=NF; nC+=(length($0) + 1) }
ENDFILE { print FILENAME, FNR, nW, nC; nW=nC=0 }
' file
The above uses GNU awk for ENDFILE. With other awks just store the values in an array and print in a loop in the END section.
I would suggest another solution:
awk '{print $1}' myfile | xargs wc
the difference is that it executes wc once with multiple arguments. It often works (for example, with kill command)
Or use the pipe | as in bash then retrive the output in a variable with awk's getline, like this
zcat /var/log/fail2ban.log* | gawk '/.*Ban.*/ {print $7};' | sort | uniq -c | sort | gawk '{ "geoiplookup " $2 "| cut -f2 -d: " | getline geoip; print $2 "\t\t" $1 " " geoip}'
That line will print all the banned IPs from your server along with their origin (country) using the geoip-bin package.
The last part of that one-liner is the one that affects us :
gawk '{ "geoiplookup " $2 "| cut -f2 -d: " | getline geoip; print $2 "\t\t" $1 " " geoip}'
It simply says : run the command "geoiplookup 182.193.192.4 | -f2 -d:" ($2 gets substituted as you may guess) and put the result of that command in geoip (the | getline geoip bit). Next, print something something and anything inside the geoip variable.
The complete example and the results can be found here, an article I wrote.

Resources