Different results when running awk from command line versus writing awk script using #!/bin/awk -f - shell

I am writing a simple awk script to read a file holding a single number (single line with a single field), subtract a constant and then write the result to another file. This is a warmup exercise to do a more complex problem. So, if the input file has X, then the output file has X-C
When I write the following in the command line, it works:
awk '{$1 = $1 - 10; print $0}' test.dat > out.dat
The output looks like this (for X = 30 and C = 10):
20
However, I wrote the following awk script :
#!/bin/awk
C=10
{$1 = $1 - C; print $0}
Next, when I run the awk script using:
./script.awk test.dat > out.dat
I get an output file with two lines as follows :
X
X-C
for example, if X=30 and C=10 I get an output file having
30
20
Why is the result different in both cases? I tried removing "-f" in the shebang but I receive and error when I do this.

This is your awk program:
C=10
{$1 = $1 - C; print $0}
Recall that awk programs take the form of a list of pattern-action pairs.
Missing action results in the default action being performed (print the input). Missing pattern is considered to return true.
Your program is equivalent to:
C=10 { print $0 }
1 { $1 = $1 -C ; print $0 }
The first pattern C=10 assigns 10 to variable C and because assignments return the value assigned, returns 10. 10 is not false, so the pattern matches, and the default action happens.
The second line has a default pattern that returns true. So the action always happens.
These two pattern-action pairs are invoked for every record that is input. So, with one record input, there will be two copies printed on output.

Related

How do pipes inside awk work (Sort with keeping header)

The following command outputs the header of a file and sorts the records after the header. But how does it work? Can anyone explain this command?
awk 'NR == 1; NR > 1 {print $0 | "sort -k3"}'
Could you please go through following once(only for explanation purposes). For learning more concepts on awk I suggest go through Stack overflow's nice awk learning section
awk ' ##Starting awk program from here.
NR == 1; ##Checking if line is first line then print it.
##awk works on method of condition then action since here is NO ACTION mentioned so by default printing of current line will happen
NR > 1{ ##If line is more than 1st line then do following.
print $0 | "sort -k3" ##It will be keep printing lines into memory and before printing it will sort them with their 3rd field.
}'
Understanding the awk command:
Overall an awk program is build out of (pattern){action} pairs which stat that if pattern returns a non-zero value, action is executed. One does not necessarily, need to write both. If pattern is omitted, it defaults to 1 and if action is omitted, it defaults to print $0.
When looking at the command in question:
awk 'NR == 1; NR > 1 {print $0 | "sort -k3"}'
We notice that there are two action-pattern pairs. The first reads NR == 1 and states that if we are processing the first record (pattern) then print the record (default action). The second is a bit more tricky. The pattern is clear, the action on the other hand needs some explaining.
awk has knowledge of 4 output statements that can redirect the output. One of these reads expression | cmd . It essentially means that awk will write output to a stream that is piped as input to a command cmd. It will keep on writing the output to that stream until the stream is explicitly closed using a close(cmd) statement or by simply terminating awk.
In case of the OP, the action reads { print $0 | "sort -k3" }, meaning that it will print all records $0 to a stream that is used as input of the shell command sort -k3. Only when the program finishes will sort write its output.
Recap: the command of the OP will print the first line of a file, and sort the consecutive lines according the third column.
Alternative solutions:
Using GNU awk, it is better to do:
awk '(FNR==1);{a[$3]=$0}
END{PROCINFO["sorted_in"]="#ind_str_asc"
for(i in a) print a[i]
}' file
Using pure shell, it is better to do:
cat file | (read -r; printf "%s\n" "$REPLY"; sort -k3)
Related questions:
Is there a way to ignore header lines in a UNIX sort?
| is one of redirections supported by print and printf - in this case pipe to command sort -k3. You might also use redirection to write to file using >:
awk 'NR == 1; NR > 1 {print $0 > "output.txt"}'
or append to file using >>:
awk 'NR == 1; NR > 1 {print $0 >> "output.txt"}'
First will write to file output.txt all lines but first, second will append to output.txt all lines but first.

Turning multi-line string into single comma-separated list in Bash

I have this format:
host1,app1
host1,app2
host1,app3
host2,app4
host2,app5
host2,app6
host3,app1
host4... and so on.
I need it like this format:
host1;app1,app2,app3
host2;app4,app5,app6
I have tired this: awk -vORS=, '{ print $2 }' data | sed 's/,$/\n/'
and it gives me this:
app1,app2,app3 without the host in front.
I do not want to show duplicates.
I do not want this:
host1;app1,app1,app1,app1...
host2;app1,app1,app1,app1...
I want this format:
host1;app1,app2,app3
host2;app2,app3,app4
host3;app2;app3
With input sorted on the first column (as in your example ; otherwise just pipe it to sort), you can use the following awk command :
awk -F, 'NR == 1 { currentHost=$1; currentApps=$2 }
NR > 1 && currentHost == $1 { currentApps=currentApps "," $2 }
NR > 1 && currentHost != $1 { print currentHost ";" currentApps; currentHost=$1; currentApps=$2 }
END { print currentHost ";" currentApps }'
It has the advantage over other solutions posted as of this edit to avoid holding the whole data in memory. This comes at the cost of needing the input to be sorted (which is what would need to put lots of data in memory if the input wasn't sorted already).
Explanation :
the first line initializes the currentHost and currentApps variables to the values of the first line of the input
the second line handles a line with the same host as the previous one : the app mentionned in the line is appended to the currentApps variable
the third line handles a line with a different host than the previous one : the infos for the previous host are printed, then we reinitialize the variables to the value of the current line of input
the last line prints the infos of the current host when we have reached the end of the input
It probably can be refined (so much redundancy !), but I'll leave that to someone more experienced with awk.
See it in action !
$ awk '
BEGIN { FS=","; ORS="" }
$1!=prev { print ors $1; prev=$1; ors=RS; OFS=";" }
{ print OFS $2; OFS=FS }
END { print ors }
' file
host1;app1,app2,app3
host2;app4,app5,app6
host3;app1
Maybe something like this:
#!/bin/bash
declare -A hosts
while IFS=, read host app
do
[ -z "${hosts["$host"]}" ] && hosts["$host"]="$host;"
hosts["$host"]+=$app,
done < testfile
printf "%s\n" "${hosts[#]%,}" | sort
The script reads the sample data from testfile and outputs to stdout.
You could try this awk script:
awk -F, '{a[$1]=($1 in a?a[$1]",":"")$2}END{for(i in a) printf "%s;%s\n",i,a[i]}' file
The script creates entries in the array a for each unique element in the first column. It appends to that array entry all element from the second column.
When the file is parsed, the content of the array is printed.

Bash: Adding the contents of a column to a variable

So I have a file that contains some text, but there are some lines that contain only "Overall>5". (the number after > could be any number from 0 to 5).
The code I wrote:
let sumOfReviews=0
while read line; do awk -F ">" '{if($1=="Overall") ((sumOfReviews+=$2))}'; done<$file
echo $sumOfReviews
I tried splitting these lines into 2 columns at ">" and I want to add the number in the second column (5 in this case) to a variable. The problem is when I print out the value of the variable at the end it shows 0. Any thoughts? Thanks!
If called from the console, the following seems to do the job:
awk -F'>' '/Overall>[0-9]+/ { total += $2 } END { print total }' exampleData.txt
If you want to call it from inside bash, you have to enclose it in $( ... ):
#!/bin/bash
total="$(awk -F'>' '/Overall>[0-9]+/ { total += $2 } END { print total }' exampleData.txt)"
# do something with `total` here.
You cannot simply use awk as some sort of syntax inside bash, awk is a separate programming language, it is invoked as a completely separate process. You can try to build in some bash parameters into awks source code though.
This can be done in a one line awk script.
awk 'BEGIN { FS=">"; sumOfReviews=0 } /^Overall>[0-5]/ { sumOfReviews+=$2 } END { print sumOfReviews }' < file
Explanation from Manpage:
An AWK program consists of a sequence of pattern-action statements and optional function definitions.
pattern { action statements }
In this case we have used the BEGIN pattern to set the file separator to ">" and the sumOfReviews variable to 0.
We use the /^Overall>[0-5]/ regular expression pattern to match lines beginning with "Overall>" followed by a number 0-5 and if true add the $2 field to the sumOfReviews variable.
Finally we use the END pattern to output the final sumOfReviews value.
Example solution in a bash shell script:
#!/bin/bash
noAuthors=4 # set to 4 for example
sumOfReviews=$(awk 'BEGIN { FS=">"; sumOfReviews=0 } /^Overall>[0-5]/
{ sumOfReviews+=$2 } END { print sumOfReviews }' < file)
echo $(($sumOfReviews/$noAuthors))
awk and bash are two separate programs; they don't share variables. All you need is a single awk script:
awk -F '>' '$1 == "Overall" {reviews += $2}; END {print reviews}' "$file"

How to use output of a command inside an awk command?

I want to print out the last update of a log file and nothing above it (old logs). Every 5 minutes the log is updated/appended to, and there is no option to overwrite instead of append. The amount of lines per update don't vary now, but I don't want to have to change the script if and when new fields are added. Each appendage starts with "Date: ...."
This is my solution so far. I'm finding the line number of the last occurrence of "Date" and then trying to send that to "awk 'NR>line_num_here filename" -
line=$(grep -n Date stats.log | tail -1 | cut --delimiter=':' --fields=1) | awk "NR>$line" file.log
However, I cannot update $line! It always holds the very first value from the very first time I ran the script. Is there a way to correctly update $line? Or are there any other ways to do this? Maybe a way to directly pipe into awk instead of making a variable?
The problem in your solution is that you need to replace the pipe in front of awk by a ;. These are two separate commands which would normally appear on two separate lines:
line=$(...)
awk -v "NR>$line" file
However, you can separate them by a ; if the should appear on the same line:
line=$(...); awk -v "NR>$line" file
But anyway you can significantly simplify the command. Simply use twice awk twice, like this:
awk -v ln="$(awk '/Date/{l=NR}END{print l}' a.log)" 'NR>ln' a.log
I'm using
awk '/Date/{l=NR}END{print l}' a.log
to obtain the line number of the last occurrence of Date. This value get's passed via -v ln=... to the outer awk command.
Here's a way you could do it, in one invocation of awk and only reading the file once:
awk '/Date/ { n = 1 } { a[n++] = $0 } END { for (i = 1; i < n; ++i) print a[i] }' file
This writes each line to an array a, resetting the counter n back to 1 every time the pattern /Date/ matches. It then loops through the array once the file has been read, printing all the most recently saved values.

how can i make awk process the BEGIN block for each file it parses?

i have an awk script that i'm running against a pair of files. i'm calling it like this:
awk -f script.awk file1 file2
script.awk looks something like this:
BEGIN {FS=":"}
{ if( NR == 1 )
{
var=$2
FS=" "
}
else print var,"|",$0
}
the first line of each file is colon-delimited. for every other line, i want it to return to the default whitespace file seperator.
this works fine for the first file, but fails because FS is not reset to : after each file, because the BEGIN block is only processed once.
tldr: is there a way to make awk process the BEGIN block once for each file i pass it?
i'm running this on cygwin bash, in case that matters.
If you're using gawk version 4 or later there's the BEGINFILE block. From the manual:
BEGINFILE and ENDFILE are additional special patterns whose bodies are executed before reading the first
record of each command line input file and after reading the last record of each file. Inside the BEGINFILE
rule, the value of ERRNO will be the empty string if the file could be opened successfully. Otherwise, there
is some problem with the file and the code should use nextfile to skip it. If that is not done, gawk produces
its usual fatal error for files that cannot be opened.
For example:
touch a b c
awk 'BEGINFILE { print "Processing: " FILENAME }' a b c
Output:
Processing: a
Processing: b
Processing: c
Edit - a more portable way
As noted by DennisWilliamson you can achieve a similar effect with FNR == 1 at the beginning of your script. In addition to this you could change FS from the command-line directly, e.g.:
awk -f script.awk FS=':' file1 FS=' ' file2
Here the FS variable will retain whatever value it had previously.
Instead of:
BEGIN {FS=":"}
use:
FNR == 1 {FS=":"}
The FNR variable should do the trick for you. It's the same as NR except it is scoped within the file, so it resets to 1 for every input file.
http://unstableme.blogspot.ca/2009/01/difference-between-awk-nr-and-fnr.html
http://www.unix.com/shell-programming-scripting/46931-awk-different-between-nr-fnr.html
When you want a POSIX complient version, the best is to do:
(FNR == 1) { FS=":"; $0=$0 }
This states that, if the File record number (FNR) equals one, we reset the field separator FS. However, you also need to reparse $0 and reset the values of all other fields and the NF built-in variable.
This is equivalent to the GNU awk 4.x BEGINFILE if and only if the record separator (RS) stays unchanged.

Resources