Need a guide to basic command-line awk syntax - bash

I have read several awk tutorials and seen a number of questions and answers on here and the problem is that I'm seeing a LOT of variety in how people do their awk 1-liners and it has really overcomplicated it in my mind.
So I see things like this:
awk '/pattern/ { print }'
awk '/pattern/ { print $0 }'
awk '/pattern/ { print($0) }'
awk '/pattern/ { print($0); }'
awk 'BEGIN { print }'
awk '/pattern/ BEGIN { print };
Sometimes I get errors and sometimes not but because I'm seeing so many different phrasings I'm really having trouble fixing syntax errors because I can't figure out what's allowed and what isn't.
Can someone explain this? Does print require parens or not? Are semi-colons required or not? Is BEGIN required or not? What happens when you start an awk script with a /pattern/, and/or just pass it the name of a function like print on its own?

One at a time:
Can someone explain this?
Yes.
Does print require parens or not?
print, like return, is a builtin, not a function, and as such does not use parens at all. When you see print("foo") the parens are associated with the string "foo", they are NOT in any way part of the print command despite how it looks. It might be clearer (but still not useful in this case) to write it as print ("foo").
Are semi-colons required or not?
Not when the statements are on separate lines. Like in shell, semi-colons would be required to separate statements that occur on a single line
Is BEGIN required or not?
No. Note that BEGIN is a keyword that represents the condition that exists before the first input file is opened for reading so BEGIN{print} will just print a blank line since nothing has been read to print. Also /pattern/ BEGIN is nonsense and should produce a syntax error.
What happens when you start an awk script with a /pattern/, and/or just pass it the name of a function like print on its own?
An awk script is made up of condition { <action> } sections with the default condition being TRUE and the default action being print $0. So awk '/pattern/' means if the regexp "pattern" exists in the current record then invoke the default action which is to print that record and awk '{ print }' means the default condition of TRUE applies so execute the specified action and print the current record. Not also that print by default prints the current record so print $0 is synonymous with just print.
If you are considering starting to use awk, get the book Effective Awk Programming by Arnold Robbins and at least read the first chapter or 2.

Function calls require (). Statements do not (but appear to allow them).
print and printf are statements so do not require () (but supports it "The entire list of items may be optionally enclosed in parentheses.")
From print we also find out that
The simple statement ‘print’ with no items is equivalent to ‘print $0’: it prints the entire current record.
So we now know that the first three statements are identical.
From Actions we find out that.
An action consists of one or more awk statements, enclosed in curly braces (‘{…}’).
and that
The statements are separated by newlines or semicolons.
Which tells us that the semicolon is a "separator" and not a terminator so we don't need one at the end of an action so we now know the fourth is also identical.
BEGIN is a special pattern and that
[a] BEGIN rule is executed once only, before the first input record is read.
So the fifth is different because it operates once at the start and not on every line.
And the last is a syntax error because it has two patterns next to each other without an intervening action or separator.

All of those awk commands (except the last 2) can be shortened to:
awk '/pattern/' file
since print is always the action in awk.
Semicolon is optional just before }.
You cannot place BEGIN after /pattern/

Related

How to use file chunks based on characters instead of lines for grep?

I am trying to parse log files of the form below:
---
metadata1=2
data1=2,data3=5
END
---
metadata2=1
metadata1=4
data9=2,data3=2, data0=4
END
Each section between the --- and END is an entry. I want to select the entire entry that contains a field such as data1. I was able to solve it with the following command, but it is painfully slow.
pcregrep -M '(?s)[\-].*data1.*END' temp.txt
What am I doing wrong here?
Parsing this file with pcregrep might be challenging. The 'pcregrep' does not have the ability to break the files into logical records. The pattern that was specific will try to find matching records by combining multiple record together. Sometimes including unmatched records in the output.
For example, if the input is "--- data=a END --- data1=a END", then the above command will select both records, as it will form a match between the initial '---', and the trailing 'END'
For this kind of input, consider using AWK. It has the ability to read input with custom record separator (RS), which make it easy to convert the input into records, and apply the pattern. If you prefer, you can use Perl or Python.
Using awk RS to create "records", possible to apply the pattern test on every record
awk -v RS='END\n' '/data1/ { print $0 }' < log1
awk -v RS='END\n' '/data1/ { print NR, $0 }' < log1
The second command include the record number in the output, if useful.
While AWK is not as fast as pcregrep, in this case, it will not have trouble processing large input set.
I would use awk:
awk 'BEGIN{RS=ORS="END\n"}/\ydata1/' file
Explanation:
awk works based on input records. By default such a record is a line of input, but this behaviour can be changed by setting the record separator (and output record separator for the output).
By setting them to END\n, we can search whole records of your input.
The regular expression /\ydata1/ searches those records for the presence of the the term data1, the \y matches a word boundary, to prevent from matching metadata1.

extract data between similar patterns

I am trying to use sed to print the contents between two patterns including the first one. I was using this answer as a source.
My file looks like this:
>item_1
abcabcabacabcabcabcabcabacabcabcabcabcabacabcabc
>item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
>item_3
cdecde
>item_4
defdefdefdefdefdefdef
I want it to start searching from item_2 (and include) and finish at next occuring > (not include). So my code is sed -n '/item_2/,/>/{/>/!p;}'.
The result wanted is:
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
but I get it without item_2.
Any ideas?
Using awk, split input by >s and print part(s) matching item_2.
$ awk 'BEGIN{RS=">";ORS=""} /item_2/' file
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
I would go for the awk method suggested by oguz for its simplicity. Now if you are interested in a sed way, out of curiosity, you could fix what you have already tried with a minor change :
sed -n '/^>item_2/ s/.// ; //,/>/ { />/! p }' input_file
The empty regex // recalls the previous regex, which is handy here to avoid duplicating /item_2/. But keep in mind that // is actually dynamic, it recalls the latest regex evaluated at runtime, which is not necessarily the closest regex on its left (although it's often the case). Depending on the program flow (branching, address range), the content of the same // can change and... actually here we have an interesting example ! (and I'm not saying that because it's my baby ^^)
On a line where /^>item_2/ matches, the s/.// command is executed and the latest regex before // becomes /./, so the following address range is equivalent to /./,/>/.
On a line where /^>item_2/ does not match, the latest regex before // is /^>item_2/ so the range is equivalent to /^>item_2/,/>/.
To avoid confusion here as the effect of // changes during execution, it's important to note that an address range evaluates only its left side when not triggered and only its right side when triggered.
This might work for you (GNU sed):
sed -n ':a;/^>item_2/{s/.//;:b;p;n;/^>/!bb;ba}' file
Turn off implicit printing -n.
If a line begins >item_2, remove the first character, print the line and fetch the next line
If that line does not begins with a >, repeat the last two instructions.
Otherwise, repeat the whole set of instructions.
If there will always be only one line following >item_2, then:
sed '/^>item_2/!d;s/.//;n' file

Regex not working as field separator on awk

I have this text file foo.txt which contains words mixed with punctuation marks.
What I want to do is filter every punctuation mark using awk, so I used a regex expression as field separator, like this awk -F '[^a-zA-Z]+' '{ print $0 }' foo.txt, the problem I'm facing is that the text stays just like the original, nothing is filtered.
Anyone knows why this happens?
Input
¿Hello? How... are foo you?'
Bye ,, hehe '" .lol
Result Expected
Hello How are foo you
Bye hehe lol
P.D
I know I can achieve the same result using sed with something like this sed 's/[[:punct:]]//g' foo.txt or sed s/[^A-Za-z]/" "/g foo.txt, but I want to know why the awk command is not working, I've already investigated everywhere and I can't find an answer, I'm not going to be able to sleep.
If you want to know where you can find the rules behind this, I would like to point to Awk POSIX standard:
However, you have to find the answer a bit on two locations:
DESCRIPTION
The awk utility shall interpret each input record as a sequence of fields where, by default, a field is a string of non- <blank> non- <newline> characters. This default <blank> and <newline> field delimiter can be changed by using the FS built-in variable or the -F sepstring option. The awk utility shall denote the first field in a record $1, the second $2, and so on. The symbol $0 shall refer to the entire record; setting any other field causes the re-evaluation of $0. Assigning to $0 shall reset the values of all other fields and the NF built-in variable.
Variables and Special Variables
References to nonexistent fields (that is, fields after $NF), shall evaluate to the uninitialized value. Such references shall not create new fields. However, assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS. Each field variable shall have a string value or an uninitialized value when created. Field variables shall have the uninitialized value when created from $0 using FS and the variable does not contain any characters.
It is a bit awkward to find the rule for recomputing $0 when new fields are introduced, but this is essentially the rule.
Furthermore, the statement print $0 prints the entire field. So according to the above, you first need to recompute your $0 as shown in the answer of #oguzismail.
So changing the field separator can be done in the following way:
awk 'BEGIN{FS="oldFS"; OFS="newFS"}{$1=$1}1' <file>
remark: you do not need to check if the line contains any fields as NF{$1=$1} since {$1=$1} will just introduce an empty field without an extra OFS.

gsub issue with awk (gawk)

I need to search a text file for a string, and make a replacement that includes a number that increments with each match.
The string to be "found" could be a single character, or a word, or a phrase.
The replacement expression will not always be the same (as it is in my examples below), but will always include a number (variable) that increments.
For example:
1) I have a test file named "data.txt". The file contains:
Now is the time
for all good men
to come to the
aid of their party.
2) I placed the awk script in a file named "cmd.awk". The file contains:
/f/ {sub ("f","f(" ++j ")")}1
3) I use awk like this:
awk -f cmd.awk data.txt
In this case, the output is as expected:
Now is the time
f(1)or all good men
to come to the
aid of(2) their party.
The problem comes when there is more than one match on a line. For example, if I was searching for the letter "i" like:
/i/ {sub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the time
for all good men
to come to the
ai(2)d of their party.
which is wrong because it doesn't include the "i" in "time" or "their".
So, I tried "gsub" instead of "sub" like:
/i/ {gsub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the ti(1)me
for all good men
to come to the
ai(2)d of thei(2)r party.
Now it makes the replacement for all occurrences of the letter "i", but the inserted number is the same for all matches on the same line.
The desired output should be:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
Note: The number won't always begin with "1" so I might use awk like this:
awk -f cmd.awk -v j=26 data.txt
To get the output:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
And just to be clear, the number in the replacement will not always be inside parenthesis. And the replacement will not always include the matched string (actually it would be quite rare).
The other problem I am having with this is...
I want to use an awk-variable (not environment variable) for the "search string", so I can specify it on the awk command line.
For example:
1) I placed the awk script in a file named "cmd.awk". The file contains something like:
/??a??/ {gsub (a,a "(" ++j ")")}1
2) I would use awk like this:
awk -f cmd.awk -v a=i data.txt
To get the output:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
The question here, is how do I represent the the variable "a" in the /search/ expression ?
awk version:
awk '{for(i=2; i<=NF; i++)$i="(" ++k ")" $i}1' FS=i OFS=i
gensub() sounds ideal here, it allows you to replace the Nth match, so what sounds like a solution is to iterate over the string in a do{}while() loop replacing one match at a time and incrementing j. This simple gensub() approach won't work if the replacement does not contain the original text (or worse, contains it multiple times), see below.
So in awk, lacking perl's "s///e" evaluation feature, and its stateful regex /g modifier (as used by Steve) the best remaining option is to break the lines into chunks (head, match, tail) and stick them back together again:
BEGIN {
if (j=="") j=1
if (a=="") a="f"
}
match($0,a) {
str=$0; newstr=""
do {
newstr=newstr substr(str,1,RSTART-1) # head
mm=substr(str,RSTART,RLENGTH) # extract match
sub(a,a"("j++")",mm) # replace
newstr=newstr mm
str=substr(str,RSTART+RLENGTH) # tail
} while (match(str,a))
$0=newstr str
}
{print}
This uses match() as an epxression instead of a // pattern so you can use a variable. (You can also just use "($0 ~ a) { ... }", but the results of match() are used in this code, so don't try that here.)
You can define j and a on the command line.
gawk supports \y which is the equivalent of perlre's \b, and also supports \< and \> to explictly match the start and end of a word, just take care to add extra escapes from a unix command line (I'm not quite sure what Windows might require or permit).
Limited gensub() version
As referenced above:
match($0,a) {
idx=1; str=$0
do {
prev=str
str=gensub(a,a"(" j ")",idx++,prev)
} while (str!=prev && j++)
$0=str
}
The problems here are:
if you replace substring "i" with substring "k" or "k(1)" then the gensub() index for the next match will be off by 1. You could work around this if you either know that in advance, or work backward through the string instead.
if you replace substring "i" with substring "ii" or "ii(i)" then a similar problem arises (resulting in an infinite loop, because gensub() keeps finding a new match)
Dealing with both conditions robustly is not worth the code.
I'm not saying this can't be done using awk, but I would strongly suggest moving to a more powerful language. Use perl instead.
To include a count of the letter i beginning at 26, try:
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=26 data.txt
This could also be a shell var:
var=26
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=$var data.txt
Results:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
To include a count of specific words, add word boundaries (i.e. \b) around the words, try:
perl -spe 's:\bthe\b:$&."(".++$x.")":ge' -- -x=5 data.txt
Results:
Now is the(6) time
for all good men
to come to the(7)
aid of their party.

Running an awk by splitting the lines

This is such a basic question in awk . But I am facing issues in this and I dont know why. problem is when I run the awk command in a single line such as
awk 'BEGIN {} {print $0;}' FILE
Then the code is running perfecctly
But if I split the code between lines such as
awk '
BEGIN
{
}
{
print $0;
}' FILE
It gives me an error stating that BEGIN should have an action part . I was wondering since it is the same code that I am formatting, why am I getting this error. Its really important for me to solve this as I would be writting large lines of codes in awk it would be difficult for me to format and bring it in a single line everytime. Could you ppl please help me regarding this. Thank you. Note. I am running this awk in shell environment
Add the '{' right after theBEGIN` and you will not get the error message.
The opening paren { for BEGIN needs to be on the same line as BEGIN. So change what you have
awk '
BEGIN
{
to
awk '
BEGIN {
and you won't get the error message.
The manual does state that "BEGIN and END rules must have actions;", so that may be another problem. This
awk 'BEGIN {} ...
seems a bit odd to me (and there's really no reason to have this if nothing is happening)
#Birei's helpful comment below explains that the way these statements will "parse will be different in both cases. The open '{' in next line is parsed as an action without pattern (not related with BEGIN), while in same line means an empty action of the BEGIN rule."

Resources