How do pipes inside awk work (Sort with keeping header) - shell

The following command outputs the header of a file and sorts the records after the header. But how does it work? Can anyone explain this command?
awk 'NR == 1; NR > 1 {print $0 | "sort -k3"}'

Could you please go through following once(only for explanation purposes). For learning more concepts on awk I suggest go through Stack overflow's nice awk learning section
awk ' ##Starting awk program from here.
NR == 1; ##Checking if line is first line then print it.
##awk works on method of condition then action since here is NO ACTION mentioned so by default printing of current line will happen
NR > 1{ ##If line is more than 1st line then do following.
print $0 | "sort -k3" ##It will be keep printing lines into memory and before printing it will sort them with their 3rd field.
}'

Understanding the awk command:
Overall an awk program is build out of (pattern){action} pairs which stat that if pattern returns a non-zero value, action is executed. One does not necessarily, need to write both. If pattern is omitted, it defaults to 1 and if action is omitted, it defaults to print $0.
When looking at the command in question:
awk 'NR == 1; NR > 1 {print $0 | "sort -k3"}'
We notice that there are two action-pattern pairs. The first reads NR == 1 and states that if we are processing the first record (pattern) then print the record (default action). The second is a bit more tricky. The pattern is clear, the action on the other hand needs some explaining.
awk has knowledge of 4 output statements that can redirect the output. One of these reads expression | cmd . It essentially means that awk will write output to a stream that is piped as input to a command cmd. It will keep on writing the output to that stream until the stream is explicitly closed using a close(cmd) statement or by simply terminating awk.
In case of the OP, the action reads { print $0 | "sort -k3" }, meaning that it will print all records $0 to a stream that is used as input of the shell command sort -k3. Only when the program finishes will sort write its output.
Recap: the command of the OP will print the first line of a file, and sort the consecutive lines according the third column.
Alternative solutions:
Using GNU awk, it is better to do:
awk '(FNR==1);{a[$3]=$0}
END{PROCINFO["sorted_in"]="#ind_str_asc"
for(i in a) print a[i]
}' file
Using pure shell, it is better to do:
cat file | (read -r; printf "%s\n" "$REPLY"; sort -k3)
Related questions:
Is there a way to ignore header lines in a UNIX sort?

| is one of redirections supported by print and printf - in this case pipe to command sort -k3. You might also use redirection to write to file using >:
awk 'NR == 1; NR > 1 {print $0 > "output.txt"}'
or append to file using >>:
awk 'NR == 1; NR > 1 {print $0 >> "output.txt"}'
First will write to file output.txt all lines but first, second will append to output.txt all lines but first.

Related

How to add an if statement before calculation in AWK

I have a series of files that I am looping through and calculating the mean on a column within each file after performing a serious of filters. Each filter is piped in to the next, BEFORE calculating the mean on the final output. All of this is done within a sub shell to assign it to a variable for later use.
for example:
variable=$(filter1 | filter 2 | filter 3 | calculate mean)
to calculate the mean I use the following code
... | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
So, my problem is that depending on the file, the number of rows after the final filter is reduced to 0, i.e. the pipe passes nothing to AWK and I end up with awk: fatal: division by zero attempted printed to screen, and the variable then remains empty. I later print the variable to file and in this case I end up with BLANK in a text file. Instead what I am attempting to do is state that if NR==0 then assign 0 to the variable so that my final output in the text file is 0.
To do this I have tried to add an if statement at the start of my awk command
... | awk '{if (NR==0) print 0}BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
but this doesn't change the output/ error and I am left with BLANKs
I did move the begin statement but this caused other errors (syntax and output errors)
Expected results:
given that column from a file has 5 lines and looks thus, I would filter on apple and pipe into the calculation
apple 10
apple 10
apple 10
apple 10
apple 10
code:
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /apple/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
then I would expect the variable to be set to 10 (10*5/5 = 10)
In the following scenario where I filter on banana
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
given that the pipe passes nothing to AWK I would want the variable to be 0
is it just easier to accept the blank space and change it later when printed to file - i.e. replace BLANK with 0?
The default value of a variable which you treat as a number in AWK is 0, so you don't need BEGIN {s=0}.
You should put the condition in the END block. NR is not the number of all rows, but the index of the current row. So it will only give the number of rows there were at the end.
awk '{s += $5} END { if (NR == 0) { print 0 } else { print s/NR } }'
Or, using a ternary:
awk '{s += $5} END { print (NR == 0) ? 0 : s/NR }'
Also, a side note about your BEGIN{OFS='\t'} ($1 ~ /banana/) { print $0 } examples: most of that code is unnecessary. You can just pass the condition:
awk -F'\t' '$1 ~ /banana/'`
When an awk program is only a condition, it uses that as a condition for whether or not to print a line. So you can use conditions as a quick way to filter through the text.
The correct way to write:
awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
is (assuming a regexp comparison for $1 really is appropriate, which it probably isn't):
awk 'BEGIN{FS=OFS="\t"} $1 ~ /banana/{ s+=$5; c++ } END{print (c ? s/c : 0)}' file.in
Is that what you're looking for?
Or are you trying to get the mean per column 1 like this:
awk 'BEGIN{FS=OFS="\t"} { s[$1]+=$5; c[$1]++ } END{ for (k in s) print k, s[k]/c[k] }' file.in
or something else?

Exiting an AWK statement after printing a block of text

My problem is that I have a very large database (10GB) and I want to save as much time as possible searching through it. I have an awk statement that is searching through the database and depending on the pattern, writes the data into another file.
I have an input file that will be fed into my script as a Terminal argument variable. There are several lines of data within it that will be used as the pattern for the awk statement.
Within the database, all the lines that match the pattern are all sorted next to each other, so essentially, after printing, there is no need to search any further into the database cause everything has already been found. Once the awk finds the first pattern matching line, all the other pattern matching lines are sequentially after it.
This problem is hard to explain with just words, so I've created a few examples of what my files, code, and the database look and operate like.
The input file via Terminal looks like this:
group_1
group_2
group_3
...
The 10GB database looks like this:
group_1 DATA ...
group_1 DATA ...
group_1 DATA ...
group_2 DATA ...
group_2 DATA ...
group_2 DATA ...
group_2 DATA ...
group_3 DATA ...
group_3 DATA ...
group_3 DATA ...
group_3 DATA ...
...
The script code with the awk statement in question looks like this:
IFS=$'\n'
set -f
for var in $(cat < "$1")
do
awk -v seq="$var" '{if (match($1, seq)) {print $0}}' filepath/database > pattern_matched.file
done
a brief explanation of what this code is doing is that it takes in the Terminal argument variable, a filename in this case, and opens it up for the for loop to begin looping. the pattern group_1, for example, is placed in var and the search through the database begins. If the first column matches the pattern, it saves the line into the file pattern_matched.file file.
Currently, it searches through the entire 10GB worth of data and prints the data into the file as intended, but it wastes a lot of time. After printing the lines that match the pattern, I want to stop the awk from continuing the search through the database and move on to the next pattern from the input file. An example behavior for group_2 would be the awk checking the first 3 lines of the database and sees that none of the lines have the matching pattern. However, line 4 contains the pattern, so it prints the line and the subsequent pattern matching lines after it. When the awk reaches line 8, it exits the awk statement and the for loop can then iterate to the next pattern to be searched for, group_3.
awk '{print $0; exit}' filename
Something like this does not work since it only prints the first instance and breaks out, I want something that can print all the matches and as soon as it finds the next non-pattern match, it breaks out.
Thanks in advance.
UPDATE:
The current problem now is that the solution given below makes logical sense. If it enters the if-statement, It would print the line into the file and iterate to the next line. If the line did not match, it would enter the else-if statement and exit the awk. This makes a lot of sense to me, but for some reason, once the flag variable has been set to 1 by the if-statement for the first matched line, it enters the else-if statement. Since the else-if condition evaluates to true, it exits before even scanning the next line. I confirmed this behavior with print statements everywhere in the awk statement.
This is my code with print statements:
awk -v seq="$seqid" '{if(match($1, seq)) {print "matched" ; print $1 ; flag=1} else if (flag) {print "not matched" ; exit}}'
which outputs this:
weird behavior
Can't you just read in the input file (input_file) into awk:
$ cat input_file
group_1
group_3
Awk script:
$ awk 'NR==FNR{a[$0];next} $1 in a' input_file database
group_1 DATA ...
group_1 DATA ...
group_1 DATA ...
group_3 DATA ...
group_3 DATA ...
group_3 DATA ...
group_3 DATA ...
Your shell code:
for var in $(cat < "$1")
do
awk 'script' filepath/database > pattern_matched.file
done
is using an anti-pattern to read the input file stored in $1, see http://mywiki.wooledge.org/BashFAQ/001, and will overwrite pattern_matched.file on every iteration of the loop. You should, I suspect, have written it as:
while IFS= read -r var
do
awk 'script' filepath/database
done < "$1" > pattern_matched.file
Your awk code:
awk -v seq="$var" '{if (match($1, seq)) {print $0}}'
is using match() unnecessarily since you just want to do a regexp comparison and aren't using the variables that match() populates to help you isolate the matching string (RSTART/RLENGTH) and it's using a defult null condition and then puting the real condition in the action space and then hard-coding the default action of printing the current record. It's equivalent to just:
awk -v seq="$var" '$1 ~ seq'
but I'm not convinced you actually need a regexp comparison - given your example you should be doing a string comparison instead:
awk -v seq="$var" '$1 == seq'
Given your posted example may be misleading you'd just choose which of these is appropriate based on whether you want a regexp or string and partial or full match on $1:
awk -v seq="$var" '$1 == seq' # full string
awk -v seq="$var" 'index($1,seq)' # partial string
awk -v seq="$var" '$1 ~ ("^"seq"$")' # full regexp
awk -v seq="$var" '$1 ~ seq' # partial regexp
Let's say we go with that first full string match match, then to exit once the matching $1 has been processed would be:
awk -v seq="$var" '$1 == seq{print; f=1; next} f{exit}'
which would make your full code:
while IFS= read -r var
do
awk -v seq="$var" '$1 == seq{print; f=1; next} f{exit}' filepath/database
done < "$1" > pattern_matched.file
BUT I doubt if you need a shell loop at all and you could just do this instead:
awk 'NR==FNR{seqs[$1]; next} $1 in seqs' "$1" filepath/database > pattern_matched.file
or some other variant that just has awk (or maybe just join) read the input files once. You can make the above exit after all seqs[] have been processed by:
awk '
NR==FNR { seqs[$1]; numSeqs++; next }
$1 in seqs { print; if ($1 !== prev) numSeqs--; prev = $1; next }
numSeqs == -1 { exit }
' "$1" filepath/database > pattern_matched.file
or similar.
I think this should do the trick:
awk -v seq="$var" '{if (match($1, seq)) {print $0; found=1} else if (found) { exit }}'
Similar to David C. Rankin answer, but no need to pass the rd=0 argument to awk since in awk any uninitialized variable is initialized to zero when its first used.
As we do not really know what you intend to do with your program, I will just give you an awk solution:
awk -v seq="$var" '($1!=seq) { if(p) exit; next }($1==seq){p=1}p'
This uses the flag p to check if it already met the sequence seq. A simple if condition determines if it should exit awk or move to the next record. Exiting is done after seq is found, moving to the next record is done before.
However, since you place this in a loop, this will read the file over and over and over again. If you want to make a subselection, you could use the solution of James Brown

Using a value from stored in a different file awk

I have a value stored in a file named cutoff1
If I cat cutoff1 it will look like
0.34722
I want to use the value stored in cutoff1 inside an awk script. Something like following
awk '{ if ($1 >= 'cat cutoff1' print $1 }' hist1.dat >hist_oc1.dat
I think I am making some mistakes. If I do manually it will look like
awk '{ if ($1 >= 0.34722) print $1 }' hist1.dat >hist_oc1.dat
How can I use the value stored in cutoff1 file inside the above mentioned awk script?
The easiest ways to achieve this are
awk -v cutoff="$(cat cutoff1)" '($1 >= cutoff){print $1}' hist.dat
awk -v cutoff="$(< cutoff1)" '($1 >= cutoff){print $1}' hist.dat
or
awk '(NR==FNR){cutoff=$1;next}($1 >= cutoff){print $1}' cutoff1 hist.dat
or
awk '($1 >= cutoff){print $1}' cutoff="$(cat cutoff1)" hist.dat
awk '($1 >= cutoff){print $1}' cutoff="$(< cutoff1)" hist.dat
note: thanks to Glenn Jackman to point to :
man bash Command substitution: Bash performs the expansion by executing command and replacing the command substitution with the
standard output of the command, with any trailing newlines deleted.
Embedded newlines are not deleted, but they may be removed during word
splitting. The command substitution $(cat file) can be replaced by
the equivalent but faster $(< file).
since awk can read multiple files just add the filename before your data file and treat first line specially. No need for external variable declaration.
awk 'NR==1{cutoff=$1; next} $1>=cutoff{print $1}' cutoff data
PS Just noticed that it's similar to the #kvantour's second answer, but keepin it here as a different flavor.
You could use getline to read a value from another file at your convenience. First the main file to process:
$ cat > file
wait
wait
did you see that
nothing more to see here
And cutoff:
$ cat cutoff
0.34722
An wwk script that reads a line from cutoff when it meets the string see in a record:
$ awk '/see/{if((getline val < "cutoff") > 0) print val}1' file
wait
wait
0.34722
did you see that
nothing more to see here
Explained:
$ awk '
/see/ { # when string see is in the line
if((getline val < "cutoff") > 0) # read a value from cutoff if there are any available
print val # and output the value from cutoff
}1' file # output records from file
As there was only one value, it was printed only once even see was seen twice.

bash grep for string and ignore above one line

One of my script will return output as below,
NameComponent=Apache
Fixed=False
NameComponent=MySQL
Fixed=True
So in the above output, I am trying to ignore the below output using grep grep -vB1 'False' which seems not working,
NameComponent=Apache
Fixed=False
Is it possible to perform this using grep or is any better way with awk..
<some-command> |tac |sed -e '/False/ { N; d}' |tac
NameComponent=MySQL
Fixed=True
For every line that matches "False", the code in the {} gets executed. N takes the next line into the pattern space as well, and then d deletes the whole thing before moving on to the next line. Note: using multiple pipes is not considered as good practice.
#Karthi1234: If your Input_file is same as provided samples then try:
awk -F' |=' '($2 != "Apache" && $2 != "False")' Input_file
First making field separator as a space or = then checking here if field 2nd's value is not equal to sting Apache and False and mentioned no action to be performed so default print action will be done by awk.
EDIT: as per OP's request following is the code changed one, try:
awk '!/Apache/ && !/False/' Input_file
You could change strings too in case if these are not the ones which you want, logic should be same.
EDIT2: eg--> You could change values of string1 and string2 and increase the conditions if needed as per your requirement.
awk '!/string1/ && !/string2/' Input_file
If I understand the question correctly you will always have a line before "Fixed=..." and you want to print both lines if and only if "Fixed=True"
The following awk should do the trick:
< command > | awk 'BEGIN {prev='NA'} {if ($0=="Fixed=True") {print prev; print $0;} prev=$0;}'
Note that if the first line is "Fixed=True" it will print the string "NA" as the first line.

Deleting the first two lines of a file using BASH or awk or sed or whatever

I'm trying to delete the first two lines of a file by just not printing it to another file. I'm not looking for something fancy. Here's my (failed) attempt at awk:
awk '{ (NR > 2) {print} }' myfile
That throws out the following error:
awk: { NR > 2 {print} }
awk: ^ syntax error
Example:
contents of 'myfile':
blah
blahsdfsj
1
2
3
4
What I want the result to be:
1
2
3
4
Use tail:
tail -n+3 file
from the man page:
-n, --lines=K
output the last K lines, instead of the last 10; or use -n +K
to output lines starting with the Kth
How about:
tail +3 file
OR
awk 'NR>2' file
OR
sed '1,2d' file
You're nearly there. Try this instead:
awk 'NR > 2 { print }' myfile
awk is rule based, and the rule appears bare (i.e., without braces) before the block it woud execute if it passes.
Also as Jaypal has pointed out, in awk if all you want to do is print the line that matches the rules you can even omit the action, thus simplifying the command to:
awk 'NR > 2' myfile
awk is based on pattern{action} statements. In your case, the pattern is NR>2 and the action you want to perform is print. This action is also the default action of awk.
So even though
awk 'NR>2{print}' filename
would work fine, you can shorten it to
awk 'NR>2' filename.

Resources