Printing contents with a specific range with Awk - bash

I have a text file containing:
Location 1 40.733596 -74.003139
Location 2 43.758102 -73.975734
Location 3 41.732456 -74.003755
Location 4 42.345907 -71.087001
where the first column is just a location count, the second column represents the latitude and third represents the longitude.
I'm trying to write an awk command to only print out the location within a specific latitude and longitude range.
awk -F '\t' '$2>40,$2<=42,$3>=-71,$3<=74 {print $1,$2,$3}'LatLon.txt
in the pattern segment of the awk command I'm trying to specify the range for the column 2 and column 3 where it prompts bash to only print the location within 40-42 lat and -71 to -74 lon range.
I'm getting an error mentioning:
awk: bailing out at source line 1
due to the pattern segment of my awk line. How do i properly specify the range?

Your code:
awk -F '\t' '$2>40,$2<=42,$3>=-71,$3<=74 {print $1,$2,$3}'LatLon.txt
This has a few errors in it:
You need to combine conditionals with && rather than commas
Your test on $3 won't pass when correctedsince you're asking for values between -71 and 74 yet all given values are lower than -71
You need a space between the awk code and your file.
This code should work for you:
awk -F '\t' '(40 < $2 && $2 <= 42) && (-74 <= $3 && $3 <= -71)' LatLon.txt
You may notice the lack of an action here. The default action is to print the line as-is, so this is roughly comparable to the action you gave (though {print $1,$2,$3} re-concatenates those fields using OFS which defaults to a space rather than a tab; you could do OFS="\t"; print $1,$2,$3 to preserve that or just print $0 which is what happens by default without an action.)
The parentheses are technically unnecessary. They are provided for legibility.

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

How to add an if statement before calculation in AWK

I have a series of files that I am looping through and calculating the mean on a column within each file after performing a serious of filters. Each filter is piped in to the next, BEFORE calculating the mean on the final output. All of this is done within a sub shell to assign it to a variable for later use.
for example:
variable=$(filter1 | filter 2 | filter 3 | calculate mean)
to calculate the mean I use the following code
... | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
So, my problem is that depending on the file, the number of rows after the final filter is reduced to 0, i.e. the pipe passes nothing to AWK and I end up with awk: fatal: division by zero attempted printed to screen, and the variable then remains empty. I later print the variable to file and in this case I end up with BLANK in a text file. Instead what I am attempting to do is state that if NR==0 then assign 0 to the variable so that my final output in the text file is 0.
To do this I have tried to add an if statement at the start of my awk command
... | awk '{if (NR==0) print 0}BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
but this doesn't change the output/ error and I am left with BLANKs
I did move the begin statement but this caused other errors (syntax and output errors)
Expected results:
given that column from a file has 5 lines and looks thus, I would filter on apple and pipe into the calculation
apple 10
apple 10
apple 10
apple 10
apple 10
code:
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /apple/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
then I would expect the variable to be set to 10 (10*5/5 = 10)
In the following scenario where I filter on banana
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
given that the pipe passes nothing to AWK I would want the variable to be 0
is it just easier to accept the blank space and change it later when printed to file - i.e. replace BLANK with 0?
The default value of a variable which you treat as a number in AWK is 0, so you don't need BEGIN {s=0}.
You should put the condition in the END block. NR is not the number of all rows, but the index of the current row. So it will only give the number of rows there were at the end.
awk '{s += $5} END { if (NR == 0) { print 0 } else { print s/NR } }'
Or, using a ternary:
awk '{s += $5} END { print (NR == 0) ? 0 : s/NR }'
Also, a side note about your BEGIN{OFS='\t'} ($1 ~ /banana/) { print $0 } examples: most of that code is unnecessary. You can just pass the condition:
awk -F'\t' '$1 ~ /banana/'`
When an awk program is only a condition, it uses that as a condition for whether or not to print a line. So you can use conditions as a quick way to filter through the text.
The correct way to write:
awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
is (assuming a regexp comparison for $1 really is appropriate, which it probably isn't):
awk 'BEGIN{FS=OFS="\t"} $1 ~ /banana/{ s+=$5; c++ } END{print (c ? s/c : 0)}' file.in
Is that what you're looking for?
Or are you trying to get the mean per column 1 like this:
awk 'BEGIN{FS=OFS="\t"} { s[$1]+=$5; c[$1]++ } END{ for (k in s) print k, s[k]/c[k] }' file.in
or something else?

passing a parameter in awk command won't work

I run the script bellow with ./command script.sh 11, the first line of code bellow stores the output (321) successfully in parameter x (checked with echo on line 2). On line 3 I try to use parameter x to retrieve the last two columns on all lines where the value in the first column is equal to x (in doc2.csv). This won't work but when I replace z=$x by z=321it works fine. Why won't this code work when passing the parameter?
#!/bin/bash
x="$(awk -v y=$1 -F\; '$1 == y' ~/Documents/doc1.csv | cut -d ';' -f2)"
echo $x
awk -v z=$x -F, '$1 == z' ~/Documents/doc2.csv | cut -d ',' -f2,3
doc1.csv (all columns have unique values)
33;987
22;654
11;321
...
doc2.csv
321,156843,ABCD
321,637253,HYEB
123,256843,BHJN
412,486522,HDBC
412,257843,BHJN
862,256843,BHLN
...
Like others have mentioned there is probably some extra characters coming along for the ride in field 2 of your cut command.
If you just use awk to print the column you want instead of the entire line and cutting that you shouldn't have any problems. If you still do then you will need to look into dos2unix.
n=33;
x=$(awk -v y=$n -F\; '$1 == y {print $2}' d1);
echo ${x};
awk -v z=$x -F, '$1 == z' d2
d1 and d2 contain doc1 and doc2 contents as you outlined.
As you can see all I did was stop using cut on the output of awk and just told awk to print the second field if the first field is equal to the input variable.
By the way awk is pretty powerful if you weren't aware... You can do this entire program within awk.
n=11; awk -v x=$n -F\; 'NR==FNR{ if($1==x){ y[$2]; } next} $1 in y{print $2, $3}' d1 <( sed 's/,/;/g' d2)
NR==FNR Is a trick that effectively says "If we are still in the first file, do this"... the key is not forgetting to use next to skip the rest of the awk command. Once we get to the second file FNR flips back to 1 but NR keeps incrementing up so they'll never be equal again.
So for the first file we just load up the second column values into an array where the first column matches our passed variable. You could optimize this since you said d1 was always unique lines.
So once we get into the next file the logic skips everything and runs $1 in y. This just checks if the first column is in the array we have created. If it is awk prints column 2 and 3.
<( sed 's/,/;/g' d2) just means we want to treat the output of the sed command as a file. The sed command is just converting the commas in d2 to semicolons so that it matches the FS that awk expects.
Hopefully you've learned a bit about awk, read more here http://www.catonmat.net/blog/ten-awk-tips-tricks-and-pitfalls/ and a great redirection cheat sheet is available here http://www.catonmat.net/download/bash-redirections-cheat-sheet.pdf .

Insert column delimiters before pattern in a sorted file on a mac

Have a resulting file which contains values from different XML files.
The file have 5 columns separated by ";" in case that all pattern matched.
First column = neutral Index
Second column = specific Index1
Third column = file does contain Index1
Fourth column = specific Index2
Fifth column = file does contain Index2
Not matching pattern with Index2 (like last three lines) should also have 5 columns, while the last two columns should be like the first two lines.
The sorted files looks like:
AAA;AAA.1D1;file_X;AAA.2D1;file_Y
AAA;AAA.1E1;file_A;AAA.2E1;file_B
AAA;AAA.2F1;file_C
BBB;BBB.2G1;file_D
CCC;CCC.1B1;file_H
YYY;YYY.2M1;file_N
The desired result would be:
AAA;AAA.1D1;file_X;AAA.2D1;file_Y
AAA;AAA.1E1;file_A;AAA.2E1;file_B
AAA;;;AAA.2F1;file_C
BBB;;;BBB.2G1;file_D
CCC;CCC.1B1;file_H;;
YYY;;;YYY.2M1;file_N
If you have any idea/hint, your help is appreciated! Thanks in advance!
Updated Answer
In the light of the updated requirement, I think you want something like this:
awk -F';' 'NF==3 && $2~/\.1/{$0=$0 ";;"}
NF==3 && $2~/\.2/{$0=$1 ";;;" $2 ";" $3} 1' file
which can be written as a one-liner:
awk -F';' 'NF==3 && $2~/\.1/{$0=$0 ";;"} NF==3 && $2~/\.2/{$0=$1 ";;;" $2 ";" $3} 1' YourFile
Original Answer
I would do that with awk:
awk -F';' 'NF==3{$0=$1 ";;;" $2 ";" $3}1' YourFile
AAA;AAA.1D1;file_X;AAA.2D1;file_Y
AAA;AAA.1E1;file_A;AAA.2E1;file_B
AAA;;;AAA.2F1;file_C
BBB;;;BBB.2G1;file_D
YYY;;;YYY.2M1;file_N
That says..."run awk on YourFile using ';' as field separator. If there are only 3 fields on any line, recreate the line using the existing first field, three semi-colons and then the other two fields. The 1 at the end, means print the current line`".
If you don't use awk much, NF refers to the number of fields, $0 refers to the entire current line, $1 refers to the first field on the line, $2 refers to the second field etc.

awk combine 2 commands for csv file formatting

I have a CSV file which has 4 columns. I want to first:
print the first 10 items of each column
only print the items in the third column
My method is to pipe the first awk command into another but i didnt get exactly what i wanted:
awk 'NR < 10' my_file.csv | awk '{ print $3 }'
The only missing thing was the -F.
awk -F "," 'NR < 10' my_file.csv | awk -F "," '{ print $3 }'
You don't need to run awk twice.
awk -F, 'NR<=10{print $3}'
This prints the third field for every line whose record number (line) is less than or equal to 10.
Note that < is different from <=. The former matches records one through nine, the latter matches records one through ten. If you need ten records, use the latter.
Note that this will walk through your entire file, so if you want to optimize your performance:
awk -F, '{print $3} NR>10{exit}'
This will print the third column. Then if the record number is greater than 10, it will exit. This does not step through your entire file.
Note also that awk's "CSV" matching is very simple; awk does not understand quoted fields, so the record:
red,"orange,yellow",green
has four fields, two of which have double quotes in them. YMMV depending on your input.

Resources