awk to get first column if the a specific number in the line is greater than a digit - shell

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.

Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456

With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file

I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)

Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.

It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.

Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Related

Processing text with multiple delims in awk

I have a text which looks like -
Application.||dates:[2022-11-12]|models:[MODEL1]|count:1|ids:2320
Application.||dates:[2022-11-12]|models:[MODEL1]|count:5|ids:2320
I want the number from the count:1 columns so 1 and i wish to store these numbers in an array.
nums=($(echo -n "$grepResult" | awk -F ':' '{ print $4 }' | awk -F '|' '{ print $1 }'))
this seems very repetitive and not very efficient, any ideas how to simplify this ?
You can use awk once, set the field separator to |. Then loop all the fields and split on :
If the field starts with count then print the second part of the splitted value.
This way the count: part can occur anywhere in the string and can possibly print this multiple times.
nums=($(echo -n "$grepResult" | awk -F'|' '
{
for(i=1; i<=NF; i++) {
split($i, a, ":")
if (a[1] == "count") {
print a[2]
}
}
}
'))
for i in "${nums[#]}"
do
echo "$i"
done
Output
1
5
If you want to combine the both split values, you can use [|:] as a character class and print field number 8 for a precise match as mentioned in the comments.
Note that it does not check if it starts with count:
nums=($(echo -n "$grepResult" | awk -F '[|:]' '{print $8}'))
With gnu awk you can use a capture group to get a bit more precise match where on the left and right can be either the start/end of string or a pipe char. The 2nd group matches 1 or more digits:
nums=($(echo -n "$grepResult" | awk 'match($0, /(^|\|)count:([0-9]+)(\||$)/, a) {print a[2]}' ))
Try sed
nums=($(sed 's/.*count://;s/|.*//' <<< "$grepResult"))
Explanation:
There are two sed commands separated with ; symbol.
First command 's/.*count://' remove all characters till 'count:' including it.
Second command 's/|.*//' remove all characters starting from '|' including it.
Command order is important here.

How to reverse order of columns in tabular data using bash

Given tabular output from some program in bash I would like to change order of colums printed. Assume number of columns might vary.
Sample input
Name Surname Age
Oli Aaa 15
Boa Bbb 25
Expected output
Age Surname Name
15 Aaa Oli
25 Bbb Boa
What I tried
It seems to me as an easy task when number of columns is known, but I don't know what to do when number of columns is just N. For 3 columns simple AWK script would do:
cat table.txt | awk '{print $3 $2 $1}' > reversed_table.txt
It would be good to achieve this using only POSIX-compliant tools.
using only POSIX-compliant tools
awk is posix.
but I don't know what to do when number of columns is just N
Now that's easy. So first, awk is really flexible. The awk '{ i=5; print $i; } will print the 5th column, just like that.
Second you can get the number of columns with NF.
Now, it's just writing a simple for loop and iterating from the NF to first argument and viola!
awk '{ for(i = NF; i >= 1; --i) printf "%s", $i "\t"; printf "\n" }'
A bit better version without a trailing tabulator:
awk '{ for(i = NF; i >= 1; --i) printf "%s", $i (i==1 ? "" : OFS); print ""; }'
Here is a Generic solution. Where we have 2 variables named swap1 and swap2, in swap one mention keep mapping with swap2 eg--> we want to exchange 3rd field to 5th field AND 4th field with 6th field. Likewise we can have a number of digits in it(I have considered a scenario where we want to exchange 3rd field to 5th field AND 4th field to 6th field).
swap1 --> 3 4
| |
| |
| |
swap2 --> 5 6
Following is the code:
awk -v swap1="3,4" -v swap2="5,6" '
BEGIN{
num=split(swap1,field1,",")
num1=split(swap2,field2,",")
for(i=1;i<=num;i++){
array1[field1[i]]=i
}
}
FNR==1{
print
next
}
{
for(i=1;i<=NF;i++){
if(i in array1){
tmp=$field1[array1[i]]
$field1[array1[i]]=$field2[array1[i]]
$field2[array1[i]]=tmp
}
}
}
1
' Input_file | column -t
This might work for you (GNU sed and rev):
sed 's/.*/echo "&" | rev/e;s/\S\+/$(echo "&"|rev)/g;s/.*/echo "&"/e' file
Reverse each line and re-reverse each separate word within the line.
awk '{use = $NF;$NF = "";print use,$2,$1}' OFS="\t" file
Age Surname Name
15 Aaa Oli
25 Bbb Boa
I looked into this one:
Printing everything except the first field with awk

Count the number of words in a lines in a file but ignoring Enter Key

I am using the below command to count the number of words in the line but it gives the number of words plus one as the output.
awk '{ print length($0); }' abc.txt
How can I ignore the last character or enter key while counting the number of words in a line?
gives no of chars
awk '{ print length($0); }' abc.txt
gives no of words, where field separator being single space
awk '{ print NF; }' abc.txt
Perhaps you can use the NF variable in awk like so?
echo "this is a test" | awk '{print NF}'
Prints 4.
I'm not sure what you expect to get out of printing the length of the whole like. That will give you the number of characters. Not the number of words.
This awk script should work:
#!/usr/bin/awk -f
BEGIN { num=0 }
{
num=num + NF
}
END {print num}

Average of first ten numbers of text file using bash

I have a file of two columns. The first column is dates and the second contains a corresponding number. The two commas are separated by a column. I want to take the average of the first three numbers and print it to a new file. Then do the same for the 2nd-4th number. Then 3rd-5th and so on. For example:
File1
date1,1
date2,1
date3,4
date4,1
date5,7
Output file
2
2
4
Is there any way to do this using awk or some other tool?
Input
akshay#db-3325:/tmp$ cat file.txt
date1,1
date2,1
date3,4
date4,1
date5,7
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, '{
x = $2;
i = NR % n;
ma += (x - q[i]) / n;
q[i] = x;
if(NR>=n)print ma;
}' file.txt
2
2
4
OR below one useful for plotting and keeping reference axis (in your case date) at center of average point
Script
akshay#db-3325:/tmp$ cat avg.awk
BEGIN {
m=int((n+1)/2)
}
{L[NR]=$2; sum+=$2}
NR>=m {d[++i]=$1}
NR>n {sum-=L[NR-n]}
NR>=n{
a[++k]=sum/n
}
END {
for (j=1; j<=k; j++)
print d[j],a[j] # remove d[j], if you just want values only
}
Output
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, -v OFS=, -f avg.awk file.txt
date2,2
date3,2
date4,4
$ awk -F, '{a[NR%3]=$2} (NR>=3){print (a[0]+a[1]+a[2])/3}' file
2
2
4
Add a little bit math tricks here, set $2 to a[NR%3] for each record. So the value in each element would be updated cyclically. And the sum of a[0], a[1], a[2] would be the sum of past 3 numbers.
updated based on the changes made due to the helpful feedback from Ed Morton
here's a quick and dirty script to do what you've asked for. It doesn't have much flexibility in it but you can easily figure out how to extend it.
To run save it into a file and execute it as an awk script either with a shebang line or by calling awk -f
// {
Numbers[NR]=$2;
if ( NR >= 3 ) {
printf("%i\n", (Numbers[NR] + Numbers[NR-1] + Numbers[NR-2])/3)
}
}
BEGIN {
FS=","
}
Explanation:
Line 1: Match all lines, "/" is the match operator and in this case we have an empty match which means "do this thing on every line". Line 3: Use the Record Number (NR) as the key and store the value from column 2 Line 4: If we have 3 or more values read from the file Line 5: Do the maths and print as an integer BEGIN block: Change the Field Separator to a comma ",".

Print a comma except on the last line in Awk

I have the following script
awk '{printf "%s", $1"-"$2", "}' $a >> positions;
where $a stores the name of the file. I am actually writing multiple column values into one row. However, I would like to print a comma only if I am not on the last line.
Single pass approach:
cat "$a" | # look, I can use this in a pipeline!
awk 'NR > 1 { printf(", ") } { printf("%s-%s", $1, $2) }'
Note that I've also simplified the string formatting.
Enjoy this one:
awk '{printf t $1"-"$2} {t=", "}' $a >> positions
Yeh, looks a bit tricky at first sight. So I'll explain, first of all let's change printf onto print for clarity:
awk '{print t $1"-"$2} {t=", "}' file
and have a look what it does, for example, for file with this simple content:
1 A
2 B
3 C
4 D
so it will produce the following:
1-A
, 2-B
, 3-C
, 4-D
The trick is the preceding t variable which is empty at the beginning. The variable will be set {t=...} only on the next step of processing after it was shown {print t ...}. So if we (awk) continue iterating we will got the desired sequence.
I would do it by finding the number of lines before running the script, e.g. with coreutils and bash:
awk -v nlines=$(wc -l < $a) '{printf "%s", $1"-"$2} NR != nlines { printf ", " }' $a >>positions
If your file only has 2 columns, the following coreutils alternative also works. Example data:
paste <(seq 5) <(seq 5 -1 1) | tee testfile
Output:
1 5
2 4
3 3
4 2
5 1
Now replacing tabs with newlines, paste easily assembles the date into the desired format:
<testfile tr '\t' '\n' | paste -sd-,
Output:
1-5,2-4,3-3,4-2,5-1
You might think that awk's ORS and OFS would be a reasonable way to handle this:
$ awk '{print $1,$2}' OFS="-" ORS=", " input.txt
But this results in a final ORS because the input contains a newline on the last line. The newline is a record separator, so from awk's perspective there is an empty last record in the input. You can work around this with a bit of hackery, but the resultant complexity eliminates the elegance of the one-liner.
So here's my take on this. Since you say you're "writing multiple column values", it's possible that mucking with ORS and OFS would cause problems. So we can achieve the desired output entirely with formatting.
$ cat input.txt
3 2
5 4
1 8
$ awk '{printf "%s%d-%d",t,$1,$2; t=", "} END{print ""}' input.txt
3-2, 5-4, 1-8
This is similar to Michael's and rook's single-pass approaches, but it uses a single printf and correctly uses the format string for formatting.
This will likely perform negligibly better than Michael's solution because an assignment should take less CPU than a test, and noticeably better than any of the multi-pass solutions because the file only needs to be read once.
Here's a better way, without resorting to coreutils:
awk 'FNR==NR { c++; next } { ORS = (FNR==c ? "\n" : ", "); print $1, $2 }' OFS="-" file file
awk '{a[NR]=$1"-"$2;next}END{for(i=1;i<NR;i++){print a[i]", " }}' $a > positions

Resources