calculating on specific fields with awk - bash

I got a csv file with that kind of informations :
2013 Cat.1 10 Structure1 Code1 34.10
2014 Cat.1 25 Structure1 Code1 254.24
2013 Cat.2 250 Structure1 Code1 2456.4
2014 Cat.2 234 Structure1 Code1 2345.9
2013 Cat.1 5 Structure2 Code2 59
2013 Cat.1 1 Structure2 Code2 18
2014 Cat.1 8 Structure2 Code2 123
2014 Cat.1 1 Structure2 Code2 18
2013 Cat.2 64 Structure2 Code2 59
2013 Cat.2 8 Structure2 Code2 18
2014 Cat.2 70 Structure2 Code2 123
2014 Cat.2 11 Structure2 Code2 18
and the result file I would like is that kind :
2013 Cat.1 10 Structure1 Code1 34.10
2014 Cat.1 25 Structure1 Code1 254.24
2013 Cat.2 250 Structure1 Code1 2456.4
2014 Cat.2 234 Structure1 Code1 2345.9
2013 Cat.1 6 (5+1) Structure2 Code2 77 (59+18)
2014 Cat.1 9 (8+1) Structure2 Code2 141 (123+18)
2013 Cat.2 72 (64+8) Structure2 Code2 77 (59+18)
2014 Cat.2 81 (70+11) Structure2 Code2 141 (123+18)
Is this possible using awk? I only have 2 different fields on this example for the second structure, but could be much more...
I'm very new to programming and awk in particular.
Thanks for any answer!

awk to the rescue!
Not the full solution but may give you ideas
$awk '{
k = $1 FS $2 FS $4 FS $5
a[k] += $3
as[k] = as[k] ? as[k] "+" $3 : "(" $3
b[k] += $6
bs[k] = bs[k] ? bs[k] "+" $6 : "(" $6
}
END {
for (k in a) {
print k, a[k], as[k] ")", b[k], bs[k] ")"
}
}' file
will give you
2014 Cat.2 Structure2 Code2 81 (70+11) 141 (123+18)
2014 Cat.1 Structure2 Code2 9 (8+1) 141 (123+18)
2014 Cat.2 Structure1 Code1 234 (234) 2345.9 (2345.9)
2014 Cat.1 Structure1 Code1 25 (25) 254.24 (254.24)
2013 Cat.2 Structure2 Code2 72 (64+8) 77 (59+18)
2013 Cat.1 Structure2 Code2 6 (5+1) 77 (59+18)
2013 Cat.2 Structure1 Code1 250 (250) 2456.4 (2456.4)
2013 Cat.1 Structure1 Code1 10 (10) 34.1 (34.10)
Note that the column order changed to reuse k and single entry values are also wrapped with parans. Both can be handled with little effort.

Another awk answer, GNU awk specific. I assume you don't actually want to print out the addition formula.
gawk '
{ data[$1 OFS $2][$4 OFS $5][1] += $3
data[$1 OFS $2][$4 OFS $5][2] += $6 }
END {
for (k1 in data) {
for (k2 in data[k1]) {
print k1, data[k1][k2][1], k2, data[k1][k2][2]
}
}
}
' | sort -k4,5 -k2,2 -k1,1 | column -t
2013 Cat.1 10 Structure1 Code1 34.1
2014 Cat.1 25 Structure1 Code1 254.24
2013 Cat.2 250 Structure1 Code1 2456.4
2014 Cat.2 234 Structure1 Code1 2345.9
2013 Cat.1 6 Structure2 Code2 77
2014 Cat.1 9 Structure2 Code2 141
2013 Cat.2 72 Structure2 Code2 77
2014 Cat.2 81 Structure2 Code2 141

Here is a possible answer:
awk 'BEGIN{FS="[ ]+"; OFS="\t";}
NR==FNR{
key = $1"-"$2"-"$4"-"$5
idx[key] = idx[key]+1
a[key][idx[key]] = $3
c[key][idx[key]] = $6
}
NR!=FNR{
key = $1"-"$2"-"$4"-"$5
if(idx[key]==1){$1=$1; print ;next;}
if(idx[key]<0){next;}
line1 =" ("a[key][1]
line2 =" ("c[key][1]
sum1 = a[key][1]
sum2 = c[key][1]
for(i = 2; i< idx[key]; i++) {
line1 = line1"+"a[key][i]
line2 = line2"+"c[key][i]
sum1 = sum1+a[key][i]
sum2 = sum1+c[key][i]
}
sum1 = sum1 + a[key][idx[key]]
sum2 = sum2 + c[key][idx[key]]
line1 = sum1""line1"+"a[key][idx[key]]")"
line2 = sum2""line2"+"c[key][idx[key]]")"
print $1, $2, line1, $4, $5, line2
idx[key] = -1
}' inputFile inputFile
In this script, one ore more blanks are interpreted as field separators (FS="[ ]+"). In the output, fields are separated by a tab (OFS="\t").
Note that the script is called with two times inputFile as argument.
If your input really is a csv-file, then try exporting it with , as field separators and set FS=OFS=",".
Example output for the input given in the question:
2013 Cat.1 10 Structure1 Code1 34.10
2014 Cat.1 25 Structure1 Code1 254.24
2013 Cat.2 250 Structure1 Code1 2456.4
2014 Cat.2 234 Structure1 Code1 2345.9
2013 Cat.1 6 (5+1) Structure2 Code2 77 (59+18)
2014 Cat.1 9 (8+1) Structure2 Code2 141 (123+18)
2013 Cat.2 72 (64+8) Structure2 Code2 77 (59+18)
2014 Cat.2 81 (70+11) Structure2 Code2 141 (123+18)

This one-liner will do the job:
awk 'BEGIN{g=1;s="%4s %5s %-12s %10s %5s %-12s\n"} f{printf s ,$1,$2,$3+a" ("a"+"$3")",$4,$5,$6+b" ("b"+"$6")";f=0;g=0} /Structure2/{a=$3;b=$6;f=g;g=1} /Structure1/{printf s,$1,$2,$3,$4,$5,$6}' file
2013 Cat.1 10 Structure1 Code1 34.10
2014 Cat.1 25 Structure1 Code1 254.24
2013 Cat.2 250 Structure1 Code1 2456.4
2014 Cat.2 234 Structure1 Code1 2345.9
2013 Cat.1 6 (5+1) Structure2 Code2 77 (59+18)
2014 Cat.1 9 (8+1) Structure2 Code2 141 (123+18)
2013 Cat.2 72 (64+8) Structure2 Code2 77 (59+18)
2014 Cat.2 81 (70+11) Structure2 Code2 141 (123+18)
I added a formatting for the alignment, I've used 12 (%-12s) for the third and the sixth columns - you can increase it if the numbers go higher.

Related

Linux Bash Print largest number in column from monthly rotated log file

I have monthly rotated log files which looks like the output below. The files are names transc-2301.log (transc-YMM). There is a file for each month of the year. I need a simple bash command to find the file of the current month, and display the largest number (max) of column 3. In the example below, the output should be 87
01/02/23 10:45 19 26
01/02/23 11:45 19 45
01/02/23 12:45 19 36
01/02/23 13:45 22 64
01/02/23 14:45 19 72
01/02/23 15:45 19 54
01/02/23 16:45 19 80
01/02/23 17:45 17 36
01/03/23 10:45 18 24
01/03/23 11:45 19 26
01/03/23 12:45 19 48
01/03/23 13:45 20 87
01/03/23 14:45 20 29
01/03/23 15:45 18 26
Since your filenames are sortable you can easily pick the file of the current month as being the last one in a sortable sequence. Than a quick awk returns the result.
for file in transc_*.log; do :; done
awk '($3>m){m=$3}END{print m}' "$file"
alternatively you can let awk do the heavy lifting on the filename
awk 'BEGIN{ARGV[1]=ARGV[ARGC-1];ARGC=2}($3>m){m=$3}END{print m}' transc_*.log
or if you don't like the glob-expansion trick:
awk '($3>m){m=$3}END{print m}' "transc_$(date "+%y%m").log"
I would harness GNU AWK for this task following way, let transc-2301.log content be
01/02/23 10:45 19 26
01/02/23 11:45 19 45
01/02/23 12:45 19 36
01/02/23 13:45 22 64
01/02/23 14:45 19 72
01/02/23 15:45 19 54
01/02/23 16:45 19 80
01/02/23 17:45 17 36
01/03/23 10:45 18 24
01/03/23 11:45 19 26
01/03/23 12:45 19 48
01/03/23 13:45 20 87
01/03/23 14:45 20 29
01/03/23 15:45 18 26
then
awk 'BEGIN{m=-1;FS="[[:space:]]{2,}";logname=strftime("transc-%y%m.log")}FILENAME==logname{m=$3>m?$3:m}END{print m}' transc*.log
gives output (as of 18 Jan 2023)
87
Warning: I assume your file use as separator two-or-more whitespace characters, if this does not hold adjust FS accordingly. Warning: set m to value which is lower than lowest value which might appear in column of interest. Explanation: I use strftime function to detect what file should be processed and ram all transc*.log files but action is only taken for selected file, action is: set m to $3 if it is higher than current m otherwise keep m value. After processing files, in END, I print value of m.
(tested ub GNU Awk 5.0.1)
mawk '_<(__ = +$NF) { _=__ } END { print +_ }'
gawk 'END { print +_ } (_=_<(__=+$NF) ?__:_)<_'
87

How can I sort csv data alphabetically then numerically by column?

If I have a set of data that has repeating name values but with different variations per repeating value, how can I sort by the top of each of those repeating values? Hopefully that made sense, but I hope to demonstrate what I mean further below.
Take for example this set of data in a tab separated csv file
Ranking ID Year Make Model Total
1 128 2010 Infiniti G37 128
2 124 2015 Jeep Wrangler 124
3 15 014 Audi S4 120
4 113 2012 Acura Tsx sportwagon 116
5 83 2014 Honda Accord 112
6 112 2008 Acura TL 110
7 65 2009 Honda Fit 106
8 91 2010 Mitsu Lancer 102
9 50 2015 Acura TLX 102
10 31 2007 Honda Fit 102
11 216 2007 Chrystler 300 96
12 126 2010 Volkswagen Eos 92
13 13 2016 Honda Civic 1.5t 92
If you look in the Make column, you can see names like Acura and Honda repeat, with differences in the Model and Total column. Assume that there's 200 or so rows of this in the csv file. How can I sort the file so that the items are grouped by Make with only three of the highest in value under the Total column being displayed by each Make?
Expected output below
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 106
4 83 2014 Honda Accord 112
5 31 2007 Honda Fit 102
6 13 2016 Honda Civic 1.5t 92
...
Here is my awk code so far, I can't get past this part to even attempt grouping the makes by total column
BEGIN {
FS = OFS = "\t";
}
FNR == 1 {
print;
next;
}
FNR > 1 {
a[NR] = $4;
}
END {
PROCINFO["sorted_in"] = "#val_str_desc"
for(i = 1; i < FN-1; i++) {
print a[i];
}
}
Currently, my code reads the text file, prints the headers (column titles) and then stops there, it doesn't go on to print out the rest of the data in alphabetical order. Any ideas?
The following assumes bash (if you don't use bash replace $'\t' by a quoted real tab character) and GNU coreutils. It also assumes that you want to sort alphabetically by Make column first, then numerically in decreasing order by Total, and finally keep at most the first 3 of each Make entries.
Sorting is a job for sort, head and tail can be used to isolate the header line, and awk can be used to keep maximum 3 of each Make, and re-number the first column:
$ head -n1 data.tsv; tail -n+2 data.tsv | sort -t$'\t' -k4,4 -k6,6rn |
awk -F'\t' -vOFS='\t' '$4==p {n+=1} $4!=p {n=1;p=$4} {$1=++r} n<=3'
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 102
4 15 014 Audi S4 120
5 216 2007 Chrystler 300 96
6 83 2014 Honda Accord 112
7 65 2009 Honda Fit 106
8 31 2007 Honda Fit 102
10 128 2010 Infiniti G37 128
11 124 2015 Jeep Wrangler 124
12 91 2010 Mitsu Lancer 102
13 126 2010 Volkswagen Eos 92
Note that this is different from your expected output: Make is sorted in alphabetic order (Audi comes after Acura, not Honda) and only the 3 largest Total are kept (112, 106, 102 for Honda, not 112, 102, 92).
If you use GNU awk, and your input file is small enough to fit in memory, you can also do all this with just awk, thanks to its multidimensional arrays and its asorti function, that sorts arrays based on indices:
$ awk -F'\t' -vOFS='\t' 'NR==1 {print; next} {l[$4][$6][$0]}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for(m in l) {
n = asorti(l[m], t, "#ind_num_desc"); n = (n>3) ? 3 : n
for(i=1; i<=n; i++) for(s in l[m][t[i]]) {$0 = s; $1 = ++r; print}
}
}' data.tsv
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 102
4 15 014 Audi S4 120
5 216 2007 Chrystler 300 96
6 83 2014 Honda Accord 112
7 65 2009 Honda Fit 106
8 31 2007 Honda Fit 102
9 128 2010 Infiniti G37 128
10 124 2015 Jeep Wrangler 124
11 91 2010 Mitsu Lancer 102
12 126 2010 Volkswagen Eos 92
Using GNU awk for arrays of arrays and sorted_in:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == 1 {
print
next
}
{
rows[$4][$6][++numRows[$4,$6]] = $0
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for ( make in rows ) {
PROCINFO["sorted_in"] = "#ind_num_desc"
cnt = 0
for ( total in rows[make] ) {
for ( rowNr=1; rowNr<=numRows[make,total]; rowNr++ ) {
if ( ++cnt <= 3 ) {
row = rows[make][total][rowNr]
print row, cnt
}
}
}
}
}
$ awk -f tst.awk file
Ranking ID Year Make Model Total
4 113 2012 Acura Tsx sportwagon 116 1
6 112 2008 Acura TL 110 2
9 50 2015 Acura TLX 102 3
3 15 014 Audi S4 120 1
11 216 2007 Chrystler 300 96 1
5 83 2014 Honda Accord 112 1
7 65 2009 Honda Fit 106 2
10 31 2007 Honda Fit 102 3
1 128 2010 Infiniti G37 128 1
2 124 2015 Jeep Wrangler 124 1
8 91 2010 Mitsu Lancer 102 1
12 126 2010 Volkswagen Eos 92 1
The above will handle cases where multiple cars of 1 make have the same total by always just printing the top 3 rows for that make, e.g. gven this input where 4 Acuras all have 116 total:
$ cat file
Ranking ID Year Make Model Total
1 128 2010 Infiniti G37 128
2 124 2015 Jeep Wrangler 124
3 15 014 Audi S4 120
4 113 2012 Acura Tsx sportwagon 116
4 113 2012 Acura Foo 116
4 113 2012 Acura Bar 116
4 113 2012 Acura Other 116
5 83 2014 Honda Accord 112
6 112 2008 Acura TL 110
7 65 2009 Honda Fit 106
8 91 2010 Mitsu Lancer 102
9 50 2015 Acura TLX 102
10 31 2007 Honda Fit 102
11 216 2007 Chrystler 300 96
12 126 2010 Volkswagen Eos 92
13 13 2016 Honda Civic 1.5t 92
this is the output showing just 3 of those 4 116 Acuras:
$ awk -f tst.awk file
Ranking ID Year Make Model Total
4 113 2012 Acura Tsx sportwagon 116 1
4 113 2012 Acura Foo 116 2
4 113 2012 Acura Bar 116 3
3 15 014 Audi S4 120 1
11 216 2007 Chrystler 300 96 1
5 83 2014 Honda Accord 112 1
7 65 2009 Honda Fit 106 2
10 31 2007 Honda Fit 102 3
1 128 2010 Infiniti G37 128 1
2 124 2015 Jeep Wrangler 124 1
8 91 2010 Mitsu Lancer 102 1
12 126 2010 Volkswagen Eos 92 1
If that's not what you want then move the if ( ++cnt <= 3 ) test to the outer loop or handle it however else you want.

concat two files side-by-side, append difference between fields, and print in tabular format

Consider I have a two files as below: I need to concatenate and find difference in the new file.
a.txt
a 2019 66
b 2020 50
c 2018 48
b.txt
a 2019 50
b 2019 40
c 2018 45
Desired output:
a 2019 66 a 2019 50 16
b 2020 50 b 2019 40 10
c 2018 48 c 2018 45 3
I tried:
awk -F, -v OFS=" " '{$7=$3-$6}1' file3.txt
it prints
a 2019 66 a 2019 50 0
b 2020 50 b 2019 40 0
c 2018 48 c 2018 45 0
Also can help in printing in tabular format?
Your awk command seems fine except -F,. You should paste those files first.
$ paste a.txt b.txt | awk '{print $0,$3-$6}' | column -t
a 2019 66 a 2019 50 16
b 2020 50 b 2019 40 10
c 2018 48 c 2018 45 3
Within single awk could you please try following.
awk 'FNR==NR{a[FNR]=$0;b[FNR]=$NF;next} {print a[FNR],$0,b[FNR]-$NF}' a.txt b.txt | column -t
Output will be as follows.
a 2019 66 a 2019 50 16
b 2020 50 b 2019 40 10
c 2018 48 c 2018 45 3

Replace exact numbers in a column keeping order

I have this file, and I would like to replace the number of the 3rd column so that they appear in order. Also, I would need to skip the first row (the header of the file).
Initial file:
#results from program A
8536 17 1 CGTCGCCTAT 116 147M2D
8536 17 1 CGTCGCTTAT 116 147M2D
8536 17 1 CGTTGCCTAT 116 147M2D
8536 17 1 CGTTGCTTAT 116 147M2D
2005 17 3 CTTG 61 145M
2005 17 3 TTCG 30 145M
91823 17 4 ATGAAGC 22 146M
91823 17 4 GTAGGCC 19 146M
16523 17 5 GGGGGTCGGT 45 30M1D115M
Modified file:
#results from program A
8536 17 1 CGTCGCCTAT 116 147M2D
8536 17 1 CGTCGCTTAT 116 147M2D
8536 17 1 CGTTGCCTAT 116 147M2D
8536 17 1 CGTTGCTTAT 116 147M2D
2005 17 2 CTTG 61 145M
2005 17 2 TTCG 30 145M
91823 17 3 ATGAAGC 22 146M
91823 17 3 GTAGGCC 19 146M
16523 17 4 GGGGGTCGGT 45 30M1D115M
Do you know how I could do it?
Could you please try following.
awk 'prev!=$1{++count}{$3=count;prev=$1;$1=$1} 1' OFS="\t" Input_file
To remove headers use following:
awk 'FNR==1{print;next}prev!=$1{++count}{$3=count;prev=$1;$1=$1} 1' OFS="\t" Input_file
Solution 2nd: In case your Input_file's 1st field is NOT in order then following may help you here.
awk 'FNR==NR{if(!a[$1]++){b[$1]=++count};next} {$3=b[$1];$1=$1} 1' OFS="\t" Input_file Input_file
To remove headers for 2nd solution above use following.
awk 'FNR==1{if(++val==1){print};next}FNR==NR{if(!a[$1]++){b[$1]=++count};next} {$3=b[$1];$1=$1} 1' OFS="\t" Input_file Input_file
another minimalist awk
$ awk '{$3=c+=p!=$1;p=$1}1' file | column -t
8536 17 1 CGTCGCCTAT 116 147M2D
8536 17 1 CGTCGCTTAT 116 147M2D
8536 17 1 CGTTGCCTAT 116 147M2D
8536 17 1 CGTTGCTTAT 116 147M2D
2005 17 2 CTTG 61 145M
2005 17 2 TTCG 30 145M
91823 17 3 ATGAAGC 22 146M
91823 17 3 GTAGGCC 19 146M
16523 17 4 GGGGGTCGGT 45 30M1D115M
with header version
$ awk 'NR==1; NR>1{$3=c+=p!=$1;p=$1; print | "column -t"}' file
#results from program A
8536 17 1 CGTCGCCTAT 116 147M2D
8536 17 1 CGTCGCTTAT 116 147M2D
8536 17 1 CGTTGCCTAT 116 147M2D
8536 17 1 CGTTGCTTAT 116 147M2D
2005 17 2 CTTG 61 145M
2005 17 2 TTCG 30 145M
91823 17 3 ATGAAGC 22 146M
91823 17 3 GTAGGCC 19 146M
16523 17 4 GGGGGTCGGT 45 30M1D115M

extract data date wise and do average calculation

How to extract data date wise and do average calculation per date from the below shown output. last column is average.
Sun Jul 5 00:00:02 IST 2015, 97
Sun Jul 5 00:02:01 IST 2015, 97
Sun Jul 5 00:04:02 IST 2015, 97
Mon Jul 6 00:00:01 IST 2015, 73
Mon Jul 6 00:02:02 IST 2015, 93
Mon Jul 6 00:04:02 IST 2015, 97
Tue Jul 7 00:00:02 IST 2015, 97
Tue Jul 7 00:02:02 IST 2015, 97
Tue Jul 7 00:04:01 IST 2015, 97
Wed Jul 8 00:00:01 IST 2015, 98
Wed Jul 8 00:02:02 IST 2015, 98
Wed Jul 8 00:04:01 IST 2015, 98
Thu Jul 9 00:00:02 IST 2015, 100
Thu Jul 9 00:02:01 IST 2015, 100
Thu Jul 9 00:04:01 IST 2015, 100
Fri Jul 10 00:00:01 IST 2015, 100
Fri Jul 10 00:02:02 IST 2015, 100
Fri Jul 10 00:04:02 IST 2015, 100
Sat Jul 11 00:00:01 IST 2015, 73
Sat Jul 11 00:02:01 IST 2015, 73
Sat Jul 11 00:04:02 IST 2015, 73
want output as
Jun 6 - 97
Jun 7 - 86.66
...
You can use this awk:
awk -F ', ' '{
split($1, a, " ");
k=a[2] OFS a[3];
if(!(k in c))
b[++n]=k;
c[k]++;
sum[k]+=$2
}
END{
for(i=1; i<=n; i++)
printf "%s - %.2f\n", b[i], (sum[b[i]]/c[b[i]])
}' file
Jul 5 - 97.00
Jul 6 - 87.67
Jul 7 - 97.00
Jul 8 - 98.00
Jul 9 - 100.00
Jul 10 - 100.00
Jul 11 - 73.00

Resources