how to find MIN, MAX, AVG, group by date and 2 digit after comma - max

Sample Source Data :
DATE|CPU%|MEMPHY%|MEMSWA%
2022-09-05 00|39.67|26.99|1.10
2022-09-05 00|44.94|25.42|1.10
2022-09-05 01|6.30|24.28|1.10
2022-09-05 01|4.68|26.45|1.10
2022-09-05 02|7.86|25.37|1.10
2022-09-05 02|10.66|24.38|1.10
i want to print MIN, MAX, AVR for each hour & for CPU%, MEMPHY%, MEMSWA%.
the output will be like this :
format decimal 2 digit after comma
delimiter pipeline
DATE|CPU%(MIN)|CPU%(MAX)|CPU%(AVG)|MEMPHY%(MIN)|MEMPHY%(MAX)|MEMPHY%(AVG)|MEMSWA%(MIN)|MEMSWA%(MAX)|MEMSWA%(AVG)
2022-09-05 00|39.67|44.94|42.30|26.99|25.42|26.20|1.10|1.10|1.10
2022-09-05 01|6.30|4.68|5.49|24.28|26.45|25.36|1.10|1.10|1.10
2022-09-05 02|7.86|10.66|9.26|25.37|24.38|24.87|1.10|1.10|1.10
i already try below command, but :
cannot group by date
cannot format decimal 2 digit after comma.
cannot use same AWK for $3 & $4 at 1 line command
cat test.txt | grep -v DATE |
awk -F'|' 'NR == 1
{min = max = $2}
{min = $2 < min ? $2 : min; max = $2 > max ? $2 : max; total += $2}
END
{print $1"|CPU%Min:",min,"|CPU%Max:",max,"|CPU%Avg:",total/NR}';
Result :
2022-09-05 02|CPU%Min: 4.68 |CPU%Max: 44.94 |CPU%Avg: 19.0183

Related

How to calculate the mean of row from csv file from nth column?

This may look like a duplicate but I could not solve the issue I'm having.
I'm trying to find the average of each column from a CSV/TSV file the data looks like below:
input.tsv
ID source random text val1 val2 val3 val4 val330
1 atttt eeeee test 0.9 0.5 0.2 0.54 0.89
2 afdg adfgrg tf 0.6 0.23 0.5 0.4 0.29
output.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
or at least
ID Avg
1 0.606
2 0.404
I tried a suggestion from here
awk 'NR==1{next}
{printf("%s\t", $1
printf("%.2f\n", ($5 + $6 + $7)/3}' input.tsv
which threw error.
and
awk '{ s = 4; for (i = 5; i <= NF; i++) s += $i; print $1, (NF > 1) ? s / (NF - 1) : 0; }' input.tsv
the below code also threw a syntax error
for i in `cat input.tsv` do; VALUES=`echo $i | tr '\t' '\t'`;COUNT=0;SUM=0;typeset -i j;IFS=' ';for j in $VALUES; do;SUM=`expr $SUM + $j`;COUNT=`expr $COUNT + 1`;done;AVG=`expr $SUM / $COUNT`;echo $AVG;done
help me resolve the issue to calculate the average of the row
From you code reference:
awk 'NR==1{next}
{
# missing the last ). This print the 1st column
#printf("%s\t", $1
printf("%s\t", $1 )
# missing the last ) and average of 3 colum only
#printf("%.2f\n", ($5 + $6 + $7)/3
printf("%.2f\n", ($5 + $6 + $7 + $8 + $9) / 5 )
}' input.tsv
Your second code is not easy work with , lot of subshell (backtic) and shell loop but most of all, i think it is made for working with integer value and for full line of value (not 5- > 9). Forget it unless you don't want awk in this case.
for fun
awk 'NR==1{
# Header
print $0 OFS "Avg"
Count = NF - 5
next
}
{
# print each element of the line + sum after col 4
for( i=Avg=0;i<=NF;i++) {
if( i >=5 ) Avg+= $i
printf( "%s ", $i)
}
# print average
printf( "%.2f\n", Avg/Count )
}
' input.tsv
Assuming here that it is always counting on the full stack of value, we can change the Count by (NF - 4) if less value are on the line and empty are not counting
You could use this awk script:
awk 'NR>1{
for(i=5;i<=NF;i++)
sum+=$i
}
{
print $1,$2,$3,$4,(NF>4&&sum!=""?sum/(NF-4):(NR==1?"Avg":""))
sum=0
}' file | column -t
The first block gets the sum of all ids starting from the 5th element.
The second block, prints the header line and the average value.
column -t displays the result in column.
This would be working as expected:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ s=0; for(i=5;i<=NF;++i) s+=$i }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
or just for the fun of it, if you want to make the for-loop obfuscated:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ for(s=!(i=5);i<=NF;s+=$(i++)) {} }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
$ cat tst.awk
NR == 1 { avg = "Avg" }
NR > 1 {
sum = cnt = 0
for (i=5; i<=NF; i++) {
sum += $i
cnt++
}
avg = (cnt ? sum / cnt : 0)
}
{ print $1, $2, $3, $4, avg }
$ awk -f tst.awk file
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
Using Perl one-liner
> perl -lane '{ $s=0;foreach(#F[4..8]){$s+=$_} $F[4]=$s==0?"Avg":$s/5;print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]" } ' input.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
>

Using one awk output into another awk command

I have one file (excel file) which has some columns (not fixed, changes dynamically) and I need to get values for couple of particular columns. I'm able to get the columns using one awk command and then printing rows using these columns numbers into another awk command. Is there any way I can combine into one?
awk -F',' ' {for(i=1;i < 9;i++) {if($i ~ /CLIENT_ID/) {print i}}} {for(s=1;s < 2;s++) {if($s ~ /SEC_DESC/) {print s}}} ' <file.csv> | awk -F "," '!($5~/...[0-9]L/ && $21~/FUT /) {print $0}' <file.csv>
Gives me output as 5 and 9 for columns (client_idandsec_desc`), which is their column number (this changes with different files).
Now using this column number, I get the desired output as follows:
awk -F "," '!($5~/...[0-9]L/ && $21~/FUT /) {print $0}' <file.csv>
How can I combine these into one command? Pass a variable from the first to the second?
Input (csv file having various dynamic columns, interested in following two columns)
CLIENT_ID SEC_DESC
USZ256 FUT DEC 16 U.S.
USZ256L FUT DEC 16 U.S. BONDS
WNZ256 FUT DEC 16 CBX
WNZ256L FUT DEC 16 CBX BONDS
Output give me rows- 2 and 4 that matched my regex pattern in second awk command (having column numbers as 5 & 21). These column numbers changes as per file so first have to get the column number using first awl and then giving it as input to second awk.
I think I got it.
awk -F',' '
NR == 1 {
for (i=1; i<=NF; ++i) {
if ($i == "CLIENT_ID") cl_col = i
if ($i == "SEC_DESC") sec_col = i
}
}
NR > 1 && !($cl_col ~ /...[0-9]L/ && $sec_col ~ /FUT /) {print $0}
' RED_FUT_TST.csv
To solve your problem you can test when you're processing the first row, and put the logic to discover the column numbers there. Then when you are processing the data rows, use the column numbers from the first step.
(NR is an awk built-in variable containing the record number being processed. NF is the number of columns.)
E.g.:
$ cat red.awk
NR == 1 {
for (i=1; i<=NF; ++i) {
if ($i == "CLIENT_ID") cl_col = i;
if ($i == "SEC_DESC") sec_col = i;
}
}
NR > 1 && $cl_col ~ /...[0-9]L/ && $sec_col ~ /FUT /
$ awk -F'\t' -f red.awk RED_FUT_TST.csv
USZ256L FUT DEC 16 U.S. BONDS
WNZ256L FUT DEC 16 CBX BONDS

UPDATED: Bash + Awk : Print first X(dynamic) columns and always last column

#file test.txt
a b c 5
d e f g h 7
gg jj 2
Say X = 3 I need the output like this:
#file out.txt
a b c 5
d e f 7
gg jj 2
NOT this:
a b c 5
d e f 7
gg jj 2 2 <--- WRONG
I've gotten to this stage:
cat test.txt | awk ' { print $1" "$2" "$3" "NF } '
If you're unsure of the total number of fields, then one option would be to use a loop:
awk '{ for (i = 1; i <= 3 && i < NF; ++i) printf "%s ", $i; print $NF }' file
The loop can be avoided by using a ternary:
awk '{ print $1, $2, (NF > 3 ? $3 OFS $NF : $3) }' file
This is slightly more verbose than the approach suggested by 123 but means that you aren't left with trailing white space on the lines with three fields. OFS is the Output Field Separator, a space by default, which is what print inserts between fields when you use a ,.
Use a $ combined with NF :
cat test.txt | awk ' { print $1" "$2" "$3" "$NF } '

Awk & Sort-Output as Comma Delimited?

I am trying to get this to output as comma delimited. The current version doesn't work at all (I get a blank file as an output), and previous versions (where I keep the awk BEGIN statements but don't have the sort delimiter) will just output as tab delimited, not comma delimited. In the previous versions, without attempting to get the comma delimiters, I do get the expected answer (with the complicated filters, etc), so I'm not asking for help with that portion of it. I realize this is a very ugly way to filter and the numbers are also ugly/very large.
The background of the question: Find the regions in the file lamina.bed that overlap with the region chr12:5000000-6000000, and to sort descending by column 4, output as comma delimited. Chromosome is the first column, start position of the region is column 2, end position is column 3, value is column 4. We are supposed to use awk (in Unix bash shell). Thank you in advance for your help!
awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000)' /vol1/opt/data/lamina.bed | awk 'BEGIN{FS=","; OFS=","} ($1 == "chr12") ' | sort -t$"," -k4rn > ~/MOLB7621/PS_2/results/2015_02_05/PS2_p3_n1.csv
cat ~/MOLB7621/PS_2/results/2015_02_05/PS2_p3_n1.csv
sample lines of input (tab delimited, including the lines on chr12 that should work):
#chrom start end value
chr1 11323785 11617177 0.86217008797654
chr1 12645605 13926923 0.934891485809683
chr1 14750216 15119039 0.945945945945946
chr12 3306736 5048326 0.913561847988077
chr12 5294045 5393088 0.923076923076923
chr12 5505370 6006665 0.791318864774624
chr12 7214638 7827375 0.8562874251497
chr12 8139885 10173149 0.884353741496599
To get comma-separated output, use the following:
$ awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000) {$1=$1;print}' file | awk 'BEGIN{FS=","; OFS=","} ($1 == "chr12") ' | sort -t$"," -k4rn
chr12,5294045,5393088,0.923076923076923
chr12,3306736,5048326,0.913561847988077
chr12,5505370,6006665,0.791318864774624
The only change above is the addition on the action:
{$1=$1;print}
awk will only reformat a line with a new field separator if the one or more of the fields on the line have been changed in some way. $1=$1 is sufficient to indicate that field 1 has been changed. Consequently, the new field separators are inserted.
Also, the two calls to awk can be combined into a single call:
awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000) {$1=$1; if($1 == "chr12") print}' file | sort -t$"," -k4rn
Simpler Example
In the following, the input is tab-separated and the output field separator, OFS, is set to a comma. In this first example, the awk command print is used:
$ echo $'a\tb\tc' | awk -v OFS=, '{print}'
a b c
Despite OFS=,, the output retains the tab-separator.
Now, we add the simple statement $1=$1 and observe the output:
$ echo $'a\tb\tc' | awk -v OFS=, '{$1=$1;print}'
a,b,c
The output is now comma-separated. Again, that is because awk only reformats a line with the new OFS if it thinks that a field on the line has been changed in some way. The assignment of $1 to itself is sufficient to trigger that reformat.
Note that it is not sufficient to make a change that affects the line as a whole. For example, the following does not trigger a reformat:
$ echo $'a\tb\tc' | awk -v OFS=, '{$0=$0;print}'
a b c
It is necessary to change one or more fields of the line individually. In the following, sub operates on $0 as a whole and, consequently, no reformat is triggered:
$ echo $'a\tb\tc' | awk -v OFS=, '{sub($1,"NEW");print}'
NEW b c
In the example below, however, sub operates specifically on field $1 and hence triggers a reformat:
$ echo $'a\tb\tc' | awk -v OFS=, '{sub($1,"NEW", $1);print}'
NEW,b,c

using sed, awk, or sort for csv manipulation

I have a csv file that needs a lot of manipulation. Maybe by using awk and sed?
input:
"Sequence","Fat","Protein","Lactose","Other Solids","MUN","SCC","Batch Name"
1,4.29,3.3,4.69,5.6,11,75,"35361305a"
2,5.87,3.58,4.41,5.32,10.9,178,"35361305a"
3,4.01,3.75,4.75,5.66,12.2,35,"35361305a"
4,6.43,3.61,3.56,4.41,9.6,275,"35361305a"
final output:
43330075995647
59360178995344
40380035995748
64360275964436
I'm able to get through some of it going step by step.
How do I test specific columns for a value over 9.9 and replace it with 9.9 ?
Also, is there a way to combine any of these steps?
remove first line:
tail -n +2 test.csv > test1.txt
remove commas:
sed 's/,/ /g' test1.txt > test2.txt
remove quotes:
sed 's/"//g' test2.txt > test3.txt
remove columns 1 and 8 and
reorder remaining columns as 1,2,6,5,4,3:
sort test3.txt | uniq -c | awk '{print $3 "\t" $4 "\t" $8 "\t" $7 "\t" $6 "\t" $5}' test4.txt
test new columns 1,2,4,5,6 - if the value is over 9.9, replace it with 9.9
How should I do this step?
solution for following parts were found in a previous question - reformating a text file
columns 1,2,4,5,6 round decimals to tenths
column 3 needs to be four characters long, using zero to left fill
remove periods and spaces
awk '{$0=sprintf("%.1f%.1f%4s%.1f%.1f%.1f", $1,$2,$3,$4,$5,$6);gsub(/ /,"0");gsub(/\./,"")}1' test5.txt > test6.txt
This produces the output you want from the original file. Note that in the question you specified - note that in the question you specified "column 4 round to whole number" but in the desired output you had rounded it to one decimal place instead:
awk -F'[,"]+' 'function m(x) { return x < 9.9 ? x : 9.9 }
NR > 1 {
s = sprintf("%.1f%.1f%04d%.1f%.1f%.1f", m($2),m($3),$7,m($6),m($5),m($4))
gsub(/\./, "", s)
print s
}' test.csv
I have specified the field separator as any number of commas and double quotes together, so this "parses" your CSV format for you without requiring any additional steps.
The function m returns the minimum of 9.9 and the number you pass to it.
Output:
43330075995647
59360178995344
40380035995748
64360275964436
The three first in one go:
awk -F, '{gsub(/"/,"");$1=$1} NR>1' test.csc
1 4.29 3.3 4.69 5.6 11 75 35361305a
2 5.87 3.58 4.41 5.32 10.9 178 35361305a
3 4.01 3.75 4.75 5.66 12.2 35 35361305a
4 6.43 3.61 3.56 4.41 9.6 275 35361305a
tail -n +2 file | sort -u | awk -F , '
{
$0 = $1 FS $2 FS $6 FS $5 FS $4 FS $3
for (i = 1; i <= 6; ++i)
if ($i > 9.9)
$i = 9.9
$0 = sprintf("%.1f%.1f%4s%.0f%.1f%.1f", $1, $2, $3, $4, $5, $6)
gsub(/ /, "0"); gsub(/[.]/, "")
print
}
'
Or
< file awk -F , '
NR > 1 {
$0 = $1 FS $2 FS $6 FS $5 FS $4 FS $3
for (i = 1; i <= 6; ++i)
if ($i > 9.9)
$i = 9.9
$0 = sprintf("%.1f%.1f%4s%.0f%.1f%.1f", $1, $2, $3, $4, $5, $6)
gsub(/ /, "0"); gsub(/[.]/, "")
print
}
'
Output:
104309964733
205909954436
304009964838
406409643636

Resources