using awk to average specified rows - shell

I have a data file set up like
a 1
b 2
c 3
d 4
a 5
b 6
c 7
d 6
etc
and I would like to output to a new file
a average of 2nd column from all "a" rows
b average of 2nd column from all "b" rows
etc
where a, b, c... are also numbers.
I have been able to do this for specific values (1.4 in the example below) of the 1st column using awk:
awk '{ if ( $1 == 1.4) total += $2; count++ }
END {print total/10 }' data
though count is not giving me the correct about of rows (i.e. count should be 10 as I have manually put in 10 to do the average in the last line).
I assume a for loop will be required but I have not been able to implement that correctly.
Please help. Thanks.

awk '{a[$1]+=$2;c[$1]++}END{for(x in a)printf "average of %s is %.2f\n",x,a[x]/c[x]}'
the output of above line (with your example input) is:
average of a is 3.00
average of b is 4.00
average of c is 5.00
average of d is 5.00

Related

Find mean and maximum in 2nd column for a selection in 1st column

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
5 2
I would like to calculate the mean and maximum values in 2nd column for some selection in 1st column.
ofile.dat
1-2 40 15.2 #Here 1-2 means all values in 1st column ranging from 1 to 2;
#40 is the maximum of corresponding values in 2nd column and 15.2 is their mean i.e. (10+4+2+40+20)/5
3-4 50 29.8 #Here 3-4 means all values in 1st column ranging from 3 to 4;
#50 is their maximum and 29.8 is their mean i.e. (34+32+20+13+50)/5
5-6 3 2.5 #Here 5-6 means all values in 1st column ranging from 5 to 6;
#3 is their maximum and 2.5 is their mean i.e. (3+2)/2
Similarly if I choose the range of selection with 3 number, then the desire output will be
ofile.dat
1-3 40 19.37
4-6 50 18.7
I have the following script which calculates for single values in the 1st column. But I am looking for multiple selections from 1st column.
awk '{
if (a[$1] < $2) { a[$1]=$2 }} END { for (i in a){}}
{b[$1]+=$2; c[$1]++} END{for (i in b)
printf "%d %2s %5s %5.2f\n", i, OFS, a[i], b[i]/c[i]}' ifile.dat
The original data has the values in the 1st column varying from 1 to 100000. So I need to stratify with an interval of 1000. i.e. 1-1000, 1001-2000, 2001-3000,...
The following awk script will provide basic descriptive statistics with grouping.
Suggesting to look into more robust solution (Python, Perl, R, ...) which will support additional measures, flexibility - no point to reinvent the circle.
Logic for grouping separated is 1-1000, 1001-2000, as per comment above. Code is verbose for clarity.
awk '
{
# Total Counter
nn++ ;
# Group id
gsize = 1000
gid = int(($1-1)/gsize )
v = $2
# Setup new group, if needed
if ( !n[gid] ) {
n[gid] = 0
sum[gid] = 0
max[gid] = min[gid] = v
name[gid] = (gid*gsize +1) "-" ((gid+1)*gsize)
}
if ( v > max[gid] ) max[gid] = v
sum[gid] += v
n[gid]++
}
END {
# Print all groups
for (gid in name) {
printf "%-20s %4d %6.1f %5.1F\n", name[gid], max[gid], sum[gid]/n[gid], n[gid]/nn ;
}
}
'
Could you please try following, tested and written with shown samples only.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
a[$1]=a[$1]>$2?a[$2]:$2
d[$1]+=$2
e[$1]++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
max=max>a[c[j]]?max:a[c[j]]
total+=d[c[j]]
occr+=e[c[j]]
}
print i"-"i+range,max,occr?total/occr:0
occr=total=max=""
}
}
'
For shown samples output will be as follows.
1-2 40 15.2
3-4 50 29.8
5-6 3 2.5
I have kept range variable as 1 since difference of 1st digit is 2 so in your case case lets say 1,1001 and so on is there then keep range variable value as 999 for same.

Comparision of 2 csv files having same column names with different data

I am having two CSV files each having 2 columns with same column name. 1.csv has generated first and 2.csv has generated after 1 hour. S
o I want to see the Profit % increament and decrement for each Business unit comparing to last report. for example: Business unit B has increment of 50%(((15-10)/10)*100).
However for C it has decrease of 50%. Some new business unit(AG & JK) is also added in new hour report which can be considered only for new one. However few businees unit(D) also removed from next hour which can be considered not required.
So basically i need how can i compare and extract this data.
Busines Profit %
A 0
B 10
C 10
D 0
E 0
F 1615
G 0
Busines profit %
A 0
B 15
C 5
AG 5
E 0
F 1615
G 0
JK 10
updated requirement:
Business Profits% Old profit % new Variation
A 0 0 0
B 10 15 50%
C 10 5 -50%
D 0 cleared
AG 5 New
E 0 0 0
F 1615 1615 0%
G 0 0 0%
JK 10 New
I'd use awk for the job, something like this:
$ awk 'NR==FNR{ # process file2
a[$1]=$2 # hash second column, key is the first column
next # process the next record of file2
}
{ # process file1
if($1 in a==0) # if company not found in hash a
p="new" # it must be new
else
p=($2-a[$1])/(a[$1]==0?1:a[$1])*100 # otherwise calculate p%
print $1,p # output company and p%
}' file1 file2
A 0
B 50
C -50
AG new
E 0
F 0
G 0
JK new
One-liner version with appropriate semicolons:
$ awk 'NR==FNR{a[$1]=$2;next}{if($1 in a==0)p="new";else p=($2-a[$1])/(a[$1]==0?1:a[$1])*100;print $1,p}' file1 file2

Compute percentile and max value per variable

Bash Gurus, I need to compute the max and percentile numbers for each item in the list, using awk
aa 1
ab 3
aa 4
ac 5
aa 3
ad 2
ab 4
ac 2
ae 2
ac 5
Expected output
Item 90th percentile max value
aa 3.8 4
ab 3.9 4
ac 5 5
ad 2 2
ae 2 2
Am able to get the sum and max using the below, but not the percentile.
awk '{
item[$1]++;
count[$1]+=$2;
max[$1]=$2;
percentile[$1,.9]=$2
}
END{
for (var in item)
print var,count[var],max[var],percentile[var]
}
'
Please suggest.
Percentile calculation from Statistics for Dummies 2nd ed. :). In Gnu awk:
$ cat mnp.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc" # for order in output
if(p=="") # if p not defined it's median
p=0.5
else
p=p/100 # if 90th percentile: p=0.9
}
{
v[$1][NR]=$2 # values stored per keyword. NR for unique
if($2>m[$1]) # find max val
m[$1]=$2
}
END {
for(i in v) { # for all keywords
n=asort(v[i]) # sort values, n is count
prc=p*n; # percentile figuration
if(prc==int(prc))
w=(v[i][prc]+v[i][prc+1])/2
else
w=v[i][int(prc)+1]
print i, m[i], w # print keyword, max and nth value
}
}
Run it:
$ awk -p=90 -f mnp.awk data.txt
aa 4 4
ab 4 4
ac 5 5
ad 2 2
ae 2 2
TODO: if the data file was sorted, this could be streamlined and not all data would need to be stored to memory.
datamash is a lovely tool, although it doesn't support the percantile part.
$ datamash -W --sort --group=1 max 2 min 2 < INPUT
aa 4 1
ab 4 3
ac 5 2
ad 2 2
ae 2 2
It supports the following operations
File operations:
transpose, reverse
Numeric Grouping operations:
sum, min, max, absmin, absmax
Textual/Numeric Grouping operations:
count, first, last, rand
unique, collapse, countunique
Statistical Grouping operations:
mean, median, q1, q3, iqr, mode, antimode
pstdev, sstdev, pvar, svar, mad, madraw
pskew, sskew, pkurt, skurt, dpo, jarque
Here is an elegant solution I found floating around the internet for finding the max value:
{
max[$1] = !($1 in max) ? $2 : ($2 > max[$1]) ? $2 : max[$1]
}
END {
for (i in max)
print i, max[i]
}
Output:
ab 4
ac 5
ad 2
ae 2
aa 4

How to separate lines depending on the value in column 1

I have a text file that contains the following (a b c d etc... contains some random values):
1 a
1 b
2 c
2 d
2 e
2 f
6 g
6 h
6 i
12 j
12 k
Is there a way to separate lines with some characters depending on the content of the first string, knowing that those numbers will always be increasing, but may vary as well. The separation would be when first string is incrementing, going from 1 to 2, then 2 to 6 etc...
The output would be like this (here I would like to use ---------- as a separation):
1 a
1 b
----------
2 c
2 d
2 e
2 f
----------
6 g
6 h
6 i
----------
12 j
12 k
awk 'NR>1 && old != $1 { print "----------" } { print; old = $1 }'
If it isn't the first line and the value in old isn't the same as in $1, print the separator. Then unconditionally print the current line, and record the value of $1 in old so that we remember for next time. Repeat until done.

How to calculate gradient with AWK

I have a file which includes two columns such as:
A B
1 2
10 20
100 200
.
.
.
I want to calculate gradient (or slope) dB/dA by awk. It means the third column should be the difference between each adjacent rows in column B divides on difference between each corresponding adjacent rows in column A. The results for above date should be:
A B dB/dA
1 2 (20-2)/(10-1)=2
10 20 (200-20)/(100-10)=2
100 200
.
.
.
How can I do that?
Given your files, you can do this :
$cat file
A B
1 2
10 20
100 200
awk 'BEGIN{OFS="\t"}NR==1{print $1,$2,"dA/dB"}NR>2{print a,b,($2-b)/($1-a)}{a=$1;b=$2}' file
A B dA/dB
1 2 2
10 20 2
100 200 2
With :
BEGIN{OFS="\t"} to set Output Field Separator to tab
NR==1{print $1,$2,"dA/dB"} to copy the header and add the grad column
NR>2 to skip the header and the first line as you want to start from the second row (of value)
{a=$1;b=$2} to save values in a and b, for next line. This part works from the 1st line
{print a,b,($2-b)/($1-a)} print the previous line and the gradient between this line and the previous
Hope this helps

Resources