Find mean and maximum in 2nd column for a selection in 1st column - shell

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
5 2
I would like to calculate the mean and maximum values in 2nd column for some selection in 1st column.
ofile.dat
1-2 40 15.2 #Here 1-2 means all values in 1st column ranging from 1 to 2;
#40 is the maximum of corresponding values in 2nd column and 15.2 is their mean i.e. (10+4+2+40+20)/5
3-4 50 29.8 #Here 3-4 means all values in 1st column ranging from 3 to 4;
#50 is their maximum and 29.8 is their mean i.e. (34+32+20+13+50)/5
5-6 3 2.5 #Here 5-6 means all values in 1st column ranging from 5 to 6;
#3 is their maximum and 2.5 is their mean i.e. (3+2)/2
Similarly if I choose the range of selection with 3 number, then the desire output will be
ofile.dat
1-3 40 19.37
4-6 50 18.7
I have the following script which calculates for single values in the 1st column. But I am looking for multiple selections from 1st column.
awk '{
if (a[$1] < $2) { a[$1]=$2 }} END { for (i in a){}}
{b[$1]+=$2; c[$1]++} END{for (i in b)
printf "%d %2s %5s %5.2f\n", i, OFS, a[i], b[i]/c[i]}' ifile.dat
The original data has the values in the 1st column varying from 1 to 100000. So I need to stratify with an interval of 1000. i.e. 1-1000, 1001-2000, 2001-3000,...

The following awk script will provide basic descriptive statistics with grouping.
Suggesting to look into more robust solution (Python, Perl, R, ...) which will support additional measures, flexibility - no point to reinvent the circle.
Logic for grouping separated is 1-1000, 1001-2000, as per comment above. Code is verbose for clarity.
awk '
{
# Total Counter
nn++ ;
# Group id
gsize = 1000
gid = int(($1-1)/gsize )
v = $2
# Setup new group, if needed
if ( !n[gid] ) {
n[gid] = 0
sum[gid] = 0
max[gid] = min[gid] = v
name[gid] = (gid*gsize +1) "-" ((gid+1)*gsize)
}
if ( v > max[gid] ) max[gid] = v
sum[gid] += v
n[gid]++
}
END {
# Print all groups
for (gid in name) {
printf "%-20s %4d %6.1f %5.1F\n", name[gid], max[gid], sum[gid]/n[gid], n[gid]/nn ;
}
}
'

Could you please try following, tested and written with shown samples only.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
a[$1]=a[$1]>$2?a[$2]:$2
d[$1]+=$2
e[$1]++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
max=max>a[c[j]]?max:a[c[j]]
total+=d[c[j]]
occr+=e[c[j]]
}
print i"-"i+range,max,occr?total/occr:0
occr=total=max=""
}
}
'
For shown samples output will be as follows.
1-2 40 15.2
3-4 50 29.8
5-6 3 2.5
I have kept range variable as 1 since difference of 1st digit is 2 so in your case case lets say 1,1001 and so on is there then keep range variable value as 999 for same.

Related

Compute percentile and max value per variable

Bash Gurus, I need to compute the max and percentile numbers for each item in the list, using awk
aa 1
ab 3
aa 4
ac 5
aa 3
ad 2
ab 4
ac 2
ae 2
ac 5
Expected output
Item 90th percentile max value
aa 3.8 4
ab 3.9 4
ac 5 5
ad 2 2
ae 2 2
Am able to get the sum and max using the below, but not the percentile.
awk '{
item[$1]++;
count[$1]+=$2;
max[$1]=$2;
percentile[$1,.9]=$2
}
END{
for (var in item)
print var,count[var],max[var],percentile[var]
}
'
Please suggest.
Percentile calculation from Statistics for Dummies 2nd ed. :). In Gnu awk:
$ cat mnp.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc" # for order in output
if(p=="") # if p not defined it's median
p=0.5
else
p=p/100 # if 90th percentile: p=0.9
}
{
v[$1][NR]=$2 # values stored per keyword. NR for unique
if($2>m[$1]) # find max val
m[$1]=$2
}
END {
for(i in v) { # for all keywords
n=asort(v[i]) # sort values, n is count
prc=p*n; # percentile figuration
if(prc==int(prc))
w=(v[i][prc]+v[i][prc+1])/2
else
w=v[i][int(prc)+1]
print i, m[i], w # print keyword, max and nth value
}
}
Run it:
$ awk -p=90 -f mnp.awk data.txt
aa 4 4
ab 4 4
ac 5 5
ad 2 2
ae 2 2
TODO: if the data file was sorted, this could be streamlined and not all data would need to be stored to memory.
datamash is a lovely tool, although it doesn't support the percantile part.
$ datamash -W --sort --group=1 max 2 min 2 < INPUT
aa 4 1
ab 4 3
ac 5 2
ad 2 2
ae 2 2
It supports the following operations
File operations:
transpose, reverse
Numeric Grouping operations:
sum, min, max, absmin, absmax
Textual/Numeric Grouping operations:
count, first, last, rand
unique, collapse, countunique
Statistical Grouping operations:
mean, median, q1, q3, iqr, mode, antimode
pstdev, sstdev, pvar, svar, mad, madraw
pskew, sskew, pkurt, skurt, dpo, jarque
Here is an elegant solution I found floating around the internet for finding the max value:
{
max[$1] = !($1 in max) ? $2 : ($2 > max[$1]) ? $2 : max[$1]
}
END {
for (i in max)
print i, max[i]
}
Output:
ab 4
ac 5
ad 2
ae 2
aa 4

Find the closest values: Multiple columns conditions

Following my first question here I want to extend the condition of find the closest value from two different files of the first and second column, and print specific columns.
File1
1 2 3 4 a1
1 4 5 6 b1
8 5 9 11 c1
File 2
1 1 3 a
1 2 5 b
1 2.1 4 c
1 4 6 d
2 4 5 e
9 4 1 f
9 5 2 g
9 6 2 h
11 10 14 i
11 15 5 j
So for example I need to find the closest value from $1 in file 2 for each $1 in file 1 but then search the closest also for $2.
Output:
1 2 a1*
1 2 b*
1 4 b1
1 4 d
8 5 c1
9 5 g
* First column file 1 and 2nd column file 2 because for the 1st column (of file 1) the closest value (from the 1st column of file 2) is 1, and the 2nd condition is that also must be the closest value for the second column which is this case is 2. And I print $1,$2,$5 from file 1 and $1,$2,$4 from file 2
For the other output is the same procedure.
The solution to find the closest it is in my other post and was given by #Tensibai.
But any solution will work.
Thanks!
Sounds a little convoluted but works:
function closest(array,searched) {
distance=999999; # this should be higher than the max index to avoid returning null
split(searched,skeys,OFS)
# Get the first part of key
for (x in array) { # loop over the array to get its keys
split(x,mkeys,OFS) # split the array key
(mkeys[1]+0 > skeys[1]+0) ? tmp = mkeys[1] - skeys[1] : tmp = skeys[1] - mkeys[1] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found1 = mkeys[1] # and save the key actually found closest
}
}
# At this point we have the first part of key found, let's redo the work for the second part
distance=999999;
for (x in array) {
split(x,mkeys,OFS)
if (mkeys[1] == found1) { # Filter on the first part of key
(mkeys[2]+0 > skeys[2]+0) ? tmp = mkeys[2] - skeys[2] : tmp = skeys[2] - mkeys[2] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found2 = mkeys[2] # and save the key actually found closest
}
}
}
# Now we got the second field, woot
return (found1 OFS found2) # return the combined key from out two search
}
{
if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array
b[($1 OFS $2)] = $4 # make a array with "$1 $2" as key and $4 as value
} else {
key = ($1 OFS $2) # Make the key to avoid too much computation accessing it later
akeys[max++] = key # store the array keys to ensure order at end as for (x in array) does not guarantee the order
a[key] = $5 # make an array with the key stored previously and $5 as value
}
}
END { # Now we ended parsing the two files, print the result
for (i in akeys) { # loop over the array of keys which has a numeric index, keeping order
print akeys[i],a[akeys[i]] # print the value for the first array (key then value)
if (akeys[i] in b) { # if the same key exist in second file
print akeys[i],b[akeys[i]] # then print it
} else {
bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
print bindex,b[bindex] # print what we found
}
}
}
Note I'm using OFS to combine the fields so if you change it for output it will behave properly.
WARNING: This should do with relative short files, but as now the array from second file is traversed twice it will be twice long for each searchEND OF WARNING
There's place for a better search algorithm if your files are sorted (but it was not the case on previous question and you wished to keep the order from the file). First improvement in this case, break the for loop when distance start to be greater than preceding one.
Output from your sample files:
$ mawk -f closest2.awk f1 f2
1 2 a1
1 2 b
1 4 b1
1 4 d
8 5 c1
9 5 g

How to calculate gradient with AWK

I have a file which includes two columns such as:
A B
1 2
10 20
100 200
.
.
.
I want to calculate gradient (or slope) dB/dA by awk. It means the third column should be the difference between each adjacent rows in column B divides on difference between each corresponding adjacent rows in column A. The results for above date should be:
A B dB/dA
1 2 (20-2)/(10-1)=2
10 20 (200-20)/(100-10)=2
100 200
.
.
.
How can I do that?
Given your files, you can do this :
$cat file
A B
1 2
10 20
100 200
awk 'BEGIN{OFS="\t"}NR==1{print $1,$2,"dA/dB"}NR>2{print a,b,($2-b)/($1-a)}{a=$1;b=$2}' file
A B dA/dB
1 2 2
10 20 2
100 200 2
With :
BEGIN{OFS="\t"} to set Output Field Separator to tab
NR==1{print $1,$2,"dA/dB"} to copy the header and add the grad column
NR>2 to skip the header and the first line as you want to start from the second row (of value)
{a=$1;b=$2} to save values in a and b, for next line. This part works from the 1st line
{print a,b,($2-b)/($1-a)} print the previous line and the gradient between this line and the previous
Hope this helps

using awk to average specified rows

I have a data file set up like
a 1
b 2
c 3
d 4
a 5
b 6
c 7
d 6
etc
and I would like to output to a new file
a average of 2nd column from all "a" rows
b average of 2nd column from all "b" rows
etc
where a, b, c... are also numbers.
I have been able to do this for specific values (1.4 in the example below) of the 1st column using awk:
awk '{ if ( $1 == 1.4) total += $2; count++ }
END {print total/10 }' data
though count is not giving me the correct about of rows (i.e. count should be 10 as I have manually put in 10 to do the average in the last line).
I assume a for loop will be required but I have not been able to implement that correctly.
Please help. Thanks.
awk '{a[$1]+=$2;c[$1]++}END{for(x in a)printf "average of %s is %.2f\n",x,a[x]/c[x]}'
the output of above line (with your example input) is:
average of a is 3.00
average of b is 4.00
average of c is 5.00
average of d is 5.00

Bash/Nawk whitespace problems

I have 100 datafiles, each with 1000 rows, and they all look something like this:
0 0 0 0
1 0 1 0
2 0 1 -1
3 0 1 -2
4 1 1 -2
5 1 1 -3
6 1 0 -3
7 2 0 -3
8 2 0 -4
9 3 0 -4
10 4 0 -4
.
.
.
999 1 47 -21
1000 2 47 -21
I have developed a script which is supposed to take the square of each value in columns 2,3,4, and then sum and square root them.
Like so:
temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)
It then calculates the square of that value, and averages these numbers over every data-file to output the average "calc" for each row and average "fluc" for each row.
The meaning of these numbers is this:
The first number is the step number, the next three are coordinates on the x, y and z axis respectively. I am trying to find the distance the "steps" have taken me from the origin, this is calculated with the formula r = sqrt(x^2 + y^2 + z^2). Next I need the fluctuation of r, which is calculated as f = r^4 or f = (r^2)^2.
These must be averages over the 100 data files, which leads me to:
r = r + sqrt(x^2 + y^2 + z^2)
avg = r/s
and similarly for f where s is the number of read data files which I figure out using sum=$(ls -l *.data | wc -l).
Finally, my last calculation is the deviation between the expected r and the average r, which is calculated as stddev = sqrt(fluc - (r^2)^2) outside of the loop using final values.
The script I created is:
#!/bin/bash
sum=$(ls -l *.data | wc -l)
paste -d"\t" *.data | nawk -v s="$sum" '{
for(i=0;i<=s-1;i++)
{
t1 = 2+(i*4)
t2 = 3+(i*4)
t3 = 4+(i*4)
temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)
fluc = $fluc + ($calc*$calc)
}
stddev = sqrt(($calc^2) - ($fluc))
print $1" "calc/s" "fluc/s" "stddev
temp=0
calc=0
stddev=0
}'
Unfortunately, part way through I receive an error:
nawk: cmd. line:9: (FILENAME=- FNR=3) fatal: attempt to access field -1
I am not experienced enough with awk to be able to figure out exactly where I am going wrong, could someone point me in the right direction or give me a better script?
The expected output is one file with:
0 0 0 0
1 (calc for all 1's) (fluc for all 1's) (stddev for all 1's)
2 (calc for all 2's) (fluc for all 2's) (stddev for all 2's)
.
.
.
The following script should do what you want. The only thing that might not work yet is the choice of delimiters. In your original script you seem to have tabs. My solution assumes spaces. But changing that should not be a problem.
It simply pipes all files sequentially into the nawk without counting the files first. I understand that this is not required. Instead of trying to keep track of positions in the file it uses arrays to store seperate statistical data for each step. In the end it iterates over all step indexes found and outputs them. Since the iteration is not sorted there is another pipe into a Unix sort call which handles this.
#!/bin/bash
# pipe the data of all files into the nawk processor
cat *.data | nawk '
BEGIN {
FS=" " # set the delimiter for the columns
}
{
step = $1 # step is in column 1
temp = $2*$2 + $3*$3 + $4*$4
# use arrays indexed by step to store data
calc[step] = calc[step] + sqrt (temp)
fluc[step] = fluc[step] + calc[step]*calc[step]
count[step] = count[step] + 1 # count the number of samples seen for a step
}
END {
# iterate over all existing steps (this is not sorted!)
for (i in count) {
stddev = sqrt((calc[i] * calc[i]) + (fluc[i] * fluc[i]))
print i" "calc[i]/count[i]" "fluc[i]/count[i]" "stddev
}
}' | sort -n -k 1 # that' why we sort here: first column "-k 1" and numerically "-n"
EDIT
As sugested by #edmorton awk can take care of loading the files itself. The following enhanced version removes the call to cat and instead passes the file pattern as parameter to nawk. Also, as suggested by #NictraSavios the new version introduces a special handling for the output of the statistics of the last step. Note that the gathering of the statistics is still done for all steps. It's a little difficult to suppress this during the reading of the data since at that point we don't know yet what the last step will be. Although this can be done with some extra effort you would probably loose a lot of robustness of your data handling since right now the script does not make any assumptions about:
the number of files provided,
the order of the files processed,
the number of steps in each file,
the order of the steps in a file,
the completeness of steps as a range without "holes".
Enhanced script:
#!/bin/bash
nawk '
BEGIN {
FS=" " # set the delimiter for the columns (not really required for space which is the default)
maxstep = -1
}
{
step = $1 # step is in column 1
temp = $2*$2 + $3*$3 + $4*$4
# remember maximum step for selected output
if (step > maxstep)
maxstep = step
# use arrays indexed by step to store data
calc[step] = calc[step] + sqrt (temp)
fluc[step] = fluc[step] + calc[step]*calc[step]
count[step] = count[step] + 1 # count the number of samples seen for a step
}
END {
# iterate over all existing steps (this is not sorted!)
for (i in count) {
stddev = sqrt((calc[i] * calc[i]) + (fluc[i] * fluc[i]))
if (i == maxstep)
# handle the last step in a special way
print i" "calc[i]/count[i]" "fluc[i]/count[i]" "stddev
else
# this is the normal handling
print i" "calc[i]/count[i]
}
}' *.data | sort -n -k 1 # that' why we sort here: first column "-k 1" and numerically "-n"
You could also use:
awk -f c.awk *.data
where c.awk is
{
j=FNR
temp=$2*$2+$3*$3+$4*$4
calc[j]=calc[j]+sqrt(temp)
fluc[j]=fluc[j]+calc[j]*calc[j]
}
END {
N=ARGIND
for (i=1; i<=FNR; i++) {
stdev=sqrt(fluc[i]-calc[i]*calc[i])
print i-1,calc[i]/N,fluc[i]/N,stdev
}
}

Resources