Calculate Percentile(s) in Bash - bash

I am trying to calculate a range of percentiles (5th-99th) in Bash for a text file that contains 5 values, one per line.
Input
34.5
32.2
33.7
30.4
31.8
Attempted Code
awk '{s[NR-1]=$1} END{print s[int(0.05-0.99)]}' input
Expected Output
99th 34.5
97th 34.4
95th 34.3
90th 34.2
80th 33.9
70th 33.4
60th 32.8
50th 32.2
40th 32.0
30th 31.9
20th 31.5
10th 31.0
5th 30.7

For calculation of percentile based on 5 values, one need to create a mapping between percentiles, and to interpolate between them. A process called 'Piecewise Linear function' (a.k.a. pwlf).
F(100) = 34.5
F(75) = 33.7
F(50) = 32.2
F(25) = 31.8
F(0) = 30.4
Mapping of any other x in the range 0..100, require linear interpolation betweeh F(L), and F(H) - where L is the highest value >= x, and H=L+1.
awk '
#! /bin/env awk
# PWLF Interpolation function, take a value, and two arrays for X & Y
function pwlf(x, px, py) {
# Shortcut to calculate low index of X, >= p
p_l = 1+int(x/25)
p_h = p_l+1
x_l = px[p_l]
x_h = px[p_h]
y_l = py[p_l]
y_h = py[p_h]
#print "X=", x, p_l, p_h, x_l, x_h, y_l, y_h
return y_l+(y_h-y_l)*(x-x_l)/(x_h-x_l)
}
# Read f Input in yy array, setup xx
{ yy[n*25] = $1 ; n++ }
# Print the table
END {
# Sort values of yy
ny = asort(yy) ;
# Create xx array 0, 25, ..., 100
for (i=1 ; i<=ny ; i++) xx[i]=25*(i-1)
# Prepare list of requested results
ns = split("99 97 95 90 80 70 60 50 40 30 20 10 5", pv)
for (i=1 ; i<=ns ; i++) printf "%dth %.1f\n", pv[i], pwlf(pv[i], xx, yy) ;
}
' input
Technically a bash script, but based on comments to OP, better to place the whole think into script.awk, and execute as one lines. Solution has the '#!' to invoke awk script.
/path/to/script.awk < input

Related

Average over diagonally in a Matrix

I have a matrix. e.g. 5 x 5 matrix
$ cat input.txt
1 5.6 3.4 2.2 -9.99E+10
2 3 2 2 -9.99E+10
2.3 3 7 4.4 5.1
4 5 6 7 8
5 -9.99E+10 9 11 13
Here I would like to ignore -9.99E+10 values.
I am looking for average of all entries after dividing diagonally. Here are four possibilities (using 999 in place of -9.99E+10 to save space in the graphic):
I would like to average over all the values under different shaded triangles.
So the desire output is:
$cat outfile.txt
P1U 3.39 (Average of all values of Lower side of Possible 1 without considering -9.99E+10)
P1L 6.88 (Average of all values of Upper side of Possible 1 without considering -9.99E+10)
P2U 4.90
P2L 5.59
P3U 3.31
P3L 6.41
P4U 6.16
P4L 4.16
It is being difficult to develop a proper algorithm to write it in fortran or in shell script.
I am thinking of the following algorithm, but can't able to think what is next.
step 1: #Assign -9.99E+10 to the Lower diagonal values of a[ij]
for i in {1..5};do
for j in {1..5};do
a[i,j+1]=-9.99E+10
done
done
step 2: #take the average
sum=0
for i in {1..5};do
for j in {1..5};do
sum=sum+a[i,j]
done
done
printf "%s %5.2f",P1U, sum
step 3: #Assign -9.99E+10 to the upper diagonal values of a[ij]
for i in {1..5};do
for j in {1..5};do
a[i-1,j]=-9.99E+10
done
done
step 4: #take the average
sum=0
for i in {1..5};do
for j in {1..5};do
sum=sum+a[i,j]
done
done
printf "%s %5.2f",P1L,sum
Just save all the values in an aray indexied by row and column number and then in the END section repeat this process of setting the beginning and end row and column loop delimiters as needed when defining the loops for each section:
$ cat tst.awk
{
for (colNr=1; colNr<=NF; colNr++) {
vals[colNr,NR] = $colNr
}
}
END {
sect = "P1U"
begColNr = 1; endColNr = NF; begRowNr = 1; endRowNr = NR
sum = cnt = 0
for (rowNr=begRowNr; rowNr<=endRowNr; rowNr++) {
for (colNr=begRowNr; colNr<=endColNr-rowNr+1; colNr++) {
val = vals[colNr,rowNr]
if ( val != "-9.99E+10" ) {
sum += val
cnt++
}
}
}
printf "%s %.2f\n", sect, (cnt ? sum/cnt : 0)
sect = "P1L"
begColNr = 1; endColNr = NF; begRowNr = 1; endRowNr = NR
sum = cnt = 0
for (rowNr=begRowNr; rowNr<=endRowNr; rowNr++) {
for (colNr=endColNr-rowNr+1; colNr<=endColNr; colNr++) {
val = vals[colNr,rowNr]
if ( val != "-9.99E+10" ) {
sum += val
cnt++
}
}
}
printf "%s %.2f\n", sect, (cnt ? sum/cnt : 0)
}
.
$ awk -f tst.awk file
P1U 3.39
P1L 6.88
I assume given the above for handling the first quadrant diagonal halves you'll be able to figure out the other quadrant diagonal halves and the horizontal/vertical quadrant halves are trivial (just set begRowNr to int(NR/2)+1 or endRowNr to int(NR/2) or begColNr to int(NF/2)+1 or endColNr to int(NF/2) then loop through the resultant full range of values of each).
you can compute all in one iteration
$ awk -v NA='-9.99E+10' '{for(i=1;i<=NF;i++) a[NR,i]=$i}
END {for(i=1;i<=NR;i++)
for(j=1;j<=NF;j++)
{v=a[i,j];
if(v!=NA)
{if(i+j<=6) {p["1U"]+=v; c["1U"]++}
if(i+j>=6) {p["1L"]+=v; c["1L"]++}
if(j>=i) {p["2U"]+=v; c["2U"]++}
if(i<=3) {p["3U"]+=v; c["3U"]++}
if(i>=3) {p["3D"]+=v; c["3D"]++}
if(j<=3) {p["4U"]+=v; c["4U"]++}
if(j>=3) {p["4D"]+=v; c["4D"]++}}}
for(k in p) printf "P%s %.2f\n", k,p[k]/c[k]}' file | sort
P1L 6.88
P1U 3.39
P2U 4.90
P3D 6.41
P3U 3.31
P4D 6.16
P4U 4.16
I forgot to add P2D, but from the pattern it should be clear what needs to be done.
To generalize further as suggested. Assert NF==NR, otherwise diagonals not well defined. Let n=NF (and n=NR) You can replace 6 with n+1 and 3 with ceil(n/2). Which can be implemented as function ceil(x) {return x==int(x)?x:x+1}

Find N max difference between two consecutive number present in file using unix

Integer numbers are stored in file, i need to find Max and N Max difference between two consecutive number present in file ( one integer number on each row/line)
e.g.
12
15
50
80
Max diff : 35 ( 50 -15 ) and say N=2 so 1st max 35 and 2nd max : 30
#!/usr/bin/awk -f
NR>1{ diff = $0 - prev
for (i = 0; i < N; ++i)
if (diff > maxdiff[i])
{ # sort new max. diff.
for (j = N; --j > i; ) if (j-1 in maxdiff) maxdiff[j] = maxdiff[j-1]
maxdiff[j] = diff
break
}
}
{ prev = $0 }
END { for (i in maxdiff) print maxdiff[i] }
- e. g., if the script is named nmaxdiff.awk and the numbers are stored in the file numbers, enter
nmaxdiff.awk N=2 numbers

Binning Together Allele Frequencies From VCF Sequencing Data

I have a sequencing datafile containing base pair locations from the genome, that looks like the following example:
chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5
I would like to compare certain groups defined by the location of the bp found in column 2. I then want the average of the numbers in column 5 of the matching regions.
So, using the example above lets say I am looking for the average of the 5th column for all samples spanning chr1 810-820 and chr2 310-330. The first five rows should be identified, and their 5th column numbers should be averaged, which equals 0.42.
I tried creating an array of ranges and then using awk to call these locations, but have been unsuccessful. Thanks in advance.
import pandas as pd
from StringIO import StringIO
s = """chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5"""
sio = StringIO(s)
df = pd.read_table(sio, sep=" ", header=None)
df.columns=["a", "b", "c", "d", "e"]
# The query expression is intuitive
r = df.query("(a=='chr1' & 810<b<820) | (a=='chr2' & 310<b<330)")
print r["e"].mean()
pandas might be better for such tabular data processing, and it's python.
Here's some python code to do what you are asking for. It assumes that your data lives in a text file called 'data.txt'
#!/usr/bin/env python
data = open('data.txt').readlines()
def avg(keys):
key_sum = 0
key_count = 0
for item in data:
fields = item.split()
krange = keys.get(fields[0], None)
if krange:
r = int(fields[1])
if krange[0] <= r and r <= krange[1]:
key_sum += float(fields[-1])
key_count += 1
print key_sum/key_count
keys = {} # Create dict to store keys and ranges of interest
keys['chr1'] = (810, 820)
keys['chr2'] = (310, 330)
avg(keys)
Sample Output:
0.42
Here's an awk script answer. For input, I created a 2nd file which I called ranges:
chr1 810 820
chr2 310 330
The script itself looks like:
#!/usr/bin/awk -f
FNR==NR { low_r[$1] = $2; high_r[$1] = $3; next }
{ l = low_r[ $1 ]; h = high_r[$1]; if( l=="" ) next }
$2 >= l && $2 <= h { total+=$5; cnt++ }
END {
if( cnt > 0 ) print (total/cnt)
else print "no matched data"
}
Where the breakdown is like:
FNR==NR - absorb the ranges file, making a low_r and high_r array keyed off of the first column in that file.
Then for every row in the data, lookup matches in the low_r and high_r array. If there's no match, then skip any other processing
Check an inclusive range based on low and high testing, incrementing total and cnt for matched ranges.
At the END, print the simple averages when there were matches
When the script (called script.awk) is made executable it can be run like:
$ ./script.awk ranges data
0.42
where I've called the data file data.

Algorithm to find an interval with the highest summed weight of weighted overlapping intervals

Well, I think it's hard to explain, so I've made a figure to show that.
As we can see in this figure, there are 6 intervals of time. Each one has its weight. Higher the opacity, higher the weight. I want an algorithm to find the interval with the highest summed weight. In the case of the figure, it'd be the overlapping of the intervals 5 and 6, which is the area with highest opacity.
Split each interval into start and end points.
Sort the points.
Start with a sum of 0.
Iterate through the points using a sweep-line algorithm:
If you get a start point:
Increase the sum by the value of the corresponding interval.
If the sum count is higher than the best sum so far, store this start point and set a flag.
If you get an end point:
If the flag is set, store the stored start point and this end point with the current sum as the best interval so far and reset the flag.
Decrease the count by the value of the corresponding interval.
This is derived from the answer I wrote here, which is based on the unweighted version, i.e. finding the maximum number of overlapping intervals, rather than the maximum summed weight.
Example:
For this example:
The start / end points will be sorted as: (S = start, E = end)
1S, 1E, 2S, 3S, 2E, 3E, 4S, 5S, 4E, 6S, 5E, 6E
Iterating through them, you'll set the flag on 1S, 5S and 6S, and you'll store the respective intervals at 1E, 4E and 5E (which is the first end-points you get to after the above start points).
You won't set the flag on 2S, 3S or 4S, as the sum will be lower than the best sum so far.
The algorithm logic can be derived from the figure. Assuming that resolution of time intervals is 1 min, then an array can be created and used for all the calculations:
create the array of 24 * 60 elements and fill it with 0 weights;
for each time interval add the weight of this interval to the corresponding part of the array;
find a maximum summed weight by iterating the array;
iterate over the array again and output array index (time) with the maximal summed weight.
This algorithm can be modified for a slightly different task, if you need to have interval indices in the output. In this case the array should contain list of the input time interval indices as a second dimension (or it can be a separate array, depending on particular language).
UPD. I was curious if this simple algorithm is significantly slower than more elegant one suggested by #Dukeling. I coded both algorithms and created an input generator to estimate their performance.
Generator:
#!/bin/sh
awk -v n=$1 '
BEGIN {
tmax = 24 * 60; wmax = 100;
for (i = 0; i < n; i++) {
t1 = int(rand() * tmax);
t2 = int(rand() * tmax);
w = int(rand() * wmax);
if (t2 >= t1) {print t1, t2, w} else {print t2, t1, w}
}
}' | sort -n > i.txt
Algorithm #1:
#!/bin/sh
awk '
{t1[++i] = $1; t2[i] = $2; w[i] = $3}
END {
for (i in t1) {
for (t = t1[i]; t <= t2[i]; t++) {
W[t] += w[i];
}
}
Wmax = 0.;
for (t in W){
if (W[t] > Wmax) {Wmax = W[t]}
}
print Wmax;
for (t in W){
if (W[t] == Wmax) {print t}
}
}
' i.txt > a1.txt
Algorithm #2:
#!/bin/sh
awk '
{t1[++i] = $1; t2[i] = $2; w[i] = $3}
END {
for (i in t1) {
p[t1[i] "a" i] = i "S";
p[t2[i] "b" i] = i "E";
}
n = asorti(p, psorted, "#ind_num_asc");
W = 0.; Wmax = 0.; f = 0;
for (i = 1; i <= n; i++){
P = p[psorted[i] ];
k = int(P);
if (index(P, "S") > 0) {
W += w[k];
if (W > Wmax) {
f = 1;
Wmax = W;
to1 = t1[k]
}
}
else {
if (f != 0) {
to2 = t2[k];
f = 0
}
W -= w[k];
}
}
print Wmax, to1 "-" to2
}
' i.txt > a2.txt
Results:
$ ./gen.sh 1000
$ time ./a1.sh
real 0m0.283s
$ time ./a2.sh
real 0m0.019s
$ cat a1.txt
24618
757
$ cat a2.txt
24618 757-757
$ ./gen.sh 10000
$ time ./a1.sh
real 0m3.026s
$ time ./a2.sh
real 0m0.144s
$ cat a1.txt
252452
746
$ cat a2.txt
252452 746-746
$ ./gen.sh 100000
$ time ./a1.sh
real 0m34.127s
$ time ./a2.sh
real 0m1.999s
$ cat a1.txt
2484719
714
$ cat a2.txt
2484719 714-714
The simple on is ~20x slower.

Finding a range of numbers of a file in another file using awk

I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.

Resources