Bash Awk: Median over windows with start and stop positions - bash

I have a text file that looks like below. First column is location, second is position and third is value.
1 10 200
1 11 150
1 12 300
2 13 400
2 14 100
2 15 250
3 16 200
3 17 200
3 18 350
3 19 150
...
I would like to calculate the median of the value field over a certain window. For example lets say a window size of 4 rows. Below is the expected result for the sample data above:
1 2 10 13 250
2 3 14 17 200
...
For every window (4 rows), the first value (within window) of first column, last value (within window) of the first column, first value of the second column, last value of the second column and the median of third column is reported.
I have got it partially working. The script below prints last position of column 1, last position of column 2 and mean.
win=4
cat file.txt | awk -v win="$win" '{sum+=$3} (NR%win)==0 {print $1,$2,sum/win;sum=0}'
2 13 262.5
3 17 187.5
...
How do I get the initial positions within each window and median?

$ awk '{r=(NR-1)%4; a[r]=$3}
r==0{f1=$1; s1=$2}
r==3{asort(a); print f1,$1,s1,$2,(a[2]+a[3])/2; delete a}' file
1 2 10 13 250
2 3 14 17 200
note that delete is not really necessary since the values are overwritten at each window computation...
you can parameterize window size, need to handle odd/even
$ awk -v w=5 '{r=(NR-1)%w; a[r]=$3}
r==0{f1=$1; s1=$2}
r==(w-1){asort(a);
print f1,$1,s1,$2,(w%2?a[int(w/2)+1]:(a[w/2]+a[w/2+1])/2);
delete a}' file
1 2 10 14 200
2 3 15 19 200
doesn't handle if the last window is not full size

Related

How to do cumulative and consecutive sums for every column in a tab file (UNIX environment)

I have a tabulated file something like that
Q8VYA50 210 69 2 8 3
Q8VYA50 208 69 1 2 8 3
Q9C8G30 316 182 4 4 7
P335430 657 98 1 10 7
That I would like to do is to apply a cumulative sum from the 4rd column up to NF and print in every column the result of the sum for this column and the original value of previous columns if any. So that, the desired output would be
Q8VYA50 210 69 2 10 13
Q8VYA50 208 69 1 3 11 14
Q9C8G30 316 182 4 8 15
P335430 657 98 1 11 18
I have tried to do it through different ways by means of sum function inside an awk script including for-loop specifying the fields where must apply the cumulative sum. However, the result obtained is wrong.
Are there some way to do it correctly by Unix (Bash)? Thanks in advance!
This is one way I have tried to do #Inian
gawk 'BEGIN {FS=OFS="\t"} {
for (i=4;i<=NF;i++)
{
sum[i]+=$i; print $1,$2,$3,$i
}
}' "input_file"
Other way is to do for every column manually. $4,$5+$4,$6+$5+$4,$7+$6+$5+$4 and so on, but I think is a "seedy" method.
Following awk may help you here.
awk '{for(i=5;i<=NF;i++){$i+=$(i-1)}} 1' OFS="\t" Input_file

Write the number of elements per line of a file and its repetitions with awk

I have a file with all different integer in which each line may have different lenghts, like this:
1 2 3 4 5
16 7 8
9 10 101 102 13 14
15 6 17
24 28 31 30 18
I would like to print in output the number of elements that a line presents and the number of times there is the same number of elements per lines; the output of this example should be:
3 2
5 2
6 1
In the first column there are the number of elements per line, in the second the number of lines that presents the same number of elements.
The first line in the file has 5 elements and also the 5th one etc etc.
Print the count for the number of fields:
$ awk '{a[NF]++}END{for(k in a)print k,a[k]}' file
5 2
6 1
3 2
Pipe to sort for ordered output:
$ awk '{a[NF]++}END{for(k in a)print k,a[k]}' file | sort
3 2
5 2
6 1

Converting a series of matrix files into an index of coordinates in awk

I have a time series of files 0000.vx.dat, 0000.vy.dat, 0000.vz.dat; ...; 0077.vx.dat, 0077.vy.dat, 0077.vz.dat... Each file is a space-separated 2D matrix. I would like to take each triplet of files and combine them all into a coordinate-based data format, i.e.:
[timestep + 1] [i] [j] [vx(i,j)] [vy(i,j)] [vz(i,j)]
Each file number corresponds to a particular time step. Given the amount of data I have in this time series (~ 4 GB), bash wasn't cutting it so it seemed to be time to head over to awk... specifically mawk. It was pretty stupid to try this in bash but here is
my ill-fated attempt:
for x in $(seq 1 78)
do
tfx=${tf[$x]} # an array of padded zeros
for y in $(seq 1 1568)
do
for z in $(seq 1 1344)
do
echo $x $y $z $(awk -v i=$z -v j=$y "FNR == i {print j}" $tfx.vx.dat) $(awk -v i=$z -v j=$y "FNR == i {print j}" $tfx.vy.dat) $(awk -v i=$z -v j=$y "FNR == i {print j}" $tfx.vz.dat) >> $file
done
done
done
edit: Thank you, ruakh, for pointing out that I had kept j in shell variable format with a $ in front! This is just a snippet of the original script, but I guess would be considered the guts of it!
Suffice it to say this would have taken about six months because of all the memory overhead in bash associated with O(MxN) algorithms, subshells and pipes and whatnot. I was looking for more along the lines of a day at most. Each file is around 18 MB, so it should not be that much of a problem. I would be happy with doing this one timestep at a time in awk provided that I get one output file per timestep. I could just cat them all together without much issue afterwords, I think. It is important, though, that the time step number be the first item on the coordinate list. I could achieve this with an awk -v argument (see above) in with a bash routine. I do not know how to look up specific elements of matrices in three separate files and put them all together into one output. This is the main hurdle I would like to overcome. I was hoping mawk could provide a nice balance between effort and computational speed. If this seems to be too much for an awk script, I could go to something lower level, and would appreciate any of those answering letting me know I should just go to C instead.
Thank you in advance! I really like awk, but am afraid I am a novice.
The three files, 0000.vx.dat, 0000.vy.dat, and 0000.vz.dat would read as follows (except huge and of the correct dimensions):
0000.vx.dat:
1 2 3
4 5 6
7 8 9
0000.vy.dat:
10 11 12
13 14 15
16 17 18
0000.vz.dat:
19 20 21
22 23 24
25 26 27
I would like to be able to input:
awk -v t=1 -f stackoverflow.awk 0000.vx.dat 0000.vy.dat 0000.vz.dat
and get the following output:
1 1 1 1 10 19
1 1 2 2 11 20
1 1 3 3 12 21
1 2 1 4 13 22
1 2 2 5 14 23
1 2 3 6 15 24
1 3 1 7 16 25
1 3 2 8 17 26
1 3 3 9 18 27
edit: Thank you, shellter, for suggesting I put the desired input and output more clearly!
Personally, I use gawk to process most of my text files. However, since you have requested a mawk compatible solution, here's one way to solve your problem. Run, in your present working directory:
for i in *.vx.dat; do nawk -f script.awk "$i" "${i%%.*}.vy.dat" "${i%%.*}.vz.dat"; done
Contents of script.awk:
FNR==1 {
FILENAME++
c=0
}
{
for (i=1;i<=NF;i++) {
c++
a[c] = (a[c] ? a[c] : FILENAME FS NR FS i) FS $i
}
}
END {
for (j=1;j<=c;j++) {
print a[j] > sprintf("%04d.dat", FILENAME)
}
}
When you run the above, the results should be a single file for each set of three files containing your coordinates. These output files will have the filenames in the form: timestamp + 1 ".dat". I decided to pad these filenames with four 0's for your convenience. But you can change this to whatever format you like. Here's the results I get from the sample data you've posted. Contents of 0001.dat:
1 1 1 1 10 19
1 1 2 2 11 20
1 1 3 3 12 21
1 2 1 4 13 22
1 2 2 5 14 23
1 2 3 6 15 24
1 3 1 7 16 25
1 3 2 8 17 26
1 3 3 9 18 27

how to display a specific value a(col#,row#) on a gnuplot chart?

I realize gnuplot 4.6 does not have a specific data point addressing capability and I would have to use a script to extract a given value and store it as a variable (for example, to extract a value in the 7th column in the 4th row from the last, I simplistically could use 'tail -4 data.out | head -1 | awk '{print $7}'). How could I store/assign that value as a gnuplot variable and then display it on a chart with the set label 1 sprintf("a = %3.4f",a) at x,y command?
Gnuplot understands backtics the same as your shell. So, to get at the particular value in your datafile:
a=`tail -4 data.dat | head -1 | awk '{print $7}'`
set label 1 sprintf("a=%3.4f",a) at x,y
When reading something like "can't be done with gnuplot", it "hurts" and encourages me
to nevertheless find a gnuplot-only solution which consequently is platform-independent. (See the above comments about Linux and Windows (in)compatibility issues.)
Although, sometimes it's getting cumbersome and less efficient, but sometime it is not much longer than the solution with external tools.
Basically, you can use stats and every for this (check help stats and help every), however, if you need the mth row from the last, you first need to know how many lines you have in total. That's why you have to run stats twice. Check the following example:
Data: SO11560130.dat
1 21 3 4 5 6 78
2 25 3 4 5 6 72
3 23 3 4 5 6 73
4 29 3 4 5 6 74
5 27 3 4 5 6 77
6 28 3 4 5 6 75
7 22 3 4 5 6 73
8 24 3 4 5 6 78
9 26 3 4 5 6 78
Script: (works with gnuplot 4.6.0, March 2012)
### extract specific value from given row/column
reset
FILE = "SO11560130.dat"
M = 4 # row from last
COL = 7 # column no.
stats FILE u 0 nooutput # get total number of lines in variable STATS_records
n = STATS_records - M # index 0-based
stats FILE u (x0=$1,y0=$2,a=column(COL)) every ::n::n nooutput # get the value and coordinates
set label 1 sprintf("a=%3.4f",a) at x0,y0 offset 0,1
plot FILE u 1:2 w lp pt 7 lc rgb "red" notitle
### end of script
Result:

Cell-wise summation of tables in a linux shell script

I have a set of tables in the following format:
1000 3 0 15 14
2000 3 0 7 13
3000 2 3 14 12
4000 3 1 11 14
5000 1 1 9 14
6000 3 1 13 11
7000 3 0 10 15
They are in simple text files.
I want to merge these files into a new table in the same format, where each cell (X,Y) is the sum of all cells (X,Y) from the original set of tables. One slightly complicating factor is that the numbers from the first column should not be summed, since these are labels.
I suspect this can be done with AWK, but I'm not particularly versed in this language and can't find a solution on the web. If someone suggests another tool, that's also fine.
I want to do this from a bash shell script.
Give this a try:
#!/usr/bin/awk -f
{
for (i=2;i<=NF; i++)
a[$1,i]+=$i
b[$1]=$1
if (NF>maxNF) maxNF=NF
}
END {
n=asort(b,c)
for (i=1; i<=n; i++) {
printf "%s ", b[c[i]]
for (j=2;j<=maxNF;j++) {
printf "%d ", a[c[i],j]
}
print ""
}
}
Run it like this:
./sumcell.awk table1 table2 table3
or
./sumcell.awk table*
The output using your example input twice would look like this:
$ ./sumcell.awk table1 table1
1000 6 0 30 28
2000 6 0 14 26
3000 4 6 28 24
4000 6 2 22 28
5000 2 2 18 28
6000 6 2 26 22
7000 6 0 20 30
Sum each line, presuming at least one numeric column on each line.
while read line ; do
label=($line)
printf ${label[0]}' ' ;
expr $(
printf "${label[1]}"
for c in "${label[#]:2}" ; do
printf ' + '$c
done
)
done < table
EDIT: Of course I didn't see the comment about combining based on the label, so this is incomplete.
perl -anE'$h{$F[0]}[$_]+=$F[$_]for 1..4}{say$_,"#{$h{$_}}"for sort{$a<=>$b}keys%h' file_1 file_2

Resources