Calculating the sum of every third column from many files - bash

I have many files with three columns in a form of:
file1 | file2
1 0 1 | 1 0 2
2 3 3 | 2 3 7
3 6 2 | 3 6 0
4 1 0 | 4 1 3
5 2 4 | 5 2 1
First two columns are the same in each file. I want to calculate a sum of 3 columns from every file to receive something like this:
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
For two files awk 'FNR==NR { _a[FNR]=$3;} NR!=FNR { $3 += _a[FNR]; print; }' file*
work perfectly (I found this solution via google). How to change it on many files?

All you need is:
awk '{sum[FNR]+=$3} ARGIND==(ARGC-1){print $1, $2, sum[FNR]}' file*
The above used GNU awk for ARGIND. With other awks just add FNR==1{ARGIND++} at the start.

Since the first two columns are same in each file:
awk 'NR==FNR{b[FNR]=$1 FS $2;}{a[FNR]+=$3}END{for(i=1;i<=length(a);i++){print b[i] FS a[i];}}' file*
Array a is used to have the cumulative sum of the 3rd column of all files.
Array b is used to the 1st and 2nd column values
In the end, we print the contents of array a and b

file1
$ cat f1
1 0 1
2 3 3
3 6 2
4 1 0
5 2 4
file2
$ cat f2
1 0 2
2 3 7
3 6 0
4 1 3
5 2 1
Output
$ awk -v start=3 'NF{for(i=1; i<=NF; i++)a[FNR, i] = i>=start ? a[FNR, i]+$i : $i }END{ for(j=1; j<=FNR; j++){ s = ""; for(i=1; i<=NF; i++){ s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "") } print s } }' f1 f2
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
Better Readable
variable start decides from which column start summing, suppose if you set 2 it will start summing from column2, column3 ...and so on, from all files, since you have equal no of fields and rows, it works well
awk -v start=3 '
NF{
for(i=1; i<=NF; i++)
a[FNR, i] = i>=start ? a[FNR, i]+$i : $i
}
END{
for(j=1; j<=FNR; j++)
{
s = "";
for(i=1; i<=NF; i++)
{
s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "")
}
print s
}
}
' f1 f2

Related

AWK assign upper value for rank assignment during tie

I'm working on rank assignment to a list of values that is sorted in a file.
A miniature example is
Input:
1
2
2
2
3
4
Instead of normal ranking when there is a tie, I need to assign the upper value. So the required output is
1 1
2 4 #Note that it is not 2, since we have three 2's the upper bound is 4
2 4
2 4
3 5
4 6
I tried something like below, but it is not consistent.
$ awk ' BEGIN{t=0} NR==FNR { a[$1]++; next } { print $1,a[$1]+t; t=a[$1] } ' rank_in.txt rank_in.txt
1 1
2 4
2 6
2 6
3 4
4 2
This answer does normal ranking, so this question is not duplicate.
Instead of doing a double pass or keeping track of memory, we just use a uniq and reconstruct everything:
uniq -c file | awk '{n=n+$1;for(i=1;i<=$1;++i) print $2,n}' -
Two passes with just awk:
$ awk 'NR==FNR{rank[$1]=NR; next} {print $1, rank[$1]}' file file
1 1
2 4
2 4
2 4
3 5
4 6
or one pass with a pipe:
$ nl file | sort -k2,2 -k1,1nr | awk '$2!=prev{rank=$1; prev=$2} {print $2, rank}'
1 1
2 4
2 4
2 4
3 5
4 6
If you don't have nl on your system you could use cat -n or awk '{print NR, $0}' to generate the line numbers.
Try this awk:
awk 'FNR==NR {++fq[$1]; next} p != $1{s+=fq[$1]} {print p=$1, s}' file file
1 1
2 4
2 4
2 4
3 5
4 6
Assumptions:
input data is already sorted
Sample data:
$ cat rank.dat
1
2
2
2
3
4
One awk idea requiring a single pass through the file:
awk '
function print_rank() {
for ( i=1 ; i<=cnt ; i++ )
print id,rank
}
$1 != id { print_rank() # if we have a new id, print last id
cnt=0 # reset counter
}
{ id=$1 # keep track of current id
rank++ # increment rank by 1 for each new row processed
cnt++ # keep track of number of times we see this id
}
END { print_rank() } # flush last id to stdout
' rank.dat
This generates:
1 1
2 4
2 4
2 4
3 5
4 6
Another awk
$ awk ' NR==FNR { a[$1]++; next } { print $1, FNR + --a[$1] } ' rank_in.txt rank_in.txt
1 1
2 4
2 4
2 4
3 5
4 6
$

Generate table from "awk" command of different files in a bash program?

I need to make a table of numbers, where these numbers were obtained from different files, my code is
#!/bin/sh
for K in 1.7e-2; do
dir0=Kn_${K};
for P in 1.4365 2.904; do
dir1=P${P};
for r in 0.30 0.35; do
dir2=${r};
awk '/result is =/{print $NF}' ./First/${dir0}/${dir1}/R\=${dir2}/Results.dat
done;
done;
done;
exit;
I obtain as
1
2
3
4
but I need
1 3
2 4
I was reading some posts on this topic, but these topic are about on files and not that the data were generated.
Thanks for your help and support
I obtained the data as,
1
2
3
4
5
6
7
8
Wit the pipe pr -4ts $ '\ t' (thanks #karakfa)
I obtained
1 3 5 7
2 4 6 8
then with a script to transpose as,
#!/bin/sh
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for (j =1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' $1
I obtained
1 2
3 4
5 6
7 8
I have a problem with interspersed the numbers,
I need
1 5
2 6
3 7
4 8
But I don't know, what is the problem with command pr and its options?
Thanks for your help

Average of multiple files in shell

I want to calculate the average of 15 files:- ifile1.txt, ifile2.txt, ....., ifile15.txt. Number of columns and rows of each file are same. Part of the data looks as
ifile1.txt ifile2.txt ifile3.txt
3 5 2 2 . 1 2 1 3 . 4 3 4 1 .
1 4 2 1 . 1 3 0 2 . 5 3 1 5 .
4 6 5 2 . 2 5 5 1 . 3 4 3 1 .
5 5 7 1 . 0 0 1 1 . 4 3 4 0 .
. . . . . . . . . . . . . . .
I would like to find over a new file which will show the average of these 15 fils.
ofile.txt
2.66 3.33 2.33 2 . (i.e. average of 3 1 4, average of 5 2 3 and so on)
2.33 3.33 1 2.66 .
3 5 4.33 1.33 .
3 2.33 4 0.66 .
. . . . .
I was trying with following, but getting error
awk'{for (i=1; i<=NF; i++)} rows=FNR;cols=NF} END
{for (i=1; i<=rows; i++){for (j=1; j<=cols; j++)
s+=$i;print $0,s/NF;s=0}}' ifile* > ofile.txt
As written:
awk'{for (i=1; i<=NF; i++)} rows=FNR;cols=NF} END
…
you get 'command not found' as the error because you must leave a space between awk and the script inside the quotes. When you fix that, you start getting into problems because there are two } and only one { on the first line of the script.
When you get around to tackling the problem, you're going to need a 2D array, indexed by line number and column number, summing the values from the files. You'll also need to know the number of files processed, and the number of columns. You can then arrange to iterate over the 2D array in the END block.
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++) sum[FNR,i] += $i
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
printf " %f", sum[line,col]/nfiles;
printf "\n"
}
}' ifile*.txt
Given the three data files from the question:
ifile1.txt
3 5 2 2
1 4 2 1
4 6 5 2
5 5 7 1
ifile2.txt
1 2 1 3
1 3 0 2
2 5 5 1
0 0 1 1
ifile3.txt
4 3 4 1
5 3 1 5
3 4 3 1
4 3 4 0
The script I showed produces:
2.666667 3.333333 2.333333
2.333333 3.333333 1.000000
3.000000 5.000000 4.333333
3.000000 2.666667 4.000000
If you want to control the number of decimal places to 2, then use %.2f in place of %f.
$ { head -n1 ifile1.txt; paste ifile*.txt;} | awk 'NR==1{d=NF; next;} {for (i=1;i<=d;i++) {s=0; for (j=i;j<=NF;j+=d) s+=$j; printf "%.2f%s",s/(NF/d),j==NF+d?"\n":"\t";}}'
2.67 3.33 2.33 2.00
2.33 3.33 1.00 2.67
3.00 5.00 4.33 1.33
3.00 2.67 4.00 0.67
This script computes each row and prints the results before moving on to the next row. Because of this, the script does not need to hold all the data in memory at once. This is important if the data files are large.
How it works
{ head -n1 ifile1.txt; paste ifile*.txt;}
This prints just the first line of ifile1.txt. Then, the paste command causes it to print the first row of all files merged, then the second row merged, and so on:
$ paste ifile*.txt
3 5 2 2 1 2 1 3 4 3 4 1
1 4 2 1 1 3 0 2 5 3 1 5
4 6 5 2 2 5 5 1 3 4 3 1
5 5 7 1 0 0 1 1 4 3 4 0
|
The pipe symbol causes the output of the above commands to be sent as input to awk. Addressing each of the awk commands in turn:
NR==1{d=NF; next;}
For the first row, we save the number of columns in variable d. Then, we skip the rest of the commands and start over on the next line of input.
for (i=1;i<=d;i++) {s=0; for (j=i;j<=NF;j+=d) s+=$j; printf "%.2f%s",s/(NF/d),j==NF+d?"\n":"\t";}
This adds up the numbers from the respective files and prints the average.
As a multiline script:
{
head -n1 ifile1.txt
paste ifile*.txt
} |
awk '
NR==1 {d=NF; next;}
{
for (i=1;i<=d;i++)
{
s=0; for (j=i;j<=NF;j+=d)
s+=$j;
printf "%.2f%s",s/(NF/d),j==NF+d?"\n":"\t";
}
}
You need to save the sum the fields into an array when you're reading the original files. You can't access $0 and i in the END block, since there's no input line then.
awk '{rows=FNR; cols=NF; for (i = 1; i <= NF; i++) { total[FNR, i] += $i }}
FILENAME != lastfn { count++; lastfn = FILENAME }
END { for (i = 1; i <= rows; i++) {
for (j = 1; j <= cols; j++) {
printf("%s ", total[i, j]/count)
}
printf("\n")
}
}' ifile* > ofile.txt

Grouping the rows of a text file based on 2 columns

I have a text file like this:
1 abc 2
1 rgt 2
1 yhj 2
3 gfk 4
5 kji 6
3 plo 4
3 vbn 4
5 olk 6
I want to group the rows on the basis of first and second column like this:
1 abc,rgt,yhj 2
3 gfk,plo,ybn 4
5 kji,olk 6
such that I can see what are the values of col2 for a particular pair of col1, col3.
How can I do this using shell script?
This should do it :
awk -F " " '{ a[$1" "$3]=a[$1" "$3]$2","; }END{ for (i in a)print i, a[i]; }' file.txt | sed 's/,$//g' | awk -F " " '{ tmp=$3;$3=$2;$2=tmp;print }' |sort
Just using awk:
#!/usr/bin/env awk -f
{
k = $1 "\x1C" $3
if (k in a2) {
a2[k] = a2[k] "," $2
} else {
a1[k] = $1
a2[k] = $2
a3[k] = $3
b[++i] = k
}
}
END {
for (j = 1; j <= i; ++j) {
k = b[j]
print a1[k], a2[k], a3[k]
}
}
One line:
awk '{k=$1"\x1C"$3;if(k in a2){a2[k]=a2[k]","$2}else{a1[k]=$1;a2[k]=$2;a3[k]=$3;b[++i]=k}}END{for(j=1;j<=i;++j){k=b[j];print a1[k],a2[k],a3[k]}}' file
Output:
1 abc,rgt,yhj 2
3 gfk,plo,vbn 4
5 kji,olk 6

Another split file in bash - based on difference between rows of column x

Hello stackoverflow users!
Generally I would like to tune up script I am using, just to make it more insensitive to missing data.
My example data looks like this (tab delimited csv file with headers):
ColA ColB ColC
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
I use awk script found elsewhere, as follows:
awk 'BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim=$2 }
$2 == delim {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f }'
Which gives me output I want - omit 1st line, find 2nd column and set delimiter - in this example it will be '0':
file_no00.txt
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
file_no01.txt
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
file_no02.txt
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
To make the script more robust (imagine that rows with 0's are deleted) I would need to split file according to the subtracted value of rows 'n+1' and 'n' if this value is below 0 split file, so basically if (value_row_n+1)-value_row_n < 0 then split file. Of course I would need also to maintain the file naming. Preferred way is bash with awk use. Any advices? Thanks in advance!
Cheers!
Here is awk command that you can use:
cat file
ColA ColB ColC
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
awk 'NR == 1 {
next
}
!p || $2 < p {
f = sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{
p = $2;
print $0 > f
}' file
I suggest small modifications to your current script:
awk 'BEGIN { fn=0; f=sprintf("file_no%02d.txt",fn++); print "Creating " f }
NR==1 { next }
NR==2 { delim=$2 }
$2 - delim < 0 {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f; delim = $2 }' infile
First, create the first file name just before starting the processing.
Second, in last condition save the value of current line to compare with the value of next line.
Third, instead the comparison with zero, do the substraction between previous value and current one to check if result is less than zero.
It yields:
==> file_no00.txt <==
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
==> file_no01.txt <==
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
==> file_no02.txt <==
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100

Resources