Sum of values larger than average per column in multiple matrices - bash

I have some matrix in files given as parameters. I need to find the average of each column and sum only the numbers in column that are bigger or equal to the column average.
For example:
f1:
10 20 30
5 8
9
f2:
1 1 2 2 3
5
6 6
1 1 1 1 1
f3:
1 2 3 4 5
6 7 8 4 10
8
10 9 8 7 6
and the output should be
f1: 19 20 30
f2: 11 6 2 2 3
f3: 18 16 16 7 10
You run the program like this:
MS.1 f1 f2 f3
So far I got this:
#!/bin/awk -f
BEGIN {
M=0
M1=0
counter=1
fname=ARGV[1]
printf fname":"
}
(fname==FILENAME) {
split($0,A," ")
for(i=1;i<=length(A);i++) {
B[i]=B[i]+A[i]
if(A[i]<=0||A[i]>=0)
C[i]=C[i]+1
}
for(i=1;i<=length(B);i++) {
if((C[i]<0||C[i]>0))
D[i]=B[i]/C[i]
}
for(i=1;i<=length(A);i++) {
if(A[i]>=D[i])
E[i]=E[i]+" "+A[i]
}
}
(fname!=FILENAME) {
for(i=1;i<=length(E);i++) {
printf " "E[i]
}
printf "\n"
for(i=1;i<=length(B);i++) {
B[i]=0
}
for(i=1;i<=length(C);i++) {
C[i]=0
}
fname=FILENAME
printf fname":"
}
END {
for(i=1;i<=length(B);i++) {
printf " "B[i]
}
printf "\n"
}
but it only works for the first file and then it messes up.
My output is
f1: 19 20 30
f2: 30 26 31 1 1 1
f3: 24 16 16 11 16 0
I know I got a problem with all the array things.

here the combination of bash and awk will simplify the script
save this as script.sh
#!/bin/bash
for f in $#; do
awk 'NR==FNR {for(i=1;i<=NF;i++) {a[i]=$i; sum[i]+=$i; c[i]++}; next}
{for(i=1;i<=NF;i++) if(c[i] && $i>=sum[i]/c[i]) csum[i]+=$i}
END {printf "%s", FILENAME;
for(i=1;i<=length(csum);i++) printf "%s", OFS csum[i];
print ""}' "$f"{,}
done;
and run with
$ ./script.sh f1 f2 f3

A solution using gawk, assuming default blanks separator for awk (6 6 has two columns, for example)
cat script.awk
{
for(i=1; i<=NF; ++i){
d[FILENAME][FNR][i] = $i
sum[FILENAME][i] += $i
++rows[FILENAME][i]
}
if(NF>cols[FILENAME]) cols[FILENAME]=NF
++rows_total[FILENAME]
}
END{
for(fname in rows_total){
printf "%s:", fname
for(c=1; c<=cols[fname]; ++c){
avg = sum[fname][c] / rows[fname][c]
sumtmp = 0
for(r=1; r<=rows_total[fname]; ++r){
if(d[fname][r][c] >= avg) sumtmp+=d[fname][r][c]
}
printf " %d", sumtmp
}
printf "\n"
}
}
awk -f script.awk f1 f2 f3
you get,
f1: 19 20 30
f2: 11 6 2 2 3
f3: 18 16 16 7 10

Related

How do I put this AWK function in a for loop to extract columns?

I have tens of files (such as fA.txt, fB.txt and fc.txt) and want an output as shown in fALL.txt.
fA.txt:
id V W X Y Z
a 1 2 4 8 16
b 3 6 13 17 18
c 5 1 20 4 8
fB.txt:
id F G H J K
a 2 5 9 7 12
b 4 9 12 3 19
c 6 13 2 40 7
fC.txt:
id L M N O P
a 7 2 19 8 16
b 8 6 12 23 47
c 91 11 15 19 80
desired output
fALL.txt:
id fA_V fB_F fC_L
a 1 2 7
b 3 4 8
c 5 6 91
id fA_W fB_G fC_M
a 2 5 2
b 6 9 6
c 1 13 11
id fA_X fB_H fC_N
a 4 9 19
b 13 12 12
c 20 2 15
id fA_Y fB_J fC_O
a 8 7 8
b 17 3 23
c 4 40 19
id fA_Z fB_K fC_P
a 16 12 16
b 18 19 47
c 8 7 80
I have seen the following AWK code on this site that works for input files with only 2 columns.
'NR==FNR{a[FNR]=$0; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3 For my case, I have modified the above as follows and it works for extracting the second columns:
'NR==FNR{a[FNR]=$1 OFS $2; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1; i<=FNR; i++) print a[i]}' file1 file2 file3 I have tried putting the above into a for loop to extract the subsequent columns but have not been successful. Any helpful hints will be greatly appreciated.
The first block of data in the desired output is the second column from each of the input files with a header concatenated from the filename and the column header in the respective input file. The subsequent blocks are the third, fourth, fifth columns from each input file.
This gnu awk should work for you.
cat tab.awk
BEGIN {
OFS = "\t"
for (i=1; i<=ARGC; ++i) {
fn = ARGV[i]
sub(/\.[^.]+$/, "_", fn)
fhdr[i] = fn
}
}
!seen[$1]++ {
keys[++k] = $1
}
{
for (i=2; i<=NF; ++i)
map[$1][i] = map[$1][i] (map[$1][i] == "" ? "" : OFS) (FNR == 1 ? fhdr[ARGIND] : "") $i
}
END {
for (i=2; i<=NF; ++i) {
for (j=1; j<=k; j++) {
key = keys[j]
print key, map[key][i]
}
print ""
}
}
Then use it as:
awk -f tab.awk f{A,B,C}.txt
id fA_V fB_F fC_L
a 1 2 7
b 3 4 8
c 5 6 91
id fA_W fB_G fC_M
a 2 5 2
b 6 9 6
c 1 13 11
id fA_X fB_H fC_N
a 4 9 19
b 13 12 12
c 20 2 15
id fA_Y fB_J fC_O
a 8 7 8
b 17 3 23
c 4 40 19
id fA_Z fB_K fC_P
a 16 12 16
b 18 19 47
c 8 7 80
Explanation:
BEGIN {
OFS = "\t" # Use output field separator as tab
for (i=1; i<=ARGC; ++i) { # for each filename in input
fn = ARGV[i]
sub(/\.[^.]+$/, "_", fn) # remove anything after dot with a _
fhdr[i] = fn # and save it in fhdr associative array
}
}
!seen[$1]++ { # if this id is not found in seen array
keys[++k] = $1 # store in seen and in keys array by index
}
{
for (i=2; i<=NF; ++i) # for each field starting from 2nd column
map[$1][i] = map[$1][i] (map[$1][i] == "" ? "" : OFS) (FNR == 1 ? fhdr[ARGIND] : "") $i
# build 2 dimensional array map where key is $1,i and value is column value
# for 1st record prefix column value with part filename stored in fhdr array
# we keep appending value in this array with OFS delimiter
}
END { # do this in the end
for (i=2; i<=NF; ++i) { # for each column position from 2 onwards
for (j=1; j<=k; j++) { # for each id stored in keys array
key = keys[j]
print key, map[key][i] # print id and value text built above
}
print "" # print a line break
}
}

distribute data in both increment and decrement order

I have a file which has n number of rows, i want it's data to be distributed in 7 files as per below order
** my input file has n number of rows, this is just an example.
Input file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
5
16
17
.
.
28
Output file
1 2 3 4 5 6 7
14 13 12 11 10 9 8
15 16 17 18 19 20 21
28 27 26 25 24 23 22
so if i open the first file it should have rows
1
14
15
28
similarly if i open the second file it should have rows
2
13
16
27
similarly output for the other files as well.
Can anybody please help, with below code it is doing what is required but not in required order.
awk '{print > ("te1234"++c".txt");c=(NR%n)?c:0}' n=7 test6.txt
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
EDIT: Since OP has changed sample of Input_file totally different so adding this solution now, again this is written and tested with shown samples only.
With xargs + single awk: (recommended one)
xargs -n7 < Input_file |
awk '
FNR%2!=0{
for(i=1;i<=NF;i++){
print $i >> (i".txt")
close(i".txt")
}
next
}
FNR%2==0{
for(i=NF;i>0;i--){
count++
print $i >> (count".txt")
close(i".txt")
}
count=""
}'
Initial solution:
xargs -n7 < Input_file |
awk '
FNR%2==0{
for(i=NF;i>0;i--){
val=(val?val OFS:"")$i
}
$i=val
val=""
}
1' |
awk '
{
for(i=1;i<=NF;i++){
print $i >> (i".txt")
close(i".txt")
}
}'
Above could be done with single awk too will add xargs + awk(single) solution in few mins too.
Could you please try following, written and tested with shown samples in GNU awk.
awk '{for(i=1;i<=NF;i++){print $i >> (i".txt");close(i".txt")}}' Input_file
The output file counter could descend for each second group of seven:
awk 'FNR%n==1 {asc=!asc}
{
out="te1234" (asc ? ++c : c--) ".txt";
print >> out;
close(out)
}' n=7 test6.txt
$ ls
file tst.awk
$ cat tst.awk
{ rec = (cnt % 2 ? $1 sep rec : rec sep $1); sep=FS }
!(NR%n) {
++cnt
nf = split(rec,flds)
for (i=1; i<=nf; i++) {
out = "te1234" i ".txt"
print flds[i] >> out
close(out)
}
rec=sep=""
}
.
$ awk -v n=7 -f tst.awk file
.
$ ls
file te12342.txt te12344.txt te12346.txt tst.awk
te12341.txt te12343.txt te12345.txt te12347.txt
$ cat te12341.txt
1
14
15
28
$ cat te12342.txt
2
13
16
27
If you can have input that's not an exact multiple of n then move the code that's currently in the !(NR%n) block into a function and call that function there and in an END section.
This might work for you (GNU sed & parallel):
parallel 'echo {1}~14w file{1}; echo {2}~14w file{1}' ::: {1..7} :::+ {14..8} |
sed -n -f - file &&
paste file{1..7}
Create a sed script to write files named filen where n is 1 thru 7 (see above first set of parameters in the parallel command and also in the paste command).
The sed script uses the n~m address where n is the starting address and m is the modulo thereafter.
The distributed files are created first and the paste command then joins them all together to produce a single output file (tab separated by default, use paste -d option to get desired delimiter).
Alternative using Bash & sed:
for ((n=1,m=14;n<=7;n++,m--));do echo "$n~14w file$n";echo "$m~14w file$n";done |
sed -nf - file &&
paste file{1..7}

Using awk to compute distinct sums depending on column value

I have a file with 5 columns:
1 1311 2 171115067 1.1688e-08
1 1313 3 171115067 1.75321e-08
1 1314 4 171115067 2.33761e-08
2 1679 5 135534747 3.68909e-08
2 1680 2 135534747 1.47564e-08
3 688 34 191154276 1.77867e-07
3 689 38 191154276 1.98792e-07
3 690 39 191154276 2.04024e-07
I would like to get the accumulated value $2*$3/$4 per index which is given in field $1:
So, as an example: For the index 1, I should have (1311*2+1313*3+1314*4)/171115067 and for the index 2 in $1 it should read (1679*5+1680*2)/135534747
What I tried is:
awk '{sum+=($2*$3)/$4} END { print "Result = ",sum}'
But that gives me the sum of the multiplication for all together divided by each time which not what I need
EDIT: As per OP's comment added fllowing solution too, which will give overall sum too for all column 1s.
awk '
prev!=$1 && prev{
if(fourth){
printf("%.9f\n",mul/fourth)
sum+=sprintf("%.9f\n",mul/fourth)
}
else{
print 0
}
mul=fourth=prev=""
}
{
mul+=$2*$3
fourth=$4
prev=$1
total_sum[$1]+=($2*$3)
}
END{
if(prev){
if(fourth){
printf("%.9f\n",mul/fourth)
sum+=sprintf("%.9f\n",mul/fourth)
}
else{
print 0
}
}
print "total= ",sum
}' Input_file
Could you please try following.
awk '
prev!=$1 && prev{
if(fourth){
printf("%.9f\n",mul/fourth)
}
else{
print 0
}
mul=fourth=prev=""
}
{
mul+=$2*$3
fourth=$4
prev=$1
}
END{
if(prev){
if(fourth){
printf("%.9f\n",mul/fourth)
}
else{
print 0
}
}
}' Input_file
If your data is sorted you can do:
awk '(NR==1) { num=0; den=$4; tmp=$1 }
($1!=tmp) { print "Result",tmp,":",num/den;
num=0; den=$4; tmp=$1 }
{ num+= $2*$3 }
END { print "Result",tmp,":",num/den }' file
If your data is not sorted you can do:
awk '{ sum[$1]+= $2*$3/$4 }
END { for(i in sum) { print "Result",i,":",sum[i] }' file
and this outputs:
Result 1 : 6.90588e-05
Result 2 : 8.67305e-05
Result 3 : 0.000400117
Using Perl
$ cat sara.txt
1 1311 2 171115067 1.1688e-08
1 1313 3 171115067 1.75321e-08
1 1314 4 171115067 2.33761e-08
2 1679 5 135534747 3.68909e-08
2 1680 2 135534747 1.47564e-08
3 688 34 191154276 1.77867e-07
3 689 38 191154276 1.98792e-07
3 690 39 191154276 2.04024e-07
$ perl -lane ' $kv{join(",",$F[0],$F[3])}+=$F[1]*$F[2]; END { for(sort keys %kv) { #x=split(",");print "$x[0],",$kv{$_}/$x[1]} print eval(join("+",values %kv)) } ' sara.txt
1,6.90587930518123e-05
2,8.67305267482441e-05
3,0.000400116605291111
100056
$

Calculating the sum of every third column from many files

I have many files with three columns in a form of:
file1 | file2
1 0 1 | 1 0 2
2 3 3 | 2 3 7
3 6 2 | 3 6 0
4 1 0 | 4 1 3
5 2 4 | 5 2 1
First two columns are the same in each file. I want to calculate a sum of 3 columns from every file to receive something like this:
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
For two files awk 'FNR==NR { _a[FNR]=$3;} NR!=FNR { $3 += _a[FNR]; print; }' file*
work perfectly (I found this solution via google). How to change it on many files?
All you need is:
awk '{sum[FNR]+=$3} ARGIND==(ARGC-1){print $1, $2, sum[FNR]}' file*
The above used GNU awk for ARGIND. With other awks just add FNR==1{ARGIND++} at the start.
Since the first two columns are same in each file:
awk 'NR==FNR{b[FNR]=$1 FS $2;}{a[FNR]+=$3}END{for(i=1;i<=length(a);i++){print b[i] FS a[i];}}' file*
Array a is used to have the cumulative sum of the 3rd column of all files.
Array b is used to the 1st and 2nd column values
In the end, we print the contents of array a and b
file1
$ cat f1
1 0 1
2 3 3
3 6 2
4 1 0
5 2 4
file2
$ cat f2
1 0 2
2 3 7
3 6 0
4 1 3
5 2 1
Output
$ awk -v start=3 'NF{for(i=1; i<=NF; i++)a[FNR, i] = i>=start ? a[FNR, i]+$i : $i }END{ for(j=1; j<=FNR; j++){ s = ""; for(i=1; i<=NF; i++){ s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "") } print s } }' f1 f2
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
Better Readable
variable start decides from which column start summing, suppose if you set 2 it will start summing from column2, column3 ...and so on, from all files, since you have equal no of fields and rows, it works well
awk -v start=3 '
NF{
for(i=1; i<=NF; i++)
a[FNR, i] = i>=start ? a[FNR, i]+$i : $i
}
END{
for(j=1; j<=FNR; j++)
{
s = "";
for(i=1; i<=NF; i++)
{
s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "")
}
print s
}
}
' f1 f2

Extracting columns from data file based on header using a header file

I have a big data file (not csv) with many columns with a header row. The column headers are strings containing letters and numbers. I would like to write a script that extracts the data columns based on their header, if the header is present in a second file. I have researched this question, and wrote a script adapted from an answer found at AWK extract columns from file based on header selected from 2nd file. I understand a good part of what it does, but I'll admit that I don't understand it completely. I am aware that it was designed for a csv file... I tried using it with my files, but I cannot get it to work. Here is the code (contained in a bash script):
(note: $motif_list and $affinity_matrix are the paths to both files and have been previously defined in the bash script)
43 awk -v motif_list="$motif_list" -v affinity_matrix="$affinity_matrix" '
44 BEGIN {
45 j=1
46 while ((getline < motif_list) > 0)
47 {
48 col_names[j++] = $1
49 }
50 n=j-1;
51 close(motif_list)
52 for (i=1; i<=n; i++) s[col_names[i]] = i
53 }
54
55 NR==1 {
56 for (f=1; f<=NF; f++)
57 if ($f in s) c[s[$f]]=f
58 next
59 }
60
61 {
62 sep=" "
63 for (f=1; f<=n; f++)
64 {
65 printf("%c%s",sep,$c[f])
66 sep=FS
67 }
68 print " "
69 }' "$affinity_matrix" > $affinity_columns
(I also changed the separator from "" to " ", but that might not be the right way to do it)
As an example, here are sample input and output tables:
Input:
A B C D E F
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
Output:
A C
1 3
1 3
1 3
1 3
1 3
Any input would be much appreciated!
Thanks
The general approach (untested since you didn't provide any sample input/output) is:
awk '
NR==FNR { names[$0]; next }
FNR==1 {
for (i=1;i<=NF;i++) {
if ($i in names) {
nrs[i]
}
}
}
{
c = 0
for (i=1;i<=NF;i++) {
if (i in nrs) {
printf "%s%s", (c++ ? OFS : ""), $i
}
}
if (c) {
print ""
}
}
' motif_list affinity_matrix

Resources