Calculating sum of gradients with awk - bash

I have a file that contains 4 columns such as:
A B C D
1 2 3 4
10 20 30 40
100 200 300 400
.
.
.
I can calculate gradient of columns B to D versus A such as following commands:
NR>1{print $0,($2-b)/($1-a)}{a=$1;b=$2}' file
How can I print sum of gradients as the 5th column in the file? The results should be:
A B C D sum
1 2 3 4 1+2+3+4=10
10 20 30 40 (20-2)/(10-1)+(30-3)/(10-1)+(40-4)/(10-1)=9
100 200 300 400 (200-20)/(100-10)+(300-30)/(100-10)+(400-40)/(100-10)=9
.
.
.

awk 'NR == 1 { print $0, "sum"; next } { if (NR == 2) { sum = $1 + $2 + $3 + $4 } else { t = $1 - a; sum = ($2 - b) / t + ($3 - c) / t + ($4 - d) / t } print $0, sum; a = $1; b = $2; c = $3; d = $4 }' file
Output:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
With ... | column -t:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
Update:
#!/usr/bin/awk -f
NR == 1 {
print $0, "sum"
next
}
{
sum = 0
if (NR == 2) {
for (i = 1; i <= NF; ++i)
sum += $i
} else {
t = $1 - a[1]
for (i = 2; i <= NF; ++i)
sum += ($i - a[i]) / t
}
print $0, sum
for (i = 1; i <= NF; ++i)
a[i] = $i
}
Usage:
awk -f script.awk file

If you apply the same logic to the first line of numbers as you do to the rest, taking the initial value of each column as 0, you get 9 as the result of the sum (as it was in your question originally). This approach uses a loop to accumulate the sum of the gradient from the second field up to the last one. It uses the fact that on the first time round, the uninitialised values in the array a evaluate to 0:
awk 'NR==1 { print $0, "sum"; next }
{
s = 0
for(i=2;i<=NF;++i) s += ($i-a[i])/($1-a[1]) # accumulate sum
for(i=1;i<=NF;++i) a[i] = $i # fill array to be used for next iteration
print $0, s
}' file
You can pack it all onto one line if you want but remember to separate the statements with semicolons. It's also slightly shorter to only use a single for loop with an if:
awk 'NR==1{print$0,"sum";next}{s=0;for(i=1;i<=NF;++i)if(i>1)s+=($i-a[i])/($1-a[1]);a[i]=$i;print$0,s}' file
Output:
A B C D sum
1 2 3 4 9
10 20 30 40 9
100 200 300 400 9

Related

How do I put this AWK function in a for loop to extract columns?

I have tens of files (such as fA.txt, fB.txt and fc.txt) and want an output as shown in fALL.txt.
fA.txt:
id V W X Y Z
a 1 2 4 8 16
b 3 6 13 17 18
c 5 1 20 4 8
fB.txt:
id F G H J K
a 2 5 9 7 12
b 4 9 12 3 19
c 6 13 2 40 7
fC.txt:
id L M N O P
a 7 2 19 8 16
b 8 6 12 23 47
c 91 11 15 19 80
desired output
fALL.txt:
id fA_V fB_F fC_L
a 1 2 7
b 3 4 8
c 5 6 91
id fA_W fB_G fC_M
a 2 5 2
b 6 9 6
c 1 13 11
id fA_X fB_H fC_N
a 4 9 19
b 13 12 12
c 20 2 15
id fA_Y fB_J fC_O
a 8 7 8
b 17 3 23
c 4 40 19
id fA_Z fB_K fC_P
a 16 12 16
b 18 19 47
c 8 7 80
I have seen the following AWK code on this site that works for input files with only 2 columns.
'NR==FNR{a[FNR]=$0; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3 For my case, I have modified the above as follows and it works for extracting the second columns:
'NR==FNR{a[FNR]=$1 OFS $2; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1; i<=FNR; i++) print a[i]}' file1 file2 file3 I have tried putting the above into a for loop to extract the subsequent columns but have not been successful. Any helpful hints will be greatly appreciated.
The first block of data in the desired output is the second column from each of the input files with a header concatenated from the filename and the column header in the respective input file. The subsequent blocks are the third, fourth, fifth columns from each input file.
This gnu awk should work for you.
cat tab.awk
BEGIN {
OFS = "\t"
for (i=1; i<=ARGC; ++i) {
fn = ARGV[i]
sub(/\.[^.]+$/, "_", fn)
fhdr[i] = fn
}
}
!seen[$1]++ {
keys[++k] = $1
}
{
for (i=2; i<=NF; ++i)
map[$1][i] = map[$1][i] (map[$1][i] == "" ? "" : OFS) (FNR == 1 ? fhdr[ARGIND] : "") $i
}
END {
for (i=2; i<=NF; ++i) {
for (j=1; j<=k; j++) {
key = keys[j]
print key, map[key][i]
}
print ""
}
}
Then use it as:
awk -f tab.awk f{A,B,C}.txt
id fA_V fB_F fC_L
a 1 2 7
b 3 4 8
c 5 6 91
id fA_W fB_G fC_M
a 2 5 2
b 6 9 6
c 1 13 11
id fA_X fB_H fC_N
a 4 9 19
b 13 12 12
c 20 2 15
id fA_Y fB_J fC_O
a 8 7 8
b 17 3 23
c 4 40 19
id fA_Z fB_K fC_P
a 16 12 16
b 18 19 47
c 8 7 80
Explanation:
BEGIN {
OFS = "\t" # Use output field separator as tab
for (i=1; i<=ARGC; ++i) { # for each filename in input
fn = ARGV[i]
sub(/\.[^.]+$/, "_", fn) # remove anything after dot with a _
fhdr[i] = fn # and save it in fhdr associative array
}
}
!seen[$1]++ { # if this id is not found in seen array
keys[++k] = $1 # store in seen and in keys array by index
}
{
for (i=2; i<=NF; ++i) # for each field starting from 2nd column
map[$1][i] = map[$1][i] (map[$1][i] == "" ? "" : OFS) (FNR == 1 ? fhdr[ARGIND] : "") $i
# build 2 dimensional array map where key is $1,i and value is column value
# for 1st record prefix column value with part filename stored in fhdr array
# we keep appending value in this array with OFS delimiter
}
END { # do this in the end
for (i=2; i<=NF; ++i) { # for each column position from 2 onwards
for (j=1; j<=k; j++) { # for each id stored in keys array
key = keys[j]
print key, map[key][i] # print id and value text built above
}
print "" # print a line break
}
}

Passing for loop using non-integers to awk

I am trying to write code which will achieve:
Where $7 is less than $i (0 - 1 in increments of 0.05), print the line and pass to word count. The way I tried to do this was:
for i in $(seq 0 0.05 1); do awk '{if ($7 <= $i) print $0}' file.txt | wc -l ; done
This just ends up returning the word count of the full file (~40 million lines) for each instance of $i. When, for example using $7 <= 0.00, it should be returning ~67K.
I feel like there may be a way to do this within awk, but I have not seen any suggestions which allow for non-integers.
Thanks in advance.
Pass $i to awk as a variable with -v and so:
for i in $(seq 0 0.05 1); do awk -v i=$i '{if ($7 <= i) print $0}' file.txt | wc -l ; done
Some made up data:
$ cat file.txt
1 2 3 4 5 6 7 a b c d e f
1 2 3 4 5 6 0.6 a b c
1 2 3 4 5 6 0.57 a b c d e f g h i j
1 2 3 4 5 6 1 a b c d e f g
1 2 3 4 5 6 0.21 a b
1 2 3 4 5 6 0.02 x y z
1 2 3 4 5 6 0.00 x y z l j k
One possible 100% awk solution:
awk '
BEGIN { line_count=0 }
{ printf "================= %s\n",$0
for (i=0; i<=20; i++)
{ if ($7 <= i/20)
{ printf "matching seq : %1.2f\n",i/20
line_count++
seq_count[i]++
next
}
}
}
END { printf "=================\n\n"
for (i=0; i<=20; i++)
{ if (seq_count[i] > 0)
{ printf "seq = %1.2f : %8s (count)\n",i/20,seq_count[i] }
}
printf "\nseq = all : %8s (count)\n",line_count
}
' file.txt
# the output:
================= 1 2 3 4 5 6 7 a b c d e f
================= 1 2 3 4 5 6 0.6 a b c
matching seq : 0.60
================= 1 2 3 4 5 6 0.57 a b c d e f g h i j
matching seq : 0.60
================= 1 2 3 4 5 6 1 a b c d e f g
matching seq : 1.00
================= 1 2 3 4 5 6 0.21 a b
matching seq : 0.25
================= 1 2 3 4 5 6 0.02 x y z
matching seq : 0.05
================= 1 2 3 4 5 6 0.00 x y z l j k
matching seq : 0.00
=================
seq = 0.00 : 1 (count)
seq = 0.05 : 1 (count)
seq = 0.25 : 1 (count)
seq = 0.60 : 2 (count)
seq = 1.00 : 1 (count)
seq = all : 6 (count)
BEGIN { line_count=0 } : initialize a total line counter
print statement is merely for debug purposes; will print out every line from file.txt as it's processed
for (i=0; i<=20; i++) : depending on implementation, some versions of awk may have rounding/accuracy problems with non-integer numbers in sequences (eg, increment by 0.05), so we'll use whole integers for our sequence, and divide by 20 (for this particular case) to provide us with our 0.05 increments during follow-on testing
$7 <= i/20 : if field #7 is less than or equal to (i/20) ...
printf "matching seq ... : print the sequence value we just matched on (i/20)
line_count++ : add '1' to our total line counter
seq_count[i]++ : add '1' to our sequence counter array
next : break out of our sequence loop (since we found our matching sequence value (i/20), and process the next line in the file
END ... : print out our line counts
for (x=1; ...) / if / printf : loop through our array of sequences, printing the line count for each sequence (i/20)
printf "\nseq = all... : print out our total line count
NOTE: Some of the awk code can be further reduced but I'll leave this as is since it's a little easier to understand if you're new to awk.
One (obvious?) benefit of a 100% awk solution is that our sequence/looping construct is internal to awk thus allowing us to limit ourselves to one loop through the input file (file.txt); when the sequence/looping construct is outside of awk we find ourselves having to process the input file once for each pass through the sequence/loop (eg, for this exercise we would have to process the input file 21 times !!!).
Using a bit of guesswork as to what you actually want to accomplish, I came up with this:
awk '{ for (i=20; 20*$7<=i && i>0; i--) bucket[i]++ }
END { for (i=1; i<=20; i++) print bucket[i] " lines where $7 <= " i/20 }'
With the mock data from mark's second answer I get this output:
2 lines where $7 <= 0.05
2 lines where $7 <= 0.1
2 lines where $7 <= 0.15
2 lines where $7 <= 0.2
3 lines where $7 <= 0.25
3 lines where $7 <= 0.3
3 lines where $7 <= 0.35
3 lines where $7 <= 0.4
3 lines where $7 <= 0.45
3 lines where $7 <= 0.5
3 lines where $7 <= 0.55
5 lines where $7 <= 0.6
5 lines where $7 <= 0.65
5 lines where $7 <= 0.7
5 lines where $7 <= 0.75
5 lines where $7 <= 0.8
5 lines where $7 <= 0.85
5 lines where $7 <= 0.9
5 lines where $7 <= 0.95
6 lines where $7 <= 1

Calculating the sum of every third column from many files

I have many files with three columns in a form of:
file1 | file2
1 0 1 | 1 0 2
2 3 3 | 2 3 7
3 6 2 | 3 6 0
4 1 0 | 4 1 3
5 2 4 | 5 2 1
First two columns are the same in each file. I want to calculate a sum of 3 columns from every file to receive something like this:
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
For two files awk 'FNR==NR { _a[FNR]=$3;} NR!=FNR { $3 += _a[FNR]; print; }' file*
work perfectly (I found this solution via google). How to change it on many files?
All you need is:
awk '{sum[FNR]+=$3} ARGIND==(ARGC-1){print $1, $2, sum[FNR]}' file*
The above used GNU awk for ARGIND. With other awks just add FNR==1{ARGIND++} at the start.
Since the first two columns are same in each file:
awk 'NR==FNR{b[FNR]=$1 FS $2;}{a[FNR]+=$3}END{for(i=1;i<=length(a);i++){print b[i] FS a[i];}}' file*
Array a is used to have the cumulative sum of the 3rd column of all files.
Array b is used to the 1st and 2nd column values
In the end, we print the contents of array a and b
file1
$ cat f1
1 0 1
2 3 3
3 6 2
4 1 0
5 2 4
file2
$ cat f2
1 0 2
2 3 7
3 6 0
4 1 3
5 2 1
Output
$ awk -v start=3 'NF{for(i=1; i<=NF; i++)a[FNR, i] = i>=start ? a[FNR, i]+$i : $i }END{ for(j=1; j<=FNR; j++){ s = ""; for(i=1; i<=NF; i++){ s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "") } print s } }' f1 f2
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
Better Readable
variable start decides from which column start summing, suppose if you set 2 it will start summing from column2, column3 ...and so on, from all files, since you have equal no of fields and rows, it works well
awk -v start=3 '
NF{
for(i=1; i<=NF; i++)
a[FNR, i] = i>=start ? a[FNR, i]+$i : $i
}
END{
for(j=1; j<=FNR; j++)
{
s = "";
for(i=1; i<=NF; i++)
{
s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "")
}
print s
}
}
' f1 f2

Sort values and output the indices of their sorted columns

I've got a file that looks like:
20 30 40
80 70 60
50 30 40
Each column represents a procedure. I want to know how the procedures did for each row. My ideal output would be
3 2 1
1 2 3
1 3 2
i.e. in row 1, the third column had the highest value, followed by the second, then the first smallest (this can be reversed, doesn't matter).
How would I do this?
I'd do it with some other Unix tools (read, cat, sort, cut, tr, sed, and bash of course):
while read line
do
cat -n <(echo "$line" | sed 's/ /\n/g') | sort -r -k +2 | cut -f1 | tr '\n' ' '
echo
done < input.txt
The output looks like this:
3 2 1
1 2 3
1 3 2
Another solution using Python:
$ python
Python 2.7.6 (default, Jan 26 2014, 17:25:18)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> with open('file.txt') as f:
... lis=[x.split() for x in f]
...
>>> for each in lis:
... each = [i[0] + 1 for i in sorted(enumerate(each), key=lambda x:x[1], reverse=True)]
... print ' '.join([str(item) for item in each])
...
3 2 1
1 2 3
1 3 2
Using Gnu Awk version 4:
$ awk 'BEGIN{ PROCINFO["sorted_in"]="#val_num_desc" }
{
split($0,a," ")
for (i in a) printf "%s%s", i,OFS
print ""
}' file
3 2 1
1 2 3
1 3 2
If you have GNU awk then you can do something like:
awk '{
y = a = x = j = i = 0;
delete tmp;
delete num;
delete ind;
for(i = 1; i <= NF; i++) {
num[$i, i] = i
}
x = asorti(num)
for(y = 1; y <= x; y++) {
split(num[y], tmp, SUBSEP)
ind[++j] = tmp[2]
}
for(a = x; a >= 1; a--) {
printf "%s%s", ind[a],(a==1?"\n":" ")
}
}' file
$ cat file
20 30 40
0.923913 0.913043 0.880435 0.858696 0.826087 0.902174 0.836957 0.880435
80 70 60
50 30 40
awk '{
y = a = x = j = i = 0;
delete tmp;
delete num;
delete ind;
for(i = 1; i <= NF; i++) {
num[$i, i] = i
}
x = asorti(num)
for(y = 1; y <= x; y++) {
split(num[y], tmp, SUBSEP)
ind[++j] = tmp[2]
}
for(a = x; a >= 1; a--) {
printf "%s%s", ind[a],(a==1?"\n":" ")
}
}' file
3 2 1
1 2 6 8 3 4 7 5
1 2 3
1 3 2
Solution via perl
#!/usr/bin/perl
open(FH,'<','/home/chidori/input.txt') or die "Can't open file$!\n";
while(my $line=<FH>){
chomp($line);
my #unsorted_array=split(/\s/,$line);
my $count=scalar #unsorted_array;
my #sorted_array = sort { $a <=> $b } #unsorted_array;
my %hash=map{$_ => $count--} #sorted_array;
foreach my $value(#unsorted_array){
print "$hash{$value} ";
}
print "\n";
}

Grouping the rows of a text file based on 2 columns

I have a text file like this:
1 abc 2
1 rgt 2
1 yhj 2
3 gfk 4
5 kji 6
3 plo 4
3 vbn 4
5 olk 6
I want to group the rows on the basis of first and second column like this:
1 abc,rgt,yhj 2
3 gfk,plo,ybn 4
5 kji,olk 6
such that I can see what are the values of col2 for a particular pair of col1, col3.
How can I do this using shell script?
This should do it :
awk -F " " '{ a[$1" "$3]=a[$1" "$3]$2","; }END{ for (i in a)print i, a[i]; }' file.txt | sed 's/,$//g' | awk -F " " '{ tmp=$3;$3=$2;$2=tmp;print }' |sort
Just using awk:
#!/usr/bin/env awk -f
{
k = $1 "\x1C" $3
if (k in a2) {
a2[k] = a2[k] "," $2
} else {
a1[k] = $1
a2[k] = $2
a3[k] = $3
b[++i] = k
}
}
END {
for (j = 1; j <= i; ++j) {
k = b[j]
print a1[k], a2[k], a3[k]
}
}
One line:
awk '{k=$1"\x1C"$3;if(k in a2){a2[k]=a2[k]","$2}else{a1[k]=$1;a2[k]=$2;a3[k]=$3;b[++i]=k}}END{for(j=1;j<=i;++j){k=b[j];print a1[k],a2[k],a3[k]}}' file
Output:
1 abc,rgt,yhj 2
3 gfk,plo,vbn 4
5 kji,olk 6

Resources